61
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign http://sifaka.cs.uiuc.edu/~qmei2, [email protected] 1 Joint work with ChengXiang Zhai 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign qmei2,

Embed Size (px)

Citation preview

Context Analysis in Text Mining and Search

Qiaozhu MeiDepartment of Computer Science

University of Illinois at Urbana-Champaign

http://sifaka.cs.uiuc.edu/~qmei2, [email protected]

1

Joint work with ChengXiang Zhai

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Motivating Example:Personalized Search

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 2

Mountain safety research

Metropolis Street Racer

Molten salt reactorMars Sample Return

Magnetic Stripe Reader

MSR

Actually Looking for Microsoft Research…

University of Illinois at Urbana-Champaign 3

Motivating Example:Comparing Product Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Unsupervised discovery of common topics and their variations

2008 © Qiaozhu Mei

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 4

Motivating Example:Discovering Topical Trends in Literature

Unsupervised discovery of topics and their temporal variations

Topic Strength

Time

1980 1990 1998 2003TF-IDF Retrieval

IR Applications

Language Model

Text Categorization

SIGIR topics

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 5

Motivating Example:Analyzing Spatial Topic Patterns

• How do bloggers in different states respond to topics such as “oil price increase during Hurricane Karina”?

• Unsupervised discovery of topics and their variations in different locations

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 6

Motivating Example: Summarizing Sentiments

Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

time

strength PositiveNegative

Topic-sentiment dynamics (Topic = Price)

Neutral

Query: Dell Laptops

Topic-sentiment summary

positive negative

Facet 2(Battery)

Facet 1(Price)

neutral

• my Dell battery sucks

• Stupid Dell laptop battery

• One thing I really like about this Dell battery is the Express Charge feature.

• i still want a free battery from dell..

• …… • ……

• it is the best site and they show Dell coupon code as early as possible

• Even though Dell's price is cheaper, we still don't want it.

• ……

• mac pro vs. dell precision: a price comparis..

• DELL is trading at $24.66

Motivating Example: Analyzing Topics on a Social Network

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 7

Publications of Gerard Salton

Publications of Bruce Croft

Unsupervised discovery of topics and correlated research communities

Data miningMachine learning

Information retrieval

Bruce Croft

Gerard Salton

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 8

Research Questions

• What are these problems in common?

• Can we model all these problems generally?

• Can we solve these problems with a unified approach?

• How can we bring human into the loop?

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9

Rest of Talk

• Background: Language Models in Text Mining and Retrieval

• Definition of context

• General methodology to model context

– Models, example applications, results

• Conclusion and Discussion

Generative Models of Text

• Text as observations: words; tags; links, etc

• Use a unified probabilistic model to explain the appearance (generation) of observations

• Documents are generated by sampling every observation from such a generative model

• Different generation assumption different model

– Document Language Models

– Probabilistic Topic Models: PLSA, LDA, etc.

– Hidden Markov Models …

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10

Multinomial Language Models

11

Known as a Topic model when there are k of them in text:

A multinomial distribution of words as a text representation

retrieval 0.2information 0.15model 0.08query 0.07language 0.06feedback 0.03……

e.g., semi-supervised learning; boosting; spectral clustering, etc.

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Language Models in Information Retrieval (e.g., KL-Div. Method)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12

Document d

A text mining paper

data mining

Doc Language Model (LM) θd : p(w|d) text 4/100=0.04

mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…

Query q

Data ½=0.5Mining ½=0.5

Query Language Model θq : p(w|q)

Data ½=0.4Mining ½=0.4Clustering =0.1…

?p(w|q’)

text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity

function

)|(

)|(log)|()||(

d

qq

Vwdq wp

wpwpD

Smoothed Doc LM θd' : p(w|d’)

13

Probabilistic Topic Models for Text Mining

Text Collections

ProbabilisticTopic Modeling

…web 0.21search 0.10link 0.08 graph 0.05…

term 0.16relevance 0.08weight 0.07 feedback 0.04independ. 0.03model 0.03…

Topic models(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic [Steyvers et al. 04]

CPLSA [Mei & Zhai 06]

Pachinko allocation[Li & McCallum 06]

CTM[Blei et al. 06]

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Subtopic discovery

Opinion comparison

Summarization

Topical pattern analysis

Passage segmentation

Importance of Context

14

• Science in the year 2000 and Science in the year 1500:

Are we still working on the same topics?

• For a computer scientist and a gardener:Does “tree, root, prune” mean the same?

• “Football” means soccer in Europe. What about in US?

Context affects topics!

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 15

Context Features of Text (Meta-data)

Weblog Article

Author

Author’s OccupationLocationTime

communities

source

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 16

Context = Partitioning of Text

1999

2005

2006

1998

…… ……

papers written in 1998

WWW SIGIR ACL KDD SIGMOD

papers written by authors in US

Papers about Web

Rich Context Information in Text

• News articles: time, publisher, etc.

• Blogs: time, location, author, …

• Scientific Literature: author, publication year, conference, citations, …

• Query Logs: time, IP address, user, clicks, …

• Customer reviews: product, source, time, sentiments..

• Emails: sender, receiver, time, thread, …

• Web pages: domain, time, click rate, etc.

• More? entity-relations, social networks, ……

172008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Categories of Context

• Some partitions of text are explicit explicit context

– Time; location; author; conference; user; IP; etc

– Similar to metadata

• Some partitions are implicit implicit context

– Sentiments; missions; goals; intents;

• Some partitions are at document level

• Some are at a finer granularity

– Context of a word; an entity; a pattern; a query, etc.

– Sentences; sliding windows; adjacent words; etc

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 18

Context Analysis

• Use context to infer semantics

– Annotating frequent patterns; labeling of topic models

• Use context to provide targeted service

– Personalized search; intent-based search; etc.

• Compare contextual patterns of topics

– Evolutionary topic patterns; spatiotemporal topic patterns; topic-sentiment patterns; etc.

• Use context to help other tasks

– Social network analysis; impact summarization; etc.

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 19

General Methodology to Model Context

• Context Generative Model

– Observations in the same context are generated with a unified model

– Observations in different contexts are generated with different models

– Observations in similar contexts are generated with similar models

• Text is generated with a mixture of such generative models

– Example Task; Model; Sample results

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 20

Model a unique context with a unified model(Generation)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 21

Probabilistic Latent Semantic Analysis (Hofmann ’99)

22

A Document d

Topics θ1…k

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

πd : P(θi|d)

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a topic

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

N DWd,n

θkZd,nπd

K k

kndnd dkzpkzwpdpwdp )|(),|()(),( ,,

πdθk

P(w|θj)

Documents about “Hurricane Katrina”

Example: Topics in Science (D. Blei 05)

242008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

25

Label a Multinomial Topic Model

• Semantically close (relevance)

• Understandable – phrases?

• High coverage inside topic

• Discriminative across topics

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

iPod Nano

Pseudo-feedback

Information Retrieval

Retrieval models

じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

26

Automatic Labeling of Topics

Collection (e.g., SIGIR)

term 0.16relevance 0.07weight 0.07

feedback 0.04independence 0.03model 0.03…filtering 0.21collaborative 0.15… trec 0.18

evaluation 0.10…

NLP ChunkerNgram Stat.

information retrieval, retrieval model, index structure, relevance feedback, …

Candidate label pool1

Relevance Score

Information retrieval 0.26retrieval models 0.19IR models 0.17 pseudo feedback 0.06……

2 Discrimination3information retriev. 0.26 0.01retrieval models 0.20IR models 0.18 pseudo feedback 0.09……

4 Coverage

retrieval models 0.20IR models 0.18 0.02 pseudo feedback 0.09……information retrieval 0.01

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

27

Clustering

hash

dimension

algorithm

partition

p(w | clustering algorithm )

Good Label (l1)“clustering algorithm”

Clustering

hash

dimension

key

algorithm…

p(w | hash join)

key …hash join… code …hashtable …search…hash join…

map key…hash…algorithm…key

…hash…keytable…join…

l2: “hash join”

Label Relevance: Context Comparison

• Intuition: expect the label with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Topic

…P(w|)

w

rank

ClwPMIwp )|,()|(

Score (l, ) = D(||l)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

28

Results: Sample Topic Labels

tree 0.09trees 0.08spatial 0.08b 0.05r 0.04disk 0.02array 0.01cache 0.01

north 0.02case 0.01trial 0.01iran 0.01documents 0.01walsh 0.009reagan 0.009charges 0.007

the, of, a, and,to, data, > 0.02…clustering 0.02time 0.01clusters 0.01databases 0.01large 0.01performance 0.01quality 0.005

clustering algorithmclustering structure

large data, data quality, high data,

data application, …

iran contra…

r treeb tree …

indexing methods

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Model different contexts with different models

(Discrimination, Comparison)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29

Example: Finding Evolutionary Patterns of Topics

30

T

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

KDD

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Content Variations

over Contexts

Example: Finding Evolutionary Patterns of Topics (II)

31

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Str

engt

h of

The

me

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

Figure from (Mei ‘05)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Strength Variations over Contexts

View of Topics: Context-Specific Version of Views

32

One context one viewA document selects from a mix of views

languagemodelsmoothing

query

generation

feedback

mixture

estimateEM

model

pseudo

vectorRocchio

weightingfeedback

term

vectorspace

TF-IDF

Okapi

LSIretrieval

Context 1: 1998 ~ 2006(e.g. After “Language Modeling”)

Context 2: 1977 ~ 1998(i.e. Before “Language Modeling”)

feedback judgeexpansionpseudoquery

Topic 2:

FeedbackTopic 1:

Retrieval Model

retrieve

modelrelevancedocumentquery

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Coverage of Topics: Distribution over Topics

33

Background

• A coverage of topics: a (strength) distribution over the topics.

• One context one coverage• A document selects from a mix of

multiple coverages.

Oil Price

Government Response

Aid and donation

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Background

Oil PriceGovernment Response

Aid and donation

Context: Texas

Context: Louisiana

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 34

A General Solution: CPLSA

• CPLAS = Contextual Probabilistic Latent Semantic Analysis

• An extension of PLSA model ([Hofmann 99]) by

– Introducing context variables

– Modeling views of topics

– Modeling coverage variations of topics

• Process of contextual text mining

– Instantiation of CPLSA (context, views, coverage)

– Fit the model to text data (EM algorithm)

– Compare a topic from different views

– Compute strength dynamics of topics from coverages

– Compute other probabilistic topic patterns

The “Generation” Process

35

View1 View2 View3

Texas July 2005

sociologist

Context of Document:

Time = July 2005Location = Texas

Author = Eric BrillOccup. = Sociologist

Age = 45+…

Topics

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Choose a view

Choose a Coverage

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a theme

Topic coverages:

Texas July 2005 document

……

sociologist

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

An Intuitive Example

• Two topics: web search; machine learning

• I am writing a WWW paper. I will cover more about “web search” instead of “machine learning”.

– But of course I have my own taste.

• I am from a search engine company, so when I write about “web search”, I will focus on “search engine” and “online advertisements”…

36

Coverage

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

View

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

The Probabilistic Model

),|( CDvp i

),|( CDp j

il

D

D),( 111

))|()|(),|(),|(log(),()(logCD Vw

k

lilj

m

jj

n

ii wplpCDpCDvpDwcp

37

• A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will – Choose a view vi according to the view distribution

– Choose a coverage кj according to the coverage distribution

.

– Choose a theme according to the coverage кj .

– Generate a word using .– The likelihood of the document collection is:

il

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

38

Example results: Query Log AnalysisContext = Days of week

Day-Week Pattern of Search Difficulty

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

1 3 5 7 9 11 13 15 17 19 21 23

Jan 2006 (Jan. 1st is a Sunday)

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

Total Clicks

H(Url | IP, Q)

Query & Clicks: more query/clicks on weekdays

Search Difficulty: more difficult to predict on weekends

39

Query Frequency over time

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Jan 2006 (Jan 1st is a Sunday)

Qu

ery

Fre

qu

ency

yahoo

mapquest

cnn

Query Frequency over time

0

0.01

0.02

0.03

0.04

0.05

0.06

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Jan 2006 (Jan 1st is a Sunday)

sex

movie

mp3

Business Queries: clear day-week pattern; weekdays more frequent than weekends

Consumer Queries: no clear day-week pattern; weekends are comparable, even more frequent than weekdays

Query Log AnalysisContext = Type of Query

Bursting Topics in SIGMOD:Context = Time (Years)

0200400600800

10001200140016001800

Sensor dataXML dataWeb dataData StreamsRanki ng, Top-K

402008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Spatiotemporal Text Mining:Context = Time & Location

41

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

About Government Responsein Hurricane Katrina

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Faceted OpinionsContext = Sentiments

Neutral Positive Negative

Topic 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Topic 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

So still a good book to past time.

This controversy book cause lots conflict in west society.

422008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Sentiment DynamicsContext = Time & Sentiments

43

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

“ the da vinci code”

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44

Event Impact Analysis: IR Research

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Model similar context with similar models

(Smoothing, Regularization)

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45

46

Personalization with Backoff

• Ambiguous query: MSG– Madison Square Garden

– Monosodium Glutamate

• Disambiguate based on user’s prior clicks

• We don’t have enough data for everyone!– Backoff to classes of users

• Proof of Concept:– Classes defined by IP addresses

• Better:– Market Segmentation (Demographics)

– Collaborative Filtering (Other users who click like me)

Context = IP

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 47

),|(

),|(

),|(

),|(

),|(),|(

00

11

22

33

44

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlPQIPUrlP

156.111.188.243

156.111.188.*

156.111.*.*

156.*.*.*

*.*.*.*

Full personalization: every context has a different model: sparse data!

No personalization: all contexts share the same model

Personalization with backoff:

similar contexts have similar

models

48

Backing Off by IP

• λs estimated with EM and CV

• A little bit of personalization– Better than too much

– Or too little

Lambda

0

0.05

0.1

0.15

0.2

0.25

0.3

λ4 λ3 λ2 λ1 λ0

λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP

……

4

0

),|(),|(i

ii QIPUrlPQIPUrlP

Sparse Data Missed Opportunity

Social Network as Correlated Contexts

49

Optimization of Relevance Feedback Weights

Parallel Architecture in IR ...

Predicting query performance

…A Language

Modeling Approach to Information

Retrieval...

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Linked contexts are similar to each other

Social Network Context for Topic Modeling

50

• Context = author

• Coauthor = similar contexts

• Intuition: I work on similar topics to my neighbors

Smoothed Topic distributions over context

e.g. coauthor network

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

Topic Modeling with Network Regularization (NetPLSA)

51

• Basic Assumption (e.g., co-author graph)• Related authors work on similar topics

PLSA

Graph Harmonic Regularizer,

Generalization of [Zhu ’03],

Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

importance (weight) of an edge

difference of topic distribution on neighbor vertices

tradeoff betweentopic and smoothness

topic distribution of a document

)|( , 2

1,

...1

upfwhereff jujkj

jTj

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Communities with PLSA

Topic 1 Topic 2 Topic 3 Topic 4

term 0.02 peer 0.02 visual 0.02 interface 0.02

question 0.02 patterns 0.01 analog 0.02 towards 0.02

protein 0.01 mining 0.01 neurons 0.02 browsing 0.02

training 0.01 clusters 0.01 vlsi 0.01 xml 0.01

weighting 0.01

stream 0.01 motion 0.01 generation 0.01

multiple 0.01 frequent 0.01 chip 0.01 design 0.01

recognition 0.01 e 0.01 natural 0.01 engine 0.01

relations 0.01 page 0.01 cortex 0.01 service 0.01

library 0.01 gene 0.01 spike 0.01 social 0.01

52

?? ? ?

Noisy community assignment

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Communities with NetPLSA

53

Topic 1 Topic 2 Topic 3 Topic 4

retrieval 0.13 mining 0.11 neural 0.06 web 0.05

information 0.05 data 0.06 learning 0.02 services 0.03

document 0.03 discovery 0.03 networks 0.02 semantic 0.03

query 0.03 databases 0.02 recognition 0.02 services 0.03

text 0.03 rules 0.02 analog 0.01 peer 0.02

search 0.03 association 0.02 vlsi 0.01 ontologies 0.02

evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02

user 0.02 frequent 0.01 gaussian 0.01 management 0.01

relevance 0.02 streams 0.01 network 0.01 ontology 0.01

Information Retrieval

Data mining Machine learning

Web

Coherent community assignment

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothed Topic Map

54

Map a topic on the network (e.g., using p(θ|a))

PLSA(Topic : “information retrieval”)

NetPLSA

Core contributors

Irrelevant

Intermediate

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothed Topic Map

55

The Windy States-Blog articles: “weather”-US states network:-Topic: “windy”

PLSA NetPLSA

Real reference

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 56

Related Work

• Specific Contextual Text Mining Problems– Multi-collection Comparative Mining (e.g., [Zhai et al. 04])

– Temporal theme pattern (e.g., [Mei et al. 05], [Blei et al. 06], [Wang et al. 06])

– Spatiotemporal theme analysis (e.g., [Mei et al. 06], [Wang et al. 07])

– Author-topic analysis (e.g., [Steyvers et al. 04], [Zhou et al 06])

– …

• Probabilistic topic models:– Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99])

– Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03])

– Many extensions (e.g., [Blei et al. 05], [Li and McCallum 06])

Conclusions

• Context analysis in text mining and search

• General methodology to model context in text

– A unified generative model for observations in the same context

– Different models for different context

– Similar models for similar contexts

– Generation discrimination smoothing

• Many applications

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 57

Discussion: Context in Search

• Not all contexts are useful

– E.g. personalized search v.s. search by time of day

– How can we know which contexts are more useful?

• Many contexts are useful

– E.g., personalized search; task-based search; localized search;

– How can we combine them?

• Can we do better than market segmentations?

– Backoff to users who search like me – Collaborative Search

– But who searches like you?

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 58

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 59

References• CPLSA

– Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06.

• NetPLSA– Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic Modeling with Network Reguarization,

Proceedings of WWW’ 08

• Labeling– Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings

KDD'07

• Personalization: – Q.Mei, K.Church, Entropy of Search Logs: How Hard is Search? With Personalization?

With Backoff? In Proceedings of WSDM’08.

• Applications:– Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of

Temporal Text Mining, In Proceedings KDD' 05

– Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06

– Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 60

The End

Thank You!

Experiments

• Bibliography data and coauthor

networks

– DBLP: text = titles; network = coauthors

– Four conferences (expect 4 topics): SIGIR, KDD, NIPS, WWW

• Blog articles and Geographic network

– Blogs from spaces.live.com

containing topical words, e.g. “weather”

– Network: US states (adjacent states)

612008 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Coherent Topical Communities

62

Semantics of community: “Data Mining (KDD) ”

NetPLSA

mining 0.11

data 0.06

discovery 0.03

databases 0.02

rules 0.02

association 0.02

patterns 0.02

frequent 0.01

streams 0.01

PLSA

peer 0.02

patterns 0.01

mining 0.01

clusters 0.01

stream 0.01

frequent 0.01

e 0.01

page 0.01

gene 0.01

PLSA

visual 0.02

analog 0.02

neurons 0.02

vlsi 0.01

motion 0.01

chip 0.01

natural 0.01

cortex 0.01

spike 0.01

NetPLSA

neural 0.06

learning 0.02

networks 0.02

recognition 0.02

analog 0.01

vlsi 0.01

neurons 0.01

gaussian 0.01

network 0.01

Semantics of community: “machine learning (NIPS)”

2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign