57
2009 © Qiaozhu Mei University of Illinois at Urbana- Champaign Towards Contextual Text Mining Qiaozhu Mei [email protected] University of Illinois at Urbana- Champaign

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei [email protected] University of Illinois at Urbana-Champaign

Embed Size (px)

Citation preview

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Towards Contextual Text Mining

Qiaozhu [email protected]

University of Illinois at Urbana-Champaign

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Knowledge Discovery from Text

2

Text Mining System

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3

Overload of Text Content

Content Type

Published Content

Professional web content

User generated content

Private text content

Amount / day 3-4G ~ 2G 8-10G ~ 3T

- Ramakrishnan and Tomkins 2007

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Challenge of Mining Text

4

~750k /day

~3M day

~150k /day

1M

10B

6M

~100B

Where to Start? Where to Go?

Gold?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context - “Situation of Text”

5

Author

Time

Source

Author’s occupati

on

Language Social

Network

Check Lap Kok, HK

self designer, publisher, editor …

3:53 AM Jan 28th

From Ping.fm

Location

Sentiment

Sentiment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Rich Context Information

6

102M blogs

100M users > 1M groups

8M contributors 100+ languages

73 years~400k authors ~4k sources

~1B queriesPer hour?~1B Users

~3M msgs /day~5M users

5M users 500M URLs

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Text + Context = ?

7

+

Context = GuidanceI Have A Guide!

=

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Query Log + User = Personalized Search

8

MSR

Modern System Research

Medical simulation

Montessori School of Raleigh

Mountain Safety Research

MSR Racing

Wikipedia definitions

Metropolis Street Racer

Molten salt reactor

Mars sample return

Magnetic Stripe Reader

How much can personalized help?

If you know me, you should give me Microsoft Research…

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9

Common Themes IBM APPLE DELL

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Customer Reviews + Brand = Comparative Product Summary

Can we compare Products?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10

Hot Topics in SIGMOD

Scientific Literature + Time = Topic Trends

What’s hot in literature?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11

One Week Later

Blogs + Time & Location = Spatiotemporal Topic Diffusion

How does discussion spread?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12

Tom Hanks, who is my favorite movie star act the leading role.

protesting... will lose your faith by watching the movie.

a good book to past time.

... so sick of people making such a big deal about a fiction book

The Da Vinci Code

Blogs + Sentiment = Faceted Opinion Summary

What is good and what is bad?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13

Information retrieval

Machine learning Data mining

Coauthor Network

Publications + Social Network =Topical Community

Who works together on what?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Query log + User = Personalized SearchScientific Literature + Time = Topic TrendsReview + Brand = Comparative OpinionBlog + Time & Location = Spatiotemporal Topic

DiffusionBlog + Sentiment = Faceted Opinion SummaryPublications + Social Network = Topical Community

Text + Context = Contextual Text Mining

14

…..

A General Solution for All ?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Roadmap

• Generative Model of Text• Integrating Contexts in Text Models

– Modeling Simple Context– Modeling Implicit Context– Modeling Complex Context

• Applications of Contextual Text Mining

15

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Generative Model of Text

16

)|( ModelwordP

the.. movie.. harry ..

potter is .. based.. on.. j..k..rowling

the

Generation

Inference, Estimation

harry

pottermovie

harry

is

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Text as a Mixture of Topics

17

WebSearch

search 0.2engine 0.15query 0.08user 0.07ranking 0.06……

learning 0.18model 0.14training 0.10kernel 0.09inference 0.07……

mining 0.21data 0.13pattern 0.10clustering 0.05network 0.04……

Topic (Theme) = the subject of a discourse

Using machine learning for web search

K topics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Probabilistic Topic Models(Hofmann ’99, Blei et al. ’03, …)

18

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

Topic 1

Topic 2

Apple iPod

Harry Potter

Ki

iTopicwPizPwP..1

)|()()(

I downloaded

the music of

the movie

harry potter to

my ipod nano

ipod 0.15

harry 0.09

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Parameter Estimation

• Maximum Likelihood Estimation (MLE):

• Parameter Estimation using EM algorithm– Gibbs sampling, Variational inference, Expectation propagation

19

)|(maxarg* DP

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

I downloaded

the music of

the movie

harry potter to

my ipod nano

?????

?????

Guess the affiliation

Estimate the params

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

Pseudo-Counts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

How Context Affects Topics

20

• Topics in science literature:16th Century v.s. 21st Century

• When do a computer scientist and a gardener write about “tree, root, prune? ”

• In Europe, “football” appears a lot in a soccer report. What about in the US?

Text are generated according to the Context!!

“Context of Situation” - B. Malinowski 1923

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Existing Work

• PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al.

‘06), PAM (Li and McCallum ‘06)

– Don’t incorporate contexts

• Author: Author-topic model (Steyvers et al. 04)

• Time: Topic-over-time (Wang et al. 06), Dynamic Topic model (Blei et al ‘06)

21

Can we capture the context in a general way?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextualized Models

22

book

Generation: • How to select contexts?• How to model context structure?

Inference:• How to reveal contextual patterns?

),|( ContextModelwordP

Location = USLocation = China

Source = official

Sentiment = +

harry

potter

is

bookharry

potterrowling

0.150.100.080.05

movieharry

potterdirector

0.180.090.080.04

Year = 1998

Year = 2008P(w|M, Year = 2008)

P(w|M, Year = 1998)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Roadmap: Modeling Simple Context

23

Author

Time

Source

Author’s occupati

on

Language

Location

Simple Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Simple Contextual Topic Model(Mei and Zhai KDD’06)

24

Topic 1

Topic 2

Context 1: 2004 Context 2: 2007

Cj Ki

jij cTopicwPcizPjcPwP..1 ..1

),|()|()()(

Apple iPod

Harry Potter

I downloaded

the music of

the movie

harry potter to

my iphone

Contextual Topic

Patterns

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25

Hot Topics in SIGMOD

Example: Topic Life Cycles(Mei and Zhai KDD’05)

Context = TimeContextual Topic Pattern P(z|time)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06)

26

Topic: Government Response in

Hurricane Katrina

Hurricane

Katrina

Hurricane Rita

Context = Time & LocationContextual Topic Pattern P(z|time, location)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27

Example: Event Impact Analysis(Mei and Zhai KDD’06)

vector 0.05concept 0.03model 0.03space 0.02boolean 0.02function 0.01…

xml 0.07email 0.02 model 0.02collect 0.02judgment 0.01rank 0.01…

probabilist 0.08model 0.04logic 0.04 boolean 0.03algebra 0.02weight 0.01…

model 0.17language 0.08estimate 0.05 parameter 0.03distribution 0.03smooth 0.02likelihood 0.01…

1998

[Ponte and Croft 98]

Starting of TREC

1992

term 0.16relevance 0.08weight 0.07 feedback 0.04model 0.03probabilistic 0.02document 0.02…

Topic: retrieval models

Context = EventContextual Pattern P(w|z, event)

SIGIR

Traditional Models

Evaluation &

Applications

Probabilistic Models

Language Models

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Instantiation: Personalized Search (Mei and Church WSDM’08)

28

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29

Personalization with Backoff

• Ambiguous query: MSR– Microsoft Research– Mountain Safety Research

• Disambiguate based on user’s prior clicks• We don’t have enough data for everyone!

– Backoff to classes of users• Proof of Concept:

– Context = Classes of Users defined by IP address

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context Users (IP), groups of users

Personalized Search as Contextual Text Mining

30

Text: query(click) logs

(IP, Query, URL)

P(URL | Query)Text Model:

Contextual Model: P(URL | Query, User)

Goal: Estimate BetterP(URL | Query, User)

156.111.188.243156.111.188.*

156.111.*.*

156.*.*.*

*.*.*.*

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31

Evaluation Metric: Entropy (H)

• Difficulty of encoding information (a distribution)– Size of search space; difficulty of a task

• Powerful tool for sizing challenges and opportunities – How hard is web search? – How much does personalization help?

• Predict future Cross Entropy H(Future|History)

URL

URLpURLpURLH )(log)()(

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Difficulty of Queries

• Easy queries (low H(URL|Q)):– google, yahoo, myspace, ebay, …

• Hard queries (high H(URL|Q)):– dictionary, yellow pages, movies, “what is may day?”

32

msrgear.commsracing.com

research....commsrwheels.com

msr.commsr.org

msrdev.com…

0.120.100.090.080.070.070.060.05

Hard Query: “MSR” – High Entropy Easy Query: “Google” – Low Entropy

google.comgoogle.cn

maps.google ……

0.800.100.08~ 0~ 0~ 0~ 0~ 0

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33

How Hard Is Search?

• Traditional Search– H(URL | Query)– 2.8 (= 23.9 – 21.1)

• Personalized Search– H(URL | Query, IPIP)– 1.21.2 (= 27.2 – 26.0)

Entropy (H)

Query 21.1

URL 22.1

IP 22.1

Query, URL 23.9

Query, IP 26.0

IP, URL 27.1

All Three 27.2Personalization cuts H in Half!

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context = First k bytes of IP

34

),|(

),|(

),|(

),|(

),|(

00

11

22

33

44

QIPURLP

QIPURLP

QIPURLP

QIPURLP

QIPURLP

156.111.188.*

156.111.*.*

156.*.*.*

*.*.*.*

Full personalization: every user has a different model: sparse data!

No personalization: all users share the same model: Missed Opportunity

Personalization with backoff: smooth by

similar users

156.111.188.243

),|( QUserURLP

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35

Context Market Segmentation

• Can we do better than IP address? • Potential Context Variables

– ID, QueryType, Click, Intent, …– Demographics (Age, Gender, Income, …)– Time of day & Day of Week

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Roadmap: Modeling Implicit Context

36

Sentiment

Sentiment

Implicit Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Implicit Context of Text

37

???

Need to infer these situations/conditionsfrom the data (with prior knowledge)

Sentiments

Intents Impact

Trust

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Modeling Implicit Context

38

Topic 1

Topic 2

Positive

Negative

???hate

awfuldisgust

0.210.030.01

goodlike

perfect

0.100.050.02

Apple iPod

Harry Potter

I like the

song of

movie on

perfect but

hate the accent

my

ipod

the

)()|(maxarg* PDP

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Faceted Opinion Summarization (Mei et al. WWW’07)

39

Tom Hanks, who is my favorite movie star act the leading role.

Protesting.. you will lose your faith by watching the movie.

a good book to past time.

... so sick of people making such a big deal about a fiction book

Context = Sentiment

Topic 1:Movie

Topic 2:Book

The Da Vinci Code

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Roadmap: Modeling Complex Context

40

Social Network

Complex Contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Complex Context of Text

41

• Find novel contextual patterns;• Regularize contextual models;• Alleviate data sparseness;

Structures of contexts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Modeling Complex Context

42

Topic 1

Topic 2

A B

Context StructureIntuitions :

Model(A) and Model(B) should be similar

Context A and B are closely related

tionRegularizaLikelihood)( DO

• users in the same building issue similar queries• collaborating researchers work on similar things• topics in SIGMOD are like topics in VLDB

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Graph-based Regularization

43

v u

projection on a plane

Intuition = Regularized model = Smoothed Surfaces!

Model(u)Model(v)

uv

Structure of contexts a graph

Intuition: Model(u) and Model(v) should be similar

Smoothed

surface(s) on top of the Graph

: MLEvu ,

uv

tionRegularizaLikelihood),( GDO

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Instantiation: Topical Community Extraction (Mei et al. WWW’08)

44

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Social Network Analysis

45

Generation, evolution e.g., [Leskovec 05]

Community extractione.g., [Kleinberg 00];

Diffusion [Gruhl 04]; [Backstrom 06]

Search e.g., [Adamic 05]

Ranking e.g., [Brin and Page 98]; [Kleinberg 98]

- Kleinberg and Backstrom 2006, New York Times

Usually don’t model topics in text- Jeong et al. 2001 Nature 411

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Community Analysis

46

physicist, physics, scientist, theory, gravitation …

writer, novel, best-sell, book, language, film…

Topics in text help community extraction

Information Retrieval +Data Mining +Machine Learning, …

=Computer Science Literature

Text + Network topical communities

+

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Community Extraction as Contextual Text Mining

47

Topic Model

Text: Scientific publications

Text Model:

Contextual Model: Topic Model + Author

Context Structure:Social Network (coauthorship)

Goal: Assign authors into topical communities using P(z|author)- Regularize using social network

Context Authors

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Evu

k

jjj

jc w

k

jj

vpupvuw

wpcpcwcGDO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

Topic Modeling with Network Regularization

48

Data Likelihood

Graph Harmonic Regularizer,

(a generalization of [Zhu ’03])

Evu

k

jjj

jc w

k

jj

vpupvuw

wpcpcwcGDO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

tradeoff betweenMLE and smoothness

Smoothness of between neighbors

Model parameters:

Text Model

Graph Regularization

Intuition 2: I work on similar topicswith my coauthors

Intuition 1: Know my research topics frommy publications

tionRegularizaLikelihood),( GDO

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topics & Communities without Network Regularization

Topic 1 Topic 2 Topic 3 Topic 4

term 0.02 peer 0.02 visual 0.02 interface 0.02

question 0.02 patterns 0.01 analog 0.02 towards 0.02

protein 0.01 mining 0.01 neurons 0.02 browsing 0.02

training 0.01 clusters 0.01 vlsi 0.01 xml 0.01

weighting 0.01

stream 0.01 motion 0.01 generation 0.01

multiple 0.01 frequent 0.01 chip 0.01 design 0.01

recognition 0.01 e 0.01 natural 0.01 engine 0.01

relations 0.01 page 0.01 cortex 0.01 service 0.01

library 0.01 gene 0.01 spike 0.01 social 0.01

49

?? ? ?

Noisy community assignment

Fuzzy Topics

Four Conferences: SIGIR, KDD, NIPS, WWW

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topics & Communities with Network Regularization

50

Topic 1 Topic 2 Topic 3 Topic 4

retrieval 0.13 mining 0.11 neural 0.06 web 0.05

information 0.05 data 0.06 learning 0.02 services 0.03

document 0.03 discovery 0.03 networks 0.02 semantic 0.03

query 0.03 databases 0.02 recognition 0.02 services 0.03

text 0.03 rules 0.02 analog 0.01 peer 0.02

search 0.03 association 0.02 vlsi 0.01 ontologies 0.02

evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02

user 0.02 frequent 0.01 gaussian 0.01 management 0.01

relevance 0.02 streams 0.01 network 0.01 ontology 0.01

Information Retrieval

Data mining Machine learning

Web

Coherent community assignment

Clear Topics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topic Modeling and SNA Improve Each Other

Methods Cut Edge Weights

Ratio Cut/ Norm. Cut

Community Size

Community 1

Community 2

Community 3

Community 4

PLSA 4831 2.14/1.25 2280 2178 2326 2257

NetPLSA 662 0.29/0.13 2636 1989 3069 1347

NCut 855 0.23/0.12 2699 6323 8 11

51

-Ncut: spectral clustering with normalized cut. (Shi et al. ’00)

Network Regularization helps extract coherent communities(ensure tight connection of authors)

Topic Modeling helps balancing communities(text implicitly bridges authors)

The smaller the betterThe smaller the better

Text

Only

NetworkOnly

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Summary of My Talk

52

• Text + Context = Contextual Text Mining– A new paradigm of text mining

• General methodology for contextual text mining– Generative models of text (e.g., Topic Models)– Contextualized models with simple context, implicit

context, complex context;

• Applications of contextual text mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Take Away Message

53

+ =Text

Context

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

A Roadmap of My Work

54

Information Retrieval& Web Search

Text Mining

KDD 06a Annotating frequent patterns

KDD 05

KDD 06b

WWW 06

WWW 07

WWW 08

Contextual TopicModels

KDD 07 Labeling topic models

SIGIR 07

CIKM 08

ACL 08 Impact-based summarization

Query suggestionusing hitting time

Poisson languagemodels

PSB 06

IP&M 07

KDD 08

Applicationto Bioinfo.

Bio. literaturemining

SIGIR 08WSDM 08

Graph-based smoothing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Text InformationManagement

A Roadmap to the Future

55

Information Retrieval& Web Search

Text Mining

Theoretical Framework• Computational challenge;• Structure of contexts

Task SupportSystems

• Web users• Scientists• Business users

Applications

Integrative analysis of heterogeneous data• web 2.0 data• Science data• Information networks

Interdisciplinary• Bioinformatics• Health informatics• Business informatics

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Thanks!

56

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Predict the Future

57

• IP in the future might not be seen in the history

Personalization with backoff

No personalization

Complete personalization

Cro

ss E

ntro

py:

H(f

utur

e | h

isto

ry)

At least first k bytes of IP are seen in History

4 3 2 1 0

Knows at least two bytes

Knows every byte –

enough data