65
Wordnet-Enhanced Topic Models Hsin-Min Lu 盧信銘 Department of Information Management National Taiwan University 1

Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Embed Size (px)

Citation preview

Page 1: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet-Enhanced

Topic Models

Hsin-Min Lu

盧信銘

Department of Information Management

National Taiwan University

1

Page 2: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Outline

• Introduction

• Literature Review

• Wordnet-Enhanced Topic Model

• Experiments

2

Page 3: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Introduction

• Leveraging unstructured data is a

challenging yet rewarding task

• Topic modeling, a family of unsupervised

learning models, is useful in discovering

latent topic structures in free text data

• Topic models assume that a document is

the mixture of topic distributions

• Each topic is a distribution of the

vocabulary

3

Page 4: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

4

Statistical Topic Models for Text Mining

Text

Collections

Probabilistic

Topic Modeling

… web 0.21

search 0.10

link 0.08

graph 0.05

Subtopic discovery

Opinion comparison

Summarization

Topical pattern

analysis

term 0.16

relevance 0.08

weight 0.07

feedback 0.04

independ. 0.03

model 0.03

Topic models

(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic

[Steyvers et al. 04]

Pachinko allocation [Li & McCallum 06]

Topic over time [Wang et al. 06]

Introduction (Cont’d.)

Page 5: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Introduction (Cont’d.)

• An on-going research stream is to

incorporate meta-data variables into topic

modeling

– Richer models

– Useful estimation results

• This study aims at incorporating Wordnet

synset information into topic models

– A topic may be the combination of Wordnet

synsets, or/and

– The hidden co-occurrence structure 5

Page 6: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Introduction (cont’d.)

• Wordnet-Enhanced Topic Model

– Incorporate Wordnet synsets into topic

models

– Wordnet synsets affect the prior of topics

– Multinomial-probit-like setting for prior

– Wordnet synset influence topic inference at

token-level

– Document-level random effects for document-

wide topic tendency

– Inference using Gibbs sampling

6

Page 7: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Literature Review

• Wordnet

• Latent Dirichlet Allocation (LDA)

• LDA with Dirichlet Forest Prior

• Concept-Topic Model

• LDA with Wordnet

7

Page 8: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet

• WordNet is a large lexical database of

English.

– POS: Nouns, verbs, adjectives and adverbs

• Words are organized by synsets

– A synset expresses a distinct concept

– Synsets are interlinked by means of

conceptual-semantic and lexical relations

– Synsets form a network

– Useful for computational linguistics and

natural language processing

8

Page 9: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet (Cont’d.)

• Important difference between Wordnet and

thesaurus

– WordNet interlinks not just word forms (strings

of letters) but specific senses of words

– WordNet labels the semantic relations among

words, whereas the groupings of words in a

thesaurus does not follow any explicit pattern

other than meaning similarity

9

Page 10: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

WordNet (Cont’d.)

• A lexical semantic network relating word forms and

lexicalized concepts (i.e., concepts that speakers have

adopted word forms to express)

• Main relations—hyponymy/troponymy (kind-of/way-to),

meronymy (part-whole), synonymy, antonymy

• Predominantly hierarchical, few relations across

grammatical class, glosses & example sentences do not

participate in network

• Nouns organized under 9 unique beginners

• Command-line interface & C library

• Prehistoric (but greppable!) db format

Page 11: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Lexical Matrix

Page 12: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Creation of Synsets

Three principles: • Minimality

• Coverage

• Replacability

Page 13: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Synsets

{house} is ambiguous. {house, home} has the sense of a social unit living together;

Is this the minimal unit?

{family, house , home} will make the unit completely unambiguous.

For coverage:

{family, household, house, home} ordered according to frequency.

Replacability of the most frequent words is a requirement.

Page 14: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Synset creation

From first principles

– Pick all the senses from good standard dictionaries.

– Obtain synonyms for each sense.

– Needs hard and long hours of work.

Page 15: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Statistics (Version 2.1)

POS Unique

Strings Synsets

Total

Word-Sense

Pairs

Noun 117097 81426 145104

Verb 11488 13650 24890

Adjective 22141 18877 31302

Adverb 4601 3644 5720

Totals 155327 117597 207016

15

Page 16: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Example

• Fake (n) has three senses:

– Something that is counterfeit; not what is

seems to be (synonyms: sham, postiche)

– A person who makes deceitful pretenses

(synonyms: imposter, impotor, pretender,

faker, …)

– [Football] A deceptive move made by a

football player (synonym: juke)

16

Page 17: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

juke

sham

postichefake, n

imposterimpostor

pretender

fakerfraud

shammerrole player

pseudopseud

entity

causal agent physical objectliving thing

organism, being

person

bad person

wrongdoer

deceiver

whole thing, unit

artifact

creation

representation

copy

imitation

act, human action

action

choice, selection

decision

move

tacticalmaneuver

feint

Wordnet Example (Cont’d.)

Page 18: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Unique beginner synsets

Page 19: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Topic Models

• Latent variable models are useful in

discovering hidden structures in text data

– Latent Semantic Indexing using singular value

decomposition (SVD) (Deerwester et al. 1990)

– Probabilistic Latent Semantic Indexing (pLSI)

(Hofmann 1999)

– Latent Dirichlet allocation (LDA) (Blei 2003)

19

Page 20: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Topic Models (Cont’d.)

• LDA addresses the shortcoming of its

predecessors

– SVD may contain negative factor loadings,

which makes the result hard to explain

– pLSI (aspect model) : The number of

parameters grow linearly w.r.t. the number of

documents

• Lead to model overfitting

– LDA outperforms pLSI in terms of testing

probability (perplexity) 20

Page 21: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA Generative Process

21

Page 22: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA Inference Problem

22

Page 23: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA Model

23

Page 24: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA Model (Cont’d.)

24

Page 25: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA Model (Cont’d.)

25

Page 26: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA: Intractable Inference

26

Page 27: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Model Estimation Methods

Model Latent Z Latent

Other Parameter

Collapsed

Gibbs Sampling LDA Sample Integrate

Out Integrate out 𝜙 𝑗

Stochastic EM TOT Sample Integrate

Out Integrate out 𝜙 𝑗 ;

maximize w.r.t

other parameters

Variational

Bayes LDA and

DTM Assume

Independent Assume

Independent

(retain

sequential

structure in

DTM)

Maximize

Augmented

Gibbs Sampling WNTM

(This

Study)

Sample Sample Integrate out 𝜙𝑗 ;

sample other

parameters 27

Page 28: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Collapsed Gibbs Sampling

28

𝑃 𝑊 𝑍, 𝜂 =

𝑃 𝑍 𝛼 =

Page 29: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Marginalize 𝜃𝑑

29

Page 30: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Marginalize 𝜃𝑑

30

Page 31: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Joint Probability

31

Page 32: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Posterior Probabilty

32

Page 33: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Posterior Probability (Cont’d.)

33

Page 34: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Limitations of The LDA Model

• Additional meta-data information cannot

be included into the model

– Partially addressed problem:

– The author-topic model (AT) (Rosen-Zvi

2010) and Dirichlet-multinomial regression

(DMR) (Mimno and McCallum, 2008)

– The AT model delivers worse performance

compared to the LDA model

• Except when testing articles are very short

– The AT model is not a general framework to

include arbitrary document-level meta data

into the model 34

Page 35: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA with Dirichlet Forest Prior

• Dirichlet Forest Prior can be used to

incorporate prior knowledge

– Mixture of Dirichlet tree distributions

– Two basic types of knowledge

• Must-Links: two words have similar probability

within any topic

• Cannot-Links: two words should not both have

large probability within any topic

35 Andrzejewski, Zhu, and Craven, ICML 2009

Page 36: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA with Dirichlet Forest Prior (Cont’d.)

– Additional types of knowledge:

• Split: separate two or more sets of word from a

single topic into different topics by placing must-

links within the sets and cannot-links between

them

• Merge: combine two or more sets of words using

must-links

• Isolate: placing must-links within the common set,

and placing cannot-link between the common set

and the other high-probability words from all topics

36

Page 37: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Dirichlet Tree Distribution For Must-Link

• A Dirichlet distribution is a composition of

Dirichlet distribution

(a) A, B, and C are vocabulary, start

sampling from the root node: model Must-

like(A, B)

(b) A instance with 𝛽 = 1 and 𝜂 = 50

37

Page 38: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Dirichlet Tree Distribution

• Dirichlet tree distribution can preserve

specific correlation structure that cannot

be accomplished by the standard Dirichlet

distribution

• (c) A large set of sample from the Dirichlet

tree in (b); note 𝑝(𝐴) ≈ 𝑝(𝐵)

• (d) Dirichlet

distribution

with (50, 50, 1)

38

Page 39: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Combining Dirichlet Tree for Cannot-Link

• (e) Cannot-Link (A,B) and Cannot-Link(B,C)

• (f) The complementary graph of (e)

• (g) The Dirichlet subtree for clique

{A,C}

• (h) The Dirichlet subtree for clique {B}

39

Page 40: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA with Dirichlet Forest Prior (Cont’d.)

• 𝑞~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝐹𝑜𝑟𝑒𝑠𝑡(𝛽, 𝜂)

• 𝜙~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝑇𝑟𝑒𝑒(𝑞)

• A Dirichlet Forest is a mixture of

Dirichlet Trees

40

Page 41: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA with Dirichlet Forest Prior (Cont’d.)

41

Page 42: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Concept Topic Model

• Observed words are

either generated from

a set of hidden topics

or a set of fixed

concepts

42

Steyvers, Smyth, and Chemuduganta, 2011

Page 43: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

LDA with Wordnet

• Words are generated by walking down the

tree of Wordnet synsets

43 Boyd-Graber, Blei, and Zhu, 2007, EMNLP

Page 44: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Research Gaps

• Dirichlet forest prior can be used to

“constraint” topic models

– However, a model cannot “turn off” the

constraints when they are inappropriate

• LDAWN provides a model-drive word-sense

disambigulation mechanism

– Not suitable for topic modeling since LDAWN

cannot handle words not in Wordnet

• CTM assumes that pre-existing concepts

are “constant”

– Different concept may emerge in different

context 44

Page 45: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Developing the

Wordnet-Enhanced Topic Model (WNTM)

• Need a more flexible framework to include

Wordnet concepts into the latent topic

model

• A topic in WNTM may be

– The combination of several WN synsets

– A new topic unrelated to existing synsets

– The combination of the above two

45

Page 46: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

The WNTM

• 𝑥𝑑𝑖: the vector

for Wordnet

concept

• Token-level

influence

structure

• 𝑞𝑑,𝑗: document-

specific topic

tendency

• 𝑔𝑗: slope for 𝑥𝑑𝑖 46

Page 47: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

The WNTM Model

• 𝐻𝑑𝑖,𝑗 = 𝑞𝑑,𝑗 + 𝑥𝑑𝑖′ 𝑔𝑗 + 𝑒𝑑𝑖,𝑗

• 𝑒𝑑𝑖,𝑗~𝑁 0, Σj

• 𝑧𝑑𝑖 = 0, if max 𝐻𝑑𝑖 < 0

𝑗, if max 𝐻𝑑𝑖 = 𝐻𝑑𝑖,𝑗

47

Page 48: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Inference: Gibbs Sampling

• Updating z

• 𝑝 𝑧𝑑𝑖 = 𝑗 𝑧−𝑑𝑖 , 𝑤𝑑𝑖 , 𝑤−𝑑𝑖 , 𝑋, 𝑞, 𝑔, Σ

∝ 𝑝 𝑤𝑑𝑖 𝑧𝑑𝑖 = 𝑗,⋅ 𝑝(𝑧𝑑𝑖 = 𝑗|𝑞, 𝑔, Σ, 𝑋)

=𝑛−𝑑𝑖,𝑗

𝑤𝑑𝑖 + 𝛽

𝑛−𝑑𝑖,𝑗⋅

+𝑊𝛽𝑝(𝑧𝑑𝑖 = 𝑗|𝑞𝑑, 𝑔, Σ, 𝑥𝑑𝑖)

48

Page 49: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Inference: Augmented Gibbs Sampling

• Updating H

• 𝐻𝑑𝑖,𝑗|𝐻𝑑𝑖,−𝑗~𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝑁𝑜𝑟𝑚𝑎𝑙(⋅)

– McCulloch and Rossi (1994), Imai and van

Dyk (2004)

49

Page 50: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Inference: Augmented Gibbs Sampling

• Draw 𝑎2∗ from 𝑡𝑟𝑎𝑐𝑒(Σ−1 𝑜𝑙𝑑)/χ 𝐽−1 2

2 .

• Draw 𝐻 𝑑𝑖,𝑗∗ by first draw 𝐻𝑑𝑖,𝑗

∗ conditional on 𝑧, 𝑞 𝑜𝑙𝑑 ,

𝑔 𝑜𝑙𝑑 , Σ 𝑜𝑙𝑑 , 𝐻 𝑜𝑙𝑑 and set 𝐻 𝑑𝑖,𝑗∗ = 𝑎∗𝐻𝑑𝑖,𝑗

∗ .

• Draw 𝑞 𝑛𝑒𝑤 , and 𝑔 𝑛𝑒𝑤 by first draw 𝑞 ∗, 𝑔 ∗, and

𝑎2∗∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑎2∗, and Σ 𝑜𝑙𝑑 and set

𝑞 𝑛𝑒𝑤 = 𝑞 ∗/𝑎∗∗, 𝑔 𝑛𝑒𝑤 = 𝑔 ∗/𝑎∗∗.

• Draw Σ 𝑛𝑒𝑤 by first draw Σ ∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑞 ∗,

𝑔 ∗ and set Σ 𝑛𝑒𝑤 = Σ ∗/Σ 11∗ , 𝐻𝑑𝑖,𝑗

𝑛𝑒𝑤=

𝐻 𝑑𝑖,𝑗∗

Σ 11∗

.

50

Page 51: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Implementation

• C, C++, and OpenMP (core functions) + R

(function interfaces) + Python (text pre-

processing)

51

Page 52: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Research Testbed

• Reuters-21578

– 11,771 documents

– 775,553 words

– 26,898 unique words

• Wordnet 2.1 is used for concept

construction

52

Page 53: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Concept Construction

• Filter out Wordnet synsets that are most

relevant to the given corpus

• Definition of a concept

– A group of words with similar meanings

constructed from Wordnet synsets

• Consider nouns only

– Organized in a tree structure

53

Page 54: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Concept Construction (Cont’d.)

• For each word,

– Find the root form using the morphy tool

– Identify the synsets for the word

– For each synset

• Construct a concept by merging words in this

synset, its descendants, its parent, its siblings, and

descendants of sibling

• Delete a concept if it contains less than 5 distinct

tokens

• A concept is not useful if it contains too few words

54

Page 55: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Concept Construction

(Cont’d.)

• For each concept

– Compute average co-occurrence length

• the number of unique tokens appearing in a

document

• Average over all positive values

– Delete concepts with average co-occurrence

length <= 1.15

• Sort the concept in descending order by a

relevance score (average co-occurrence

length / number of unique token)

• Delete the concepts in the last 25

percentile

55

Page 56: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Wordnet Concepts

Concept

# Unique Words /

Avg. Freq. /

Avg. Co-occur. Len.

Words in the Concept

(List at Most 10 Words)

proportion.n.01 6/2372.7/1.24

scale, percent, pct, content, rate,

percentage.

security.n.04 7/1856.9/1.47

scrip, debenture, share, treasury,

convertible, stock, bond.

offer.n.02 9/842.4/1.24

price, question, proposition, prospectus,

tender, proposal, reward, bid, special.

fossil fuel.n.01 6/838.8/1.64 oil, jet, gas, petroleum, coal, crude.

funds.n.01 7/806.0/1.15

exchequer, pocket, till, trough, treasury,

roll, bank.

sum.n.01 49/736.3/2.21

figure, revenue, pool, win, purse, sales,

profits, rent, proceeds, payoff (list

truncated).

social science.n.01 5/688.2/1.17

econometrics, politics, economics,

finance, government.

slope.n.01 15/616.7/1.42

decline, upgrade, descent, waterside,

rise, coast, uphill, steep, brae, fall (list

truncated).

gregorian calendar

month.n.01 20/612.3/1.51

february, feb, mar, march, august, aug,

september, sept, december, dec (list

truncated). 56

Page 57: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Summary Statistics of Wordnet Concepts

# of Wordnet

Concepts Per

Word

0 1 2 3 4 or

more

Proportion 45% 27% 12% 8% 8%

57

Page 58: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Perplexity at Different Sweeps

• # of topic = 25 58

Page 59: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Estimated Topic “Statement”

Top Keywords:

estimate statement bill account action order coupon intervention

review case usair accounting suit pass transfer

Wordnet Concepts:

commercial document.n.01 (5.43)*

estimate statement bill account order

proceeding.n.01 (4.29) *

action intervention review case suit

relationship.n.03 (0.86) *

account hold restraint trust confinement

advantage.n.01 (0.69)*

account leverage profitability expediency privilege

fact.n.01 (0.51) *

case observation score specific item

Matching LDA Topic

Top Keywords:

ct net loss shr profit rev note oper avg shrs mths qtr sales exclude gain

*Estimated coefficients for Wordnet concepts. 59

Page 60: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Estimated Topic “Earnings”

Top Keywords:

mln ct net loss dlrs shr profit rev note year gain oper include avg

shrs

Wordnet Concepts:

advantage.n.01 (3.59) *

profit gain good leverage preference

subject.n.01 (-0.02) *

puzzle head precedent case question

push.n.01 (-0.03) *

pinch crunch nudge mill boost

legislature.n.01 (-0.06) *

diet congress house senate parliament

Matching LDA Topic

Top Keywords:

mln note net stg include profit tax extraordinary pretax operate full item

making turnover income 60

Page 61: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Estimated Topic “Market Update”

Top Keywords:

week total end product period average amount demand supply line

inflation term shipment number release

Wordnet Concepts:

quantity.n.03 (5.33) *

total product average amount term

part.n.09 (4.66) *

end period factor top beginning

work time.n.01 (4.38) *

week turn hours shift turnaround

economic process.n.01 (4.34) *

demand supply inflation consumption spiral

merchandise.n.01 (4.26) *

line shipment number release inventory cargo

Matching LDA Topic

Top Keywords:

union south area spokesman city ship strike port worker africa line week

affect state southern 61

Page 62: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Estimated Topic “Macroeconomics”

Top Keywords:

dollar market currency west yen economic dealer central growth cut

japan economy expect policy interest

Wordnet Concepts:

semite.n.01 (-0.03) *

palestinian arab saudi omani arabian

rational_number.n.01 (-0.11) *

thousandth fraction fourth eighth half

seed.n.01 (-0.12) *

soybean coffee hazelnut nut cob

fact.n.01 (-0.22) *

observation score specific item case

Matching LDA Topic

Top Keywords:

dollar currency yen west exchange market rates japan dealer central

german germany intervention finance paris 62

Page 63: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

The Effect of Wordnet Concept

63

Page 64: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

The Effect of Topic Number

64

Page 65: Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept Construction • Filter out Wordnet synsets that are most relevant to the given corpus •

Questions

65