40
An Unsupervised Topic Segmentation Model Incorporating Word Order Shoaib Jameel and Wai Lam The Chinese University of Hong Kong One line summary of the work We will see how maintaining the document structure such as paragraphs, sentences, and the word order helps improve the performance of a topic model. Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 1

An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

An Unsupervised Topic SegmentationModel Incorporating Word Order

Shoaib Jameel and Wai Lam

The Chinese University of Hong Kong

One line summary of the workWe will see how maintaining the document structure such asparagraphs, sentences, and the word order helps improve theperformance of a topic model.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 1

Page 2: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Outline

MotivationRelated Work

I Probabilistic Unigram Topic Model (LDA)I Probabilistic N-gram Topic ModelsI Topic Segmentation Models

Overview of our modelI Our N-gram Topic Segmentation model (NTSeg)

Text Mining Experiments of NTSegI Word-Topic and Segment-Topic Correlation GraphI Topic Segmentation ExperimentI Document Classification ExperimentI Document Likelihood Experiment

Conclusions and Future Directions

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 2

Page 3: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

MotivationMany works in the topic modeling literature assumeexchangeability among the words.As a result we see many ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 3

Page 4: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

MotivationMany works in the topic modeling literature assumeexchangeability among the words.As a result we see many ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 3

Page 5: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Latent Dirichlet Allocation Model (LDA) (Bleiet al., JMLR-2003)

Generative Process1 Draw θ from Dirichlet(α),

where each θ(d) consists oftopic distribution for documentd

2 Draw φ from Dirichlet(β),where φ encompasses worddistribution for topic

3 For every word in thedocument d

1 Draw a topic z(d)i from

Multinomial (θ(d))2 Draw a word w (d)

i fromMultinomial (φz(d)

i)

Graphical Model in PlateDiagram

N

w

z

MZ

θ α

φ

β

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 4

Page 6: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Relaxing the Bag-of-Words Assumption in a Topic ModelCan the bag-of-words assumption be relaxed in a topic model? Thismakes more sense as this is how documents are written by humansand also read.

more

lot

a

here

I

don’tunderstand

dog

cat

noticed

things

Here things are lot moreorganized. I can understandthat it was actually the cat

which noticed a dog.

Figure : This is how the bag-of-words looks (left) - complete chaos. The oneon the right makes more sense to us.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 5

Page 7: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Relaxing the Bag-of-Words AssumptionBigram Topic Model (BTM) (Wallach, ICML-2006)

Some Properties of the modelWord is generated by both thetopic and the previous wordInspired by the HierarchicalDirichlet Language ModelBetter empirical results thanthe LDA modelA limitation of the model

I Always generates bigrams ina topic

Graphical Model of BTM

w(d)i−1

z(d)i−1

M

θ

w(d)i w

(d)i+1

w(d)i+2

z(d)i z

(d)i+1 z

(d)i+2

α

σ

ZV

δ

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 6

Page 8: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Relaxing the Bag-of-Words AssumptionLDA-Collocation Model (LDACOL) (Griffiths et al., Psy. Rev-2007)

Some Properties of the modelWord is generated by thetopic, the previous word and abinary bigram status variableEach word has a topicassignment and a collocationassignmentCan form longer order phrasesCan generate both unigramsand bigram wordsA limitation of the model

I Only the first word in abigram has a topicassignment

Graphical Model of LDACOL

M

VZ

V

α

θ

z(d)i−1 z

(d)i z

(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

φβ

γ

ψ

σ δ

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 7

Page 9: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Relaxing the Bag-of-Words AssumptionTopical N-Gram Model (TNG) (Wang et al., ICDM-2007)

Some Properties of the modelExtends the LDACOL modelEach word has a topicassignment and a collocationassignmentCan form longer order phrasesCan generate both unigramsand bigram wordsA limitation of the model

I Words in a bigram may havedifferent topic assignments

Graphical Model of TNG

M

ZVZ

ZV

α

θ

z(d)i−1 z

(d)i z

(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

φβ

γ

ψ

σ δ

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 8

Page 10: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

What Topic N-gram models do - An Illustration

Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems ofpattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is notunique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. Wenote that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution.We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ...

case the usual method for determining b does not work...Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their

support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987.

Para.

1P

ara.2

Topic 1 Topic 2

support vector

cost functions

acknowledgements

reference

Consider the document as a wholeFind topical n-grams in the document

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 9

Page 11: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Bag of Words in Topic SegmentationThese models maintain the document structure such asparagraphs or sentencesAssume that words within a paragraph or a sentence areexchangeableIntroduces the notion of segment-topics or super-topics andword-topics

Paragraph n in the document d

Paragraph n + 1 in the document d

thisis

paragraph

1

2

will

follow

next

follow

paragraph

3is

willnext

this

2

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 10

Page 12: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

A Topic Segmentation Model (LDSEG) (Shafiei et al.,Canadian AI-2008)

Model PropertiesPerforms topic segmentationCan work at paragraph andsentence levelc a binary variable gives thechange in topics segment-wiseSegments come from apredefined number ofsuper-topicsThe super-topics comprise ofa mixture of word-topics

Graphical Model in PlateDiagram

N

w

z

S

Z

θ α

φ

β

M

y

τ ρ

c

π

Ω

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 11

Page 13: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

A Topic Segmentation Model (LDSEG) (Shafiei et al.,Canadian AI-2008)

Model PropertiesThis region is similar to theLDA modelSegments exhibit multipletopicsWords are generated from apredefined number ofword-topics

Graphical Model in PlateDiagram

N

w

z

S

Z

θ α

φ

β

M

y

τ ρ

c

π

Ω

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 12

Page 14: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Topic Segmentation Illustration using theLDSEG model

Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems ofpattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is notunique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. Wenote that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution.We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ...

case the usual method for determining b does not work...Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their

support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987.

Para.

1P

ara.2

Word-Topic 1 Word-Topic 2

acknowledgements

reference

support

vector

cost

Super-Topic 1

Super-Topic 4

Super-Topic 7

Word-Topic 1

Word-Topic 9

Word-Topic 12

Super-Topic 2

Word-Topic 5

Word-Topic 8

Word-Topic 1

Performs topic segmentationUnigram words are assigned to the word-topicsSegments are assigned to the document-topics or super-topics

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 13

Page 15: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Our Research Contributions

We propose a model called, NTSegOur proposed model maintains the document structure such asparagraphs and sentencesMaintains the order of the words in the documentDetects and coordinates two topic granularity levels

I Segment-TopicsI Word-Topics

Derivation of the posterior inference schemeConducted extensive text mining experiments

I Shown improvement over state-of-the-art models

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 14

Page 16: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Our Proposed Model (NTSeg)ρΩ

τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 15

Page 17: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Our Proposed Model (NTSeg)ρΩ

τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 16

Page 18: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Few Properties of NTSeg

Segments are assigned to the segment-topics

Assume a Markov property on the segment-topics y (d)s

c(d)s denotes the segment-topic change-points

Segments can be taken as a paragraphs or sentences

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 17

Page 19: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Our Proposed Model (NTSeg)ρΩ

τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 18

Page 20: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Some properties of NTSeg

Does not break the order of the wordsCan form unigrams, bigrams and higher order phrases (using x)variableThe phrases share the same topic

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 19

Page 21: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

θ

z(d)i−1 z

(d)i

z(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

b

b

b

bb

b

Irish

x = 1

cricket team

x = 1

Irish cricket team

WordTopic

7

WordTopic

7

WordTopic

7

Figure : This is how we form longer phrasesShoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 20

Page 22: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Technical Challenges for NTSeg

Sharing of the same word-topic among words in a phraseCoordination of segment-topics and word-topicsDerivation of the Posterior Inference scheme

I Gibbs sampling update equations

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 21

Page 23: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Posterior InferenceGibbs Sampling

Sampling word-topic assignments

P(z(d)si , x

(d)si |w, z

(d)¬si , x

(d)¬si ,y,c, α, β, γ, δ, ρ,Ω) ∝

(αy (d)

s z(d)si

+ h(d)

sz(d)si

− 1)× (γx (d)

si+ p

z(d)s,i−1w (d)

s,i−1x (d)si− 1)

×

βw(d)

si+n

z(d)si w(d)

si−1∑V

v=1

(βv +n

z(d)si v

)−1

if x (d)si = 0

δw(d)

si+m

w(d)si w(d)

s,i−1z(d)si−1∑V

v=1

(δv +m

w(d)s,i−1vz(d)

si

)−1

if x (d)si = 1 & z(d)

si = z(d)s,i−1

(1)

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 22

Page 24: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Posterior InferenceGibbs Sampling

Sampling segment-topic assignments

P(y (d)s , c(d)

s |z, y (d)¬s , c

(d)¬s ,w,x, α, β, γ, δ, ρ,Ω) ∝

(ρy (d)

s+ b(d)

y (d)s− 1)× (α

y (d)s z(d)

si+ h(d)

sz(d)si

− 1)×(κ

(d)cs,0

+Ω0∑1x=0 κ

(d)cs,x +Ω0+Ω1

)if c(d)

s = 0

(αy (d)

s z(d)si

+ h(d)

sz(d)si

− 1)×(

κ(d)cs,1

+Ω1∑1x=0 κ

(d)cs,x +Ω0+Ω1

)if c(d)

s = 1 & s > 1 & y (d)s = y (d)

(s−1)

(2)

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 23

Page 25: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

NTSeg Word-Topic and Segment-TopicIllustration

Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems ofpattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is notunique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. Wenote that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution.We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ...

case the usual method for determining b does not work...Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their

support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987.

Para.

1P

ara.2

Word-Topic 1 Word-Topic 2

acknowledgements

reference

support vector

cost

Segment-Topic 1

Segment Topic 4

Segment Topic 7

Word-Topic 1

Word-Topic 9

Word-Topic 12

Segment-Topic 2

Word-Topic 5

Word-Topic 8

Word-Topic 1functions

Performs document segmentation based on topicN-gram words are assigned to the word-topicsSegments are assigned to the segment-topics

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 24

Page 26: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Word-topic and Segment-topic CorrelationGraph

Used a large dataset - OHSUMEDI OHSUMED consists of 348,566 medical abstracts

The idea is to show the discovery of n-gram words of topics viathe correlation graph

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 25

Page 27: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Word-Topic and Segment-Topic CorrelationGraphResult of NTSeg

methodresults obtainedsystemclinical laboratory

methods

hivhuman immunodeficiency virus

aidsinfectedinfection

vaccineprotectionantibody response

immunizationprotective

weeksfetal

umbilical arteryfetal heart

adult

childrenchild

sexual abuseadults

young children

infantsgestational age

minorpregnant woman

preterm infants

medicalhealth

family physiciansprimary carefamily pratice

womenconfidence interval

riskmen

risk factors

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 26

Page 28: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Topic Correlation GraphCorrelation Graph from GD-LDA (Caballero et al.,CIKM-2012)

studypatientpeople

universediseasemedical

internetinformationtechnology

servicepeoplebusy

computermakehand

systemtv

people

drugstateunitedtalknato

clinton

militarywar

nuclearpresident

politicchechnya

storyjournaltime

editorbudgetyork

bankproblemeconomysysteminvestorpercent

increase

price

work

program

power

american

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 27

Page 29: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Topic Segmentation ExperimentThe ordering of c(d)

s gives the topic change-points in the documentUsed two benchmark datasets - Books and Lectures datasets

I Books dataset - Medical text book, 140 sentences, 227 chaptersI Lectures dataset - Undergraduate lecture recording of Physics and

AI classes, 90 min lecture, 700 sentences, 8500 words

A segment here is a sentenceComparative method - TopicTiling Algorithm (Reidl et al,ACL-2012)Used two commonly used evaluation metrics

I Pk - Probability that the two segments drawn randomly from adocument are incorrectly identified as belonging to the same topic

I WinDiff - Moves a sliding window across the text and counts thenumber of times the hypothesized and referenced segmentboundaries are different from within the window

These two evaluation metrics give an error estimate, so the lower,the better

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 28

Page 30: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Topic SegmentationResults

Books dataset

20 40 60 80 1000.310

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk

10 Segment-Topics (K)

20 40 60 80 100

0.310

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk

30 Segment-Topics (K)

20 40 60 80 100

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk

50 Segment-Topics (K)

20 40 60 80 100

0.330

0.340

0.350

Word-Topics (Z)

Win

Diff

10 Segment-Topics (K)

20 40 60 80 1000.330

0.335

0.340

0.345

0.350

Word-Topics (Z)

Win

Diff

30 Segment-Topics (K)

20 40 60 80 100

0.330

0.335

0.340

0.345

0.350

Word-Topics (Z)W

inD

iff

50 Segment-Topics (K)

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 29

Page 31: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Topic SegmentationResults

Lectures dataset

20 40 60 80 100

0.3550.3600.3650.3700.3750.380

Word-Topics (Z)

Pk

10 Segment-Topics (K)

20 40 60 80 100

0.350

0.360

0.370

0.380

Word-Topics (Z)

Pk

30 Segment-Topics (K)

20 40 60 80 100

0.360

0.365

0.370

0.375

0.380

Word-Topics (Z)

Pk

50 Segment-Topics (K)

20 40 60 80 1000.440

0.445

0.450

0.455

0.460

Word-Topics (Z)

Win

Diff

10 Segment-Topics (K)

20 40 60 80 1000.4300.4350.4400.4450.4500.4550.460

Word-Topics (Z)

Win

Diff

30 Segment-Topics (K)

20 40 60 80 100

0.4430.4460.4490.4520.4550.4580.461

Word-Topics (Z)W

inD

iff

50 Segment-Topics (K)

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 30

Page 32: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Classification ExperimentDataset

Generate four datasets from 20 Newsgroups dataThe datasets are:

I ComputerI PoliticsI SportsI Science

Each dataset comprises of equal number of documents of severalclasses. For example, the Computer dataset consists of thefollowing classes:

I GraphicsI HardwareI X WindowsI MacI Microsoft Windows

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 31

Page 33: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Classification ExperimentExperimental Setup

Split each dataset into training and test set maintaining the classdistribution

I We used 75% training and 25% testing in our experiments

For each class, we generate a topic model using the training setDuring classification, compute the likelihood of each document inthe test set in each topic modelThe test document gets classified to that class where thelikelihood is maximumEvaluation Metrics

I Standard Precision, Recall and F-Measure for each classI Adopted Macro-Averaging scheme

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 32

Page 34: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Classification ExperimentComparative Methods

Latent Dirichlet Segmentation Method (Word-Topics andSuper-Topics) - LDSEG (Shafiei et al., Canadian AI-2008)Pachinko Allocation Model (Super-Topics and Word-Topics) - PAM(Li and McCallum, ICML-2006)LDA Collocation Model (N-gram Topic Model) - LDACOL (Griffithset al., Psy. Rev-2007)Topical N-gram Model (N-gram Topic Model) - TNG (Wang et al.,ICDM-2007)Phrase Discovery Topic Model based on Pitman-Yor Process -PDLDA (Lindsey et al., EMNLP-2012)

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 33

Page 35: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Classification ExperimentResults

Precision Recall F-Measure Precision Recall F-MeasureLDSEG 0.580 0.420 0.487 0.440 0.400 0.419PAM 0.550 0.450 0.495 0.500 0.330 0.398

LDACOL 0.400 0.300 0.343 0.420 0.370 0.393TNG 0.490 0.420 0.452 0.560 0.470 0.511

PDLDA 0.580 0.500 0.537 0.580 0.510 0.543NTSeg 0.640 0.520 0.574 0.620 0.560 0.588

Computer dataset Science datasetPrecision Recall F-Measure Precision Recall F-Measure

LDSEG 0.390 0.320 0.352 0.330 0.320 0.325PAM 0.540 0.490 0.514 0.368 0.360 0.363

LDACOL 0.550 0.410 0.470 0.200 0.180 0.189TNG 0.550 0.450 0.495 0.340 0.290 0.313

PDLDA 0.590 0.410 0.484 0.380 0.210 0.271NTSeg 0.620 0.570 0.594 0.420 0.380 0.399

Politics dataset Sports dataset

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 34

Page 36: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Modeling Experiment

NIPS dataset

50 100 150 200

−8.800

−8.700

−8.600

·106

Word-Topics (Z)

Log-

Like

lihoo

d

10 Segment-Topics (K)

50 100 150 200

−8.800

−8.700

−8.600

·106

Word-Topics (Z)Lo

g-Li

kelih

ood

50 Segment-Topics (K)

Figure : NTSeg ( ) LDSEG ( ), PAM ( ), LDACOL ( ), TNG ( ),and PDLDA ( ).

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 35

Page 37: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Document Modeling ExperimentResults

OHSUMED dataset (348,566 medical abstracts)

200 300 400 500−3.300

−3.250

−3.200

−3.150·107

Word-Topics (Z)

Log-

Like

lihoo

d

50 Segment-Topics (K)

200 300 400 500−3.300

−3.250

−3.200

−3.150

·107

Word-Topics (Z)

Log-

Like

lihoo

d

150 Segment-Topics (K)

Figure : NTSeg ( ) LDSEG ( ), PAM ( ), LDACOL ( ), TNG ( ),and PDLDA ( ).

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 36

Page 38: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Concluding Remarks

We have presented a topic segmentation model that:I Maintains the document structure such as paragraphs and

sentencesI Keeps the order of the words intact

We have applied our model in multitudes of text mining tasksI We have obtained good improvement over the state-of-the-art

models

Future Direction: Nonparametric topic segmentation modelWe wish to automatically find out the number of latent topics thatdescribes the collection instead of manually supplying and varying thenumber of latent topics

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 37

Page 39: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

Acknowledgments

Specially thank SIGIR (the organization) for the Student Travel Awardand also the local organizers for waiving my student registration fee(and for helping me keep busy for about 8 hours during the conferencetime). The experience has been rewarding.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 38

Page 40: An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. JMLR, 3,993-1022.

Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. Proc of ICML (pp. 977-984).

Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semanticrepresentation. Psychological review, 114(2), 211.

Wang, X., McCallum, A., and Wei, X. (2007). Topical n-grams: Phrase and topic discovery,with an application to information retrieval. In Proc. of ICDM, (pp. 697-702).

Caballero, K. L., Barajas, J., and Akella, R. (2012). The generalized dirichlet distribution inenhanced topic detection. In Proc. of CIKM, (pp. 773-782).

Riedl, M., and Biemann, C. (2012). Topictiling: a text segmentation algorithm based onLDA. In Proc. of ACL, (pp. 37-42).

Lindsey, R V., William P. H. , and Michael J. S. (2012) A phrase-discovering topic modelusing hierarchical Pitman-Yor processes. In Proc. of EMNLP, (pp. 214-222).

Li, W., and McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models oftopic correlations. In Proc. of ICML (pp. 577-584).

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 39