An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model

An Unsupervised Topic SegmentationModel Incorporating Word Order

Shoaib Jameel and Wai Lam

The Chinese University of Hong Kong

One line summary of the workWe will see how maintaining the document structure such asparagraphs, sentences, and the word order helps improve theperformance of a topic model.

Shoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 1

Outline

MotivationRelated Work

I Probabilistic Unigram Topic Model (LDA)I Probabilistic N-gram Topic ModelsI Topic Segmentation Models

Overview of our modelI Our N-gram Topic Segmentation model (NTSeg)

Text Mining Experiments of NTSegI Word-Topic and Segment-Topic Correlation GraphI Topic Segmentation ExperimentI Document Classification ExperimentI Document Likelihood Experiment

Conclusions and Future Directions


MotivationMany works in the topic modeling literature assumeexchangeability among the words.As a result we see many ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.


MotivationMany works in the topic modeling literature assumeexchangeability among the words.As a result we see many ambiguous words in topics.For example, consider few topics obtained from the NIPScollection using the Latent Dirichlet Allocation (LDA) model:

Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5architecture order connectionist potential priorrecurrent first role membrane bayesiannetwork second binding current datamodule analysis structures synaptic evidencemodules small distributed dendritic experts

The problem with the LDA modelWords in topics are not insightful.


Latent Dirichlet Allocation Model (LDA) (Bleiet al., JMLR-2003)

Generative Process1 Draw θ from Dirichlet(α),

where each θ(d) consists oftopic distribution for documentd

2 Draw φ from Dirichlet(β),where φ encompasses worddistribution for topic

3 For every word in thedocument d

1 Draw a topic z(d)i from

Multinomial (θ(d))2 Draw a word w (d)

i fromMultinomial (φz(d)

i)

Graphical Model in PlateDiagram

N

w

z

MZ

θ α

φ

β


Relaxing the Bag-of-Words Assumption in a Topic ModelCan the bag-of-words assumption be relaxed in a topic model? Thismakes more sense as this is how documents are written by humansand also read.

more

lot

a

here

I

don’tunderstand

dog

cat

noticed

things

Here things are lot moreorganized. I can understandthat it was actually the cat

which noticed a dog.

Figure : This is how the bag-of-words looks (left) - complete chaos. The oneon the right makes more sense to us.


Relaxing the Bag-of-Words AssumptionBigram Topic Model (BTM) (Wallach, ICML-2006)

Some Properties of the modelWord is generated by both thetopic and the previous wordInspired by the HierarchicalDirichlet Language ModelBetter empirical results thanthe LDA modelA limitation of the model

I Always generates bigrams ina topic

Graphical Model of BTM

w(d)i−1

z(d)i−1

M

θ

w(d)i w

(d)i+1

w(d)i+2

z(d)i z

(d)i+1 z

(d)i+2

α

σ

ZV

δ


Relaxing the Bag-of-Words AssumptionLDA-Collocation Model (LDACOL) (Griffiths et al., Psy. Rev-2007)

Some Properties of the modelWord is generated by thetopic, the previous word and abinary bigram status variableEach word has a topicassignment and a collocationassignmentCan form longer order phrasesCan generate both unigramsand bigram wordsA limitation of the model

I Only the first word in abigram has a topicassignment

Graphical Model of LDACOL

M

VZ

V

α

θ

z(d)i−1 z

(d)i z

(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

φβ

γ

ψ

σ δ


Relaxing the Bag-of-Words AssumptionTopical N-Gram Model (TNG) (Wang et al., ICDM-2007)

Some Properties of the modelExtends the LDACOL modelEach word has a topicassignment and a collocationassignmentCan form longer order phrasesCan generate both unigramsand bigram wordsA limitation of the model

I Words in a bigram may havedifferent topic assignments

Graphical Model of TNG

M

ZVZ

ZV

α

θ

z(d)i−1 z

(d)i z

(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

φβ

γ

ψ

σ δ


What Topic N-gram models do - An Illustration

Abstract We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems ofpattern recognition and regression estimation, for a general class of cost functions. We sho that if the solution is notunique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. Wenote that uniqueness of the primal (dual) silution does not necessarily imply uniqueness of the dual (primal) solution.We show how to compute the threshold b when the solution is unique, but when all support vectors are bound, in which ...

case the usual method for determining b does not work...Acknowledgements C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Technologies for their

support. Reference [1] R. Fletcher, Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987.

Para.

1P

ara.2

Topic 1 Topic 2

support vector

cost functions

acknowledgements

reference

Consider the document as a wholeFind topical n-grams in the document


Bag of Words in Topic SegmentationThese models maintain the document structure such asparagraphs or sentencesAssume that words within a paragraph or a sentence areexchangeableIntroduces the notion of segment-topics or super-topics andword-topics

Paragraph n in the document d

Paragraph n + 1 in the document d

thisis

paragraph

1

2

will

follow

next

follow

paragraph

3is

willnext

this

2


A Topic Segmentation Model (LDSEG) (Shafiei et al.,Canadian AI-2008)

Model PropertiesPerforms topic segmentationCan work at paragraph andsentence levelc a binary variable gives thechange in topics segment-wiseSegments come from apredefined number ofsuper-topicsThe super-topics comprise ofa mixture of word-topics


N

w

z

S

Z

θ α

φ

β

M

y

τ ρ

c

π

Ω


A Topic Segmentation Model (LDSEG) (Shafiei et al.,Canadian AI-2008)

Model PropertiesThis region is similar to theLDA modelSegments exhibit multipletopicsWords are generated from apredefined number ofword-topics


N

w

z

S

Z

θ α

φ

β

M

y

τ ρ

c

π

Ω


Topic Segmentation Illustration using theLDSEG model




Para.

1P

ara.2

Word-Topic 1 Word-Topic 2

acknowledgements

reference

support

vector

cost

Super-Topic 1

Super-Topic 4

Super-Topic 7

Word-Topic 1

Word-Topic 9

Word-Topic 12

Super-Topic 2

Word-Topic 5

Word-Topic 8

Word-Topic 1

Performs topic segmentationUnigram words are assigned to the word-topicsSegments are assigned to the document-topics or super-topics


Our Research Contributions

We propose a model called, NTSegOur proposed model maintains the document structure such asparagraphs and sentencesMaintains the order of the words in the documentDetects and coordinates two topic granularity levels

I Segment-TopicsI Word-Topics

Derivation of the posterior inference schemeConducted extensive text mining experiments

I Shown improvement over state-of-the-art models


Our Proposed Model (NTSeg)ρΩ

τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K



τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K


Few Properties of NTSeg

Segments are assigned to the segment-topics

Assume a Markov property on the segment-topics y (d)s

c(d)s denotes the segment-topic change-points

Segments can be taken as a paragraphs or sentences



τ (d)π(d)

y(d)s−1

c(d)s

θ(d)s−1

z(d)s−1,n−1

x(d)s−1,n

w(d)s−1,n−1

φβ σ δ

Z ZVZV

γ

M

α

y(d)s

z(d)s−1,n

w(d)s−1,n

z(d)s−1,n+1

w(d)s−1,n+1

x(d)s−1,n+1

θ(d)s

z(d)s,n−1

w(d)s,n−1

x(d)s,n

z(d)s,n

w(d)s,n

z(d)s,n+1

x(d)s,n+1

w(d)s,n+1

ψ

c(d)s−1

K


Some properties of NTSeg

Does not break the order of the wordsCan form unigrams, bigrams and higher order phrases (using x)variableThe phrases share the same topic


θ

z(d)i−1 z

(d)i

z(d)i+1 z

(d)i+2

x(d)i x

(d)i+1 x

(d)i+2

w(d)i−1 w

(d)i w

(d)i+1 w

(d)i+2

b

b

b

bb

b

Irish

x = 1

cricket team

x = 1

Irish cricket team

WordTopic

7

WordTopic

7

WordTopic

7

Figure : This is how we form longer phrasesShoaib Jameel and Wai Lam SIGIR-2013, Dublin, Ireland 20

Technical Challenges for NTSeg

Sharing of the same word-topic among words in a phraseCoordination of segment-topics and word-topicsDerivation of the Posterior Inference scheme

I Gibbs sampling update equations


Posterior InferenceGibbs Sampling

Sampling word-topic assignments

P(z(d)si , x

(d)si |w, z

(d)¬si , x

(d)¬si ,y,c, α, β, γ, δ, ρ,Ω) ∝

(αy (d)

s z(d)si

+ h(d)

sz(d)si

− 1)× (γx (d)

si+ p

z(d)s,i−1w (d)

s,i−1x (d)si− 1)

×

βw(d)

si+n

z(d)si w(d)

si−1∑V

v=1

(βv +n

z(d)si v

)−1

if x (d)si = 0

δw(d)

si+m

w(d)si w(d)

s,i−1z(d)si−1∑V

v=1

(δv +m

w(d)s,i−1vz(d)

si

)−1

if x (d)si = 1 & z(d)

si = z(d)s,i−1

(1)


Posterior InferenceGibbs Sampling

Sampling segment-topic assignments

P(y (d)s , c(d)

s |z, y (d)¬s , c

(d)¬s ,w,x, α, β, γ, δ, ρ,Ω) ∝

(ρy (d)

s+ b(d)

y (d)s− 1)× (α

y (d)s z(d)

si+ h(d)

sz(d)si

− 1)×(κ

(d)cs,0

+Ω0∑1x=0 κ

(d)cs,x +Ω0+Ω1

)if c(d)

s = 0

(αy (d)

s z(d)si

+ h(d)

sz(d)si

− 1)×(

κ(d)cs,1

+Ω1∑1x=0 κ

(d)cs,x +Ω0+Ω1

)if c(d)

s = 1 & s > 1 & y (d)s = y (d)

(s−1)

(2)


NTSeg Word-Topic and Segment-TopicIllustration




Para.

1P

ara.2

Word-Topic 1 Word-Topic 2

acknowledgements

reference

support vector

cost

Segment-Topic 1

Segment Topic 4

Segment Topic 7

Word-Topic 1

Word-Topic 9

Word-Topic 12

Segment-Topic 2

Word-Topic 5

Word-Topic 8

Word-Topic 1functions

Performs document segmentation based on topicN-gram words are assigned to the word-topicsSegments are assigned to the segment-topics


Word-topic and Segment-topic CorrelationGraph

Used a large dataset - OHSUMEDI OHSUMED consists of 348,566 medical abstracts

The idea is to show the discovery of n-gram words of topics viathe correlation graph


Word-Topic and Segment-Topic CorrelationGraphResult of NTSeg

methodresults obtainedsystemclinical laboratory

methods

hivhuman immunodeficiency virus

aidsinfectedinfection

vaccineprotectionantibody response

immunizationprotective

weeksfetal

umbilical arteryfetal heart

adult

childrenchild

sexual abuseadults

young children

infantsgestational age

minorpregnant woman

preterm infants

medicalhealth

family physiciansprimary carefamily pratice

womenconfidence interval

riskmen

risk factors


Topic Correlation GraphCorrelation Graph from GD-LDA (Caballero et al.,CIKM-2012)

studypatientpeople

universediseasemedical

internetinformationtechnology

servicepeoplebusy

computermakehand

systemtv

people

drugstateunitedtalknato

clinton

militarywar

nuclearpresident

politicchechnya

storyjournaltime

editorbudgetyork

bankproblemeconomysysteminvestorpercent

increase

price

work

program

power

american


Topic Segmentation ExperimentThe ordering of c(d)

s gives the topic change-points in the documentUsed two benchmark datasets - Books and Lectures datasets

I Books dataset - Medical text book, 140 sentences, 227 chaptersI Lectures dataset - Undergraduate lecture recording of Physics and

AI classes, 90 min lecture, 700 sentences, 8500 words

A segment here is a sentenceComparative method - TopicTiling Algorithm (Reidl et al,ACL-2012)Used two commonly used evaluation metrics

I Pk - Probability that the two segments drawn randomly from adocument are incorrectly identified as belonging to the same topic

I WinDiff - Moves a sliding window across the text and counts thenumber of times the hypothesized and referenced segmentboundaries are different from within the window

These two evaluation metrics give an error estimate, so the lower,the better


Topic SegmentationResults

Books dataset

20 40 60 80 1000.310

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk

10 Segment-Topics (K)

20 40 60 80 100

0.310

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk


20 40 60 80 100

0.315

0.320

0.325

0.330

Word-Topics (Z)

Pk


20 40 60 80 100

0.330

0.340

0.350

Word-Topics (Z)

Win

Diff


20 40 60 80 1000.330

0.335

0.340

0.345

0.350

Word-Topics (Z)

Win

Diff


20 40 60 80 100

0.330

0.335

0.340

0.345

0.350

Word-Topics (Z)W

inD

iff



Topic SegmentationResults

Lectures dataset

20 40 60 80 100

0.3550.3600.3650.3700.3750.380

Word-Topics (Z)

Pk


20 40 60 80 100

0.350

0.360

0.370

0.380

Word-Topics (Z)

Pk


20 40 60 80 100

0.360

0.365

0.370

0.375

0.380

Word-Topics (Z)

Pk


20 40 60 80 1000.440

0.445

0.450

0.455

0.460

Word-Topics (Z)

Win

Diff


20 40 60 80 1000.4300.4350.4400.4450.4500.4550.460

Word-Topics (Z)

Win

Diff


20 40 60 80 100

0.4430.4460.4490.4520.4550.4580.461

Word-Topics (Z)W

inD

iff



Document Classification ExperimentDataset

Generate four datasets from 20 Newsgroups dataThe datasets are:

I ComputerI PoliticsI SportsI Science

Each dataset comprises of equal number of documents of severalclasses. For example, the Computer dataset consists of thefollowing classes:

I GraphicsI HardwareI X WindowsI MacI Microsoft Windows


Document Classification ExperimentExperimental Setup

Split each dataset into training and test set maintaining the classdistribution

I We used 75% training and 25% testing in our experiments

For each class, we generate a topic model using the training setDuring classification, compute the likelihood of each document inthe test set in each topic modelThe test document gets classified to that class where thelikelihood is maximumEvaluation Metrics

I Standard Precision, Recall and F-Measure for each classI Adopted Macro-Averaging scheme


Document Classification ExperimentComparative Methods

Latent Dirichlet Segmentation Method (Word-Topics andSuper-Topics) - LDSEG (Shafiei et al., Canadian AI-2008)Pachinko Allocation Model (Super-Topics and Word-Topics) - PAM(Li and McCallum, ICML-2006)LDA Collocation Model (N-gram Topic Model) - LDACOL (Griffithset al., Psy. Rev-2007)Topical N-gram Model (N-gram Topic Model) - TNG (Wang et al.,ICDM-2007)Phrase Discovery Topic Model based on Pitman-Yor Process -PDLDA (Lindsey et al., EMNLP-2012)


Document Classification ExperimentResults

Precision Recall F-Measure Precision Recall F-MeasureLDSEG 0.580 0.420 0.487 0.440 0.400 0.419PAM 0.550 0.450 0.495 0.500 0.330 0.398

LDACOL 0.400 0.300 0.343 0.420 0.370 0.393TNG 0.490 0.420 0.452 0.560 0.470 0.511

PDLDA 0.580 0.500 0.537 0.580 0.510 0.543NTSeg 0.640 0.520 0.574 0.620 0.560 0.588

Computer dataset Science datasetPrecision Recall F-Measure Precision Recall F-Measure

LDSEG 0.390 0.320 0.352 0.330 0.320 0.325PAM 0.540 0.490 0.514 0.368 0.360 0.363

LDACOL 0.550 0.410 0.470 0.200 0.180 0.189TNG 0.550 0.450 0.495 0.340 0.290 0.313

PDLDA 0.590 0.410 0.484 0.380 0.210 0.271NTSeg 0.620 0.570 0.594 0.420 0.380 0.399

Politics dataset Sports dataset


Document Modeling Experiment

NIPS dataset

50 100 150 200

−8.800

−8.700

−8.600

·106

Word-Topics (Z)

Log-

Like

lihoo

d


50 100 150 200

−8.800

−8.700

−8.600

·106

Word-Topics (Z)Lo

g-Li

kelih

ood


Figure : NTSeg ( ) LDSEG ( ), PAM ( ), LDACOL ( ), TNG ( ),and PDLDA ( ).


Document Modeling ExperimentResults

OHSUMED dataset (348,566 medical abstracts)

200 300 400 500−3.300

−3.250

−3.200

−3.150·107

Word-Topics (Z)

Log-

Like

lihoo

d


200 300 400 500−3.300

−3.250

−3.200

−3.150

·107

Word-Topics (Z)

Log-

Like

lihoo

d


Figure : NTSeg ( ) LDSEG ( ), PAM ( ), LDACOL ( ), TNG ( ),and PDLDA ( ).


Concluding Remarks

We have presented a topic segmentation model that:I Maintains the document structure such as paragraphs and

sentencesI Keeps the order of the words intact

We have applied our model in multitudes of text mining tasksI We have obtained good improvement over the state-of-the-art

models

Future Direction: Nonparametric topic segmentation modelWe wish to automatically find out the number of latent topics thatdescribes the collection instead of manually supplying and varying thenumber of latent topics


Acknowledgments

Specially thank SIGIR (the organization) for the Student Travel Awardand also the local organizers for waiving my student registration fee(and for helping me keep busy for about 8 hours during the conferencetime). The experience has been rewarding.


References

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. JMLR, 3,993-1022.

Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. Proc of ICML (pp. 977-984).

Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semanticrepresentation. Psychological review, 114(2), 211.

Wang, X., McCallum, A., and Wei, X. (2007). Topical n-grams: Phrase and topic discovery,with an application to information retrieval. In Proc. of ICDM, (pp. 697-702).

Caballero, K. L., Barajas, J., and Akella, R. (2012). The generalized dirichlet distribution inenhanced topic detection. In Proc. of CIKM, (pp. 773-782).

Riedl, M., and Biemann, C. (2012). Topictiling: a text segmentation algorithm based onLDA. In Proc. of ACL, (pp. 37-42).

Lindsey, R V., William P. H. , and Michael J. S. (2012) A phrase-discovering topic modelusing hierarchical Pitman-Yor processes. In Proc. of EMNLP, (pp. 214-222).

Li, W., and McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models oftopic correlations. In Proc. of ICML (pp. 577-584).


Documents

An Unsupervised Topic Segmentation Model …msjameel/presentations/SIGIR-2013...Topical N-Gram Model (TNG)(Wang et al., ICDM-2007) Some Properties of the model Extends the LDACOL model