31
Modeling Syntactic Structures of Topics with a Nested HMMLDA Jing Jiang Singapore Management University Singapore Management University

of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Modeling Syntactic Structures of Topics with a Nested HMM‐LDA

Jing Jiang

Singapore Management UniversitySingapore Management University

Page 2: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Topic Models• A generative model for discovering hidden topics from 

documents

• Topics are represented as word distributions• Topics are represented as word distributions

This makes full synchronyof activated units thedefault condition in themodel cortex, as in Browns model [Brown ands model [Brown andCooke, 1996], so that the background activation ish d b dcoherent, and can be read

into high order corticallevels which synchronize 

12/7/2009 ICDM 2009 2

ywith it.

Page 3: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

How to Interpret Topics?• List the top‐K frequent words

– Not easy to interpret

– How are the top words related to each other?

• Our method: model the syntactic structures ofOur method: model the syntactic structures of topics using a combination of hidden Markov models (HMM) and topic models (LDA)models (HMM) and topic models (LDA)

• A preliminary solution towards meaningful representations of topicsrepresentations of topics

12/7/2009 ICDM 2009 3

Page 4: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Related Work on Syntactic LDA• Similar to / based on [Griffiths et al. 05]

– More general, with multiple “semantic classes”

• [Boyd‐Graber & Blei 09]– Combines parse trees with LDACombines parse trees with LDA

– Expensive to obtain parse trees for large text collections

• [Gruber et al. 07]– Combines HMM with LDA– Combines HMM with LDA

– Does not model syntax

12/7/2009 ICDM 2009 4

Page 5: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

HMM to Model Syntax • In natural language sentences, the syntactic class of a word occurrence (noun, verb, adjective, adverb, preposition, etc.) depends on its context

• Transitions between syntactic classes follow some structure

• HMMs can be used to model these transitions– HMM‐based part‐of‐speech tagger– HMM‐based part‐of‐speech tagger

12/7/2009 ICDM 2009 5

Page 6: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Overview of Our Model• Assumptions

– A topic is represented as an HMM

– C Content states: convey semantic meanings of topics (likely to be nouns, verbs, adjectives, etc.)

– F Functional states: serve linguistic functions (e.g. prepositions and articles)

• Word distributions of these functional states are shared among topics

Each document has a mixture of topics– Each document has a mixture of topics

– Each sentence is generated from a single topic

12/7/2009 ICDM 2009 6

Page 7: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Overview of Our Model

12/7/2009 ICDM 2009 7

Page 8: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Overview of Our ModelTopics

12/7/2009 ICDM 2009 8

Page 9: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Overview of Our ModelTopics States

12/7/2009 ICDM 2009 9

Page 10: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Overview of Our ModelTopics States

12/7/2009 ICDM 2009 10

Page 11: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

The n‐HMM‐LDA Model

12/7/2009 ICDM 2009 11

Page 12: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Document Generation Process

Sample topics and transition probabilities

12/7/2009 ICDM 2009 12

Page 13: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Document Generation Process

Sample a topic distribution for the document (same as in LDA)

12/7/2009 ICDM 2009 13

Page 14: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Document Generation Process

Sample a topic for a sentence

12/7/2009 ICDM 2009 14

Page 15: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Document Generation Process

Generate the words in the sentence using the HMMsentence using the HMM corresponding to this topic

12/7/2009 ICDM 2009 15

Page 16: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Variations• Transition probabilities between states can be either topic‐specific (left) or shared by all topics (right)

12/7/2009 ICDM 2009 16

Page 17: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Model Inference: Gibbs Sampling• Sample a topic for a sentence

• Sample a state for a word

12/7/2009 ICDM 2009 17

Page 18: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Experiments – Data Sets• NIPS publications (downloaded from http://nips.djvuzone.org/txt.html)

• Reuters‐21578

Date Sets

NIPS Publications* Reuters‐21578

Vocabulary 18,864 10,739

Words 5,305,230 1,460,666

documents for training 1314 8052documents for training 1314 8052

documents for testing 618 2665

12/7/2009 ICDM 2009 18

Page 19: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Quantitative Evaluation• Perplexity: a commonly used metric for the generalization power of language models

• For a test document, observe the first Ksentences and predict the remainingsentences and predict the remaining sentences

12/7/2009 ICDM 2009 19

Page 20: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

LDA vs. LDA‐s• LDA‐s: n‐HMM‐LDA with a single state for each HMM.– Same as standard LDA with each sentence having a single topic 

NIPS Reuters

One topic per sentence assumption helps

12/7/2009 ICDM 2009 20

One‐topic‐per‐sentence assumption helps.

Page 21: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

HMM• Achieves much lower perplexity, but cannot be used to discover topics

NIPS

12/7/2009 ICDM 2009 21

NIPS

Page 22: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Increase Number of Functional States• Fixing the number of content states to 1 and the number of topics to 40

NIPS ReutersNIPS Reuters

More functional states decreases perplexity

12/7/2009 ICDM 2009 22

More functional states decreases perplexity.

Page 23: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Qualitative Evaluation• Use the top frequent words to represent a topic/state

12/7/2009 ICDM 2009 23

Page 24: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Sample Topics/States from LDA/HMM

12/7/2009 ICDM 2009 24

NIPS

Page 25: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Sample States from n‐HMM‐LDA‐d

12/7/2009 ICDM 2009 25

NIPS

Page 26: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Different Content States

12/7/2009 ICDM 2009 26

Page 27: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Case Study (LDA)

This makes full synchronyof activated units the the

f

isthethatdefault condition in the

model cortex, as in Browns model [Brown and

ofasignaland

thatbearecan

oftheins model [ rown and

Cooke, 1996], so that the background activation iscoherent and can be read

andtoinfrequency

canoneitto

andcellsto

coherent, and can be readinto high order corticallevels which synchronize 

frequencyis*

for cellmodelaresponsewith it. response

12/7/2009 ICDM 2009 27

Page 28: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Case Study (n‐HMM‐LDA)

This makes full synchrony of activated units the  cells

ll

*receptivesynapticdefault condition in the 

model cortex, as in Brown s model [Brown and 

cell*neuronsfield

synapticinhibitoryheadexcitatorysmodel [ rown and

Cooke, 1996], so that the background activation is coherent and can be read

fieldinputresponsemodel

excitatorydirectioncellvisualcoherent, and can be read 

into high order corticallevels which synchronize 

modelactivitysynapses

pyramidal

with it.

12/7/2009 ICDM 2009 28

Page 29: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Case Study (Comparison)

This makes full synchrony of activated units the 

This makes full synchronyof activated units the

default condition in the model cortex, as in Brown s model [Brown and 

default condition in themodel cortex, as in Browns model [Brown and smodel [ rown and

Cooke, 1996], so that the background activation is coherent and can be read

s model [ rown andCooke, 1996], so that the background activation iscoherent and can be read coherent, and can be read 

into high order corticallevels which synchronize 

coherent, and can be readinto high order corticallevels which synchronize 

with it.with it.

LDA n‐HMM‐LDA

12/7/2009 ICDM 2009 29

LDA

Page 30: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Conclusion• We proposed a nested‐HMM‐LDA to model the syntactic structures of topics– Extension of [Griffiths et al. 05]

• Experiments on two data sets show thatp– The model achieves perplexity between LDA and HMM

– The model can provide more insights into the structures of topics than standard LDA

12/7/2009 ICDM 2009 30

Page 31: of Topics with a Nested HMM LDA09.pdf · 2010-06-30 · • A generative model for discovering hidden topics from documents • Topics areare representedrepresented asas wordword

Thank You!• Questions?

12/7/2009 ICDM 2009 31