51
Dirk Hovy [email protected] @dirk_hovy Natural Language Processing Topic Models

topic models - socialdatascience.network

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: topic models - socialdatascience.network

Dirk Hovy

[email protected]

@dirk_hovy

Natural Language Processing

Topic Models

Page 2: topic models - socialdatascience.network

• Understand what information topic models can and can not provide

• Learn about the Latent Dirichlet Allocation (LDA) model

• Understand the parameters influencing the output

• Learn about the Structured Author Topic Model

• Learn about evaluation criteria

Goals for Today

2

Page 3: topic models - socialdatascience.network

What Gets Funded?

3

Talley et al., 2011

Page 4: topic models - socialdatascience.network

What do People Want in Smart Devices?

4

Price price, buy, gift, christmas, worthIntegration light, control, command, system, integration

Sound speaker, sound, quality, small, musicAccuracy question, answer, time, response, quick

Skills music, weather, news, alarm, timerFun fun, family, kid, useful, helpful

Ease of Use easy, use, set, setup, simpleMusic music, play, song, playlist, favorite

Nguyen & Hovy (2019)

Page 5: topic models - socialdatascience.network

Topics are Word Lists

• "pasta, pizza, wine, sauce, spaghetti"

• "BLEU, Bert, encoder, decoder, transformer"

5

TOPIC OR NOT?

SOME DOMAIN KNOWLEDGE REQUIRED…

Page 6: topic models - socialdatascience.network

How to use Topic Models

6

CORPUS MODEL TOPICSDESCRIPTORS

- find best #topics - find best parameters - check output

- preprocess - choose top 5 words - name

[pasta, pizza, wine, sauce, spaghetti]

FOOD

Page 7: topic models - socialdatascience.network

Preprocessing

Page 8: topic models - socialdatascience.network

Preprocessing• Be aggressive:

• lemmatization,

• stopwords,

• replace numbers/user names,

• join collocations

• use TFIDF

• use minimum document frequency 10, 20, 50, or even 100

• use maximum document frequency 50% – 10%

8

Page 9: topic models - socialdatascience.network

Pre-processing steps

9

<div id="text">I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.</div>

GOAL: MINIMIZE VARIATION

Page 10: topic models - socialdatascience.network

Pre-processing steps• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

10

I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.

Page 11: topic models - socialdatascience.network

Pre-processing steps• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

11

I've been in New York in 2011, but didn’t like it.

I preferred Los Angeles.

Page 12: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

12

I ’ve been in New York in 2011 , but did n’t like it .

I preferred Los Angeles .

Pre-processing steps

Page 13: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

13

i ‘ve been in new york in 0000 , but did n’t like it .

i preferred los angeles .

Pre-processing steps

Page 14: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

14

i have be in new york in 0000 , but do not like it .

i prefer los angeles .

Pre-processing steps

Page 15: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

15

i new york 0000 , like .

i prefer los angeles .

Pre-processing steps

Page 16: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

16

new york 0000 like

prefer los angeles

CONTENT = (NOUN, VERB, NUM)

Pre-processing steps

Page 17: topic models - socialdatascience.network

• Remove formatting (e.g. HTML)

• Segment sentences

• Tokenize words

• Normalize words

• numbers

• lemmas vs. stems

• Remove unwanted words

• stopwords

• content words (use POS tagging!)

• join collocations

17

new_york 0000 like

prefer los_angeles

Pre-processing steps

Page 18: topic models - socialdatascience.network

18

new_york 0000 like

prefer los_angeles

<div id="text">I've been in New York in 2011, but didn’t like it. I preferred Los Angeles.</div>

“BAG OF WORDS"

MINIMIMAL VARIATION

Pre-processing steps

Page 19: topic models - socialdatascience.network

Representing Text

Page 20: topic models - socialdatascience.network

N-grams

20

"As Gregor Samsa awoke one morning from uneasy dreams, he found himself transformed in his bed into a gigantic insect-like creature."

As, Gregor, Samsa, awoke, one, morning, from, uneasy, dreams, ...

As_Gregor, Gregor_Samsa, Samsa_awoke, awoke_one, one_morning, ...

As_Gregor_Samsa, Gregor_Samsa_awoke, Samsa_awoke_one, awoke_one_morning, ...

As_Gregor_Samsa_awoke, Gregor_Samsa_awoke_one, Samsa_awoke_one_morning, ...

Bigrams

Trigrams

4-grams

Unigrams

Page 21: topic models - socialdatascience.network

Bags of words (BOW)

21

{ 'shakespeare': 6, 'in': 20, 'love': 6, 'is': ...}

6 … 20 0 …æ

0 6 0 … 0

Xshakespeare

in love

...

beer

COUNT WORDS

VECTORIZE FEATURES

Page 22: topic models - socialdatascience.network

Counting Trouble

22

…AND A MAN NAMED ZIPF

50%

T H E O T H E R 5 0 % . . .

Page 23: topic models - socialdatascience.network

Finding Important Words: TF-IDF

Page 24: topic models - socialdatascience.network

Some Words are Just More Interesting…

24

the

the

the

thethe

thethe

the

thethe

thethe

thethethe

thethe

sustainable

sustainable

Page 25: topic models - socialdatascience.network

Karen Spärck Jones

1935–2007

• Became a teacher before starting CS career at Cambridge

• Laid the foundation for modern NLP, Google Search, text classification

• Campaigned for more women in CS

• Namesake of prestigious CS prize

Page 26: topic models - socialdatascience.network

Problems with Term Frequency

26

BORING STUFF

OBSCURE STUFFJUST RIGHT ???

TOLD YA SO…

Page 27: topic models - socialdatascience.network

df(w)

Document and Term Frequency

27

2

5

1

1

X

TERM FREQUENCY (SUM): 9

DOCUMENT FREQUENCY (COUNT): 4

IDF = Nlog

TF

FEATURE

Page 28: topic models - socialdatascience.network

Putting it Together

28

df(w)TFIDF(w) = NlogTF(w) •

HOW OFTEN WE SAW THE WORD

ADJUSTED BY HOW MANY DOCUMENTS

Page 29: topic models - socialdatascience.network

Document and Term Frequency

29

YE

CHAPTER

MAN WHALE

AHAB

Words in "Moby Dick"tf

word tf idf tfidfye 467 4.257380 148.497079

chapter 171 5.039475 147.504638whale 1150 3.262357 139.755743man 525 3.982412 106.932953ahab 511 4.019453 103.357774

Page 30: topic models - socialdatascience.network

Latent Dirichlet Allocation

Page 31: topic models - socialdatascience.network

How to Generate Documents• Draw a topic distribution 𝜃

• For i in N:

• Draw a topic from 𝜃

• Sample a word from the word distribution z

31

0,140,290,290,140,14

Page 32: topic models - socialdatascience.network

Topics per Document

32 Topic 1

Document 10,04

0,650,130,130,04

Topic 2 Topic 3 Topic 4 Topic 5

Document 2 0,140,290,290,140,14

Document 3 0,170,330,170,170,17

Document N0,79

0,040,040,110,04

Document 4 0,200,070,070,200,47

𝜃 =P(topic|document)

Page 33: topic models - socialdatascience.network

Words per Topic

33 words

Topic 4

Topic 3

Topic 2

Topic 1

TOPIC DESCRIPTORS

Topic 5

z =P(word|topic)

Page 34: topic models - socialdatascience.network

Plate Notation

34

𝛃

M

N

𝜃i zij wij𝛂

#DOCUMENTS#WORDS

REPEATHOW SPECIFIC ARE WORDS TO TOPICS?

HOW MANY TOPICS PER DOCUMENT?

TOPIC DISTRO

WORD DISTRO

HYPERPARAMETERS

Page 35: topic models - socialdatascience.network

Dirichlet Distributions

35

"DISTRIBUTION GENERATOR"

PEAKED PARETO UNIFORM

Page 36: topic models - socialdatascience.network

Parameters: 𝛂

360.01

1000

0,79

0,040,040,110,04

0,190,210,200,190,21

MORE UNIFORM: EVERY TOPIC IN EVERY DOCUMENT

MORE PEAKED: ONE DOMINANT TOPIC/DOC

Page 37: topic models - socialdatascience.network

Parameters: 𝛃

370.01

1000

WORDS ARE HIGHLY TOPIC-SPECIFIC

ALL WORDS FOR ALL TOPICS

Page 38: topic models - socialdatascience.network

Training and Parameters

Page 39: topic models - socialdatascience.network

Evaluating LDA

39

MODEL-INHERENT CONTENT-BASED

WORD INTRUSION

[apple, banana, pear, lime, orange]

[apple, banana, foot, lime, orange]

P(T)

FIT

PREDICT

X�

X

x

p(x) log p(x)

=2PERPLEXITY

WHICH ONE'S WRONG?

Page 40: topic models - socialdatascience.network

Parameters: K

405 10 15 20

PICK LOWEST NUMBER WITH BEST SCORE

K=8

EVALUATION CRITERION

number of topics

Page 41: topic models - socialdatascience.network

Coherence Scores

41

LOG PROB OF WORD CO-OCCURRENCES

NORMALIZED PMI AND COSINE SIMILARITY

Page 42: topic models - socialdatascience.network

Word and Topic Intrusion

42 Slide credit: Hanh Nguyen

WORD INTRUSION

TOPIC INTRUSION

Page 43: topic models - socialdatascience.network

Adding Constraints

• Maybe we know which words go with a topic

• Fix some probabilities/add smoothing

43

Page 44: topic models - socialdatascience.network

Author Topic Models• Learn separate topic distribution for external factors

44

TOPICS BY COUNTRY

Page 45: topic models - socialdatascience.network

Plate Notation

45

M

N

𝜃i zij wij

#DOCUMENTS#WORDSHOW MANY

TOPICS PER DOCUMENT?

𝛃

𝛂

Page 46: topic models - socialdatascience.network

𝛂

𝛃

Plate Notation

46

M

N

ai xij zij wij

𝜃iT

#DOCUMENTS#WORDSAUTHOR AUTHOR

DISTRO

HOW MANY TOPICS PER DOCUMENT?

Page 47: topic models - socialdatascience.network

Wrapping Up

Page 48: topic models - socialdatascience.network

How to use Topic Models

48

CORPUS MODEL TOPICSDESCRIPTORS

- find best #topics - find best parameters - check output

- preprocess - choose top 5 words - name

[pasta, pizza, wine, sauce, spaghetti]

FOOD

Page 49: topic models - socialdatascience.network

Caveats!

Topic models ALWAYS need manual assessment, because:

• Random initialization: no two models are the same!

• More likely models ≠ more interpretable topics

• "Interpretable" is subjective

• Topics are not stable from run to run

49

NEVER USE TOPICS AS INPUT TO REGRESSION!

Page 50: topic models - socialdatascience.network

• LDA is one architecture for topic models

• Model document generation conditioned on latent topics

• Topic models are stochastic: each run is different

• Preprocessing and parameters influence performance

• Results need to be interpreted!

• We can introduce constraints through priors or labels

Take-Home Points

50

Page 51: topic models - socialdatascience.network

To Neural and Beyond

51

https://github.com/MilaNLProc/contextualized-topic-models

• Based on neural networks: better coherence

• cross-lingual: train in one language, use in others

• add supervision: use document labels (similar to author topics)

Sentence Topic

ENITPT

Blackmore’s Night is a British/American traditional folk rock duo [...]Blackmore’s Night sono la band fondatrice del renaissance rock [...]Blackmore’s Night ́e uma banda de folk rock de estilo [...]

rock, band, bass, formedrock, band, bass, formedrock, band, bass, formed

ENFRDE

Langton’s ant is a two-dimensional Turing machine with [...]On nomme fourmi de Langton un automate cellulaire [...]Die Ameise ist eine Turingmaschine mit einem [...]

math, theory, space, numbersmath, theory, space, numbersmath, theory, space, numbers