CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019Marion Neumann

LECTURE 11: TOPIC MODELS

Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.

• text does not come as numerical vectors• requires feature extraction

• typically we want to analyze multiple text documents à corpus of documents

RECAP: TEXT DATA

2

great

small

location

friends

…

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends.

Thankfully there is great outdoor seating to escape the noise.

MAKING SENSE OF TEXT

à Suppose you want to learn something about a corpus that’s too big to read…

3

• What topics are trending today on Twitter? • What research topics are most active? • What issues are considered by Congress? • Are certain topics discussed more in certain languages on Wikipedia?

TOPIC MODELS• Topic models can help you automatically discover

patterns in a corpus à unsupervised learning

• Topic models automatically...• group topically-related words in topics• associate terms and documents with those topics

4

What is a topic? à a grouping of words that are likely to appear in the same context

WHAT ARE TOPICS MODELS?

1) associate words (terms) with topics, then2) associate topics with documents

5

WHAT ARE TOPICS MODELS?

1) associate words (terms) with topics, then2) associate topics with documentsà document summarization!

6

TOPICS MODELS à THE MODEL• What are the topics and associations?

à we need to learn them form the corpus

• we need a model first!à let’s model documents as a set of words being

generated from a set of topics

à generative ML model7

TOPICS MODELS à THE MODEL

8

TOPICS MODELS à THE MODEL

• probability of each possible word:

9

TOPICS MODELS à LEARNING TASK• Given: observed words in a corpus• Task: learn what topic model has generated the data (corups)• this means we have to infer the

• probability distribution over words associated with each topic, • the distribution over topics for each document, and• the topic responsible for generating each word.

10

TOPICS MODELS à LEARNING TASK

• Task: learn what topic model has generated the data (corpus)à rephrased: how likely is it that our corpus was generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)

11

this is likelihood maximization!

…we don’t even have the parameters…….

TOPICS MODELS à LEARNING TASK

• Task: learn what topic model has generated the data (corpus)àrephrased: how likely is it that our corpus was

generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)

• Solution: • Dirichlet prior à Latent Dirichlet Allocation• iterative algorithm

12

we need some more probability and statisticsknowledge to derive and

understand this

TOPIC MODELS – LINEAR ALGEBRA VIEW• matrix factorization

• singular value decomposition of word document co-occurrence matrix

13

we need some more matrix algebra (linear algebra)

knowledge to derive and understand this

14

• [DSFS] • Ch9 Getting Data: Scraping the Web (p2108-110)• Ch20 Natural Language Processing: Topic Modelling (p247-252)

• Your Easy Guide to LDA https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d• Topic Modeling and Latent Dirichlet Allocation (LDA) in Python

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

SUMMARY & READING

• A lot of today’s data is free-form text data à huge corpora

• Topic models provide a way of summarizing and organizing text documents.

• Instead of hard-coding topics, we learn them from a given corpus. No labels à unsupervised learning

Documents

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some