Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CSE217 INTRODUCTION TO DATA SCIENCE
Spring 2019Marion Neumann
LECTURE 11: TOPIC MODELS
Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.
• text does not come as numerical vectors• requires feature extraction
• typically we want to analyze multiple text documents à corpus of documents
RECAP: TEXT DATA
2
great
small
location
friends
…
Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends.
Thankfully there is great outdoor seating to escape the noise.
MAKING SENSE OF TEXT
à Suppose you want to learn something about a corpus that’s too big to read…
3
• What topics are trending today on Twitter? • What research topics are most active? • What issues are considered by Congress? • Are certain topics discussed more in certain languages on Wikipedia?
TOPIC MODELS• Topic models can help you automatically discover
patterns in a corpus à unsupervised learning
• Topic models automatically...• group topically-related words in topics• associate terms and documents with those topics
4
What is a topic? à a grouping of words that are likely to appear in the same context
WHAT ARE TOPICS MODELS?
1) associate words (terms) with topics, then2) associate topics with documents
5
WHAT ARE TOPICS MODELS?
1) associate words (terms) with topics, then2) associate topics with documentsà document summarization!
6
TOPICS MODELS à THE MODEL• What are the topics and associations?
à we need to learn them form the corpus
• we need a model first!à let’s model documents as a set of words being
generated from a set of topics
à generative ML model7
TOPICS MODELS à THE MODEL
8
TOPICS MODELS à THE MODEL
• probability of each possible word:
9
TOPICS MODELS à LEARNING TASK• Given: observed words in a corpus• Task: learn what topic model has generated the data (corups)• this means we have to infer the
• probability distribution over words associated with each topic, • the distribution over topics for each document, and• the topic responsible for generating each word.
10
TOPICS MODELS à LEARNING TASK
• Task: learn what topic model has generated the data (corpus)à rephrased: how likely is it that our corpus was generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)
11
this is likelihood maximization!
…we don’t even have the parameters…….
TOPICS MODELS à LEARNING TASK
• Task: learn what topic model has generated the data (corpus)àrephrased: how likely is it that our corpus was
generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)
• Solution: • Dirichlet prior à Latent Dirichlet Allocation• iterative algorithm
12
we need some more probability and statisticsknowledge to derive and
understand this
TOPIC MODELS – LINEAR ALGEBRA VIEW• matrix factorization
• singular value decomposition of word document co-occurrence matrix
13
we need some more matrix algebra (linear algebra)
knowledge to derive and understand this
14
• [DSFS] • Ch9 Getting Data: Scraping the Web (p2108-110)• Ch20 Natural Language Processing: Topic Modelling (p247-252)
• Your Easy Guide to LDA https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d• Topic Modeling and Latent Dirichlet Allocation (LDA) in Python
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
SUMMARY & READING
• A lot of today’s data is free-form text data à huge corpora
• Topic models provide a way of summarizing and organizing text documents.
• Instead of hard-coding topics, we learn them from a given corpus. No labels à unsupervised learning