Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Term Frequency Analy/cs and Document Clustering
Thomas Jones DC NLP Meetup 10/09/2013
Who Am I and What Do I Do? • Sta/s/cian at IDA/Science and Technology Policy Ins/tute (STPI) since January
• Stats/econometrics (professionally) since early 2008
• Former enlisted infantry Marine – (but now I only shoot eigenvalues)
Who is This Talk For?
The Library of Babel
hRp://www.betaversion.org/~stefano/linotype/news/26/
Frequency Analysis in 3 Steps
1. Data Cura/on a. Remove stop words and other terms/symbols/
numbers b. Count words/n-‐grams and re-‐weight c. Calculate distance/similarity between documents
2. Preliminary visualiza/on a. Plot a nearest neighbor network
3. Cluster analysis a. Choose your favorite algorithm b. Find the most frequent terms in a cluster
The Document Term Matrix
10/14/13 6
Each row is an individual document
term
Raw count
Texts as Points in Space
0
2
4
6
8
10
12
0 2 4 6 8 10 12
Hummus
Cheeseburgers
“Distance” Between Documents
0
2
4
6
8
10
12
0 2 4 6 8 10 12
Hummus
Cheeseburgers
A C
B
Which Words Contain Informa/on? The TF-‐IDF Frequency Weights
0
2
4
6
8
10
12
1 39
77
115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
723
761
799
837
875
913
951
989
1027
1065
1103
1141
1179
1217
1255
1293
Inverse Document Frequency Weight
Number of Documents in Which a Term Appears
VISUAL EXPLORATION Term Frequency Analy/cs and Document Clustering
Sample Data
• Source: hRp://www.congressionalbills.org/
• Titles of 5,000 randomly sampled Congressional bills that were signed into law from the 80th to the 112th Congress
• Used for example visuals only, not a thorough analysis
Nearest Neighbor Networks
CLUSTER ANALYSIS Term Frequency Analy/cs and Document Clustering
Par//onal Clustering
What Does a Cluster Represent?
15
• Clusters are groups of documents.
• Documents are grouped around a co-‐occurrence of terms (TF-‐IDF)
• Manual inspec/on of documents augments analyses.
Bills Pertaining to the Navy and Marine Corps
Term Freqnavy 100corps 86
medical 44marine 37
appointments 36officers 35army 34band 29grade 28nurse 26duty 24
permanent 24united 23
authorize 21states 21nurses 19career 18estates 18norfolk 18held 17
members 16attendance 16
force 16air 15
Nearest Neighbor Network of Clusters
Cluster Nearest Neighbor Network
TAKE AWAYS Term Frequency Analy/cs and Document Clustering
If you remember nothing else…
• Corpus representa/on = Document Term Matrix
• Frequency measure = Term Frequency Inverse Document Frequency
• Distance/Similarity measure = Cosine similarity
Pro Tips
• Longer documents are more internally heterogeneous and can be more difficult to cluster meaningfully
• Context-‐specific dic/onaries are helpful.
• Dimensionality (i.e. data size) requires thoughgul programming
• Get more clusters than you think you need and then aggregate them aier inspec/on.