22
Term Frequency Analy/cs and Document Clustering Thomas Jones DC NLP Meetup 10/09/2013

Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Term  Frequency  Analy/cs  and  Document  Clustering  

Thomas  Jones  DC  NLP  Meetup  10/09/2013  

Page 2: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Who  Am  I  and  What  Do  I  Do?  •  Sta/s/cian  at  IDA/Science  and  Technology  Policy  Ins/tute  (STPI)  since  January  

•  Stats/econometrics  (professionally)  since  early  2008  

•  Former  enlisted  infantry  Marine  –  (but  now  I  only  shoot  eigenvalues)  

Page 3: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Who  is  This  Talk  For?  

Page 4: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

The  Library  of  Babel  

hRp://www.betaversion.org/~stefano/linotype/news/26/  

Page 5: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Frequency  Analysis  in  3  Steps  

1.  Data  Cura/on    a.  Remove  stop  words  and  other  terms/symbols/

numbers  b.  Count  words/n-­‐grams  and  re-­‐weight  c.  Calculate  distance/similarity  between  documents  

2.  Preliminary  visualiza/on  a.  Plot  a  nearest  neighbor  network  

3.  Cluster  analysis  a.  Choose  your  favorite  algorithm  b.  Find  the  most  frequent  terms  in  a  cluster  

Page 6: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

The  Document  Term  Matrix  

10/14/13   6  

Each  row  is  an  individual  document  

term  

Raw  count  

Page 7: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Texts  as  Points  in  Space  

0  

2  

4  

6  

8  

10  

12  

0   2   4   6   8   10   12  

Hummus  

Cheeseburgers  

Page 8: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

“Distance”  Between  Documents  

0  

2  

4  

6  

8  

10  

12  

0   2   4   6   8   10   12  

Hummus  

Cheeseburgers  

A   C  

B  

Page 9: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Which  Words  Contain  Informa/on?    The  TF-­‐IDF  Frequency  Weights  

0  

2  

4  

6  

8  

10  

12  

1   39  

77  

115  

153  

191  

229  

267  

305  

343  

381  

419  

457  

495  

533  

571  

609  

647  

685  

723  

761  

799  

837  

875  

913  

951  

989  

1027  

1065  

1103  

1141  

1179  

1217  

1255  

1293  

Inverse  Document  Frequency  Weight  

Number  of  Documents  in  Which  a  Term  Appears  

Page 10: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

VISUAL  EXPLORATION  Term  Frequency  Analy/cs  and  Document  Clustering  

Page 11: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Sample  Data  

•  Source:  hRp://www.congressionalbills.org/  

•  Titles  of  5,000  randomly  sampled  Congressional  bills  that  were  signed  into  law  from  the  80th  to  the  112th  Congress  

•  Used  for  example  visuals  only,  not  a  thorough  analysis      

Page 12: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Nearest  Neighbor  Networks  

Page 13: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

CLUSTER  ANALYSIS  Term  Frequency  Analy/cs  and  Document  Clustering  

Page 14: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Par//onal  Clustering  

Page 15: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

What  Does  a  Cluster  Represent?  

15  

•  Clusters  are  groups  of  documents.  

•  Documents  are  grouped  around  a  co-­‐occurrence  of  terms  (TF-­‐IDF)    

•  Manual  inspec/on  of  documents  augments  analyses.  

Bills  Pertaining  to  the  Navy  and  Marine  Corps  

Term Freqnavy 100corps 86

medical 44marine 37

appointments 36officers 35army 34band 29grade 28nurse 26duty 24

permanent 24united 23

authorize 21states 21nurses 19career 18estates 18norfolk 18held 17

members 16attendance 16

force 16air 15

Page 16: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Nearest  Neighbor  Network  of  Clusters  

Page 17: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Cluster  Nearest  Neighbor  Network  

Page 18: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%
Page 19: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

TAKE  AWAYS  Term  Frequency  Analy/cs  and  Document  Clustering  

Page 20: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

If  you  remember  nothing  else…  

•  Corpus  representa/on  =  Document  Term  Matrix  

•  Frequency  measure  =  Term  Frequency  Inverse  Document  Frequency  

•  Distance/Similarity  measure  =  Cosine  similarity  

Page 21: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Pro  Tips  

•  Longer  documents  are  more  internally  heterogeneous  and  can  be  more  difficult  to  cluster  meaningfully  

•  Context-­‐specific  dic/onaries  are  helpful.    

•  Dimensionality  (i.e.  data  size)  requires  thoughgul  programming  

•  Get  more  clusters  than  you  think  you  need  and  then  aggregate  them  aier  inspec/on.    

Page 22: Term%Frequency%Analy/cs%and% DocumentClustering%files.meetup.com/7616132/DC-NLP-2013-10 Thomas Jones.pdf · Term%Frequency%Analy/cs%and% DocumentClustering% Thomas%Jones% DCNLP% Meetup%

Ques/ons?  

Thomas  Jones  Science  and  Technology  Policy  Ins/tute  

[email protected]