31
Topic Models Dorota Glowacka [email protected]

Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Topic  Models  

Dorota  Glowacka  [email protected]  

Page 2: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Topic  Models  

•  We  want  to  find  themes  (or  topics)  in  documents,  which   is   useful   for,   e.g.   searching   or   visualising  themes  in  document  collecAons  

•  We  don’t  want   to   do  manual   annotaAon  –  Ame  consuming   and   relies   on   availability   of   human  annotators.  

•  Approach   should   automa,cally   tease   out   the  topics  

•  EssenAally   a   clustering   problem   where   both  words  and  documents  are  clustered.    

Page 3: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Topic  Models  Simple  intui+on:  documents  exhibit  mulAple  topics  

Page 4: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Latent  Dirichlet  AllocaAon  (LDA)  

•  Each  topic  is  a  distribuAon  over  words  •  Each  document  is  a  mixture  of  corpus-­‐wide  topics  •  Each  word  is  drawn  from  one  of  these  topics  

Page 5: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Posterior  DistribuAon  

•  In  reality,  we  only  observe  documents  •  The  other  structure  are  hidden  variables  

Page 6: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Posterior  DistribuAon  

•  Our  goal  is  to  infer  the  hidden  variables  •  i.e.  compute  their  distribuAon  condiAoned  on  the  documents:  

p(topics,  propor,ons,  assignments|documents)    

Page 7: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

LDA:  key  assumpAons  

•  Documents   exhibit   mulAple   topics   (but   typically  not  many)  

•  LDA  is  a  probabilis,c  model  with  a  corresponding  genera,ve  process   (each  document   is  generated  by  this  process)  

•  A   topic   is   a   distribu,on   over   a   fixed   vocabulary  (topics  are  assumed  to  be  generated  first,  before  the  documents)  

•  Only  the  number  of  topics  is  specified  in  advance  

Page 8: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Document  GeneraAon  

1.  Choose  a  number  of  words  N  the  document  will  have  

2.  Choose  topic  mixture  for  document  according  to  Dirichlet  distribuAon  over  a  set  of  T  topics  

3.  Generate  each  word  wi  in  the  document  by:  1.  Randomly   choosing   a   topic   from   the   distribuAon  

over  topics  2.  Randomly   choosing   a  word   from   the   corresponding  

topic  (distribuAon  over  vocabulary)  

Page 9: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Document  GeneraAon  

1.  Pick  5  to  be  the  number  of  words  in  D.  2.  Decide   D   will   be   ½   about   food   and   ½   about   cute  

animals.  3.  Pick  first  word  to  come  from  the  food  topic:  broccoli  4.  Pick   second   word   to   come   from   the   cute   animals  

topic:  panda  5.  Pick  third  word  from  the  cute  animals  topic:  adorable.  6.  Pick  fourth  word  from  the  food  topic:  cherries.  7.  Pick  fiYh  word  from  food  topic:  ea,ng.  What  is  the  document  generated  under  the  LDA  model?  

Page 10: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Learning  

1.  Go  through  each  document  and  randomly  assign  each  word  in  the  document  to  one  of  T  topics.  

2.  For   each  document  d,   go   through  each  word  w   in  d  and  for  each  topic  t  compute:  1.  p(topic  t|document  d)  2.  p(word  w|topic  t)  

3.  Reassign  w   to  a  new  topic,  where  we  choose   topic   t  with  probability:  p(topic   t|document   d)   x   p(word  w|topic  t)  

4.  AYer   repeaAng   the   previous   step   a   large   number   of  Ames,   you   will   eventually   reach   a   roughly   steady  state  where  your  assignments  are  pre[y  good.  

Page 11: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

GeneraAve  Process  For  j  =  1  …  T  topics    choose  

For  d  =  1  …  D  documents    choose    For  i  =  1  …  Nd  words  in  doc  d      choose      choose              

φ j( ) ≈ Dirichlet β( )

θ d( ) ≈ Dirichlet α( )

zi ≈ Multinomial θd( )( )

wi ≈ Multinomial ϕzi( )( )

:  prob.  of  each  wd.  in  topic  j        :prob.  of  each  topic  in  doc.  d      zi:  topic  of  word  i    

wi:  idenAty  of  word  i    

φVj( )…φv

j( )

θ1d( )…θT

d( )

Page 12: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

The  Graphical  Model  

•  Nodes  are  random  variables;  edges  indicate  dependence  •  Shaded  nodes  are  observed;  plates  ≈  replicated  variables  

Page 13: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

The  Graphical  Model  

•  Nodes  are  random  variables;  edges  indicate  dependence  •  Shaded  nodes  are  observed;  plates  ≈  replicated  variables  

ç Per-­‐document  topic      proporAons  

ProporAons  parameter  è    

ç Per-­‐word  topic      assignment  

Topics            ê  

Topics  parameter  è  ç  Observed  words  

ç  Documents    

Page 14: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

The  Graphical  Model  

   

P wi( ) = P wi zi = j( )P zi = j( )j=1

T

•  P(zi   =   j)   is   probability   that   jth  topic   was   sampled   from   ith  word  token  

•  P(wi|zi   =   j)   is   probability   of  word  wi  under  topic  j  

•  .  

•  .  

φ j( ) = P zi = j( )

θ d( ) = P z( )

Page 15: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Dirichlet  DistribuAon  

p θ α( ) = Dir α1,…,αT( )Γ α jj∑( )∏ j Γ α j( )

θ jα j=1

j=1

T

•  Dirichlet   distribuAon   is   an   exponenAal   family   distribuAon   over  the  simplex,  i.e.  posiAve  vectors  that  sum  to  one.  

•  It   is   conjugate   to   the   mulAnomial   distribuAon.   Given   a  mulAnomial   observaAon,   the   posterior   distribuAon   of   θ   is   a  Dirichlet.  

•  Parameter  α  smoothes  topic  distribuAon  in  the  document.  •  θ  smoothes  word  distribuAon  in  every  topic.    

Page 16: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 17: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 18: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 19: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 20: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 21: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 22: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Inference  with  Gibbs  Sampling  

•  An  iteraAve  process  •  Start  with  random  topic  assignment  for  each  word  – Assume  you  know  (from  the  previous  iteraAon)  the  topics  

– Determine  the  probabiliAes  of  each  topic-­‐assignment  given  the  rest  of  data  

– Choose  the  most  probable  assignment.    •  Iterate  unAl  converge  

Page 23: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Gibbs  Sampling  •  CollecAon  of  documents  is  a  set  of  words  wi  and  document  indices  di,  for  each  word  token  i.  

•  From  this  condiAonal  distribuAon  a  topic  is  sampled  and  stored  as  the  new  topic  assignment  for  this  word  token.    zi  =  j:  topic  assignment  of  token  i  to  topic  j    z_i      :  topic  assignments  of  all  other  word  tokens        :  matrix  of  counts  with  dimensions  WxT        :  matrix  of  counts  with  dimensions  DxT  

P zi = j z_ i,wi,di, ⋅( ) =Cdi j

DT +α

CditDT +Tα

t=1

T

Cwi jWT +β

Cwj

WT +Wβw=1

W

CWT

CDT

Page 24: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Posterio  EsAmates  of  β  and  θ  

φij =Cij

WT +β

Cwi jWT +Wβ

t=1

W

∑θd j =

Cdi jDT +α

CditDT +Tα

t=1

T

•  ϕij:  normalizes  word-­‐topic  count  matrix  (probability  of  word  type  i  for  topic  j)  

•  θdj:  normalizes  the  counts    in  the  documents-­‐topic  matrix  (probability  that  document  d  belongs  to  topic  j)  

•  Techincally  the  posterior  mean  of  the  parameters  φ  and  θ  

Page 25: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Example  Inference  

Data:  The  OCR’ed  collecAon  of  Science  from  1990-­‐2000    17K  documents    11M  words    20K  unique  terms  (stop  words  and  rare  words  removed)  

Model:  100-­‐topic  LDA  model    

Page 26: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&
Page 27: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Why  does  LDA  “work”?  •  LDA  trades  off  two  goals:  

1.  For  each  document,  allocate  its  words  to  as  few  topics  as  possible.  

2.  For  each  topic,  assign  high  probability  to  as  few  terms  as  possible.  

•  These  goals  are  at  odds:  –  Puong  a  document  in  a  single  topics  makes  2  hard:  all  words  must  have  probability  under  that  topic.  

–  Puong  very  few  words  in  each  topic  makes  1  hard:  to  cover  a  document’s  words,  it  must  assign  many  topics  to  it.  

•  Trading  off  these  goals  finds  groups  of  Aghtly  co-­‐occurring  words.  

Page 28: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

LDA  summary  

•  LDA  is  a  probabilis,c  model  of  text.  It  casts  the  problem   of   discovering   themes   in   large  document  collec,ons  as  a  posterior  inference.  

•  It   allows   visualisa,on   of   hidden   thema,c  structure  of  large  document  collecAons.  

•  LDA  is  a  building  block  of  many  applicaAons.  •  It   is   popular   because   organizing   and   finding  pa[erns   in   data   has   become   important   in  sciences,  humaniAes,  industry,  etc.  

Page 29: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Extending  Topic  Models  

•  LDA  can  be  embedded  in  more  complex  models,  embodying  further  intuiAons  about  the  structure  of  the  text.  

•  E.g.   used   in   models   that   account   for   syntax,  authors,  correlaAon  or  hierarchies.  

Page 30: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Extending  Topic  Models  

•  Data   generaAng   distribuAon   can   be   changed.  We   can  apply   mixed-­‐membership   assumpAons   to   many   kinds  of  data.  

•  E.g.   we   can   build  models   of   images,   social   networks,  music,  computer  code,  geneAc  data,  etc.    

Page 31: Topic&Models& - inf.ed.ac.uk · Topic&Models& • We&wantto&find& themes&(or&topics)&in&documents, which&is&useful&for,&e.g.&searching&or&visualising& themes&in&documentcollecAons&

Extending  Topic  Models  

•  The  posterior  can  be  used  in  creaAve  ways.  •  E.g.   we   can   use   inferences   in   informaAon  retrieval,   recommendaAon,   visualisaAon,  summarizaAon,  etc.