41
Tengyu Ma Joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang, and Andrej Risteski Princeton University

word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Tengyu Ma

Joint works with Sanjeev Arora, Yuanzhi Li, YingyuLiang, and Andrej Risteski

Princeton University

Page 2: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

𝑥 ∈ 𝒳

𝑣& ∈ ℝ(

complicatedspace Euclideanspacewithmeaningful innerproducts

Ø Kernelmethods)*+,-.*/01,23/03+4

Linearlyseparable

Ø Neuralnets 0.*3+1,+15.*2+106 Multi-classlinearclassifier

Page 3: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Vocabulary ={ 60k most frequentwords }

ℝ788

Goal:Embeddingcapturessemanticsinformation

(vialinearalgebraicoperations)

Ø inner products characterize similarityØ similarwordshavelargeinnerproducts

Ø differencescharacterizerelationshipØanalogouspairshavesimilardifferences

Ø more? picture:ChrisOlah’s blog

Page 4: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Meaningofawordisdeterminedbywordsitco-occurswith.

(Distributionalhypothesisofmeaning,[Harris’54],[Firth’57])

⋯⋮⋮ ⋱ ⋮

⋮⋯

word 𝑥 → 𝑣&

word 𝑦↓

Ø Pr 𝑥, 𝑦 ≜ prob.ofco-occurrencesof𝑥, 𝑦 inawindowofsize5

Ø 𝑣&, 𝑣C - agoodmeasureofsimilarityof(𝑥, 𝑦) [Lund-Burgess’96]

Ø 𝑣& =rowofentry-wise square-rootofco-occurrencematrix[Rohdeetal’05]

Ø 𝑣& =rowofPMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]

matrix[Church-Hanks’90]

Co-occurrencematrixPr ⋅,⋅

Page 5: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø “Linearstructure”inthefound𝑣&’s:

𝑣PQRST − 𝑣RST ≈ 𝑣WXYYT − 𝑣Z[T\ ≈ 𝑣XT]^Y − 𝑣SXT_ ≈ ⋯

aunt

king

uncleman

woman

queen

Algorithm [Levy-Goldberg]:(dimension-reductionversionof[Church-Hanks’90])

Ø ComputePMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C]

Ø Takerank-300SVD(bestrank-300approximation)ofPMIØ ⇔ FitPMI 𝑥, 𝑦 ≈ ⟨𝑣&, 𝑣C⟩ (withsquaredloss),where𝑣& ∈ ℝ788

Page 6: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Questions: woman:manqueen:?

,aunt:?

Ø Answers: 𝑘𝑖𝑛𝑔 = argmink|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||

𝑎𝑢𝑛𝑡 = argmink|| 𝑣XT]^Y − 𝑣P − (𝑣PQRST−𝑣RST)||

aunt

king

uncleman

woman

queen

Page 7: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ørecurrentneuralnetworkbasedmodel[Mikolov etal’12]

Øword2vec[Mikolov etal’13]:

Pr 𝑥[pq 𝑥[pr,… , 𝑥[pt ∝ exp⟨𝑣&yz{ ,15 𝑣&yz~ +⋯+ 𝑣&yz� ⟩

ØGloVe [Penningtonetal’14]:

log Pr[𝑥, 𝑦] ≈ 𝑣&, 𝑣C + 𝑠& + 𝑠C + 𝐶

Ø [Levy-Goldberg’14](Previousslide)

PMI 𝑥, 𝑦 = log L.[&,C]L. & L.[C] ≈ 𝑣&, 𝑣C + 𝐶

Logarithm(orexponential)seemstoexcludelinearalgebra!

Page 8: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Whyco-occurrencestatistics+logà linearstructure[Levy-Goldberg’13,Penningtonetal’14,rephrased]

Ø Formostofthewords𝜒:

Pr[𝜒 ∣ 𝑘𝑖𝑛𝑔]Pr[𝜒 ∣ 𝑞𝑢𝑒𝑒𝑛] ≈

Pr[𝜒 ∣ 𝑚𝑎𝑛]Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]

§ For𝜒 unrelatedtogender:LHS,RHS≈ 1

§ for𝜒=dress,LHS,RHS≪ 1 ;for𝜒 =John,LHS,RHS≫ 1

ØItsuggests

=� PMI 𝜒, 𝑘𝑖𝑛𝑔 − PMI 𝜒, 𝑞𝑢𝑒𝑒𝑛 − PMI 𝜒, 𝑚𝑎𝑛 − PMI 𝜒, 𝑤𝑜𝑚𝑎𝑛�

≈ 0

Ø RowsofPMImatrixhas“linearstructure”

Ø Empiricallyonecanfind𝑣P’ss.t. PMI 𝜒, 𝑤 ≈ ⟨𝑣�, 𝑣P⟩

Ø Suggestion:𝑣P’salsohavelinearstructure

� logPr 𝜒 𝑘𝑖𝑛𝑔Pr 𝜒 𝑞𝑢𝑒𝑒𝑛 − log

Pr 𝜒 𝑚𝑎𝑛Pr 𝜒 𝑤𝑜𝑚𝑎𝑛]

≈ 0

Page 9: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?

PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)

M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?

Ø NB:solvinganalogytaskrequiresinnerproductsof6pairsofwordvectors,andthat“king”survivesagainstallotherwords– noiseispotentiallyanissue!

𝑘𝑖𝑛𝑔 = argmaxk|| 𝑣WXYYT − 𝑣P − (𝑣PQRST−𝑣RST)||�

Ø Fact:low-dimwordvectorshavemoreaccurate linearstructurethantherowsofPMI(thereforebetteranalogytaskperformance).

↑empiricalfithas17%error

Ø NB:PMImatrixisnotnecessarilyPSD.

Page 10: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

M1:Whydolow-dimvectorscaptureessenceofhugeco-occurrencestatistics?Thatis,whyisalow-dimfitofPMImatrixevenpossible?

PMI 𝑥, 𝑦 ≈ 𝑣&, 𝑣C (∗)

A1:Underagenerativemodel(namedRAND-WALK),(*)provably holds

M2:Whylow-dimvectorssolvesanalogywhen(∗) isonlyroughlytrue?

A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise

(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)

Page 11: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø HiddenMarkovModel:§ discoursevector𝑐_ ∈ ℝ( governsthediscourse/theme/contextoftime𝑡§ words𝑤_ (observable);embedding𝑣P� ∈ ℝ

( (parameterstolearn)§ log-linearobservationmodel

Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩

Ø Closelyrelatedto[Mnih-Hinton’07]

𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7

𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�

𝑐_p�

Page 12: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Ideally,𝑐_, 𝑣P ∈ ℝ( shouldcontainsemanticinformationinitscoordinates§ E.g.(0.5,-0.3,…)couldmean“0.5gender,-0.3age,..”

Ø But,thewholesystemisrotationalinvariant: 𝑐_, 𝑣P = ⟨𝑅𝑐_,𝑅𝑣P⟩

Ø Thereshouldexistarotationsothatthecoordinatesaremeaningful(backtothislater)

𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7

𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�

𝑐_p�

Page 13: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Assumptions:§ {𝑣P}consistsofvectorsdrawnfrom𝑠 ⋅ 𝒩(0, Id);𝑠 isboundedscalarr.v.§ 𝑐_ doesaslowrandomwalk(doesn’tchangemuchinawindowof5)§ log-linearobservationmodel:Pr[𝑤_ ∣ 𝑐_] ∝ exp⟨𝑣P�,𝑐_⟩

Ø MainTheorem:

(1) logPr 𝑤,𝑤′ = 𝑣P + 𝑣P� �/𝑑 − 2 log𝑍 ± 𝜖

(2)logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖

(3) PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖

Ø Normdeterminesfrequency;spatialorientationdetermines“meaning”

𝑐_ 𝑐_pr 𝑐_p� 𝑐_p7

𝑤_ 𝑤_pr 𝑤_p� 𝑤_p7 𝑤_p�

𝑐_p�

Fact:(2)implies thatthewordshavepowerlawdist.

Page 14: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Øword2vec[Mikolov etal’13]:

Pr 𝑤[pq 𝑤[pr,… ,𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩

ØGloVe [Penningtonetal’14]:

log Pr[𝑤,𝑤′] ≈ 𝑣P, 𝑣P¢ + 𝑠P + 𝑠P� + 𝐶

Eq.(1) logPr 𝑤,𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Ø [Levy-Goldberg’14]

PMI 𝑤,𝑤� ≈ 𝑣P, 𝑣P¢ + 𝐶

Eq.(3)PMI 𝑤, 𝑤� = 𝑣P, 𝑣P¢ /𝑑 ± 𝜖

Page 15: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Øword2vec[Mikolov etal’13]:

Pr 𝑤[pq 𝑤[pr,… , 𝑤[pt ∝ exp⟨𝑣Pyz{ ,15 𝑣Pyz~ + ⋯+ 𝑣Pyz� ⟩

Ø Underourmodel,

§ Randomwalkisslow:𝑐[pr ≈ 𝑐[p� ≈ ⋯ ≈ 𝑐[pq ≈ 𝑐

§ Bestestimateforcurrentdiscourse𝑐[pq:

argmax],||]||£r

Pr 𝑐 𝑤[pr,… ,𝑤t] = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz�

§ Prob.distributionofnextwordgiventhebestguess𝑐:

Pr[𝑤[pq ∣ 𝑐[pq = 𝛼 𝑣Pyz~ + ⋯+ 𝑣Pyz� ] ∝ exp⟨𝑣Pyz{ ,𝛼 𝑣Pyz~ +⋯+ 𝑣Pyz� ⟩

↑max-likelihoodestimateof𝑐[pq

𝑐[p� 𝑐[pt

𝑤[p� 𝑤[pt 𝑤[pq

𝑐[pq

Page 16: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Pr[𝑤,𝑤�] = ¥ Pr 𝑤 𝑐] Pr 𝑤� 𝑐′] 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′

= ¥1

𝑍]𝑍]�⋅ exp 𝑣P, 𝑐 exp⟨𝑣P¢ , 𝑐�⟩ 𝑝 𝑐, 𝑐� 𝑑𝑐𝑑𝑐′

Ø Assume𝑐 = 𝑐′ withprobability1,

= ¥exp⟨𝑣P + 𝑣P¢, 𝑐⟩ 𝑝 𝑐 𝑑𝑐 = exp 𝑣P + 𝑣P¢ � /𝑑

??

Thistalk:windowofsize2

Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction

Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

sphericalGaussianvector𝑐Ø 𝔼 exp 𝑣, 𝑐 = exp 𝑣 �/𝑑

Page 17: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Thistalk:windowofsize2

Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction

Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍

Ø Proof(sketch):§ formost𝑐,𝑍] concentratesarounditsmean§ meanof𝑍] isdeterminedby||𝑐||,whichinturnconcentrates§ caveat:exp⟨𝑣, 𝑐⟩ for𝑣 ∼𝒩(0, Id) isnotsubgaussian,norsub-exponential.(𝛼-Orlicz normisnotboundedforany𝛼 > 0)

Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Page 18: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø ProofSketch:

Ø Fixing𝑐,toshowhighprobabilityoverchoicesof𝑣P’s

𝑍] =�exp⟨𝑣P, 𝑐⟩P

= 1 + 𝑜 1 𝔼[𝑍]]

Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable

Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated

Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍

Page 19: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Question:𝑧r,… , 𝑧T ∼ 𝒩(0,1)

𝑍 = �exp(𝑧[)T

[£rØ Howis𝑍 concentrated?

Ø 𝔼 𝑍] = Θ(𝑛),and𝕍𝑎𝑟 𝑍] = O 𝑛Ø Thetailof𝑒𝑥𝑝(𝑧[) isbad!

Ø Pr exp𝑧[ > 𝑡 ≈ 𝑡² 2³4 _

Ø Claim:Pr[𝑍 > 𝔼𝑍 + 𝐶 𝑛 ⋅ log 𝑛] ≤ exp(− log� 𝑛)

Ø Trick:truncate𝑧[ atlog𝑛 anddealwiththetailbyunionbound

Ø (sub)-Gaussian tailPr 𝑋 > 𝑡 ≤ exp(−𝑡�/2)

Ø (sub)-exponential tailPr 𝑋 > 𝑡 ≤ exp(−𝑡/2)

Page 20: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø ProofSketch:

Ø Fixing𝑐,wehavewithhighprobabilityoverchoicesof𝑣P’s

𝑍] =�exp⟨𝑣P, 𝑐⟩P

= 1 + 𝑜 1 𝔼[𝑍]]

Ø 𝑧P = ⟨𝑣P, 𝑐⟩ scalarGaussianrandomvariable

Ø ||𝑐|| governsthemeanandvarianceof𝑧P.Ø ||𝑐|| inturnsisconcentrated

Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍

Page 21: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Pr[𝑤,𝑤�] = ¥1

𝑍]𝑍]�⋅ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐

= 1 ± 𝑜 11𝑍� ¥ exp 𝑣P + 𝑣P¢, 𝑐 𝑝 𝑐 𝑑𝑐

= 1 ± 𝑜 11𝑍� exp(||𝑣P + 𝑣P¢ ||

�/𝑑)

Thistalk:windowofsize2

Pr[𝑤 ∣ 𝑐] ∝ exp⟨𝑣P, 𝑐⟩

𝑐 𝑐′

𝑤 𝑤′

Pr[𝑤′ ∣ 𝑐′] ∝ exp⟨𝑣P¢ , 𝑐′⟩

Eq. (1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Ø Pr[𝑤 ∣ 𝑐] = r§¨⋅ exp⟨𝑣P, 𝑐⟩

Ø 𝑍] = ∑ exp⟨𝑣P, 𝑐⟩P partitionfunction

Lemma1:foralmostallc,almostall 𝑣P ,𝑍] = 1 + 𝑜 1 𝑍

Page 22: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Ourtheorypredicts

Eq.(1) logPr 𝑤, 𝑤� = 𝑣P + 𝑣P¢ � /𝑑 − 2 log𝑍 ± 𝜖

Ø (Approximate)maximumlikelihoodobjective(SN)

min{·¸},º

� Pr»[𝑤,𝑤�](logPr»[𝑤, 𝑤�] − 𝑣P + 𝑣P¢ �

P,P�

− 𝑌)�

Simplestwordembeddingmethodyet(fewest“knobs”toturn)Comparableperformanceonanalogytest

Page 23: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Ourtheorypredicts

Eq.(2) logPr 𝑤 = 𝑣P �/𝑑 − log𝑍 ± 𝜖

Page 24: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Ourtheorypredicts

𝑍] = 1± 𝑜 1 𝑍

Page 25: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

ØUndergenerativemodelRANK-WALK

Formostofthewords𝜒:

Pr[𝜒 ∣ 𝑎]Pr[𝜒 ∣ 𝑏] ≈

Pr[𝜒 ∣ 𝑐]Pr 𝜒 𝑑]

⟺ 𝑣S − 𝑣¿ ≈ 𝑣] − 𝑣(

↑semanticdef.ofanalogy

↑algebraicdef.ofanalogy

Ø Beyondonlysolvinganalogytask?

Ø Extractingmoreinformationfromanalogy/embeddings?

Page 26: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Extractingdifferentmeaningsfromwordembeddings(sameteam:Arora,Li,Liang,M.,Risteski)

Somerecentwork:

Page 27: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø “Tie”canmeanarticleofclothing,orphysicalact

ØTie representsunrelatedwordstie1,tie2,etc.

Quickexperiment:Taketworandom/unrelatedwordsw1,w2 wherew1 is~100timesmorefrequentthanw2 .Declarethesetobeasinglewordandcomputeitsembeddinginourmodel.

Result:closetosomethinglike0.8𝑣P~ + 0.2𝑣PÂ

Page 28: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Mathematicalexplanation

Ø Merge𝑤r,𝑤� as𝑤.Let𝑟 =L.[P~]L.[PÂ]

> 1

Ø Then𝑣P ≈ 𝛼𝑣P~ + 𝛽𝑣PÂ,where§ 𝛼 = 1 − 𝑐r log 1 + r

Ä ≈ 1§ 𝛽 = 1 − 𝑐� log 𝑟

Ø 𝛽 > .1 evenif𝑟 = 100 !

Ø Raremeaningisnotswamped,thankstothelog !

Page 29: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

whichcorrespondtodifferentrepresentative“discourses”

𝑣_[Y = 0.8𝑎r + 0.2𝑎� +noise↑

discoursefor𝑡𝑖𝑒r

↑discoursefor𝑡𝑖𝑒�

Ø “Tie”canmeanarticleofclothing,orphysicalact

ØTie representsunrelatedwordstie1,tie2,etc.

Ø Sparsecodingforextractingdifferentmeanings:§ Find𝑚 = 2000“discourses”𝑎r,𝑎�,… ∈ ℝ( suchthateachwordvectorexpressedasweightedsumofatmost5ofthem,plus“noisevector.”

𝑣P = 𝑥P,r𝑎r+ 𝑥P,�𝑎� +…+ 𝑛𝑜𝑖𝑠𝑒

𝑥P hasonly5non-zeros

Ø Trainingobjective:

minÆ£[S~,…,SÇ]ÈÉSÄÈY&¸¢ È

� 𝑣P − 𝐴𝑥P �

P

Ø localsearchalgo.[EAB’05],provablealgo.[SWW’12,AGM’14,AGMM’15..]

Page 30: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

other. Thus combining the bases while merging duplicates yielded a basis of about the same

size. Some atoms are were found to be semantically meaningless but are easily identified and

filtered out by checking if their closest words tend to have low pairwise inner products amongst

themselves (16).

The significant discourses represented by the basis vectors will henceforth be refered to as

atoms of discourse. The “meaning” of an atom can be discerned by looking at the set of words

whose embeddings are close to it. Table 1 contains some examples of the discourse atoms.

Atoms of discourse may be reminiscent of the results of other automated methods for ob-

taining a thematic understanding of text, such as topic modeling, described in the survey (17).

Indeed the model (1) used to compute the word embeddings is related to a log-linear topic model

from (18). However, the discourses here are computed via sparse coding on word embeddings,

which has no analog in topic modeling. There is also a long tradition of detecting coherent

clusters of word vectors using Brown clustering, or even sparse coding (19). The novelty in

the current paper is a clear interpretation of the basis –in terms of discourses— yielded by the

sparse coding, as well as its use to capture different senses of words.

Atom 1978 825 231 616 1638 149 330drowning instagram stakes membrane slapping orchestra conferencessuicides twitter thoroughbred mitochondria pulling philharmonic meetingsoverdose facebook guineas cytosol plucking philharmonia seminarsmurder tumblr preakness cytoplasm squeezing conductor workshopspoisoning vimeo filly membranes twisting symphony exhibitionscommits linkedin fillies organelles bowing orchestras organizesstabbing reddit epsom endoplasmic slamming toscanini concertsstrangulation myspace racecourse proteins tossing concertgebouw lecturesgunshot tweets sired vesicles grabbing solti presentations

Table 1: Some discourse atoms and their nearest 9 words. By (1) words most likely to appearin a discouse are those nearest to it.

6

Representativesubsetof2000discourses(representedusingtheirnearestwords)

↑closestwordsto𝑎�7r

Page 31: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

5atomsthatexpress𝑣_[Y

Page 32: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Atomsofdiscoursefoundarefairlyfine-grained

Ø Maybe𝑎¿[Q]ËYR[È_ÄC = 𝛼 ⋅ 𝑏¿[Q^Q\C + 𝛽 ⋅ 𝑏]ËYR[È_ÄC?

Ø Anotherlayer:

minÌ,ºÈÉSÄÈY

||𝐴 − 𝐵𝑌||�

Page 33: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor
Page 34: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø PartI:newgenerativemodelthatcapturessemantics.

Ø Provableguarantee:§ logofco-occurrencematrixhaslowrankstructure§ semanticanalogy⇔ linearalgebraicstructureforwordvectors

Ø Simplisticassumptions,butgoodfittoreality

Ø PartII:automaticwayofdetectwordmeanings§ Hierarchicalbasisintheembeddingspace

Ø Otherapplicationsofourmodel/method?

Page 35: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor
Page 36: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Eachordinateof𝑣P meanssomething:

𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]

𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]

𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]

𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]

currency↓

country↓

American↓

Chinese↓

𝑣ÎÏÆ − 𝑣(Q^^SÄ = […… ,−1, ……… ,1, ……… ,0 ……… ,0,…… ]

𝑣ÐË[TS − 𝑣ÑÒÌ = […… ,−1,……… , 1,……… , 0,……… , 0,… …]

Ø Onothercoordinates,thevaluesareeitherverysmallorthesupportsarenon-overlapping

Ø Problem:rotationalinvariance– rotationofwordvectorsdoesn’tchangethemodel.

Page 37: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

𝑣ÎÏÆ = […… ,0, ……… ,1, ……… ,1, ……… ,0, …… ]

𝑣(Q^^SÄ = […… ,1, ……… ,0, ……… ,1, ……… ,0, …… ]

𝑣ÐË[TS = […… ,0, ……… ,1, ……… ,0, ……… ,1, …… ]

𝑣ÑÒÌ = […… ,1, ……… ,0, ……… ,0, ……… ,1, …… ]

currency↓

country↓

American↓

Chinese↓

⋅ 𝑅

↑sparsecoefficients

↑basisvectors

ØWithsparsity,themodelisidentifiable;allowsovercomplete basis;istractableundermildassumptions.[SWW’12][AGM’13][AAJNT’13][AGMM’14]

Page 38: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

minÓ6Ô*.61,Ñ

||𝑉 − 𝑋 ⋅ 𝑅||Ö�

Ø 𝑉 containswordvectorsasrows(obtainedfromanyembeddingmethod)

Ø SparsityofrowsofXischosentobe5

Ø 𝑅contains2000basisvectors(asrows),eachofwhichis300-dim

Page 39: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

AssumingM1wasanswered,

PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉

M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?

A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise

(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)

Page 40: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

Ø Ourtheoryassumesthat𝑐_ doesaslowrandomwalk

Ø reddot:theestimatehiddenvariable𝑐_ attime𝑡

Ø sentenceattop:thewindowofsize10attime𝑡

Page 41: word embeddings theory · Ø ⇔ Fit PMI !,= ≈ 〈% &,% ... overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor

AssumingM1wasanswered,

PMI 𝑤,𝑤� = 𝑣P, 𝑣P¢ + 𝜉 (*)withlarge𝜉

M2:Whylow-dimvectorssolvesanalogywhen(*)isonlyroughlytrue?

A2:(*)+isotropyofwordvectors⇒ low-dimfittingreducesnoise

(Quiteintuitive, though doesn’t followOccam’sbound forPAC-learning)