RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji [email protected] Oct13, 2015 Acknowledgement: distributional semantics slides from

RELATION EXTRACTION, SYMBOLIC SEMANTICS,DISTRIBUTIONAL SEMANTICS

Heng Ji

[email protected], 2015

Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan

2

Task Definition Symbolic Semantics

Basic Features World Knowledge Learning Models

Distributional Semantics

Outline

relation: a semantic relationship between two entities

ACE relation type example

Agent-Artifact Rubin Military Design, the makers of the Kursk

Discourse each of whomEmployment/ Membership Mr. Smith, a senior programmer at

MicrosoftPlace-Affiliation Salzburg Red Cross officialsPerson-Social relatives of the deadPhysical a town some 50 miles south of SalzburgOther-Affiliation Republican senators

Relation Extraction: Task

Test Sample

Train Sample

Train Sample

Train Sample

Train Sample

Train Sample

K=3

A Simple Baseline with K-Nearest-Neighbor (KNN)

http://images.google.com/imgres?imgurl=limnology.wisc.edu/tls/personnel/lake%20in%20summer.JPG&imgrefurl=http://limnology.wisc.edu/tls/personnel/personnel.html&h=1200&w=1792&prev=/images?q=lake&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8

http://images.google.com/imgres?imgurl=edugreen.teri.res.in/play/puzzle/lake/lake.jpg&imgrefurl=http://edugreen.teri.res.in/play/puzzle/lake/lake.htm&h=272&w=400&prev=/images?q=lake&start=40&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&sa=N

Test Sample

Train Sample: Employment

Train Sample: Physical



Train Sample: Physical

1. If the heads of the mentions don’t match: +82. If the entity types of the heads of the mentions don’t match: +203. If the intervening words don’t match: +10

the president of the United States

the previous president of the United States

the secretary of NIST

US forces in Bahrain Connecticut’s governor

his ranch in Texas

46 26

46

360

Relation Extraction with KNN

Lexical Heads of the mentions and their context words, POS tags

Entity Entity and mention type of the heads of the mentions Entity Positional Structure Entity Context

Syntactic Chunking Premodifier, Possessive, Preposition, Formulaic The sequence of the heads of the constituents, chunks between the two mentions The syntactic relation path between the two mentions Dependent words of the mentions

Semantic Gazetteers Synonyms in WordNet Name Gazetteers Personal Relative Trigger Word List

Wikipedia If the head extent of a mention is found (via simple string matching) in the predicted

Wikipedia article of another mention References: Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007; Chan and Roth, 2010,2011

Typical Relation Extraction Features

7

Using Background Knowledge (Chan and Roth, 2010)

• Features employed are usually restricted to being defined on the various representations of the target sentences

• Humans rely on background knowledge to recognize relations

• Overall aim of this work• Propose methods of using knowledge or resources that exists

beyond the sentence• Wikipedia, word clusters, hierarchy of relations, entity type constraints,

coreference• As additional features, or under the Constraint Conditional Model (CCM)

framework with Integer Linear Programming (ILP)

7

8

8

DavidCone,aKansasCitynative,wasoriginallysignedbytheRoyalsandbrokeintothemajorswiththeteam

Using Background Knowledge

9

9



10

10



11

11



12

12


Using Background KnowledgeDavid Brian Cone (born January 2, 1963) is a former Major League Baseball pitcher. He compiled an 8–3 postseason record over 21 postseason starts and was a part of five World Series championship teams (1992 with the Toronto Blue Jays and 1996, 1998, 1999 & 2000 with the New York Yankees). He had a career postseason ERA of 3.80. He is the subject of the book A Pitcher's Story: Innings With David Cone by Roger Angell. Fans of David are known as "Cone-Heads."Cone lives in Stamford, Connecticut, and is formerly a color commentator for the Yankees on the YES Network.[1]

Contents[hide]1 Early years2 Kansas City Royals3 New York Mets

Partly because of the resulting lack of leadership, after the 1994 season the Royals decided to reduce payroll by trading pitcher David Cone and outfielder Brian McRae, then continued their salary dump in the 1995 season. In fact, the team payroll, which was always among the league's highest, was sliced in half from $40.5 million in 1994 (fourth-highest in the major leagues) to $18.5 million in 1996 (second-lowest in the major leagues)

http://en.wikipedia.org/wiki/Major_League_Baseball

http://en.wikipedia.org/wiki/Pitcher

http://en.wikipedia.org/wiki/World_Series

http://en.wikipedia.org/wiki/1992_World_Series

http://en.wikipedia.org/wiki/Toronto_Blue_Jays





http://en.wikipedia.org/wiki/New_York_Yankees

http://en.wikipedia.org/wiki/Roger_Angell

http://en.wikipedia.org/wiki/Coneheads

http://en.wikipedia.org/wiki/Stamford,_Connecticut

http://en.wikipedia.org/wiki/Color_commentator

http://en.wikipedia.org/wiki/YES_Network

http://en.wikipedia.org/wiki/David_Cone#cite_note-LeavingYes-0

http://en.wikipedia.org/wiki/David_Cone

http://en.wikipedia.org/wiki/David_Cone#Early_years

http://en.wikipedia.org/wiki/David_Cone#Kansas_City_Royals

http://en.wikipedia.org/wiki/David_Cone#New_York_Mets

http://en.wikipedia.org/wiki/David_Cone

http://en.wikipedia.org/wiki/Brian_McRae

http://en.wikipedia.org/wiki/1995_Major_League_Baseball_season

http://en.wikipedia.org/wiki/1996_Major_League_Baseball_season

13

13



fine-grained

Employment:Staff 0.20

Employment:Executive 0.15

Personal:Family 0.10

Personal:Business 0.10

Affiliation:Citizen 0.20

Affiliation:Based-in 0.25

14

14



fine-grained coarse-grained

Employment:Staff 0.200.35 Employment


Personal:Family 0.100.40 Personal


Affiliation:Citizen 0.200.25 Affiliation


15

15










16

16










0.55

17

Knowledge1: Wikipedia1 (as additional feature)

• We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages

• Introduce a new feature based on: •

• introduce a new feature by combining the above with the coarse-grained entity types of mi,mj

otherwise ,0

)(or )( if ,1),(1

imjmji

mAmAmmw ji

17

mi mj

r ?

18

Knowledge1: Wikipedia2 (as additional feature)

• Given mi,mj, we use a Parent-Child system (Do and Roth, 2010) to predict whether they have a parent-child relation

• Introduce a new feature based on:

•

• combine the above with the coarse-grained entity types of mi,mj

otherwise ,0

),( if ,1),(2

jiji

mmchild-parentmmw

18

mi mj

parent-child?

19

Knowledge2: Word Class Information(as additional feature)

• Supervised systems face an issue of data sparseness (of lexical features)

• Use class information of words to support generalization better: instantiated as word clusters in our work• Automatically generated from unlabeled texts using algorithm of

(Brown et al., 1992)

apple pear Apple IBM

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

19

20

Knowledge2: Word Class Information




apple pear Apple

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

20

IBM

21





apple pear Apple

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

21

IBM011

22


• All lexical features consisting of single words will be duplicated with its corresponding bit-string representation

apple pear Apple IBM

0 1 0 1

0 1

bought run of in

0 1 0 1

0 1

0 1

22

00 01 10 11

23

23

weight vector for“local” models collection of

classifiers

Constraint Conditional Models (CCMs)(Roth and Yih, 2007; Chang et al., 2008)

24


24

weight vector for“local” models collection of

classifiers

penalty for violatingthe constraint

how far y is from a “legal” assignment

25


25

•Wikipedia•word clusters

•hierarchy of relations•entity type constraints•coreference

26

26


Constraint Conditional Models (CCMs)








27

• Key steps• Write down a linear objective function• Write down constraints as linear inequalities• Solve using integer linear programming (ILP) packages

27


28

Knowledge3: Relations between our target relations

......

personal

......

employment

family biz executivestaff

28

29

Knowledge3: Hierarchy of Relations

......

personal

......

employment


29

coarse-grainedclassifier

fine-grainedclassifier

30


......

personal

......

employment


30

mi mj

coarse-grained?

fine-grained?

31


......

personal

......

employment


31

32


......

personal

......

employment


32

33


......

personal

......

employment


33

34


......

personal

......

employment


34

35


......

personal

......

employment


35

36

Knowledge3: Hierarchy of Relations Write down a linear objective function

R LR L R rf

rfRRR rc

rcRR

RfRc

yrfpxrcp ,, )()(max

36

coarse-grainedprediction probabilities

fine-grainedprediction probabilities

37

Knowledge3: Hierarchy of Relations Write down a linear objective function

R LR L R rf

rfRRR rc

rcRR

RfRc

yrfpxrcp ,, )()(max

37

coarse-grainedprediction probabilities

fine-grainedprediction probabilities

coarse-grainedindicatorvariable

fine-grainedindicatorvariable

indicator variable == relation assignment

38

Knowledge3: Hierarchy of Relations Write down constraints

• If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc.

• (Capturing the inverse relationship) If we assign rf to R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label

38

nrfRrfRrfRrcR yyyx ,,,, 21

)(,, rfparentRrfR xy

39

Knowledge4: Entity Type Constraints(Roth and Yih, 2004, 2007)

• Entity types are useful for constraining the possible labels that a relation R can assume

39

mi mj

Employment:Staff

Employment:Executive

Personal:Family

Personal:Business

Affiliation:Citizen

Affiliation:Based-in

40

• Entity types are useful for constraining the possible labels that a relation R can assume

40

Employment:Staff


Personal:Family

Personal:Business

Affiliation:Citizen


per org

per org

per

per per

per

per

org

gpe

gpe

per per

mi mj


41

• We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations• By improving the coarse-grained predictions and combining with the

hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications

41

Employment:Staff


Personal:Family

Personal:Business

Affiliation:Citizen


per org

per org

per

per per

per

per

org

gpe

gpe

per per

mi mj


42

Knowledge5: Coreference

42

mi mj

Employment:Staff


Personal:Family

Personal:Business

Affiliation:Citizen


43

Knowledge5: Coreference

• In this work, we assume that we are given the coreference information, which is available from the ACE annotation.

43

mi mj

Employment:Staff


Personal:Family

Personal:Business

Affiliation:Citizen


null

44

Experiment Results

44

F1% improvement from using each knowledge source

All nwire 10% of nwire

BasicRE 50.5% 31.0%

• Consider different levels of syntactic information• Deep processing of text produces structural but less reliable results• Simple surface information is less structural, but more reliable

• Generalization of feature-based solutions• A kernel (kernel function) defines a similarity metric Ψ(x, y) on objects• No need for enumeration of features

• Efficient extension of normal features into high-order spaces• Possible to solve linearly non-separable problem in a higher order

space

• Nice combination properties• Closed under linear combination• Closed under polynomial extension• Closed under direct sum/product on different domains

• References: Zelenko et al., 2002, 2003; Aron Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Che et al., 2005, Zhang et al., 2006; Qian et al., 2007; Zhou et al., 2007; Khayyamian et al., 2009; Reichartz et al., 2009

Most Successful Learning Methods: Kernel-based

)arg.,arg.(),( 212,1

211 iii

E RRKRR

).,.().,.().,.().,.(),( 2121212121 roleEroleEIsubtypeEsubtypeEItypeEtypeEItkEtkEKEEK TE

, where1) Argument

2) Local dependency

).arg.,.arg.(),( 212,1

212 dseqRdseqRKRR iiDi

lendseqi lendseqj

jiTjiD dwarcdwarcKlabelarclabelarcIdseqdseqK.0 .'0

)).',.().',.(()',(

, where

Kernel Examples for Relation Extraction

).,.().,.().,.(),( 21212121 baseTbaseTIposTposTIwordTwordTITTKT KT is a token kernel defined as:

(Zhao and Grishman, 2005)

lenpathi lenpathj

jiTjipath dwarcdwarcKlabelarclabelarcIpathpathK.0 .'0

)).',.().',.(()',(

).,.(),( 21213 pathRpathRKRR path3) Path

, where

Composite Kernels:

4/)()(),( 22121211 RR

Occurrences of seed tuples:

Computer servers at Microsoft’s headquarters in Redmond…In mid-afternoon trading, share ofRedmond-based Microsoft fell…The Armonk-based IBM introduceda new line…The combined company will operatefrom Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Bootstrapping for Relation Extraction

• <STRING1>’s headquarters in <STRING2>

•<STRING2> -based <STRING1>

•<STRING1> , <STRING2>




Augment Table

LearnedPatterns:

Bootstrapping for Relation Extraction (Cont’)




Augment Table

Generatenew seedtuples; start newiteration

ORGANIZATION LOCATIONAG EDWARDS ST LUIS157TH STREET MANHATTAN7TH LEVEL RICHARDSON3COM CORP SANTA CLARA3DO REDWOOD CITYJELLIES APPLEMACWEEK SAN FRANCISCO

Bootstrapping for Relation Extraction (Cont’)

50

Task Definition Symbolic Semantics

Basic Features World Knowledge Learning Models

Distributional Semantics

Outline

Word Similarity & Relatedness• How similar is pizza to pasta?• How related is pizza to Italy?

• Representing words as vectors allows easy computation of similarity

51

Approaches for Representing WordsDistributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI

matrix• Decomposed with SVD

Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)

52

Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57)

“Similar words occur in similar contexts”

Approaches for Representing WordsBoth approaches:• Rely on the same linguistic theory• Use the same data• Are mathematically related

• “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014)

• How come word embeddings are so much better?• “Don’t Count, Predict!” (Baroni et al., ACL 2014)

• More than meets the eye…53

What’s really improving performance?

The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)

• Skip Grams + Negative Sampling

• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution

Smoothing• Adding Context Vectors• …

54








55








56








57

Our Contributions

1) Identifying the existence of new hyperparameters• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between

algorithms

58

Our Contributions

1) Identifying the existence of new hyperparameters• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between

algorithms

3) Comparing algorithms across all hyperparameter settings• Over 5,000 experiments

59

Background

60

What is word2vec?

61

What is word2vec?

How is it related to PMI?

62

What is word2vec?

• word2vec is not a single algorithm• It is a software package for representing words as

vectors, containing:• Two distinct models

• CBoW• Skip-Gram

• Various training methods• Negative Sampling• Hierarchical Softmax

• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words 63

What is word2vec?

• word2vec is not a single algorithm• It is a software package for representing words as

vectors, containing:• Two distinct models

• CBoW• Skip-Gram (SG)

• Various training methods• Negative Sampling (NS)• Hierarchical Softmax

• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words 64

65

Demo

http://rare-technologies.com/word2vec-tutorial/#app




Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.

“word2vec Explained…”Goldberg & Levy, arXiv 2014

66



67


words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in… …


(data)

68

Skip-Grams with Negative Sampling (SGNS)• SGNS finds a vector for each word in our

vocabulary • Each such vector has latent dimensions (e.g. )• Effectively, it learns a matrix whose rows represent • Key point: it also learns a similar auxiliary matrix of

context vectors• In fact, each word has two embeddings


𝑊𝑑

𝑉𝑊

:wampimuk

𝐶𝑉𝐶

𝑑

:wampimuk

≠

69

Skip-Grams with Negative Sampling (SGNS)


70

Skip-Grams with Negative Sampling (SGNS)• Maximize:

• was observed with

wordscontextswampimuk furrywampimuk littlewampimuk hidingwampimuk in


71

Skip-Grams with Negative Sampling (SGNS)• Maximize:

• was observed with

wordscontextswampimuk furrywampimuk littlewampimuk hidingwampimuk in

• Minimize: • was hallucinated with

wordscontextswampimukAustraliawampimuk cyberwampimuk thewampimuk 1985


72

Skip-Grams with Negative Sampling (SGNS)• “Negative Sampling”• SGNS samples contexts at random as negative

examples• “Random” = unigram distribution

• Spoiler: Changing this distribution has a significant effect

73

What is SGNS learning?

74


• Take SGNS’s embedding matrices ( and )

“Neural Word Embeddings as Implicit Matrix Factorization”

Levy & Goldberg, NIPS 2014

𝑊𝑑

𝑉𝑊

𝑉𝐶

𝑑

𝐶75


• Take SGNS’s embedding matrices ( and )• Multiply them• What do you get?

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑



76


• A matrix• Each cell describes the relation between a specific

word-context pair

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑



?¿𝑉𝑊

𝑉 𝐶

77


• We proved that for large enough and enough iterations

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑



?¿𝑉𝑊

𝑉 𝐶

78


• We proved that for large enough and enough iterations

• We get the word-context PMI matrix

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑



𝑀𝑃𝑀𝐼¿𝑉𝑊

𝑉 𝐶

79


• We prove that for large enough and enough iterations

• We get the word-context PMI matrix, shifted by a global constant

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑



𝑀𝑃𝑀𝐼¿𝑉𝑊

𝑉 𝐶

− log𝑘

80


• SGNS is doing something very similar to the older approaches

• SGNS is factorizing the traditional word-context PMI matrix

• So does SVD!

• GloVe factorizes a similar word-context matrix

81

But embeddings are still better, right?• Plenty of evidence that embeddings outperform

traditional methods• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• GloVe (Pennington et al., EMNLP 2014)

• How does this fit with our story?

82

The Big Impact of “Small” Hyperparameters

83

The Big Impact of “Small” Hyperparameters• word2vec & GloVe are more than just

algorithms…

• Introduce new hyperparameters

• May seem minor, but make a big difference in practice

84

Identifying New Hyperparameters

85

New Hyperparameters

• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words

• Postprocessing (GloVe)• Adding Context Vectors

• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing

86

New Hyperparameters




87

New Hyperparameters




88

New Hyperparameters




89

Dynamic Context Windows

Marco saw a furry little wampimuk hiding in the tree.

90



91



word2vec:

GloVe:

Aggressive:

The Word-Space Model (Sahlgren, 2006)92

Adding Context Vectors

• SGNS creates word vectors • SGNS creates auxiliary context vectors

• So do GloVe and SVD

93

Adding Context Vectors

• SGNS creates word vectors • SGNS creates auxiliary context vectors

• So do GloVe and SVD

• Instead of just • Represent a word as:

• Introduced by Pennington et al. (2014)• Only applied to GloVe

94

Adapting Hyperparameters across Algorithms

95

Context Distribution Smoothing• SGNS samples to form negative examples

• Our analysis assumes is the unigram distribution

96

Context Distribution Smoothing• SGNS samples to form negative examples

• Our analysis assumes is the unigram distribution

• In practice, it’s a smoothed unigram distribution

• This little change makes a big difference

97

Context Distribution Smoothing• We can adapt context distribution smoothing to

PMI!

• Replace with :

• Consistently improves PMI on every task

• Always use Context Distribution Smoothing!98

Comparing Algorithms

99

Controlled Experiments

• Prior art was unaware of these hyperparameters

• Essentially, comparing “apples to oranges”

• We allow every algorithm to use every hyperparameter

100

Controlled Experiments

• Prior art was unaware of these hyperparameters

• Essentially, comparing “apples to oranges”

• We allow every algorithm to use every hyperparameter*

* If transferable 101

Systematic Experiments

• 9 Hyperparameters• 6 New

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

• 5,632 experiments102

Systematic Experiments

• 9 Hyperparameters• 6 New

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

• 5,632 experiments103

Hyperparameter Settings

Classic Vanilla Setting(commonly used for distributional baselines)

• Preprocessing• <None>

• Postprocessing• <None>

• Association Metric• Vanilla PMI/PPMI

104

Hyperparameter Settings

Classic Vanilla Setting(commonly used for distributional baselines)

• Preprocessing• <None>


• Association Metric• Vanilla PMI/PPMI

Recommended word2vec Setting(tuned for SGNS)

• Preprocessing• Dynamic Context Window• Subsampling


• Association Metric• Shifted PMI/PPMI• Context Distribution

Smoothing105

Experiments

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

WordSim-353 Relatedness

Spea

rman

’s Co

rrel

ation

106

Experiments: Prior Art


0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

VanillaSetting

0.54

VanillaSetting

0.587

word2vecSetting

0.688

word2vecSetting

0.623


Spea

rman

’s Co

rrel

ation

107

Experiments: “Apples to Apples”Experiments: “Oranges to Oranges”

Experiments: “Oranges to Oranges”Experiments: Hyperparameter Tuning


0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

VanillaSetting

0.54

VanillaSetting

0.587

word2vecSetting

0.688

word2vecSetting

0.623

OptimalSetting

0.697

OptimalSetting

0.681


Spea

rman

’s Co

rrel

ation

108[different settings]

Overall Results

• Hyperparameters often have stronger effects than algorithms

• Hyperparameters often have stronger effects than more data

• Prior superiority claims were not accurate

109

Re-evaluating Prior Claims

110

Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”

• Hyperparameter settings account for most of the reported gaps

• Embeddings do not really outperform count-based methods

111

Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”


• Embeddings do not really outperform count-based methods*

* Except for one task…112

GloVe (Pennington et al., 2014)• “GloVe is better than word2vec”


• Adding context vectors applied only to GloVe• Different preprocessing

• We observed the opposite• SGNS outperformed GloVe on every task

• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?

113

GloVe (Pennington et al., 2014)• “GloVe is better than word2vec”


• Adding context vectors applied only to GloVe• Different preprocessing

• We observed the opposite• SGNS outperformed GloVe on every task

• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?

114

Linguistic Regularities in Sparse and ExplicitWord Representations (Levy and Goldberg, 2014)

• “PPMI vectors perform on par with SGNS on analogy tasks”

• Holds for semantic analogies• Does not hold for syntactic analogies (MSR dataset)


• Different context type for PPMI vectors

• Syntactic Analogies: there is a real gap in favor of SGNS115

Conclusions

116

Conclusions: Distributional SimilarityThe Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters

What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust

117

Conclusions: Distributional SimilarityThe Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters

What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust & efficient

118

Conclusions: Methodology

• Look for hyperparameters

• Adapt hyperparameters across different algorithms

• For good results: tune hyperparameters

• For good science: tune baselines’ hyperparameters

Thank you :)119

Documents

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji [email protected] Oct13, 2015 Acknowledgement: distributional semantics slides from