Learning Word Subsumption Projections for the Russian Language

Learning Word Subsumption Projections forthe Russian Language

Dmitry Ustalov1,2 Alexander Panchenko3

1Ural Federal University, Russia2Krasovskii Institute of Mathematics and Mechanics, Russia

3Technische Universität Darmstadt, Germany

October 6, 2016

Introduction

Hyponymy is the asymmetric relationship between a genericterm (hypernym) and an instance of this term (hyponym).

In biology, the same relationship between genus and species iscalled subsumptions.Examples: cat is−a→ feline, laptop is−a→ computer.

Extremely useful in various NLP applications, but barelyavailable for Russian.

Goals:Propose an approach for learning subsumptions (for Russian).Develop the software for that.Empirically evaluate them.

2/17

Related Work

Traditionally, subsumptions were derived by the expertlexicographers.Hearst (1992) proposed using lexico-syntactic patterns forextracting subsumptions automatically from a text corpus.Mikolov et al. (2013) developed word2vec, an efficient toolfor inducing word embeddings.Fu et al. (2014) presented a projection learning setup fortransforming hyponym embeddings into the hypernyms.Arefyev et al. (2015) trained a large word embeddings modelfor Russian during the RUSSE competition.Shwartz et al. (2016) used RNNs for learning subsumptionsfor English.

3/17

Word Embeddings

Word embeddings are similar to SVD of a document-term matrix,so the vocabulary words are being mapped into the dense vectors.

Figure from Mikolov et al. (2013).

What about other linear transformations?

4/17

Approach: The Baseline

The baseline approach assumes obtaining a projection matrix Φ∗

such that tranforms a hyponym x⃗ into its hypernym y⃗.

Φ∗ = argminΦ

1

N

∑(x⃗,y⃗)

dist(x⃗Φ, y⃗)

This is achieved by numerically minimizing the Euclidean (L2)distance using linear regression.

Also, separating the initial linear space using k-meanssubstantially increases the model capacity.

5/17

Example: Identity vs. Projection

Identity

кот 1.0000 котище0.7766

кота

0.7688

котяра

0.7663

котенок

0.7462

барсик

0.7272

кот…

0.7124

кот, —

0.7085

котом

0.7070

мяукнул

0.6980

Baseline Projection

кот 0.7012 животное0.7643

зверь

0.7299

хищник

0.7201

намбат

0.7060

cryptoprocta

0.6994

сумчатость

0.6978

вуалехвостый

0.6940

виверровая

0.6888

гепардообразной

0.6885

6/17

Approach: Hyponymy Penalization

Applying the same transformation to the hypernym vector x⃗Φas to the hyponym vector should not provide the initialhyponym vector x⃗.

Φ∗ = argminΦ

1

N

∣∣∣∣∣∣(1− α)∑(x⃗,y⃗)

dist(x⃗Φ, y⃗)− α∑x⃗

dist(x⃗ΦΦ, x⃗)

∣∣∣∣∣∣

7/17

Approach: Synonymy Penalization

Exploit the negative sampling technique by explicitly providingthe examples of synonyms z⃗ that penalizes the matrix toproduce the vectors similar to them.

Φ∗ = argminΦ

1

N

∣∣∣∣∣∣(1− α)∑(x⃗,y⃗)

dist(x⃗Φ, y⃗)− α∑(x⃗,z⃗)

dist(x⃗ΦΦ, z⃗)

∣∣∣∣∣∣

8/17

Approach: Hypernymy Promotion

Instead of negative sampling, promote the matrix to producehypernyms not just for the initial hyponym, but also for itsrandomly sampled synonym z⃗.In lexical ontologies, the words are grouped into synsets (setsof synonyms) and the subsumptions are established betweensuch synsets.

Φ∗ = argminΦ

1

N

(1− β)∑(x⃗,y⃗)

dist(x⃗Φ, y⃗) + β∑(y⃗,z⃗)

dist(z⃗Φ, y⃗)

9/17

Implementation

A single-layer perceptron instead of linear regression.TensorFlow for defining and executing the computation graph,scikit-learn for k-means clustering.The Adam stochastic optimization method for minimizing theloss function.Tried the cosine distance instead of L2, but without any luck(the details are in the paper).Parameters: α = 0.01, β = 0.3, 14 000 training epochs.

10/17

Experiments: The SetupLanguage Resources:

500-dimensional word vectors for Russian trained using theskip-gram architecture (64 GB in RAM),21 997 train items: Hearst patterns + Russian Wiktionary,10 811 test items: Russian Wiktionary only.

To avoid lexical overfitting, each set contains a distinct vocabulary.

Computational Resources:Intel Xeon E5-2620 v2 @ 2.10GHz (32 GB of RAM),NVIDIA Tesla K20Xm, 2866 cores (6 GB of VRAM).

Some preliminary computations have been done on anothermachine with the larger amount of available RAM.

Each experiment is run for five times for evaluating the statisticalsignificance using t-test.

11/17

Experiments: The Metric

No standard evaluation metric for such a task is available yet.We measure the quality by analyzing the ten nearest neighbours.

A@10 =1

N

∑(x⃗,y⃗)

1(NN10(x⃗Φ

∗) ∋ y⃗)

This is the probability of providing the correct hypernym amongthe ten nearest neighbours by projecting its related hyponym,which is previously unknown to the model.

12/17

Experiments: The ResultsOn (almost) every configuration, both hyponymy and synonymypenalizations significantly outperform the baseline.

0.2

0.3

1 2 3 4 5 6 7 8 9 10# of clusters

A@

10

Baseline Pen. Hyponymy Pen. Synonymy Prom. Hypernymy

13/17

Experiments: The PerformanceSince the matrix is 501× 500, using GPU is infeasable due to therequirements for the batch size (which is 512 in our case).

5

10

15

20

25

512 1024 2048 4096 8192batch size

seco

nds

per

1K tr

aini

ng e

poch

s

Baseline Pen. Synonymy

CPU GPU

14/17

Example: Baseline vs. Penalization

Baseline Projection


зверь

0.7299

хищник

0.7201

намбат

0.7060

cryptoprocta

0.6994


0.6978

вуалехвостый

0.6940

виверровая

0.6888

гепардообразной

0.6885

Pen. Synonymy Projection


зверь

0.7283

хищник

0.7141

намбат

0.6983


0.6946

cryptoprocta

0.6915

ornithoryngue

0.6887

млекопитающее

0.6876

кволл

0.6837

15/17

Conclusion

A negative sampling approach for synonymy penalization hasbeen proposed and successfully evaluated.An open source projection learning toolkit has been developedusing TensorFlow: https://github.com/dustalov/projlearn/.The released datasets, including the trained models, are alsoavailable for other researchers under a libré license.GPUs have a lot of potential in our task: multi-layer setup,CNNs, RNNs, etc.The primary obstacle right now is the availability of thetraining subsumptions.

16/17

https://github.com/dustalov/projlearn/

Thank You!

Dmitry Ustalov https://linkedin.com/in/ustalov [email protected]

The reported study was funded by RFBR according to the researchproject No. 16-37-00354 мол_a. We are grateful to Nikolay Arefyev,Andrey Kutuzov, Andrey Krizhanovsky, Benjamin Milde and AlexanderBersenev for the fruitful discussions on the present study. Dmitry Ustalovwas partially supported by the Deutscher Akademischer Austauschdienst(DAAD) scholarship. Alexander Panchenko was supported by theDeutsche Forschungsgemeinschaft (DFG) foundation under the project“JOIN-T: Joining Ontologies and Semantics Induced from Text”.

17/17

https://linkedin.com/in/ustalov

mailto:[email protected]

Science

Learning Word Subsumption Projections for the Russian Language