View
158
Download
0
Category
Preview:
Citation preview
Learning Word Subsumption Projections forthe Russian Language
Dmitry Ustalov1,2 Alexander Panchenko3
1Ural Federal University, Russia2Krasovskii Institute of Mathematics and Mechanics, Russia
3Technische Universität Darmstadt, Germany
October 6, 2016
Introduction
Hyponymy is the asymmetric relationship between a genericterm (hypernym) and an instance of this term (hyponym).
In biology, the same relationship between genus and species iscalled subsumptions.Examples: cat is−a→ feline, laptop is−a→ computer.
Extremely useful in various NLP applications, but barelyavailable for Russian.
Goals:Propose an approach for learning subsumptions (for Russian).Develop the software for that.Empirically evaluate them.
2/17
Related Work
Traditionally, subsumptions were derived by the expertlexicographers.Hearst (1992) proposed using lexico-syntactic patterns forextracting subsumptions automatically from a text corpus.Mikolov et al. (2013) developed word2vec, an efficient toolfor inducing word embeddings.Fu et al. (2014) presented a projection learning setup fortransforming hyponym embeddings into the hypernyms.Arefyev et al. (2015) trained a large word embeddings modelfor Russian during the RUSSE competition.Shwartz et al. (2016) used RNNs for learning subsumptionsfor English.
3/17
Word Embeddings
Word embeddings are similar to SVD of a document-term matrix,so the vocabulary words are being mapped into the dense vectors.
Figure from Mikolov et al. (2013).
What about other linear transformations?
4/17
Approach: The Baseline
The baseline approach assumes obtaining a projection matrix Φ∗
such that tranforms a hyponym x⃗ into its hypernym y⃗.
Φ∗ = argminΦ
1
N
∑(x⃗,y⃗)
dist(x⃗Φ, y⃗)
This is achieved by numerically minimizing the Euclidean (L2)distance using linear regression.
Also, separating the initial linear space using k-meanssubstantially increases the model capacity.
5/17
Example: Identity vs. Projection
Identity
кот 1.0000 котище0.7766
кота
0.7688
котяра
0.7663
котенок
0.7462
барсик
0.7272
кот…
0.7124
кот, —
0.7085
котом
0.7070
мяукнул
0.6980
Baseline Projection
кот 0.7012 животное0.7643
зверь
0.7299
хищник
0.7201
намбат
0.7060
cryptoprocta
0.6994
сумчатость
0.6978
вуалехвостый
0.6940
виверровая
0.6888
гепардообразной
0.6885
6/17
Approach: Hyponymy Penalization
Applying the same transformation to the hypernym vector x⃗Φas to the hyponym vector should not provide the initialhyponym vector x⃗.
Φ∗ = argminΦ
1
N
∣∣∣∣∣∣(1− α)∑(x⃗,y⃗)
dist(x⃗Φ, y⃗)− α∑x⃗
dist(x⃗ΦΦ, x⃗)
∣∣∣∣∣∣
7/17
Approach: Synonymy Penalization
Exploit the negative sampling technique by explicitly providingthe examples of synonyms z⃗ that penalizes the matrix toproduce the vectors similar to them.
Φ∗ = argminΦ
1
N
∣∣∣∣∣∣(1− α)∑(x⃗,y⃗)
dist(x⃗Φ, y⃗)− α∑(x⃗,z⃗)
dist(x⃗ΦΦ, z⃗)
∣∣∣∣∣∣
8/17
Approach: Hypernymy Promotion
Instead of negative sampling, promote the matrix to producehypernyms not just for the initial hyponym, but also for itsrandomly sampled synonym z⃗.In lexical ontologies, the words are grouped into synsets (setsof synonyms) and the subsumptions are established betweensuch synsets.
Φ∗ = argminΦ
1
N
(1− β)∑(x⃗,y⃗)
dist(x⃗Φ, y⃗) + β∑(y⃗,z⃗)
dist(z⃗Φ, y⃗)
9/17
Implementation
A single-layer perceptron instead of linear regression.TensorFlow for defining and executing the computation graph,scikit-learn for k-means clustering.The Adam stochastic optimization method for minimizing theloss function.Tried the cosine distance instead of L2, but without any luck(the details are in the paper).Parameters: α = 0.01, β = 0.3, 14 000 training epochs.
10/17
Experiments: The SetupLanguage Resources:
500-dimensional word vectors for Russian trained using theskip-gram architecture (64 GB in RAM),21 997 train items: Hearst patterns + Russian Wiktionary,10 811 test items: Russian Wiktionary only.
To avoid lexical overfitting, each set contains a distinct vocabulary.
Computational Resources:Intel Xeon E5-2620 v2 @ 2.10GHz (32 GB of RAM),NVIDIA Tesla K20Xm, 2866 cores (6 GB of VRAM).
Some preliminary computations have been done on anothermachine with the larger amount of available RAM.
Each experiment is run for five times for evaluating the statisticalsignificance using t-test.
11/17
Experiments: The Metric
No standard evaluation metric for such a task is available yet.We measure the quality by analyzing the ten nearest neighbours.
A@10 =1
N
∑(x⃗,y⃗)
1(NN10(x⃗Φ
∗) ∋ y⃗)
This is the probability of providing the correct hypernym amongthe ten nearest neighbours by projecting its related hyponym,which is previously unknown to the model.
12/17
Experiments: The ResultsOn (almost) every configuration, both hyponymy and synonymypenalizations significantly outperform the baseline.
0.2
0.3
1 2 3 4 5 6 7 8 9 10# of clusters
A@
10
Baseline Pen. Hyponymy Pen. Synonymy Prom. Hypernymy
13/17
Experiments: The PerformanceSince the matrix is 501× 500, using GPU is infeasable due to therequirements for the batch size (which is 512 in our case).
5
10
15
20
25
512 1024 2048 4096 8192batch size
seco
nds
per
1K tr
aini
ng e
poch
s
Baseline Pen. Synonymy
CPU GPU
14/17
Example: Baseline vs. Penalization
Baseline Projection
кот 0.7012 животное0.7643
зверь
0.7299
хищник
0.7201
намбат
0.7060
cryptoprocta
0.6994
сумчатость
0.6978
вуалехвостый
0.6940
виверровая
0.6888
гепардообразной
0.6885
Pen. Synonymy Projection
кот 0.6889 животное0.7757
зверь
0.7283
хищник
0.7141
намбат
0.6983
сумчатость
0.6946
cryptoprocta
0.6915
ornithoryngue
0.6887
млекопитающее
0.6876
кволл
0.6837
15/17
Conclusion
A negative sampling approach for synonymy penalization hasbeen proposed and successfully evaluated.An open source projection learning toolkit has been developedusing TensorFlow: https://github.com/dustalov/projlearn/.The released datasets, including the trained models, are alsoavailable for other researchers under a libré license.GPUs have a lot of potential in our task: multi-layer setup,CNNs, RNNs, etc.The primary obstacle right now is the availability of thetraining subsumptions.
16/17
Thank You!
Dmitry Ustalov https://linkedin.com/in/ustalov dmitry.ustalov@urfu.ru
The reported study was funded by RFBR according to the researchproject No. 16-37-00354 мол_a. We are grateful to Nikolay Arefyev,Andrey Kutuzov, Andrey Krizhanovsky, Benjamin Milde and AlexanderBersenev for the fruitful discussions on the present study. Dmitry Ustalovwas partially supported by the Deutscher Akademischer Austauschdienst(DAAD) scholarship. Alexander Panchenko was supported by theDeutsche Forschungsgemeinschaft (DFG) foundation under the project“JOIN-T: Joining Ontologies and Semantics Induced from Text”.
17/17
Recommended