NSCaching: Simple and Efficient Negative Sampling for …qyaoaa/papers/NSCaching-quanming.pdf · 2019-03-08 · NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph

NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding

Dr. Quanming YAO

Researcher@4Paradigm. Inc

Accepted by ICDE 2019: https://arxiv.org/pdf/1812.06410.pdf

Code: https://github.com/yzhangee/NSCaching

Email: [email protected]

https://arxiv.org/pdf/1812.06410.pdf

https://github.com/yzhangee/NSCaching

About This Talk

• Knowledge Graph Embedding

• Negative Sampling

• NSCaching: faster and better negative sampling

• Experiments

Knowledge Graph Embedding

Knowledge Graph (KG)

Knowledge structure as graph• Each node = an entity

• Each edge = a relation

Fact (triplet):• (head, relation, tail)

Typical KGs:• WordNet: Linguistic KG

• Freebase, DBpedia, YAGO: World KG

Applications:• Structured search [Dong et.al. KDD 2014]

• Question answering [Lukovnikov et.al. WWW 2017]

• Recommendation [Zhang et.al. KDD 2016]

(Michelle, hasChild, ?)

KG Embedding (what & why)

Encode entities and relations in a KG into low-dimensional vector spaces ℝ𝑑, while capturing nodes’ and edges’ connection properties

Once triplets are processed into vectors, they can be used for subsequent learning tasks

KG Embedding (how)

A scoring function 𝑓(ℎ, 𝑟, 𝑡) is given to capture the interactions (similarity) between two entities based on a relation by their embeddings

TransE [Bordes, etal 2013]: 𝑓 ℎ, 𝑟, 𝑡 = − 𝐡 + 𝐫 − 𝐭 1

DistMult [Yang, etal. 2017]: 𝑓 ℎ, 𝑟, 𝑡 = 𝐡𝑇diag 𝐫 𝐭

𝑓 ℎ, 𝑟, 𝑡

• ℎ: embedded vector of head entity

• 𝑟: embedded vector of relation

• 𝑡: embedded vector of tail entity

KG Embedding (how)

Target:

• maximize 𝑓 on a set of positive triplets 𝒮 = ℎ, 𝑟, 𝑡

• minimize 𝑓 on a set of negative triplets ҧ𝒮 = തℎ, 𝑟, ҧ𝑡

− 𝐎𝐛𝐚𝐦𝐚 +𝐌𝐚𝐫𝐫𝐢𝐞𝐝𝐓𝐨 −𝐌𝐢𝐜𝐡𝐞𝐥𝐥𝐞 1

− 𝐎𝐛𝐚𝐦𝐚 +𝐌𝐚𝐫𝐫𝐢𝐞𝐝𝐓𝐨 − 𝐓𝐫𝐮𝐦𝐩 1

Objective, minimize, e.g.,

• 𝐿 ℰ,ℛ = σ ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮 𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+, (𝛾 > 0)

• 𝐿 ℰ,ℛ = σ ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮 ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))

Use TransE as example

ℓ is a loss function for binary classification

Negative Sampling (why)

A KG only contains observed facts (positive triplets)

Non-observed ones are assumed to be negative with large probability

Positive Negative

(Obama, marriedTo, Michelle) (Obama, marriedTo, Sasha), (SaSha, marriedTo, Michelle), (Obama, bornOn, Michelle)

(Michelle, hasChild, Malia) (Michelle, hasChild, Obama), (Sasha, hasChild, Malia), (Michelle, bornOn, Malia)


𝐿 ℰ,ℛ =

ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮

ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))

• Performance: Not all negative samples are equally good, bad ones can make

performance worse

Positive: (Steve Jobs, FounderOf, Apple Inc.)

Low-quality: (Baseball, FounderOf, Apple Inc.)

High-quality: (Bill Gates, FounderOf, Apple Inc.)


𝐿 ℰ,ℛ =


ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))

• Computation: Number of negative samples (unobserved triplets) is very large,

considering all of them is computationally infeasible

How to sample a few high-quality negative samples is important for performance and efficiency


Negative sampling is not isolated in KG, they also appears in word2vec [Mikolov, et.al. 2013], Click-Through Rate prediction (CTR)

The needs of negative sampling are the same

word2vec

Negative Sampling (how)

Given a positive triplet ℎ, 𝑟, 𝑡 , the set of negative triplets

Few negative samples are sampled from ҧ𝒮 ℎ,𝑟,𝑡 .

Note that ℎ, ҧ𝑟, 𝑡 ∉ 𝒮| ҧ𝑟 ∈ ℰ is not included since it is more likely to be false negative.

ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ

Negative Sampling (problems)

Given a negative triplets set ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ ,

uniformly sampling from the set is widely used in literature.

The quality of negative sample matters!

Negative Sampling (problems)

Low quality negative samples become less informative gradually [Wang et.al. AAAI 2018]

• Positive: (Steve Jobs, FounderOf, Apple Inc.)

• Low-quality: (Baseball, FounderOf, Apple Inc.)

• High-quality: (Bill Gates, FounderOf, Apple Inc.)

Vanishing Gradient

𝐿 ℰ,ℛ =


𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+

[Wang et.al. AAAI 2018]

We need to adaptively generate high quality negative triplets as training goes on.

High-quality negative triplets should havelarge scores. We need to capture the dynamicdistribution of them and sample from it.

GAN-based Method (existing solutions)

Key idea• Use a generator to model the dynamic negative triplet distribution

• High quality negative triplets are sampled by the generator

• Joint optimize (reinforcement learning is used):• Discriminator is trained based on negative triplets provided by generator;

• Generator obtains reward by the discriminator.

target KG embedding

IGAN [Wang et.al. AAAI 2018]

KBGAN [Cai et.al. NAACL 2018]

Self-pace NE [Gao et. al. KDD 2018]

NSCaching: faster and better negative sampling

𝐿 ℰ,ℛ =


𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+

Key Observations

High-quality: large score evaluated from the scoring function

Recall that:

Key Observations

The score distribution of negative triplets is highly skewed

Properties: dynamic, rare, complex

Key Observations

Word2vec has similar observations on negative samples (on words)

- “While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality.”

Can we design a sampling scheme fully explore above properties?

GAN-based Method (existing solutions)

GAN based NSCaching

Increased number of training parameters No extra parameters introduced

Sampling is not efficient Efficient sampling through the cache

Training suffers from instability and degeneracy Stable without pre-train

NSCaching (overview)

Challenges:

• How to model the dynamic distribution of negative triplets

• How to sample high-quality negative triplets in an efficient way

Motivation:

the KG embedding itself contains information of triplets quality

• Use a small amount of extra memory, which caches negative samples with large scores for each triplet in 𝒮 during training

• Sample the negative triplet directly from the cache

NSCaching (design issues)

Core idea: cache high-quality negative samples for each observed triplets

• How to construct & update the cache?

• How to sample from the cache?

Recall that: the distribution of negative samples are dynamic

during training, but high-quality ones are rare

NSCaching (design issues)

Sampling from the cache• The negative triplets in the cache may not be accurate enough;

• There are false negative triplets in the negative sample sets.

Update the cache• The cache needs to be dynamically changed during the iterations of the algorithm;

• Should be able to explore all the possible high-quality negative samples;

• The update procedure should be efficient.

sample

sample

ҧ𝑡

തℎ

𝒯ℎ,𝑟

ℋ 𝑟,𝑡

Recall that all possible choices are:ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ

NSCaching (update & construct cache)

• Randomly sample candidates

from all possible negative triplets

• Evaluate scores of each candidate

• Compared with existing ones in

the cache and keep top ones

Recall that all possible choices are:ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ

Cache update

Possible Choices:• Compute the score over all ℎ′ ∈ ℰ, 𝑡′ ∈ ℰ and

select among them.

• Sample a subset ℛ𝑚from ℰ, and select among

them.

• Sample a subset ℛ𝑚from ℰ, concatenate it with

the cache, and select among the new set.

Entity set ℰ

𝒯ℎ,𝑟 ℋ 𝑟,𝑡

ℛ𝑚

Design Requirements

Capture dynamic distribution Explore all possible candidates Efficient

𝒯ℎ,𝑟 ℋ 𝑟,𝑡ℛ𝑚

Entity set ℰ

𝒯ℎ,𝑟 ℋ 𝑟,𝑡

update cache

random sample

ℛ𝑚

Update scheme• top-k• importance sampling

Connection to self-paced learning:• As training goes on, easy samples will gradually have small scores and are removed from the

cache. Thus, hard samples are gradually stored.

NSCaching (update & construct cache)

NSCaching (sample from cache)

Since negative samples in the cache is almost equally good, we uniformly

sample from them

ℋ

𝒯

tail cache

head cache

ℎ, 𝑟, 𝑡

ℎ, 𝑟

𝑟, 𝑡

index

index

ℎ, 𝑟, ҧ𝑡

തℎ, 𝑟, 𝑡

samp

lesam

ple

concatenate

concatenate

𝑓(ℎ, 𝑟, 𝑡)

തℎ, 𝑟, ҧ𝑡 𝑓

𝑓 തℎ, 𝑟, ҧ𝑡

ℎ, 𝑟, 𝑡 തℎ, 𝑟, ҧ𝑡

loss

𝑓

negative triplet

update cache

Cache KGE

NSCaching (overview)

𝒯ℎ,𝑟

ℋ 𝑟,𝑡

NSCaching (detailed design concerns)

There are other design possibilities for NSCaching, e.g.

• Sample the top 1 in the cache

• Keep the top in the cache

• Etc….

Please check our paper for detailed discussion, the main principle is :

exploration and exploitation

Experiments

Effectiveness

Measurements• Given a triplet ℎ, 𝑟, 𝑡 ;• Compute the score of ℎ′, 𝑟, 𝑡 , ∀ℎ′ ∈ ℰ;• Get the rank of ℎ among all ℎ′;• Same for 𝑡.

Metrics• MRR (mean reciprocal rank):

• MR (mean rank):

• Hit@10:

1

𝒮

𝑖=1

𝒮1

rank𝑖

1

𝒮

𝑖=1

𝒮

rank𝑖

1

𝒮

𝑖=1

𝒮

𝕀 rank𝑖 < 10

EfficiencyWe measure the convergence by testing performance v.s. training time.

EfficiencyWe measure the convergence by testing performance v.s. training time.

Sampling and Updating schemes

• Sampling from cache: uniform, importance sampling (IS), top-1

• Cache update: importance sampling (IS), top-k

Diff. sampling scheme. Diff. updating scheme.

Stability

We change the cache size 𝑁1 among 10, 30, 50, 70, 90 when fixing 𝑁2 = 50,

and random subset size 𝑁2 among 10, 30, 50, 70, 90 when fixing 𝑁1 = 50.

Diff. 𝑁1 Diff. 𝑁2

Visualization

Given positive triplet (manorama, profession, actor), we randomly select and

visualize some entities in the tail-cache 𝒯(𝑚𝑎𝑛𝑜𝑟𝑎𝑚𝑎,𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛) during

training.

Summary

A novel negative sampling method.

Why it works

• It can dynamically hold high-quality negative samples;

• Sampling is efficient and extra memory is small;

• Both sampling and updating schemes are carefully designed to balance through exploration and exploitation;

• The cache schemes has connection with self-paced learning.

Future works

• Using advanced index structure for the cache to further improve efficiency and

reduce cache sizes.

• Adapt to negative sampling in other tasks like word embedding, network

embedding, PU learning;

• Theoretical analysis on the convergence;

• Using AutoML to make NSCaching better adapt to other datasets.

Documents

NSCaching: Simple and Efficient Negative Sampling for …qyaoaa/papers/NSCaching-quanming.pdf · 2019-03-08 · NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph