Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding
Dr. Quanming YAO
Researcher@4Paradigm. Inc
Accepted by ICDE 2019: https://arxiv.org/pdf/1812.06410.pdf
Code: https://github.com/yzhangee/NSCaching
Email: [email protected]
About This Talk
• Knowledge Graph Embedding
• Negative Sampling
• NSCaching: faster and better negative sampling
• Experiments
Knowledge Graph Embedding
Knowledge Graph (KG)
Knowledge structure as graph• Each node = an entity
• Each edge = a relation
Fact (triplet):• (head, relation, tail)
Typical KGs:• WordNet: Linguistic KG
• Freebase, DBpedia, YAGO: World KG
Applications:• Structured search [Dong et.al. KDD 2014]
• Question answering [Lukovnikov et.al. WWW 2017]
• Recommendation [Zhang et.al. KDD 2016]
(Michelle, hasChild, ?)
KG Embedding (what & why)
Encode entities and relations in a KG into low-dimensional vector spaces ℝ𝑑, while capturing nodes’ and edges’ connection properties
Once triplets are processed into vectors, they can be used for subsequent learning tasks
KG Embedding (how)
A scoring function 𝑓(ℎ, 𝑟, 𝑡) is given to capture the interactions (similarity) between two entities based on a relation by their embeddings
TransE [Bordes, etal 2013]: 𝑓 ℎ, 𝑟, 𝑡 = − 𝐡 + 𝐫 − 𝐭 1
DistMult [Yang, etal. 2017]: 𝑓 ℎ, 𝑟, 𝑡 = 𝐡𝑇diag 𝐫 𝐭
𝑓 ℎ, 𝑟, 𝑡
• ℎ: embedded vector of head entity
• 𝑟: embedded vector of relation
• 𝑡: embedded vector of tail entity
KG Embedding (how)
Target:
• maximize 𝑓 on a set of positive triplets 𝒮 = ℎ, 𝑟, 𝑡
• minimize 𝑓 on a set of negative triplets ҧ𝒮 = തℎ, 𝑟, ҧ𝑡
− 𝐎𝐛𝐚𝐦𝐚 +𝐌𝐚𝐫𝐫𝐢𝐞𝐝𝐓𝐨 −𝐌𝐢𝐜𝐡𝐞𝐥𝐥𝐞 1
− 𝐎𝐛𝐚𝐦𝐚 +𝐌𝐚𝐫𝐫𝐢𝐞𝐝𝐓𝐨 − 𝐓𝐫𝐮𝐦𝐩 1
Objective, minimize, e.g.,
• 𝐿 ℰ,ℛ = σ ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮 𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+, (𝛾 > 0)
• 𝐿 ℰ,ℛ = σ ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮 ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))
Use TransE as example
ℓ is a loss function for binary classification
Negative Sampling (why)
A KG only contains observed facts (positive triplets)
Non-observed ones are assumed to be negative with large probability
Positive Negative
(Obama, marriedTo, Michelle) (Obama, marriedTo, Sasha), (SaSha, marriedTo, Michelle), (Obama, bornOn, Michelle)
(Michelle, hasChild, Malia) (Michelle, hasChild, Obama), (Sasha, hasChild, Malia), (Michelle, bornOn, Malia)
Negative Sampling (why)
𝐿 ℰ,ℛ =
ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮
ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))
• Performance: Not all negative samples are equally good, bad ones can make
performance worse
Positive: (Steve Jobs, FounderOf, Apple Inc.)
Low-quality: (Baseball, FounderOf, Apple Inc.)
High-quality: (Bill Gates, FounderOf, Apple Inc.)
Negative Sampling (why)
𝐿 ℰ,ℛ =
ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮
ℓ +1, 𝑓 ℎ, 𝑟, 𝑡 + ℓ(−1, 𝑓(തℎ, 𝑟, ҧ𝑡))
• Computation: Number of negative samples (unobserved triplets) is very large,
considering all of them is computationally infeasible
How to sample a few high-quality negative samples is important for performance and efficiency
Negative Sampling (why)
Negative sampling is not isolated in KG, they also appears in word2vec [Mikolov, et.al. 2013], Click-Through Rate prediction (CTR)
The needs of negative sampling are the same
word2vec
Negative Sampling (how)
Given a positive triplet ℎ, 𝑟, 𝑡 , the set of negative triplets
Few negative samples are sampled from ҧ𝒮 ℎ,𝑟,𝑡 .
Note that ℎ, ҧ𝑟, 𝑡 ∉ 𝒮| ҧ𝑟 ∈ ℰ is not included since it is more likely to be false negative.
ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ
Negative Sampling (problems)
Given a negative triplets set ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ ,
uniformly sampling from the set is widely used in literature.
The quality of negative sample matters!
Negative Sampling (problems)
Low quality negative samples become less informative gradually [Wang et.al. AAAI 2018]
• Positive: (Steve Jobs, FounderOf, Apple Inc.)
• Low-quality: (Baseball, FounderOf, Apple Inc.)
• High-quality: (Bill Gates, FounderOf, Apple Inc.)
Vanishing Gradient
𝐿 ℰ,ℛ =
ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮
𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+
[Wang et.al. AAAI 2018]
We need to adaptively generate high quality negative triplets as training goes on.
High-quality negative triplets should havelarge scores. We need to capture the dynamicdistribution of them and sample from it.
GAN-based Method (existing solutions)
Key idea• Use a generator to model the dynamic negative triplet distribution
• High quality negative triplets are sampled by the generator
• Joint optimize (reinforcement learning is used):• Discriminator is trained based on negative triplets provided by generator;
• Generator obtains reward by the discriminator.
target KG embedding
IGAN [Wang et.al. AAAI 2018]
KBGAN [Cai et.al. NAACL 2018]
Self-pace NE [Gao et. al. KDD 2018]
NSCaching: faster and better negative sampling
𝐿 ℰ,ℛ =
ℎ,𝑟,𝑡 ∈𝒮, ഥℎ,𝑟, ҧ𝑡 ∈ ҧ𝒮
𝛾 − 𝑓 ℎ, 𝑟, 𝑡 + 𝑓 തℎ, 𝑟, ҧ𝑡+
Key Observations
High-quality: large score evaluated from the scoring function
Recall that:
Key Observations
The score distribution of negative triplets is highly skewed
Properties: dynamic, rare, complex
Key Observations
Word2vec has similar observations on negative samples (on words)
- “While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality.”
Can we design a sampling scheme fully explore above properties?
GAN-based Method (existing solutions)
GAN based NSCaching
Increased number of training parameters No extra parameters introduced
Sampling is not efficient Efficient sampling through the cache
Training suffers from instability and degeneracy Stable without pre-train
NSCaching (overview)
Challenges:
• How to model the dynamic distribution of negative triplets
• How to sample high-quality negative triplets in an efficient way
Motivation:
the KG embedding itself contains information of triplets quality
• Use a small amount of extra memory, which caches negative samples with large scores for each triplet in 𝒮 during training
• Sample the negative triplet directly from the cache
NSCaching (design issues)
Core idea: cache high-quality negative samples for each observed triplets
• How to construct & update the cache?
• How to sample from the cache?
Recall that: the distribution of negative samples are dynamic
during training, but high-quality ones are rare
NSCaching (design issues)
Sampling from the cache• The negative triplets in the cache may not be accurate enough;
• There are false negative triplets in the negative sample sets.
Update the cache• The cache needs to be dynamically changed during the iterations of the algorithm;
• Should be able to explore all the possible high-quality negative samples;
• The update procedure should be efficient.
sample
sample
ҧ𝑡
തℎ
𝒯ℎ,𝑟
ℋ 𝑟,𝑡
Recall that all possible choices are:ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ
NSCaching (update & construct cache)
• Randomly sample candidates
from all possible negative triplets
• Evaluate scores of each candidate
• Compared with existing ones in
the cache and keep top ones
Recall that all possible choices are:ҧ𝒮 ℎ,𝑟,𝑡 = തℎ, 𝑟, 𝑡 ∉ 𝒮|തℎ ∈ ℰ ∪ ℎ, 𝑟, ҧ𝑡 ∉ 𝒮| ҧ𝑡 ∈ ℰ
Cache update
Possible Choices:• Compute the score over all ℎ′ ∈ ℰ, 𝑡′ ∈ ℰ and
select among them.
• Sample a subset ℛ𝑚from ℰ, and select among
them.
• Sample a subset ℛ𝑚from ℰ, concatenate it with
the cache, and select among the new set.
Entity set ℰ
𝒯ℎ,𝑟 ℋ 𝑟,𝑡
ℛ𝑚
Design Requirements
Capture dynamic distribution Explore all possible candidates Efficient
𝒯ℎ,𝑟 ℋ 𝑟,𝑡ℛ𝑚
Entity set ℰ
𝒯ℎ,𝑟 ℋ 𝑟,𝑡
update cache
random sample
ℛ𝑚
Update scheme• top-k• importance sampling
Connection to self-paced learning:• As training goes on, easy samples will gradually have small scores and are removed from the
cache. Thus, hard samples are gradually stored.
NSCaching (update & construct cache)
NSCaching (sample from cache)
Since negative samples in the cache is almost equally good, we uniformly
sample from them
ℋ
𝒯
tail cache
head cache
ℎ, 𝑟, 𝑡
ℎ, 𝑟
𝑟, 𝑡
index
index
ℎ, 𝑟, ҧ𝑡
തℎ, 𝑟, 𝑡
samp
lesam
ple
concatenate
concatenate
𝑓(ℎ, 𝑟, 𝑡)
തℎ, 𝑟, ҧ𝑡 𝑓
𝑓 തℎ, 𝑟, ҧ𝑡
ℎ, 𝑟, 𝑡 തℎ, 𝑟, ҧ𝑡
loss
𝑓
negative triplet
update cache
Cache KGE
NSCaching (overview)
𝒯ℎ,𝑟
ℋ 𝑟,𝑡
NSCaching (detailed design concerns)
There are other design possibilities for NSCaching, e.g.
• Sample the top 1 in the cache
• Keep the top in the cache
• Etc….
Please check our paper for detailed discussion, the main principle is :
exploration and exploitation
Experiments
Effectiveness
Measurements• Given a triplet ℎ, 𝑟, 𝑡 ;• Compute the score of ℎ′, 𝑟, 𝑡 , ∀ℎ′ ∈ ℰ;• Get the rank of ℎ among all ℎ′;• Same for 𝑡.
Metrics• MRR (mean reciprocal rank):
• MR (mean rank):
• Hit@10:
1
𝒮
𝑖=1
𝒮1
rank𝑖
1
𝒮
𝑖=1
𝒮
rank𝑖
1
𝒮
𝑖=1
𝒮
𝕀 rank𝑖 < 10
EfficiencyWe measure the convergence by testing performance v.s. training time.
EfficiencyWe measure the convergence by testing performance v.s. training time.
Sampling and Updating schemes
• Sampling from cache: uniform, importance sampling (IS), top-1
• Cache update: importance sampling (IS), top-k
Diff. sampling scheme. Diff. updating scheme.
Stability
We change the cache size 𝑁1 among 10, 30, 50, 70, 90 when fixing 𝑁2 = 50,
and random subset size 𝑁2 among 10, 30, 50, 70, 90 when fixing 𝑁1 = 50.
Diff. 𝑁1 Diff. 𝑁2
Visualization
Given positive triplet (manorama, profession, actor), we randomly select and
visualize some entities in the tail-cache 𝒯(𝑚𝑎𝑛𝑜𝑟𝑎𝑚𝑎,𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛) during
training.
Summary
A novel negative sampling method.
Why it works
• It can dynamically hold high-quality negative samples;
• Sampling is efficient and extra memory is small;
• Both sampling and updating schemes are carefully designed to balance through exploration and exploitation;
• The cache schemes has connection with self-paced learning.
Future works
• Using advanced index structure for the cache to further improve efficiency and
reduce cache sizes.
• Adapt to negative sampling in other tasks like word embedding, network
embedding, PU learning;
• Theoretical analysis on the convergence;
• Using AutoML to make NSCaching better adapt to other datasets.