14
Low-Dimensional Hyperbolic Knowledge Graph Embeddings Ines Chami 1* , Adva Wolf 1 , Da-Cheng Juan 2 , Frederic Sala 1 , Sujith Ravi 3and Christopher R´ e 1 1 Stanford University 2 Google Research 3 Amazon Alexa {chami,advaw,fredsala,chrismre}@cs.stanford.edu [email protected] [email protected] Abstract Knowledge graph (KG) embeddings learn low- dimensional representations of entities and re- lations to predict missing facts. KGs often ex- hibit hierarchical and logical patterns which must be preserved in the embedding space. For hierarchical data, hyperbolic embedding methods have shown promise for high-fidelity and parsimonious representations. However, existing hyperbolic embedding methods do not account for the rich logical patterns in KGs. In this work, we introduce a class of hyperbolic KG embedding models that si- multaneously capture hierarchical and logi- cal patterns. Our approach combines hyper- bolic reflections and rotations with attention to model complex relational patterns. Exper- imental results on standard KG benchmarks show that our method improves over previ- ous Euclidean- and hyperbolic-based efforts by up to 6.1% in mean reciprocal rank (MRR) in low dimensions. Furthermore, we observe that different geometric transformations cap- ture different types of relations while attention- based transformations generalize to multiple relations. In high dimensions, our approach yields new state-of-the-art MRRs of 49.6% on WN18RR and 57.7% on YAGO3-10. 1 Introduction Knowledge graphs (KGs), consisting of (head en- tity, relationship, tail entity) triples, are popular data structures for representing factual knowledge to be queried and used in downstream applications such as word sense disambiguation, question an- swering, and information extraction. Real-world KGs such as Yago (Suchanek et al., 2007) or Word- net (Miller, 1995) are usually incomplete, so a com- mon approach to predicting missing links in KGs is via embedding into vector spaces. Embedding * Work partially done during an internship at Google. Work done while at Google AI. director E.T features married features Movie Movie director Actor Singer Jurassic Park Steven Spielberg Laura Dern Jeff Goldblum Drew Barrymore Sam Neill Dee Wallace Henry Thomas Ben Harper Figure 1: A toy example showing how KGs can simul- taneously exhibit hierarchies and logical patterns. methods learn representations of entities and re- lationships that preserve the information found in the graph, and have achieved promising results for many tasks. Relations found in KGs have differing properties: for example, (Michelle Obama, married to, Barack Obama) is symmetric, whereas hypernym relations like (cat, specific type of, feline), are not (Figure 1). These distinctions present a challenge to em- bedding methods: preserving each type of behavior requires producing a different geometric pattern in the embedding space. One popular approach is to use extremely high-dimensional embeddings, which offer more flexibility for such patterns. How- ever, given the large number of entities found in KGs, doing so yields very high memory costs. For hierarchical data, hyperbolic geometry of- fers an exciting approach to learn low-dimensional embeddings while preserving latent hierarchies. Hyperbolic space can embed trees with arbitrarily low distortion in just two dimensions. Recent re- search has proposed embedding hierarchical graphs into these spaces instead of conventional Euclidean space (Nickel and Kiela, 2017; Sala et al., 2018). However, these works focus on embedding simpler graphs (e.g., weighted trees) and cannot express the diverse and complex relationships in KGs. We propose a new hyperbolic embedding ap- arXiv:2005.00545v1 [cs.LG] 1 May 2020

Low-Dimensional Hyperbolic Knowledge Graph EmbeddingsIn high (500) dimensions, we find that both hy-perbolic and Euclidean embeddings achieve similar performance, and our approach

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Low-Dimensional Hyperbolic Knowledge Graph Embeddings

    Ines Chami1∗, Adva Wolf1, Da-Cheng Juan2, Frederic Sala1, Sujith Ravi3† and Christopher Ré11Stanford University2Google Research3Amazon Alexa

    {chami,advaw,fredsala,chrismre}@[email protected]@sravi.org

    Abstract

    Knowledge graph (KG) embeddings learn low-dimensional representations of entities and re-lations to predict missing facts. KGs often ex-hibit hierarchical and logical patterns whichmust be preserved in the embedding space.For hierarchical data, hyperbolic embeddingmethods have shown promise for high-fidelityand parsimonious representations. However,existing hyperbolic embedding methods donot account for the rich logical patterns inKGs. In this work, we introduce a classof hyperbolic KG embedding models that si-multaneously capture hierarchical and logi-cal patterns. Our approach combines hyper-bolic reflections and rotations with attentionto model complex relational patterns. Exper-imental results on standard KG benchmarksshow that our method improves over previ-ous Euclidean- and hyperbolic-based effortsby up to 6.1% in mean reciprocal rank (MRR)in low dimensions. Furthermore, we observethat different geometric transformations cap-ture different types of relations while attention-based transformations generalize to multiplerelations. In high dimensions, our approachyields new state-of-the-art MRRs of 49.6% onWN18RR and 57.7% on YAGO3-10.

    1 Introduction

    Knowledge graphs (KGs), consisting of (head en-tity, relationship, tail entity) triples, are populardata structures for representing factual knowledgeto be queried and used in downstream applicationssuch as word sense disambiguation, question an-swering, and information extraction. Real-worldKGs such as Yago (Suchanek et al., 2007) or Word-net (Miller, 1995) are usually incomplete, so a com-mon approach to predicting missing links in KGsis via embedding into vector spaces. Embedding

    ∗Work partially done during an internship at Google.†Work done while at Google AI.

    director

    E.T

    featur

    es

    married

    featur

    es

    Movie

    Movie director

    Actor

    SingerJurassic

    Park

    Steven Spielberg

    Laura Dern

    Jeff Goldblum

    Drew Barrymore

    Sam Neill

    Dee Wallace

    Henry Thomas

    Ben Harper

    Figure 1: A toy example showing how KGs can simul-taneously exhibit hierarchies and logical patterns.

    methods learn representations of entities and re-lationships that preserve the information found inthe graph, and have achieved promising results formany tasks.

    Relations found in KGs have differing properties:for example, (Michelle Obama, married to, BarackObama) is symmetric, whereas hypernym relationslike (cat, specific type of, feline), are not (Figure1). These distinctions present a challenge to em-bedding methods: preserving each type of behaviorrequires producing a different geometric patternin the embedding space. One popular approachis to use extremely high-dimensional embeddings,which offer more flexibility for such patterns. How-ever, given the large number of entities found inKGs, doing so yields very high memory costs.

    For hierarchical data, hyperbolic geometry of-fers an exciting approach to learn low-dimensionalembeddings while preserving latent hierarchies.Hyperbolic space can embed trees with arbitrarilylow distortion in just two dimensions. Recent re-search has proposed embedding hierarchical graphsinto these spaces instead of conventional Euclideanspace (Nickel and Kiela, 2017; Sala et al., 2018).However, these works focus on embedding simplergraphs (e.g., weighted trees) and cannot expressthe diverse and complex relationships in KGs.

    We propose a new hyperbolic embedding ap-

    arX

    iv:2

    005.

    0054

    5v1

    [cs

    .LG

    ] 1

    May

    202

    0

  • proach that captures such patterns to achieve thebest of both worlds. Our proposed approach pro-duces the parsimonious representations offered byhyperbolic space, especially suitable for hierar-chical relations, and is effective even with low-dimensional embeddings. It also uses rich trans-formations to encode logical patterns in KGs, pre-viously only defined in Euclidean space. To ac-complish this, we (1) train hyperbolic embeddingswith relation-specific curvatures to preserve mul-tiple hierarchies in KGs; (2) parameterize hyper-bolic isometries (distance-preserving operations)and leverage their geometric properties to capturerelations’ logical patterns, such as symmetry oranti-symmetry; (3) and use a notion of hyperbolicattention to combine geometric operators and cap-ture multiple logical patterns.

    We evaluate the performance of our approach,ATTH, on the KG link prediction task using thestandard WN18RR (Dettmers et al., 2018; Bordeset al., 2013), FB15k-237 (Toutanova and Chen,2015) and YAGO3-10 (Mahdisoltani et al., 2013)benchmarks. (1) In low (32) dimensions, we im-prove over Euclidean-based models by up to 6.1%in the mean reciprocical rank (MRR) metric. In par-ticular, we find that hierarchical relationships, suchas WordNet’s hypernym and member meronym, sig-nificantly benefit from hyperbolic space; we ob-serve a 16% to 24% relative improvement versusEuclidean baselines. (2) We find that geometricproperties of hyperbolic isometries directly map tological properties of relationships. We study sym-metric and anti-symmetric patterns and find thatreflections capture symmetric relations while rota-tions capture anti-symmetry. (3) We show thatattention based-transformations have the abilityto generalize to multiple logical patterns. For in-stance, we observe that ATTH recovers reflectionsfor symmetric relations and rotations for the anti-symmetric ones.

    In high (500) dimensions, we find that both hy-perbolic and Euclidean embeddings achieve similarperformance, and our approach achieves new state-of-the-art results (SotA), obtaining 49.6% MRRon WN18RR and 57.7% YAGO3-10. Our exper-iments show that trainable curvature is critical togeneralize hyperbolic embedding methods to high-dimensions. Finally, we visualize embeddingslearned in hyperbolic spaces and show that hyper-bolic geometry effectively preserves hierarchies inKGs.

    2 Related Work

    Previous methods for KG embeddings also relyon geometric properties. Improvements have beenobtained by exploiting either more sophisticatedspaces (e.g., going from Euclidean to complex orhyperbolic space) or more sophisticated operations(e.g., from translations to isometries, or to learninggraph neural networks). In contrast, our approachtakes a step forward in both directions.

    Euclidean embeddings In the past decade, therehas been a rich literature on Euclidean embeddingsfor KG representation learning. These includetranslation approaches (Bordes et al., 2013; Ji et al.,2015; Wang et al., 2014; Lin et al., 2015) or tensorfactorization methods such as RESCAL (Nickelet al., 2011) or DistMult (Yang et al., 2015). Whilethese methods are fairly simple and have few pa-rameters, they fail to encode important logical prop-erties (e.g., translations can’t encode symmetry).

    Complex embeddings Recently, there has beeninterest in learning embeddings in complex space,as in the ComplEx (Trouillon et al., 2016) and Ro-tatE (Sun et al., 2019) models. RotatE learns ro-tations in complex space, which are very effectivein capturing logical properties such as symmetry,anti-symmetry, composition or inversion. The re-cent QuatE model (Zhang et al., 2019) learns KGembeddings using quaternions. However, a down-side is that these embeddings require very high-dimensional spaces, leading to high memory costs.

    Deep neural networks Another family of meth-ods uses neural networks to produce KG embed-dings. For instance, R-GCN (Schlichtkrull et al.,2018) extends graph neural networks to the multi-relational setting by adding a relation-specific ag-gregation step. ConvE and ConvKB (Dettmerset al., 2018; Nguyen et al., 2018) leverage the ex-pressiveness of convolutional neural networks tolearn entity embeddings and relation embeddings.More recently, the KBGAT (Nathani et al., 2019)and A2N (Bansal et al., 2019) models use graphattention networks for knowledge graph embed-dings. A downside of these methods is that theyare computationally expensive as they usually re-quire pre-trained KG embeddings as input for theneural network.

    Hyperbolic embeddings To the best of ourknowledge, MuRP (Balažević et al., 2019) is the

  • only method that learns KG embeddings in hy-perbolic space in order to target hierarchical data.MuRP minimizes hyperbolic distances betweena re-scaled version of the head entity embeddingand a translation of the tail entity embedding. Itachieves promising results using hyperbolic em-beddings with fewer dimensions than its Euclideananalogues. However, MuRP is a translation modeland fails to encode some logical properties of rela-tionships. Furthermore, embeddings are learned ina hyperbolic space with fixed curvature, potentiallyleading to insufficient precision, and training relieson cumbersome Riemannian optimization. Instead,our proposed method leverages expressive hyper-bolic isometries to simultaneously capture logicalpatterns and hierarchies. Furthermore, embeddingsare learned using tangent space (i.e., Euclidean) op-timization methods and trainable hyperbolic curva-tures per relationship, avoiding precision errors thatmight arise when using a fixed curvature, and pro-viding flexibility to encode multiple hierarchies.

    3 Problem Formulation and Background

    We describe the KG embedding problem settingand give some necessary background on hyperbolicgeometry.

    3.1 Knowledge graph embeddings

    In the KG embedding problem, we are given a setof triples (h, r, t) ∈ E ⊆ V ×R×V , where V andR are entity and relationship sets, respectively. Thegoal is to map entities v ∈ V to embeddings ev ∈UdV and relationships r ∈ R to embeddings rr ∈UdR , for some choice of space U (traditionally R),such that the KG structure is preserved.

    Concretely, the data is split into ETrain and ETesttriples. Embeddings are learned by optimizing ascoring function s : V × R × V → R, whichmeasures triples’ likelihoods. s(·, ·, ·) is trainedusing triples in ETrain and the learned embeddingsare then used to predict scores for triples in ETest.The goal is to learn embeddings such that the scoresof triples in ETest are high compared to triples thatare not present in E .

    3.2 Hyperbolic geometry

    We briefly review key notions from hyperbolic ge-ometry; a more in-depth treatment is available instandard texts (Robbin and Salamon). Hyperbolicgeometry is a non-Euclidean geometry with con-stant negative curvature. In this work, we use the d-

    TxM

    Mexpx(v)

    v

    x

    Figure 2: An illustration of the exponential mapexpx(v), which maps the tangent space TxM at thepoint x to the hyperbolic manifold M .

    dimensional Poincaré ball model with negative cur-vature −c (c > 0): Bd,c = {x ∈ Rd : ||x||2 < 1c},where || · || denotes the L2 norm. For each pointx ∈ Bd,c, the tangent space T cx is a d-dimensionalvector space containing all possible directions ofpaths in Bd,c leaving from x.

    The tangent space T cx maps to Bd,c via the ex-ponential map (Figure 2), and conversely, the log-arithmic map maps Bd,c to T cx . In particular, wehave closed-form expressions for these maps at theorigin:

    expc0(v) = tanh(√c||v||) v√

    c||v|| , (1)

    logc0(y) = arctanh(√c||y||) y√

    c||y|| . (2)

    Vector addition is not well-defined in the hyper-bolic space (adding two points in the Poincaré ballmight result in a point outside the ball). Instead,Möbius addition ⊕c (Ganea et al., 2018) providesan analogue to Euclidean addition for hyperbolicspace. We give its closed-form expression in Ap-pendix A.1. Finally, the hyperbolic distance onBd,c has the explicit formula:

    dc(x,y) =2√carctanh(

    √c|| − x⊕c y||). (3)

    4 Methodology

    The goal of this work is to learn parsimonious hy-perbolic embeddings that can encode complex log-ical patterns such as symmetry, anti-symmetry, orinversion while preserving latent hierarchies. Ourmodel, ATTH, (1) learns KG embeddings in hyper-bolic space in order to preserve hierarchies (Sec-tion 4.1), (2) uses a class of hyperbolic isometriesparameterized by compositions of Givens transfor-mations to encode logical patterns (Section 4.2),(3) combines these isometries with hyperbolic at-tention (Section 4.3). We describe the full modelin Section 4.4.

  • 4.1 Hierarchies in hyperbolic space

    As described, hyperbolic embeddings enable usto represent hierarchies even when we limit our-selves to low-dimensional spaces. In fact, two-dimensional hyperbolic space can represent anytree with arbitrarily small error (Sala et al., 2018).

    It is important to set the curvature of the hy-perbolic space correctly. This parameter providesflexibility to the model, as it determines whetherto embed relations into a more curved hyperbolicspace (more “tree-like”), or into a flatter, more“Euclidean-like” geometry. For each relation, welearn a relation-specific absolute curvature cr, en-abling us to represent a variety of hierarchies. Aswe show in Section 5.5, fixing, rather than learn-ing curvatures can lead to significant performancedegradation.

    4.2 Hyperbolic isometries

    Relationships often satisfy particular properties,such as symmetry: e.g., if (Michelle Obama,married to, Barack Obama) holds, then (BarackObama, married to, Michelle Obama) does as well.These rules are not universal. For instance, (BarackObama, born in, Hawaii) is not symmetric.

    Creating and curating a set of deterministic rulesis infeasible for large-scale KGs; instead, embed-ding methods represent relations as parameterizedgeometric operations that directly map to logicalproperties. We use two such operations in hyper-bolic space: rotations, which effectively capturecompositions or anti-symmetric patterns, and reflec-tions, which naturally encode symmetric patterns.

    Rotations Rotations have been successfully usedto encode compositions in complex space with theRotatE model (Sun et al., 2019); we lift these tohyperbolic space. Compared to translations or ten-sor factorization approaches which can only infersome logical patterns, rotations can simultaneouslymodel and infer inversion, composition, symmetricor anti-symmetric patterns.

    Reflections These isometries reflect along a fixedsubspace. While some rotations can represent sym-metric relations (more specifically π−rotations),any reflection can naturally represent symmetricrelations, since their second power is the identity.They provide a way to fill-in missing entries insymmetric triples, by applying the same operationto both the tail and the head entity. For instance,by modelling sibling of with a reflection, we can

    0 0

    (a) Rotations

    0 0

    (b) Reflections

    Figure 3: Euclidean (left) and hyperbolic (right) isome-tries. In hyperbolic space, the distance between startand end points after applying rotations or reflections ismuch larger than the Euclidean distance; it approachesthe sum of the distances between the points and the ori-gin, giving more “room” to separate embeddings. Thisis similar to trees, where the shortest path between twopoints goes through their nearest common ancestor.

    directly infer (Bob, sibling of, Alice) from (Alice,sibling of, Bob) and vice versa.

    Parameterization Unlike RotatE which modelsrotations via unitary complex numbers, we learnrelationship-specific isometries using Givens trans-formations, 2× 2 matrices commonly used in nu-merical linear algebra. Let Θr := (θr,i)i∈{1,... d

    2}

    and Φr := (φr,i)i∈{1,... d2} denote relation-specific

    parameters. Using an even number of dimensions d,our model parameterizes rotations and reflectionswith block-diagonal matrices of the form:

    Rot(Θr) = diag(G+(θr,1), . . . , G

    +(θr, d2)), (4)

    Ref(Φr) = diag(G−(φr,1), . . . , G

    −(φr,n2)), (5)

    where G±(θ) :=[cos(θ) ∓sin(θ)sin(θ) ±cos(θ)

    ]. (6)

    Rotations and reflections of this form are hyper-bolic isometries (distance-preserving). We cantherefore directly apply them to hyperbolic embed-dings while preserving the underlying geometry.Additionally, these transformations are computa-tionally efficient and can be computed in linear timein the dimension. We illustrate two-dimensionalisometries in both Euclidean and hyperbolic spacesin Figure 3.

    4.3 Hyperbolic attentionOf our two classes of hyperbolic isometries, one orthe other may better represent a particular relation.To handle this, we use an attention mechanism tolearn the right isometry. Thus we can representsymmetric, anti-symmetric or mixed-behaviour re-lations (i.e. neither symmetric nor anti-symmetric)as a combination of rotations and reflections.

    Let xH and yH be hyperbolic points (e.g., re-flection and rotation embeddings), and a be an

  • attention vector. Our approach maps hyperbolicrepresentations to tangent space representations,xE = logc0(x

    H) and yE = logc0(yH), and com-

    putes attention scores:

    (αx, αy) = Softmax(aTxE ,aTyE).

    We then compute a weighted average using therecently proposed tangent space average (Chamiet al., 2019; Liu et al., 2019):

    Att(xH ,yH ;a) := expc0(αxxE + αyy

    E). (7)

    4.4 The ATTH model

    We have all of the building blocks for ATTH, andcan now describe the model architecture. Let(eHv )v∈V and (r

    Hr )r∈R denote entity and relation-

    ship hyperbolic embeddings respectively. For atriple (h, r, t) ∈ V × R × V , ATTH appliesrelation-specific rotations (Equation 4) and reflec-tions (Equation 5) to the head embedding:

    qHRot = Rot(Θr)eHh , q

    Href = Ref(Φr)e

    Hh . (8)

    ATTH then combines the two representations usinghyperbolic attention (Equation 7) and applies ahyperbolic translation:

    Q(h, r) = Att(qHRot,qHRef ;ar)⊕cr rHr . (9)

    Intuitively, rotations and reflections encode log-ical patterns while translations capture tree-likestructures by moving between levels of the hierar-chy. Finally, query embeddings are compared totarget tail embeddings via the hyperbolic distance(Equation 3). The resulting scoring function is:

    s(h, r, t) = −dcr(Q(h, r), eHt )2 + bh + bt, (10)

    where (bv)v∈V are entity biases which act as mar-gins in the scoring function (Tifrea et al., 2019;Balažević et al., 2019).

    The model parameters are then{(Θr,Φr, rHr ,ar, cr)r∈R, (eHv , bv)v∈V}. Notethat the total number of parameters in ATTH isO(|V|d), similar to traditional models that do notuse attention or geometric operations. The extracost is proportional to the number of relations,which is usually much smaller than the number ofentities.

    Dataset #entities #relations #triples ξGWN18RR 41k 11 93k -2.54FB15k-237 15k 237 310k -0.65YAGO3-10 123k 37 1M -0.54

    Table 1: Datasets statistics. The lower the metric ξG is,the more tree-like the knowledge graph is.

    5 Experiments

    In low dimensions, we hypothesize (1) that hyper-bolic embedding methods obtain better represen-tations and allow for improved downstream per-formance for hierarchical data (Section 5.2). (2)We expect the performance of relation-specific ge-ometric operations to vary based on the relation’slogical patterns (Section 5.3). (3) In cases wherethe relations are neither purely symmetric nor anti-symmetric, we anticipate that hyperbolic attentionoutperforms the models which are based on solelyreflections or rotations (Section 5.4). Finally, inhigh dimensions, we expect hyperbolic modelswith trainable curvature to learn the best geometry,and perform similarly to their Euclidean analogues(Section 5.5).

    5.1 Experimental setup

    Datasets We evaluate our approach on the linkprediction task using three standard competitionbenchmarks, namely WN18RR (Bordes et al.,2013; Dettmers et al., 2018), FB15k-237 (Bor-des et al., 2013; Toutanova and Chen, 2015) andYAGO3-10 (Mahdisoltani et al., 2013). WN18RRis a subset of WordNet containing 11 lexical re-lationships between 40,943 word senses, and hasa natural hierarchical structure, e.g., (car, hyper-nym of, sedan). FB15k-237 is a subset of Free-base, a collaborative KB of general world knowl-edge. FB15k-237 has 14,541 entities and 237 re-lationships, some of which are non-hierarchical,such as born-in or nationality, while others havenatural hierarchies, such as part-of (for organiza-tions). YAGO3-10 is a subset of YAGO3, contain-ing 123,182 entities and 37 relations, where mostrelations provide descriptions of people. Some re-lationships have a hierarchical structure such asplaysFor or actedIn, while others induce logicalpatterns, like isMarriedTo.

    For each KG, we follow the standard data aug-mentation protocol by adding inverse relations(Lacroix et al., 2018) to the datasets. Addition-ally, we estimate the global graph curvature ξG (Guet al., 2019) (see Appendix A.2 for more details),

  • WN18RR FB15k-237 YAGO3-10U Model MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10

    Rd RotatE .387 .330 .417 .491 .290 .208 .316 .458 - - - -MuRE .458 .421 .471 .525 .313 .226 .340 .489 .283 .187 .317 .478Cd ComplEx-N3 .420 .390 .420 .460 .294 .211 .322 .463 .336 .259 .367 .484Bd,1 MuRP .465 .420 .484 .544 .323 .235 .353 .501 .230 .150 .247 .392

    RdREFE .455 .419 .470 .521 .302 .216 .330 .474 .370 .289 .403 .527ROTE .463 .426 .477 .529 .307 .220 .337 .482 .381 .295 .417 .548ATTE .456 .419 .471 .526 .311 .223 .339 .488 .374 .290 .410 .537

    Bd,cREFH .447 .408 .464 .518 .312 .224 .342 .489 .381 .302 .415 .530ROTH .472 .428 .490 .553 .314 .223 .346 .497 .393 .307 .435 559ATTH .466 .419 .484 .551 .324 .236 .354 .501 .397 .310 .437 .566

    Table 2: Link prediction results for low-dimensional embeddings (d = 32) in the filtered setting. Best score in boldand best published underlined. Hyperbolic isometries significantly outperform Euclidean baselines on WN18RRand YAGO3-10, both of which exhibit hierarchical structures.

    101 102

    Embedding dimension

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    MR

    R

    Mean Reciprocical Rank (MRR) vs. dimension

    MurP

    ComplEx− N3RotH

    Figure 4: WN18RR MRR dimension for d ∈{10, 16, 20, 32, 50, 200, 500}. Average and standarddeviation computed over 10 runs for ROTH.

    which is a distance-based measure of how close agiven graph is to being a tree. We summarize thedatasets’ statistics in Table 1.

    Baselines We compare our method to SotA mod-els, including MurP (Balazevic et al., 2019), MurE(which is the Euclidean analogue or MurP), RotatE(Sun et al., 2019), ComplEx-N3 (Lacroix et al.,2018) and TuckER (Balazevic et al., 2019). Base-line numbers in high dimensions (Table 5) are takenfrom the original papers, while baseline numbers inthe low-dimensional setting (Table 2) are computedusing open-source implementations of each model.In particular, we run hyper-parameter searches overthe same parameters as the ones in the originalpapers to compute baseline numbers in the low-dimensional setting.

    Ablations To analyze the benefits of hyperbolicgeometry, we evaluate the performance of ATTE,which is equivalent to ATTH with curvatures setto zero. Additionally, to better understand therole of attention, we report scores for variants ofATTE/H using only rotations (ROTE/H) or reflec-tions (REFE/H).

    Evaluation metrics At test time, we use the scor-ing function in Equation 10 to rank the correct tailor head entity against all possible entities, and usein use inverse relations for head prediction (Lacroixet al., 2018). Similar to previous work, we computetwo ranking-based metrics: (1) mean reciprocalrank (MRR), which measures the mean of inverseranks assigned to correct entities, and (2) hits atK (H@K, K ∈ {1, 3, 10}), which measures theproportion of correct triples among the top K pre-dicted triples. We follow the standard evaluationprotocol in the filtered setting (Bordes et al., 2013):all true triples in the KG are filtered out duringevaluation, since predicting a low rank for thesetriples should not be penalized.

    Training procedure and implementation Wetrain ATTH by minimizing the full cross-entropyloss with uniform negative sampling, where neg-ative examples for a triple (h, r, t) are sampleduniformly from all possible triples obtained by per-turbing the tail entity:

    L =∑

    t′∼U(V)

    log(1+exp(yt′s(h, r, t′))), (11)

    where yt′ =

    {−1 if t′ = t1 otherwise.

    Since optimization in hyperbolic space is practi-cally challenging, we instead define all parametersin the tangent space at the origin, optimize embed-dings using standard Euclidean techniques, and usethe exponential map to recover the hyperbolic pa-rameters (Chami et al., 2019). We provide moredetails on tangent space optimization in AppendixA.4. We conducted a grid search to select the learn-ing rate, optimizer, negative sample size, and batchsize, using the validation set to select the best hy-

  • Relation KhsG ξG ROTE ROTH Improvementmember meronym 1.00 -2.90 .320 .399 24.7%hypernym 1.00 -2.46 .237 .276 16.5%has part 1.00 -1.43 .291 .346 18.9%instance hypernym 1.00 -0.82 .488 .520 6.56%member of domain region 1.00 -0.78 .385 .365 -5.19%member of domain usage 1.00 -0.74 .458 .438 -4.37%synset domain topic of 0.99 -0.69 .425 .447 5.17%also see 0.36 -2.09 .634 .705 11.2%derivationally related form 0.07 -3.84 .960 .968 0.83%similar to 0.07 -1.00 1.00 1.00 0.00%verb group 0.07 -0.50 .974 .974 0.00%

    Table 3: Comparison of H@10 for WN18RR relations.Higher KhsG and lower ξG means more hierarchical.

    perparameters. Our best model hyperparametersare detailed in Appendix A.3. We conducted allour experiments on NVIDIA Tesla P100 GPUs andmake our implementation publicly available∗.

    5.2 Results in low dimensions

    We first evaluate our approach in the low-dimensional setting for d = 32, which is approxi-mately one order of magnitude smaller than SotAEuclidean methods. Table 2 compares the perfor-mance of ATTH to that of other baselines, includ-ing the recent hyperbolic (but not rotation-based)MuRP model. In low dimensions, hyperbolicembeddings offer much better representations forhierarchical relations, confirming our hypothesis.ATTH improves over previous Euclidean and hy-perbolic methods by 0.7% and 6.1% points in MRRon WN18RR and YAGO3-10 respectively. Bothdatasets have multiple hierarchical relationships,suggesting that the hierarchical structure imposedby hyperbolic geometry leads to better embeddings.On FB15k-237, ATTH and MurP achieve similarperformance, both improving over Euclidean base-lines. We conjecture that translations are sufficientto model relational patterns in FB15k-237.

    To understand the role of dimensionality, wealso conduct experiments on WN18RR againstSotA methods under varied low-dimensional set-tings (Figure 4). We include error bars for ourmethod with average MRR and standard deviationcomputed over 10 runs. Our approach consistentlyoutperforms all baselines, suggesting that hyper-bolic embeddings still attain high-accuracy acrossa broad range of dimensions.

    Additionally, we measure performance per re-lation on WN18RR in Table 3 to understand thebenefits of hyperbolic geometric on hierarchical re-lations. We report the Krackhardt hierarchy score

    ∗Code available at https://github.com/tensorflow/neural-structured-learning/tree/master/research/kg_hyp_emb

    Relation Anti-symmetric Symmetric ROTH REFH ATTHhasNeighbor 7 3 .750 1.00 1.00isMarriedTo 7 3 .941 .941 1.00actedIn 3 7 .145 .110 .150hasMusicalRole 3 7 .431 .375 .458directed 3 7 .500 .450 .567graduatedFrom 3 7 .262 .167 .274playsFor 3 7 .671 .642 .664wroteMusicFor 3 7 .281 .188 .266hasCapital 3 7 .692 .731 .731dealsWith 7 7 .286 .286 .429isLocatedIn 7 7 .404 .399 .420

    Table 4: Comparison of geometric transformations ona subset of YAGO3-10 relations.

    (KhsG) (Balažević et al., 2019) and estimated cur-vature per relation (see Appendix A.2 for moredetails). We consider a relation to be hierarchicalwhen its corresponding graph is close to tree-like(low curvature, high KhsG). We observe that hyper-bolic embeddings offer much better performanceon hierarchical relations such as hypernym or haspart, while Euclidean and hyperbolic embeddingshave similar performance on non-hierarchical rela-tions such as verb group. We also plot the learnedcurvature per relation versus the embedding dimen-sion in Figure 5b. We note that the learned curva-ture in low dimensions directly correlates with theestimated graph curvature ξG in Table 3, suggestingthat the model with learned curvatures learns more“curved” embedding spaces for tree-like relations.

    Finally, we observe that MurP achieves lowerperformance than MurE on YAGO3-10, whileATTH improves over ATTE by 2.3% in MRR. Thissuggests that trainable curvature is critical to learnembeddings with the right amount of curvature,while fixed curvature might degrade performance.We elaborate further on this point in Section 5.5.

    5.3 Hyperbolic rotations and reflections

    In our experiments, we find that rotations work wellon WN18RR, which contains multiple hierarchi-cal and anti-symmetric relations, while reflectionswork better for YAGO3-10 (Table 5). To betterunderstand the mechanisms behind these observa-tions, we analyze two specific patterns: relationsymmetry and anti-symmetry. We report perfor-mance per-relation on a subset of YAGO3-10 re-lations in Table 4. We categorize relations intosymmetric, anti-symmetric, or neither symmetricnor anti-symmetric categories using data statistics.More concretely, we consider a relation to satisfy alogical pattern when the logical condition is satis-fied by most of the triplets (e.g., a relation r is sym-metric if for most KG triples (h, r, t), (t, r, h) isalso in the KG). We observe that reflections encode

    https://github.com/tensorflow/neural-structured-learning/tree/master/research/kg_hyp_embhttps://github.com/tensorflow/neural-structured-learning/tree/master/research/kg_hyp_embhttps://github.com/tensorflow/neural-structured-learning/tree/master/research/kg_hyp_emb

  • 101 102 103

    Embedding dimension

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    MR

    RLow dimensions High dimensions

    Mean Reciprocal Rank (MRR) vs. dimension

    RotE (Zero curvature)

    RotH (Fixed curvature)

    RotH (Trainable curvatures)

    (a) MRR for fixed and trainable curvatures on WN18RR.

    101 102

    Embedding dimension

    −0.50.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    Ab

    solu

    tecu

    rvat

    ure

    Absolute curvature per relation vs. dimensionhypernym

    synset domain topic of

    has part

    derivationally related form

    also see

    verb group

    instance hypernym

    member meronym

    member of domain usage

    similar to

    member of domain region

    (b) Curvatures learned by with ROTH on WN18RR.

    Figure 5: (a): ROTH offers improved performance in low dimensions; in high dimensions, fixed curvature degradesperformance, while trainable curvature approximately recovers Euclidean space. (b): As the dimension increases,the learned curvature of hierarchical relationships tends to zero.

    symmetric relations particularly well, while rota-tions are well suited for anti-symmetric relations.This confirms our intuition—and the motivation forour approach—that particular geometric propertiescapture different kinds of logical properties.

    5.4 Attention-based transformations

    One advantage of using relation-specific transfor-mations is that each relation can learn the rightgeometric operators based on the logical propertiesit has to satisfy. In particular, we observe that inboth low- and high-dimensional settings, attention-based models can recover the performance of thebest transformation on all datasets (Tables 2 and 5).Additionally, per-relationship results on YAGO3-10 in Table 4 suggest that ATTH indeed recoversthe best geometric operation.

    Furthermore, for relations that are neither sym-metric nor anti-symmetric, we find that ATTHcan outperform rotations and reflections, suggest-ing that combining multiple operators with atten-tion can learn more expressive operators to modelmixed logical patterns. In other words, attention-based transformations alleviate the need to conductexperiments with multiple geometric transforma-tions by simply allowing the model to choose whichone is best for a given relation.

    5.5 Results in high dimensions

    In high dimensions (Table 5), we compare againsta variety of other models and achieve new SotAresults on WN18RR and YAGO3-10, and third-best results on FB15k-237. As we expected, whenthe embedding dimension is large, Euclidean andhyperbolic embedding methods perform similarlyacross all datasets. We explain this behavior by not-ing that when the dimension is sufficiently large,

    both Euclidean and hyperbolic spaces have enoughcapacity to represent complex hierarchies in KGs.This is further supported by Figure 5b, whichshows the learned absolute curvature versus thedimension. We observe that curvatures are close tozero in high dimensions, confirming our expecta-tion that ROTH with trainable curvatures learns aroughly Euclidean geometry in this setting.

    In contrast, fixed curvature degrades perfor-mance in high dimensions (Figure 5a), confirmingthe importance of trainable curvatures and its im-pact on precision and capacity (previously studiedby (Sala et al., 2018)). Additionally, we show theembeddings’ norms distribution in the Appendix(Figure 7). Fixed curvature results in embeddingsbeing clustered near the boundary of the ball whiletrainable curvatures adjusts the embedding spaceto better distribute points throughout the ball. Pre-cision issues that might arise with fixed curvaturecould also explain MurP’s low performance in highdimensions. Trainable curvatures allow ROTH toperform as well or better than previous methods inboth low and high dimensions.

    5.6 Visualizations

    In Figure 6, we visualize the embeddings learnedby ROTE versus ROTH for a sub-tree of the or-ganism entity in WN18RR. To better visualize thehierarchy, we apply k inverse rotations for all nodesat level k in the tree.

    By contrast to ROTE, ROTH preserves the treestructure in the embedding space. Furthermore, wenote that ROTE cannot simultaneously preserve thetree structure and make non-neighboring nodes farfrom each other. For instance, virus should be farfrom male, but preserving the tree structure (bygoing one level down in the tree) while making

  • WN18RR FB15k-237 YAGO3-10U Model MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10

    RdDistMult .430 .390 .440 .490 .241 .155 .263 .419 .340 .240 .380 .540ConvE .430 .400 .440 .520 .325 .237 .356 .501 .440 .350 .490 .620TuckER .470 .443 .482 .526 .358 .266 .394 .544 - - - -MurE .475 .436 .487 .554 .336 .245 .370 .521 .532 .444 .584 .694

    Cd ComplEx-N3 .480 .435 .495 .572 .357 .264 .392 .547 .569 .498 .609 .701RotatE .476 .428 .492 .571 .338 .241 .375 .533 .495 .402 .550 .670Hd Quaternion .488 .438 .508 .582 .348 .248 .382 .550 - - - -Bd,1 MurP .481 .440 .495 .566 .335 .243 .367 .518 .354 .249 .400 567

    RdREFE .473 .430 .485 .561 .351 .256 .390 .541 .577 .503 .621 .712ROTE .494 .446 .512 .585 .346 .251 .381 .538 .574 .498 .621 .711ATTE .490 .443 .508 .581 .351 .255 .386 .543 .575 .500 .621 .709

    Bd,cREFH .461 .404 .485 .568 .346 .252 .383 .536 .576 .502 .619 .711ROTH .496 .449 .514 .586 .344 .246 .380 .535 .570 .495 .612 .706ATTH .486 .443 .499 .573 .348 .252 .384 .540 .568 .493 .612 .702

    Table 5: Link prediction results for high-dimensional embeddings (best for d ∈ {200, 400, 500}) in the filteredsetting. DistMult, ConvE and ComplEx results are taken from (Dettmers et al., 2018). Best score in bold andbest published underlined. ATTE and ATTH have similar performance in the high-dimensional setting, performingcompetitively with or better than state-of-the-art methods on WN18RR, FB15k-237 and YAGO3-10.

    organismfauna

    microorganism

    plantlife

    malefemale

    younglarva

    poisonousplant

    potplant

    microbevirus

    protoctist

    (a) ROTE embeddings.

    organism

    fauna

    microorganism

    plantlife

    malefemale

    young

    larva

    poisonousplant

    potplant

    microbe

    virus

    protoctist

    (b) ROTH embeddings.

    Figure 6: Visualizations of the embeddings learned byROTE and ROTH on a sub-tree of WN18RR for the hy-pernym relation. In contrast to ROTE, ROTH preserveshierarchies by learning tree-like embeddings.

    these two nodes far from each other is difficult inEuclidean space. In hyperbolic space, however, weobserve that going one level down in the tree isachieved by translating embeddings towards theleft. This pattern essentially illustrates the transla-tion component in ROTH, allowing the model tosimultaneously preserve hierarchies while makingnon-neighbouring nodes far from each other.

    6 Conclusion

    We introduce ATTH, a hyperbolic KG embed-ding model that leverages the expressiveness ofhyperbolic space and attention-based geometrictransformations to learn improved KG representa-tions in low-dimensions. ATTH learns embeddingswith trainable hyperbolic curvatures, allowing itto learn the right geometry for each relationshipand generalize across multiple embedding dimen-sions. ATTH achieves new SotA on WN18RR andYAGO3-10, real-world KGs which exhibit hierar-

    chical structures. Future directions for this work in-clude exploring other tasks that might benefit fromhyperbolic geometry, such as hypernym detection.The proposed attention-based transformations canalso be extended to other geometric operations.

    Acknowledgements

    We thank Avner May for their helpful feedbackand discussions. We gratefully acknowledge thesupport of DARPA under Nos. FA86501827865(SDH) and FA86501827882 (ASED); NIH underNo. U54EB020405 (Mobilize), NSF under Nos.CCF1763315 (Beyond Sparsity), CCF1563078(Volume to Velocity), and 1937301 (RTML); ONRunder No. N000141712266 (Unifying Weak Super-vision); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC,ARM, Hitachi, BASF, Accenture, Ericsson, Qual-comm, Analog Devices, the Okawa Foundation,American Family Insurance, Google Cloud, SwissRe, the HAI-AWS Cloud Credits for Researchprogram, TOTAL, and members of the StanfordDAWN project: Teradata, Facebook, Google, AntFinancial, NEC, VMWare, and Infosys. The U.S.Government is authorized to reproduce and dis-tribute reprints for Governmental purposes notwith-standing any copyright notation thereon. Any opin-ions, findings, and conclusions or recommenda-tions expressed in this material are those of theauthors and do not necessarily reflect the views,policies, or endorsements, either expressed or im-plied, of DARPA, NIH, ONR, or the U.S. Govern-ment.

  • ReferencesIvana Balažević, Carl Allen, and Timothy Hospedales.

    2019. Multi-relational poincaré graph embeddings.In Advances in Neural Information Processing Sys-tems, pages 4465–4475.

    Ivana Balazevic, Carl Allen, and Timothy Hospedales.2019. Tucker: Tensor factorization for knowledgegraph completion. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 5188–5197.

    Trapit Bansal, Da-Cheng Juan, Sujith Ravi, and An-drew McCallum. 2019. A2n: Attending to neigh-bors for knowledge graph inference. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 4387–4392.

    Silvere Bonnabel. 2013. Stochastic gradient descenton Riemannian manifolds. IEEE Transactions onAutomatic Control, 58(9):2217–2229.

    Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in Neural InformationProcessing Systems, pages 2787–2795.

    Ines Chami, Zhitao Ying, Christopher Ré, and JureLeskovec. 2019. Hyperbolic graph convolutionalneural networks. In Advances in Neural InformationProcessing Systems, pages 4869–4880.

    Tim Dettmers, Pasquale Minervini, Pontus Stenetorp,and Sebastian Riedel. 2018. Convolutional 2Dknowledge graph embeddings. In Thirty-SecondAAAI Conference on Artificial Intelligence.

    John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. Journal of MachineLearning Research, 12(Jul):2121–2159.

    Octavian Ganea, Gary Bécigneul, and Thomas Hof-mann. 2018. Hyperbolic neural networks. In Ad-vances in Neural Information Processing Systems.

    Albert Gu, Fred Sala, Beliz Gunel, and ChristopherRé. 2019. Learning mixed-curvature representationsin product spaces. In International Conference onLearning Representations.

    Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, andJun Zhao. 2015. Knowledge graph embedding viadynamic mapping matrix. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers), pages 687–696.

    Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference for Learning Representations.

    David Krackhardt. 1994. Graph theoretical dimensionsof informal organizations. In Computational organi-zation theory, pages 107–130. Psychology Press.

    Timothée Lacroix, Nicolas Usunier, and GuillaumeObozinski. 2018. Canonical tensor decompositionfor knowledge base completion. International Con-ference on Machine Learning.

    Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu,and Xuan Zhu. 2015. Learning entity and relationembeddings for knowledge graph completion. InTwenty-ninth AAAI Conference on Artificial Intelli-gence.

    Qi Liu, Maximilian Nickel, and Douwe Kiela. 2019.Hyperbolic graph neural networks. In Advancesin Neural Information Processing Systems, pages8228–8239.

    Farzaneh Mahdisoltani, Joanna Biega, and Fabian MSuchanek. 2013. Yago3: A knowledge base frommultilingual wikipedias.

    George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

    Deepak Nathani, Jatin Chauhan, Charu Sharma, andManohar Kaul. 2019. Learning attention-basedembeddings for relation prediction in knowledgegraphs. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics.Association for Computational Linguistics.

    Dai Quoc Nguyen, Tu Dinh Nguyen, Dat QuocNguyen, and Dinh Phung. 2018. A Novel Embed-ding Model for Knowledge Base Completion Basedon Convolutional Neural Network. In Proceed-ings of the 16th Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT), pages 327–333.

    Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In InternationalConference on Machine Learning, pages 809–816.Omnipress.

    Maximillian Nickel and Douwe Kiela. 2017. Poincaréembeddings for learning hierarchical representa-tions. In Advances in Neural Information Process-ing Systems, pages 6338–6347.

    Joel W Robbin and Dietmar A Salamon. Introductionto differential geometry.

    Frederic Sala, Chris De Sa, Albert Gu, and ChristopherRé. 2018. Representation tradeoffs for hyperbolicembeddings. In International Conference on Ma-chine Learning, pages 4457–4466.

    Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne Van Den Berg, Ivan Titov, and Max Welling.

  • 2018. Modeling relational data with graph convolu-tional networks. In European Semantic Web Confer-ence, pages 593–607. Springer.

    Fabian M Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowledge.In Proceedings of the 16th international conferenceon World Wide Web, pages 697–706. ACM.

    Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and JianTang. 2019. Rotate: Knowledge graph embeddingby relational rotation in complex space. In Interna-tional Conference on Learning Representations.

    Alexandru Tifrea, Gary Bécigneul, and Octavian-Eugen Ganea. 2019. Poincaré GloVe: Hyperbolicword embeddings. In International Conference onLearning Representations.

    Kristina Toutanova and Danqi Chen. 2015. Observedversus latent features for knowledge base and textinference. In Proceedings of the 3rd Workshop onContinuous Vector Space Models and their Compo-sitionality, pages 57–66.

    Théo Trouillon, Johannes Welbl, Sebastian Riedel, ÉricGaussier, and Guillaume Bouchard. 2016. Com-plex embeddings for simple link prediction. In In-ternational Conference on Machine Learning, pages2071–2080.

    Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In Twenty-Eighth AAAI Con-ference on Artificial Intelligence.

    Canran Xu and Ruijiang Li. 2019. Relation embed-ding with dihedral group in knowledge graph. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics. Associationfor Computational Linguistics.

    Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding entities andrelations for learning and inference in knowledgebases. In International Conference on Learning Rep-resentations.

    Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019.Quaternion knowledge graph embeddings. In Ad-vances in Neural Information Processing Systems,pages 2731–2741.

  • A Appendix

    Below, we provide additional details. We start byproviding the formula for the hyperbolic analogueof addition that we use, along with additional hy-perbolic geometry background. Next, we providemore information about the metrics that are usedto determine how hierarchical a dataset is. Af-terwards, we give additional experimental details,including the table of hyperparameters and furtherdetails on tangent space optimization. Lastly, weinclude an additional comparison against the Dihe-dral model (Xu and Li, 2019).

    A.1 Möbius addition

    The Möbius addition operation (Ganea et al., 2018)has the closed-form expression:

    x⊕c y = αxyx + βxyy1 + 2cxTy + c2||x||2||y||2 ,

    where αxy = 1 + 2cxTy + c||y||2,and βxy = 1− c||x||2.

    In contrast to Euclidean addition, it is neither com-mutative nor associative. However, it providesan analogue through the lens of parallel transport:given two points x,y and a vector v in T cx , there isa unique vector in T cy which creates the same angleas v with the direction of the geodesic (shortestpath) connecting x to y. This map is the paral-lel transport P cx→y(·); Euclidean parallel transportis the standard Euclidean addition. Analogously,the Möbius addition satisfies (Ganea et al., 2018):x⊕c y = expcx(P c0→x(logc0(y))).

    A.2 Hierarchy estimates

    We use two metrics to estimate how hierarchical arelation is: the curvature estimate ξG and the Krack-hardt hierarchy score KhsG. While the curvatureestimate captures global hierarchical behaviours(how much the graph is tree-like when zooming-out), the Krackhardt score captures a more localbehaviour (how many small loops the graph has).See Figure 8 for examples.

    Curvature estimate To estimate the curvatureof a relation r, we restrict to the undirected graphGr spanned by the edges labeled as r. Following(Gu et al., 2019), let ξGr(a, b, c) be the curvatureestimate of a triangle in Gr with vertices {a, b, c},

    0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.20

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500Fixed curvature

    1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.80

    500

    1000

    1500

    2000

    2500Trainable curvature

    Figure 7: Histogram of embeddings norm learned withfixed and trainable curvatures for the hypernym relationin WN18RR.

    which is given by:

    ξGr(a, b, c) =1

    2dGr(a,m)

    (dGr(a,m)

    2

    + dGr(b, c)2/4

    − (dGr(a, b)2 + dGr(a, c)2)/2),

    where m is the midpoint of the shortest path con-necting b to c. This estimate is positive for trianglesin circles, negative for triangles in trees, and zerofor triangles in lines. Moreover, for a triangle ina Riemannian manifold M , ξM (a, b, c) estimatesthe sectional curvature of the plane on which thetriangle lies (see (Gu et al., 2019) for more de-tails). Let mr be the total number of connectedcomponents in Gr. We sample 1000 wi,r trianglesfrom each connected component ci,r of Gr where

    wi,r =N3i,r∑mri=1 N

    3i,r

    , and Ni,r is the number of nodes

    in the component ci,r. ξGr is the mean of the es-timated curvatures of the sampled triangles. Forthe full graph, we take the weighted average of therelation curvatures ξGr with respect to the weights∑mr

    i=1 N3i,r∑

    r

    ∑mri=1 N

    3i,r.

    Krackhardt hierarchy score For the directedgraph Gr spanned by the relation r, we let R bethe adjacency matrix (Ri,j = 1 if there is an edgefrom node i to node j and 0 otherwise). Then:

    KhsGr =

    ∑ni,j=1Ri,j(1−Rj,i)∑n

    i,j=1Ri,j.

    See (Krackhardt, 1994) for more details. Wenote that for fully observed symmetric relations(each edge is in a two-edge loop), KhsGr = 0while for anti-symmetric relations (no small loops),KhsGr = 1.

  • ξG < 0, KhsG = 1 ξG < 0, KhsG = 0

    ξG = 0, KhsG = 1 ξG = 0, KhsG = 0

    ξG > 0, KhsG = 1 ξG > 0, KhsG = 0 ’

    Figure 8: The curvature estimate ξG and the Krackhardt hierarchy score KhsG for several simple graphs. Thetop-left graph is the most hierarchical, while the bottom-right graph is the least hierarchical.

    WN18RR FB15k-237 YAGO3-10Model MRR H@10 MRR H@10 MRR H@10Dihedral .486 557 .300 .496 .388 .573ATTE .490 .581 .351 .543 .575 .709

    Table 6: Comparison of Dihedral and ATTE in high-dimensions.

    A.3 Experimental details

    For all our Euclidean and hyperbolic models, weconduct a hyperparameter search for the learningrate, optimizer (Adam (Kingma and Ba, 2015) orAdagrad (Duchi et al., 2011)), negative sample sizeand batch size. We train each model for 500 epochsand use early stopping after 100 epochs if the vali-dation MRR stops increasing. We report the besthyperparameters for each dataset in Table 7.

    A.4 Tangent space optimization

    Optimization in hyperbolic space normally requiresRiemannian Stochastic Gradient Descent (RSGD)

    (Bonnabel, 2013), as was used in MuRP. RSGDis challenging in practice. Instead, we use tangentspace optimization (Chami et al., 2019). We de-fine all the ATTH parameters in the tangent spaceat the origin (our parameter space), optimize em-beddings using standard Euclidean techniques, anduse the exponential map to recover the hyperbolicparameters.

    Note that tangent space optimization is an exactprocedure, which does not incur losses in repre-sentational power. This is the case in hyperbolicspace specifically because of a completeness prop-erty: there is always a global bijection between thetangent space and the manifold.

    Concretely, ATTH optimizes the entity and rela-tionship embeddings (eEv )v∈V and (r

    Er )r∈R, which

    are mapped to the Poincaré ball with:

    eHv = expcr0 (e

    Ev ) and r

    Hr = exp

    cr0 (r

    Er ), (12)

    The trainable model parameters are then

  • Dataset embedding dimension model learning rate optimizer batch size negative samples

    WN18RR

    32

    REFE 0.001 Adam 100 250ROTE 0.001 Adam 100 250ATTE 0.001 Adam 100 250REFH 0.0005 Adam 250 250ROTH 0.0005 Adam 500 50ATTH 0.0005 Adam 500 50

    500

    REFE 0.1 Adagrad 500 50ROTE 0.001 Adam 100 500ATTE 0.001 Adam 1000 50REFH 0.05 Adagrad 500 50ROTH 0.001 Adam 1000 50ATTH 0.001 Adam 1000 50

    FB15k-237

    32

    REFE 0.075 Adagrad 250 250ROTE 0.05 Adagrad 500 50ATTE 0.05 Adagrad 500 50REFH 0.05 Adagrad 500 250ROTH 0.1 Adagrad 100 50ATTH 0.05 Adagrad 500 100

    500

    REFE 0.05 Adagrad 500 50ROTE 0.05 Adagrad 100 50ATTE 0.05 Adagrad 500 50REFH 0.05 Adagrad 500 50ROTH 0.05 Adagrad 1000 50ATTH 0.05 Adagrad 500 50

    YAGO3-10

    32

    REFE 0.005 Adam 2000 NAROTE 0.005 Adam 2000 NAATTE 0.005 Adam 2000 NAREFH 0.005 Adam 1000 NAROTH 0.001 Adam 1000 NAATTH 0.001 Adam 1000 NA

    500

    REFE 0.005 Adam 4000 NAROTE 0.005 Adam 4000 NAATTE 0.005 Adam 2000 NAREFH 0.001 Adam 1000 NAROTH 0.0005 Adam 1000 NAATTH 0.0005 Adam 1000 NA

    Table 7: Best hyperparameters in low- and high-dimensional settings. NA negative samples indicates that the fullcross-entropy loss is used, without negative sampling.

    {(Θr,Φr, rEr ,ar, cr)r∈R, (eEv , bv)v∈V}, which areall Euclidean parameters that can be learned usingstandard Euclidean optimization techniques.

    A.5 Comparison to DihedralWe compare the performance of Dihedral (Xu andLi, 2019) versus that of ATTE in Table 6. Bothmethods combine rotations and reflections, but ourapproach learns attention-based transformations,while Dihedral learns a single parameter to deter-mine which transformation to use. ATTE signifi-cantly outperforms Dihedral on all datasets, sug-gesting that using attention-based representationsis important in order to learn the right geometrictransformation for each relation.