13
Relatedness of Given Names Mitzlaff, F Knowledge and Data Engineering Group (KDE) University of Kassel Wilhelmshöher Allee 73, D-34121 Kassel, Germany Email: [email protected] Stumme, G Knowledge and Data Engineering Group (KDE) University of Kassel Wilhelmshöher Allee 73, D-34121 Kassel, Germany Email: [email protected] ABSTRACT As a result of the author’s need for help in finding a given name for the unborn baby, the Nameling 1 , a search engine for given names, based on data from the “Social Web” was born. Within less than six months, more than 35,000 users accessed Nameling with more than 300,000 search requests, underpinning the relevance of the underlying research questions. The present work proposes a new approach for dis- covering relations among given names, based on co- occurrences within Wikipedia. In particular, the task of finding relevant names for a given search query is consid- ered as a ranking task and the performance of different measures of relatedness among given names are evalu- ated with respect to Nameling’s actual usage data. We will show that a modification for the PageRank algorithm overcomes limitations imposed by global network charac- teristics to preferential PageRank computations. By publishing the considered usage data, the research community is stipulated for developing advanced recom- mendation systems and analyzing influencing factors for the choice of a given name. I INTRODUCTION Whoever was in the need for a given name knows how difficult it is to choose a suitable name which meets (in the best case) all constraints. The choice of a given name is not only influenced by personal taste. Many influenc- ing factors have to be considered, such as cultural back- ground, language dependent pronunciation, current trends and, last but not least, public perception or even prejudices towards given names. Though there is a constant demand for finding a suitable name, little aid exists, beside alphabetically ordered lists of names and simple filtering techniques. From a scien- tist’s perspective, the task of ranking and recommending given names is challenging and can be tackled from many different disciplines. In the present work, we consider the task, where a user searches for suitable names based on a given set of query names. The user thus can, e. g., search for names which are related to his own name, the names of all family mem- bers together or just for names which are similar to a name he or she likes. The proposed approach for discovering relations among given names is based on co-occurrences within Wikipedia, 2 but can as well be applied to other data sources, such as micro blogging systems, news paper archives or collections of eBooks. Manual inspection of a first implementation of cosine similarity (cf. Section III) already showed promising results, but as there is a variety of metrics for calculating similarity in such co-occurrence graphs, the question arises, which one gives raise to the most “meaningful” notion of relatedness. For assessing the performance of the different similarity metrics, we firstly compare the obtained rankings with an external reference ranking. Secondly, we evaluate the ranking performance with respect to actual usage data from Nameling, which comprises more than 35,000 users with more than 300,000 search queries during the time period of evaluation. We will show that a modification of the PageRank algorithm which we adopted from our pre- vious work on folksonomies [11] overcomes limitations imposed by global network characteristics to preferential PageRank computations. To promote other researchers’ efforts in developing new ranking and recommendation systems, all considered data is made publicly available. 3 The rest of the work is structured as follows: Section II presents related work. Section III summarizes basic no- tions and concepts. Section IV briefly describes Name- ling and Section V presents results on the semantics of the similarity metrics under consideration. In Section VI, the actual usage data of Nameling is introduced and analyzed, which is then used for comparing the performance of dif- ferent similarity metrics in Section VII as well as com- paring approaches for combining multiple query names in Section VIII. Finally, in Section IX, the present work’s contributions are summarized and forthcoming work is 1 http://nameling.net 2 http://www.wikipedia.org 3 http://www.kde.cs.uni-kassel.de/nameling/dumps Page 1 of 13 c ASE 2012

Relatedness of Given Names - Uni Kassel

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Relatedness of Given Names - Uni Kassel

Relatedness of Given NamesMitzlaff, F

Knowledge and Data Engineering Group (KDE)University of Kassel

Wilhelmshöher Allee 73, D-34121 Kassel, GermanyEmail: [email protected]

Stumme, GKnowledge and Data Engineering Group (KDE)

University of KasselWilhelmshöher Allee 73, D-34121 Kassel, Germany

Email: [email protected]

ABSTRACT

As a result of the author’s need for help in finding a givenname for the unborn baby, the Nameling1, a search enginefor given names, based on data from the “Social Web”was born. Within less than six months, more than 35,000users accessed Nameling with more than 300,000 searchrequests, underpinning the relevance of the underlyingresearch questions.

The present work proposes a new approach for dis-covering relations among given names, based on co-occurrences within Wikipedia. In particular, the task offinding relevant names for a given search query is consid-ered as a ranking task and the performance of differentmeasures of relatedness among given names are evalu-ated with respect to Nameling’s actual usage data. Wewill show that a modification for the PageRank algorithmovercomes limitations imposed by global network charac-teristics to preferential PageRank computations.

By publishing the considered usage data, the researchcommunity is stipulated for developing advanced recom-mendation systems and analyzing influencing factors forthe choice of a given name.

I INTRODUCTION

Whoever was in the need for a given name knows howdifficult it is to choose a suitable name which meets (inthe best case) all constraints. The choice of a given nameis not only influenced by personal taste. Many influenc-ing factors have to be considered, such as cultural back-ground, language dependent pronunciation, current trendsand, last but not least, public perception or even prejudicestowards given names.

Though there is a constant demand for finding a suitablename, little aid exists, beside alphabetically ordered listsof names and simple filtering techniques. From a scien-tist’s perspective, the task of ranking and recommendinggiven names is challenging and can be tackled from manydifferent disciplines.

In the present work, we consider the task, where a usersearches for suitable names based on a given set of querynames. The user thus can, e. g., search for names whichare related to his own name, the names of all family mem-bers together or just for names which are similar to a namehe or she likes.

The proposed approach for discovering relationsamong given names is based on co-occurrences withinWikipedia,2 but can as well be applied to other datasources, such as micro blogging systems, news paperarchives or collections of eBooks. Manual inspection of afirst implementation of cosine similarity (cf. Section III)already showed promising results, but as there is a varietyof metrics for calculating similarity in such co-occurrencegraphs, the question arises, which one gives raise to themost “meaningful” notion of relatedness.

For assessing the performance of the different similaritymetrics, we firstly compare the obtained rankings withan external reference ranking. Secondly, we evaluate theranking performance with respect to actual usage datafrom Nameling, which comprises more than 35,000 userswith more than 300,000 search queries during the timeperiod of evaluation. We will show that a modification ofthe PageRank algorithm which we adopted from our pre-vious work on folksonomies [11] overcomes limitationsimposed by global network characteristics to preferentialPageRank computations. To promote other researchers’efforts in developing new ranking and recommendationsystems, all considered data is made publicly available.3

The rest of the work is structured as follows: Section IIpresents related work. Section III summarizes basic no-tions and concepts. Section IV briefly describes Name-ling and Section V presents results on the semantics of thesimilarity metrics under consideration. In Section VI, theactual usage data of Nameling is introduced and analyzed,which is then used for comparing the performance of dif-ferent similarity metrics in Section VII as well as com-paring approaches for combining multiple query names inSection VIII. Finally, in Section IX, the present work’scontributions are summarized and forthcoming work is

1http://nameling.net2http://www.wikipedia.org3http://www.kde.cs.uni-kassel.de/nameling/dumps

Page 1 of 13c©ASE 2012

Page 2: Relatedness of Given Names - Uni Kassel

presented.

II RELATED WORK

The present work tackles the new problem of discoveringand assessing relatedness of given names based on datafrom the social web for building search and recommenda-tion systems which aid (not only) future parents in findingand choosing a suitable name.

Major parts of the underlying research questions areclosely related to work on link prediction in the context ofsocial networks as well as distributional similarity, where,more generally, semantic relations among named entitiesare investigated. However, this work focuses on the eval-uation of similarity metrics relative to actual user pref-erences as expressed by interactions with names withina live system. This is related to the evaluation of rec-ommender systems which exist for many application con-texts, such as movies [9], tags [12] and products [19].

Distributional Similarity & Semantic Relatedness Thefield of distributional similarity and semantic relatednesshas attracted a lot of attention in literature during the pastdecades (see [6] for a review). Several statistical measuresfor assessing the similarity of words are proposed, as forexample in [15, 17, 22, 27]. Notably, first approaches forusing Wikipedia as a source for discovering relatedness ofconcepts can be found in [3, 8, 25].

Vertex Similarity & Link Prediction In the context ofsocial networks, the task of predicting (future) links isespecially relevant for online social networks, where so-cial interaction is significantly stimulated by suggestingpeople as contacts which the user might know. Froma methodological point of view, most approaches buildon different similarity metrics on pairs of nodes withinweighted or unweighted graphs [13, 16, 20, 21]. A goodcomparative evaluation of different similarity metrics ispresented in [18].

Nevertheless, usage data of systems such as Nameling is,to the best of our knowledge, new and was not availablebefore. The present work combines approaches from thelink prediction and recommendation tasks with a focus onthe performance of structural similarity metrics based onco-occurrence networks obtained from Wikipedia.

III PRELIMINARIES

In this chapter, we want to familiarize the reader with thebasic concepts and notations used throughout this paper.

1 GRAPH & NETWORK BASICS

A graph G = (V,E) is an ordered pair, consisting of afinite set V of vertices or nodes, and a set E of edges,which are two-element subsets of V . A directed graph isdefined accordingly: E denotes a subset of V × V . Forsimplicity, we write (u, v) ∈ E in both cases for an edgebelonging to E and freely use the term network as syn-onym for a graph. In a weighted graph, each edge l ∈ Eis given an edge weight w(l) by some weighting functionw : E → R. For a subset U ⊆ V we write G|U to denotethe subgraph induced by U . The density of a graph de-notes the fraction of realized links, i. e., 2m

n(n−1) for undi-rected graphs and m

n(n−1) for directed graphs (excludingself loops). The neighborhood Γ of a node u ∈ V is theset {v ∈ V | (u, v) ∈ E} of adjacent nodes. The de-gree of a node in a network measures the number of con-nections it has to other nodes. For the adjacency matrixA ∈ Rn×n with n = |V | holds Aij = 1 (Aij = w(i, j))iff (i, j) ∈ E for any nodes i, j in V (assuming somebijective mapping from 1, . . . , n to V ). We represent agraph by its according adjacency matrix where appropri-ate.

A path v0 →G vn of length n in a graph G is a sequencev0, . . . , vn of nodes with n ≥ 0 and (vi, vi+1) ∈ E fori = 0, . . . , n−1. A shortest path between nodes u and v isa path u →G v of minimal length. The transitive closureof a graph G = (V,E) is given by G∗ = (V,E∗) with(u, v) ∈ E∗ iff there exists a path u →G v. A stronglyconnected component (scc) of G is a subset U ⊆ V , suchthat u →G∗ v exists for every u, v ∈ U . A (weakly) con-nected component (wcc) is defined accordingly, ignoringthe direction of edges (u, v) ∈ E.

2 VERTEX SIMILARITIES

Similarity scores for pairs of vertices based only on thesurrounding network structure have a broad range of ap-plications, especially for the link prediction task [18]. Inthe following we present all considered similarity func-tions, following the presentation given in [7] which buildson the extensions of standard similarity functions forweighted networks from [23].

The Jaccard coefficient measures the fraction of commonneighbors:

JAC(x, y) :=|Γ(x) ∩ Γ(y)||Γ(x) ∪ Γ(y)|

The Jaccard coefficient is broadly applicable and com-monly used for various data mining tasks. For weighted

Page 2 of 13c©ASE 2012

Page 3: Relatedness of Given Names - Uni Kassel

networks the Jaccard coefficient becomes:

J̃AC(x, y) :=

∑z∈Γ(x)∩Γ(y) w(x, z) + w(y, z)∑

a∈Γ(x)

w(a, x) +∑

b∈Γ(y)

w(b, y)

The cosine similarity measures the cosine of the angle be-tween the corresponding rows of the adjacency matrix,which for a unweighted graph can be expressed as

COS(x, y) :=|Γ(x) ∩ Γ(y)|√|Γ(x)| ·

√|Γ(y)|

,

and for a weighted graph is given by

C̃OS(x, y) :=

∑z∈Γ(x)∩Γ(y) w(x, z)w(y, z)√ ∑

a∈Γ(x)

w(x, a)2 ·√ ∑b∈Γ(y)

w(y, b)2.

The preferential PageRank similarity is based on the wellknown PageRankTM [2] algorithm. For a column stochas-tic adjacency matrix A and damping factor α, the globalPageRank vector ~w with uniform preference vector ~p isgiven as the fixpoint of the following equation:

~w = αA~w + (1− α)~p

In case of the preferential PageRank for a given node i,only the corresponding component of the preference vec-tor is set. For vertices x, y we set accordingly

PPR(x, y) := ~w(x)[y],

that is, we compute the preferential PageRank vector ~w(x)

for node x and take its y’th component. We also calculatean adapted preferential PageRank score by adopting theidea presented in [11], where the global PageRank scorePR is subtracted from the preferential PageRank score inorder to reduce frequency effects and set

PPR+(x, y) := PPR(x, y)− PR(x, y).

3 EVALUATION METRICS

Several metrics for assessing the perfomance of recom-mendation systems exists. We apply the mean averageprecision for obtaining a single value performance scorefor a set Q of ranked predicted recommendations Pi withrelevant documents Ri:

MAP(Q) :=1

|Q|

|Q|∑i=1

AveP(Pi, Ri)

where the average precision is given by

AveP(Pi, Ri) :=1

|Ri|

|Ri|∑k=1

[Prec(Pi, k) · δ(Pi(k), Ri)]

and Prec(Pi, k) is the precision for all predicted elementsup to rank k and δ(Pi(k), Ri) = 1, iff the predicted ele-ment at rank k is a relevant document (Pi(k) ∈ Ri). Referto [28] for more details.

IV A SEARCH ENGINE FOR GIVEN NAMES

Nameling is designed as a search engine and recommen-dation system for given names. The basic principle is sim-ple: The user enters a given name and gets a browsablelist of “relevant” names, called “namelings”. Figure 1(a)exemplarily shows namelings for the classical masculineGerman given name “Oskar”.

(a) Namelings (b) Co-occurring names

Figure 1: The user queries for the classical German givenname “Oskar”.

The list of namelings in this example (“Rudolf”, “Her-mann”, “Egon”, . . .) exclusively contains classical Ger-man masculine given names as well. Whenever an accord-ing article in Wikipedia exists, categories for the respec-tive given name are displayed, as, e. g., “Masculine givennames” and “Place names” for the given name “Egon”.Via hyperlinks, the user can browse for namelings of eachlisted name or get a list of all names linked to a certaincategory in Wikipedia. Further background informationfor the query name is summarized in a corresponding de-tails view, where, among others, popularity of the namein different language editions of Wikipedia as well as inTwitter is shown. As depicted in Fig. 1(b), the user mayalso explore the “neighborhood” of a given name, i. e.,names which co-occur often with the query name.

From a user’s perspective, the Nameling is a tool for find-ing a suitable given name. Accordingly, names can easilybe added to a personal list of favorite names. The list offavorite names is shown on every page in the Namelingand can be shared with a friend, for collaboratively find-ing a given name.

Page 3 of 13c©ASE 2012

Page 4: Relatedness of Given Names - Uni Kassel

The Nameling is based on a comprehensive list of givennames, which was initially manually collected, but thenpopulated by user suggestions. It currently covers morethen 35,000 names from a broad range of cultural con-texts. For different use cases, three different data sourcesare respectively used, as depicted in Fig. 2.

(a) Co-Occurrence (b) Popularity (c) Social Context

Figure 2: Nameling determines similarities among givennames based on co-occurrence networks from Wikipedia,popularity of given names via Twitter and social contextof the querying user via facebook.

Wikipedia As basis for discovering relations amonggiven names, a co-occurrence graph is generated for eachlanguage edition of Wikipedia separately. That is, for eachlanguage, a corresponding data set is downloaded fromthe Wikimedia Foundation. Afterwards, for any pair ofgiven names, the number of sentences where they jointlyoccur is determined. Thus, for every language, an undi-rected graph is obtained, where two names are adjacent, ifthey occur together in at least one sentence within any ofthe articles and the edge’s weight is given by the numberof such sentences.

Relations among given names are established by calculat-ing a vertex similarity score between the correspondingnodes in the co-occurrence graph. Currently, namelingsare calculated based on cosine similarity (cf. Section III).

Twitter For assessing up-to-date popularity of givennames, a random sample of tweets in Twitter is constantlyprocessed via the Twitter streaming api4. For each name,the number of tweets mentioning it is counted.

facebook Optionally a user may connect the Namelingwith facebook5. If the user allows the Nameling to accesshis or her profile information, the given names of all con-tacts in facebook are collected anonymously. Thus, a “so-cial context” for the user’s given name is recorded. Cur-

rently, the social context graph is too small for implement-ing features based on it, but it will be a valuable source fordiscovering and evaluating relations among given names.

V SEMANTICS OF SOCIAL CO-OCCURRENCES

The basic idea behind the Nameling was to find relationsamong given names based on user-generated content inthe social web. The most basic relation among such en-tities can be observed when they occur together withina given atomic context. In case of Wikipedia, such co-occurrences were counted based on sentences.

Considering the German and English Wikipedia sepa-rately, undirected weighted graphs WikiDE and WikiEN

are obtained, where name nodes u and v are connectedand labeled with weight c, if u and v co-occurred in ex-actly c sentences. For example, the given names “Pe-ter” and “Paul” co-occurred in 30,565 sentences withinthe English Wikipedia. Accordingly, there is an edge(Peter,Paul) in WikiEN with a corresponding edge weight.For the present analysis, the co-occurrence networks arederived from the official Wikipedia data dumps, whichare freely available for download,6 whereby the Englishdump was dated 2012-01-05 and the German dump 2011-12-12. The following experiments focus on the EnglishWikipedia unless explicitly stated otherwise.

When Nameling was implemented, the choice of the ap-plied similarity metric on those co-occurrence networkswas inspired by previous work on emergent semantics ofsocial tagging systems [22], but only based on manualinspection of a (non representative) sample. This sec-tion aims at grounding the obtained notions of relatednessamong given names by comparing the different similaritymetrics with an external reference similarity.

We built such a reference relation on the set of givennames from the on-line dictionary Wiktionary7 by extract-ing all category assignments from pages correspondingto given names. Thus, for each of 10,938 given namesa respective binary vector was built, where each compo-nent indicates whether the corresponding category was as-signed to it (in total 7,923 different categories and 80,726non-zero entries). These vectors were then used for as-sessing relatedness among given names, based on a cate-gorization which was explicitly established by the authorsof the corresponding entries in Wiktionary.

In detail: For any pair (u, v) of names in the co-

4https://dev.twitter.com/docs/api/1/get/statuses/sample5http://www.facebook.com6http://dumps.wikimedia.org/backup-index.html7http://www.wiktionary.org

Page 4 of 13c©ASE 2012

Page 5: Relatedness of Given Names - Uni Kassel

occurrence network which have a category assignment,we calculated the cosine similarity COS(u, v) based onthe respective category assignment vectors as well as anyof the similarity metrics s(u, v) on the co-occurrencegraph as described in Section III. As the number of datapoints (COS(u, v), s(u, v)) grows quadratically with thenumber of names, we grouped the co-occurrence basedsimilarity scores in 1,000 equidistant bins (in case of COSand JC) or logarithmic bins (in case of PPR and PPR+).For each bin, the average cosine similarity based on cate-gory assignments was calculated, as shown in Figure 3.

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

COOC Similarity

Cat

egor

y S

imila

rity

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

COSCOSJACJACNull

● ●●●●

1e−16 1e−12 1e−08 1e−04

0.00

50.

020

0.05

0

COOC Similarity

Cat

egor

y S

imila

rity

●●●

●●

●●● ●

PPR+PPR+PPRPPR

Figure 3: Similarity based on name categories in Wik-tionary vs. vertex similarity in the co-occurrence-networks (weighted and unweighted) whereby the nullmodel was obtained by shuffling the mapping of namesto categories.

Notably, all considered similarity metrics capture a pos-itive correlation between similarity in the co-occurrencenetwork and similarity between category assignments tonames. But significant differences between the appliedsimilarity functions can be observed. The weightedcosine similarity performs very well, firstly in show-ing a steep slope and secondly in exhibiting a stablemonotonous curve progression. The unweighted Jaccardcoefficient shows an even more pronounced linear pro-gression, but is less stable for higher similarity scoreswhereas the weighted Jaccard coefficient shows a highercorrelation with the reference similarity for high similar-ity scores. It is worth noting, that the cosine similar-ity is only marginally effected by the edge weights ofthe co-occurrence graph, whereas the weighted Jaccardcoefficient significantly differs from its unweighted vari-ant. In case of the PageRank based similarity metrics,the weighted variants consistently outperform the corre-sponding unweighted variants and the adapted preferencePageRank function PPR+ outperforms the plain prefer-ence PageRank.

For further investigating the interplay of structural simi-larity in the co-occurrence graph and categories of names,we look in detail at the most prominent attribute of a givenname which can be obtained from Wiktionary’s catego-

rization; namely its gender. We ask, which fraction ofthe top similar names are of the same gender as the queryname. For this purpose, we selected for each gender theset of names with a corresponding distinct gender cate-gorization in Wiktionary (in particular ignoring gender-neutral names) which resulted in sets of 2, 725 male and2, 361 female names. For each of this names and everyconsidered similarity metric, we calculated the 100 mostsimilar names based on Wikipedia’s co-occurrence graphWikiEN. For each name u and each k = 1, . . . , 100 wethen calculated the fraction of names up to position kwhich have the same gender as u. Figure 4 shows the ob-tained results, averaged over all names of the same gender.Most notably, both the weighted and unweighted variantsof the cosine similarity as well as the unweighted Jaccardsimilarity show a precision score of around 80% both formale and female names. For all other considered simi-larity metrics the obtained precision score largely variesdepending on the variant and gender.

As for the weighted Jaccard similarity J̃AC and the pref-erence PageRank similarity its corresponding weightedand unweighted variants PPR and P̃PR, a strong bias to-wards male names can be observed. This bias is due tothe skewed distribution of male an female names withinWikipedia, where the considered male names occurred in67, 076, 455 sentences, in contrast to 29, 711, 215 occur-rences of the female names. This bias is even more pro-nounced in the co-occurrence graphs, as shown in Fig-ure 5, where the distribution of node strengths (i. e., thesum of all adjacent edge weights) in WikiEN for male andfemale names are depicted separately.

Concerning the adapted preference PageRank, the un-weighted variant PPR+ performs well for female names,but only slighter better then random for male names, con-versely to the weighted variant P̃PR+. Nevertheless, theeffect of considering the preference PageRank relative tothe global PageRank (cf. Section III) can clearly be ob-served.

We conclude that all considered similarity metrics on theco-occurrence networks obtained from Wikipedia capturea notion of relatedness which correlates to an external se-mantically motivated notion of relatedness among givennames.

Page 5 of 13c©ASE 2012

Page 6: Relatedness of Given Names - Uni Kassel

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Top k

Ave

rage

Pre

cisi

on

Female Given Names

●● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Top k

Ave

rage

Pre

cisi

on

Male Given Names

● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Top k

Ave

rage

Pre

cisi

on

Female Given Names

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Top k

Ave

rage

Pre

cisi

on

Male Given Names

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

COSCOSJACJACPPRPPRPPR+PPR+Null

Figure 4: Average fraction of names having the same gender as the query name among the top k similar names.

●●

●●

●●●●●●●

●●

●●●●●●●

●●●

●●

●●●

●●●

●●

●●●●●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●●●

●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●● ●

5e+03 5e+04 5e+05 5e+06

15

5050

0

Node Strength Distribution

Node Strength

Num

ber

of N

odes

● MaleFemale

Figure 5: Distribution of node strengths in the co-occurrence graph WikiEN for male and female names sep-arately.

VI USAGE DATA

The results presented in the previous section indicatecorrelations between semantic relatedness among givennames and structural similarity within co-occurrence net-works obtained from Wikipedia.

This section aims at assessing the performance of the dif-ferent structural similarity metrics with respect to actualinteractions of users within Nameling. For this purpose,we considered the Nameling’s activity log entries withinthe time range 2012-03-06 until 2012-08-10. In the fol-lowing, we firstly describe the collected usage data, an-alyze properties of emerging network structures amongnames and users and finally compare interrelations be-tween the different networks.

In total, 38,404 users issued 342,979 search requests.Subsequently, we differentiate between the following ac-tivities:

• “Enter”: A user manually entered a given nameinto search mask.

• “Click”: A user followed a link to a name within aresult list.

• “Favorite”: A user added a given name to his/herlist of. favorite names

• “Nameling”: All search requests together

Table 1 summarizes high level statistics for these activityclasses, showing, e. g., that 35, 684 users entered 16, 498different given names.

Table 1: Basic statistics for the different activities withinthe Nameling.

#Users #Names

Enter 35, 684 16, 498

Click 22, 339 10, 028

Favorite 1, 396 1, 558

For analyzing how different users contribute to the Name-ling’s activities, Figure 6 shows the distribution of activ-ities over the set of users, separately for Enter, Click andFavorite requests. Clearly, all activities’ distributions ex-hibit long tailed distributions, that is, most users enteredless than 20 names but there are also users with more than200 requests.

To get a glimpse at the actual usage patterns within theNameling and the distribution of names within Wikipedia,Table VI exemplarily shows most popular names for thedifferent activity classes as well as the considered datasets derived from Wikipedia. Thereby we assess a name’s“popularity” by its frequency within the correspondingWikipedia dump or the number of search queries for thename within Nameling. Firstly, it is worth noting that in-deed in the German Wikipedia German given names aredominant, whereas in the English Wikipedia, accordingly,English given names are the most popular. Secondly, bothlanguage editions of Wikipedia are dominated by malegiven names. As for the Nameling, users (still) mostlyoriginate in Germany and accordingly the correspondingquery logs are dominated by search requests for Germangiven names, both male and female.

For a more formal analysis of the relationship betweenthe popularity induced by search queries in the Namelingand corresponding frequencies within a Wikipedia corpus,we calculated Kendall’s τ rank correlation coefficient [14]pairwise on the common set of names for all considered

Page 6 of 13c©ASE 2012

Page 7: Relatedness of Given Names - Uni Kassel

1 2 5 10 50 200

15

5050

050

00

Enter

Number of Queries

Num

ber

of U

sers

1 5 50 500

110

100

1000

Link

Number of Queries

Num

ber

of U

sers

1 2 5 10 20 50 100

15

5050

0

Favorite

Number of Queries

Num

ber

of U

sers

Figure 6: Number of users per query count for Enter, Click and Favorite activities.

Table 2: Most popular names separately for Enter,Click and Favorite activities, as well as global popularity in Name-ling and popularity in Wikipedia.

Enter Click Favorite Nameling (global) Wikipedia (EN) Wikipedia (DE)

Emma 1,433 Emma 1,461 Emma 62 Emma 3,073 John 2,474,408 Friedrich 356,636Anna 1,125 Jonas 1,010 Lina 49 Anna 2,178 David 1,130,766 Karl 348,879Paul 1,073 Emil 965 Ida 40 Paul 1,918 William 1,100,253 Hans 309,734Julia 942 Anna 947 Jakob 39 Jonas 1,856 James 1,051,219 Peter 304,516Greta 895 Alexander 851 Felix 37 Emil 1,816 George 973,210 Johann 298,575Michael 878 Daniel 819 Oskar 36 Michael 1,771 Robert 871,368 John 295,563

activity classes and both language editions of Wikipedia,as shown in Table VI.

Table 3: Kendall’s τ rank correlation coefficient, pairwisefor the popularity induced rankings in the different sys-tems.

Ent

er

Clic

k

Favo

rite

Nam

elin

g

Wik

iDE

Wik

iEN

Enter −Click 0.62 −Favorite 0.54 0.60 −Nameling 0.86 0.79 0.59 −Wikipedia (DE) 0.35 0.35 0.33 0.40 −Wikipedia (EN) 0.28 0.26 0.25 0.33 0.69 −

Firstly, we note that the most pronounced correlation isindicated for pairs of popularity rankings within a sys-tem. Nevertheless, rankings induced by search queriesin the Nameling and those induced by frequency withinWikipedia are also assessed, which is slightly more pro-nounced for the German Wikipedia (e. g., τ = 0.40 forthe global ranking over all activity classes in the Name-ling and the German Wikipedia versus τ = 0.33 for theEnglish Wikipedia).

Another type of interdependence among the different ac-tivity classes can be revealed by applying association rule

mining [1] which is commonly used, e. g., for market bas-ket analysis. Table VI exemplarily shows a few associa-tion rules obtained from all favorite name activities withinthe Nameling, whereby all requests of a user a aggregatedto a single transaction. Much more rules with various sup-port and confidence levels can be found and directly beused for implementing a recommendation system, basedon other users search patterns.

Table 4: Examples for association rules obtained from theNameling’s activity log. The second example reads as fol-lows: 85.71% of 7 users which have added “Charlotte”and “Johanna” as a favorite name, also added “Emma” tothe favorite name list.

Rule Support Confidence

Emma← Teresa 6 83.33%

Emma← Charlotte Johanna 7 85.71%

Ida← Frieda Lina 8 87.50%

For assessing the interdependence of name contexts es-tablished by search queries within the Nameling and co-occurrences within Wikipedia, we further constructed foreach activity class C ∈ {Enter,Click,Favorite} and eachnode type T ∈ {User,Name} a corresponding weightedprojection graph GTC , where an edge (u, v) exists withweight k, if users u and v have searched for k commonnames (or, respectively, names a and b are connected by

Page 7 of 13c©ASE 2012

Page 8: Relatedness of Given Names - Uni Kassel

an edge with label k, if k users searched for both namesa and b). Table VI summarizes various high level net-work statistics for all considered projection graphs. Itis worth noting that GUser

Enter is the largest graph, indicat-ing that most users directly entered names whereof 62%actively followed proposed names by clicking on corre-sponding links. All projection graphs encompass a giantconnected component [24], giving raise to a notion of re-latedness among users and names, respectively, based onthe implicitly expressed interest in names.

Table 5: High level statistics for all considered projectiongraphs with the number of weakly connected components#wcc and largest weakly connected components lwcc.

|V | |E| #wcc lwcc D

GUserEnter 35, 113 31, 118, 016 18 35, 078 7

GNameEnter 15, 992 996, 930 72 15, 834 7

GUserClick 21, 925 6, 371, 132 28 21, 865 9

GNameClick 9, 665 2, 633, 220 79 9, 465 10

GUserFavorite 1, 205 42, 770 25 1, 155 7

GNameFavorite 1, 365 60, 124 23 1, 306 7

For analyzing the strength of interdependence amongusers and names induced by the set of searched names,Figure 7 shows the distribution of the number of com-mon names among pairs of users and the number of com-mon users among pairs of names, respectively. All dis-tributions exhibit long tailed characteristics. For exam-ple, there are around ten million pairs of users whichhave clicked on a single common node and only around100,000 users clicked on two common names. But thereare also pairs of users, which have clicked on more than200 common names.

Please note that these usage graphs themselves can beused for calculating similarity among given names, e. g.,by applying the same similarity metrics as discussed forthe co-occurrence graphs. Table VI exemplarily showssimilar names for the given name “Kevin” as calcu-lated with the adapted preferential PageRank on the co-occurrence graph WikiEN and the shared search graphGName

Enter . While the former gives raise to a list of Americanmale given names (as expected), the latter yields a list ofAmerican and French male as well as female given names.It is therefore natural to ask, whether and to which ex-tend the network structures obtained from Wikipedia andNameling’s usage date interrelate.

Table 6: Similar names for the given name “Kevin”,calculated with the weighted adapted preferential Page-Rank on Wikipedia’s co-occurrence graph (left) and onthe shared search graph (right).

Kevin Kevin

Tim ChantalDanny JustinJason JaquelineJeremy SkyNick Chantalle

WikiEN GNameEnter

The quadradic assignment procedure (QAP) test is an ap-proach for inter-network comparison, common in litera-ture. It is based on the correlation of the adjacency matri-ces of the considered graphs [4, 5].

For given graphs G1 = (V1, E1) and G2 = (V2, E2) withU := V1∩V2 6= ∅ and adjacency matricesAi correspond-ing to Gi|U (Gi reduced to the common vertex set U , cf.Section III), the graph covariance is given by

cov(G1, G2) :=1

n2 − 1

n∑i=1

n∑j=1

(A1[i, j]−µ1)(A2[i, j]−µ2)

where µi denotes Ai’s mean (i = 1, 2). Then var(Gi) :=cov(Gi, Gi) leading to the graph correlation

ρ(G1, G2) :=cov(G1, G2)√var(G1) var(G2)

.

The QAP test compares the observed graph correlation tothe distribution of resulting correlation scores obtained onrepeated random row/column permutations of A2. Thefractions of permutations π with correlation ρπ ≥ ρo isused for assessing the significance of an observed corre-lation score ρo. Intuitively, the test determines (asymptot-ically) the fraction of all graphs with the same structureas G2|U having at least the same level of correlation withG1|U .

Table VI shows the pairwise correlation scores for all con-sidered name-based networks.

Table 7: Graph level correlations for all pairs of consid-ered networks. All observed correlations are significantaccording to the quadradic assignments procedure (QAP).

GNameEnter GName

Click GNameFavorite WikiDE WikiEN

GNameEnter −

GNameClick 0.466 −

GNameFavorite 0.370 0.364 −

WikiDE 0.138 0.110 0.040 −WikiEN 0.054 0.056 0.032 0.406 −

Page 8 of 13c©ASE 2012

Page 9: Relatedness of Given Names - Uni Kassel

1 2 5 10 20

1e+

011e

+05

Enter

Number Common Names

Num

ber

of U

ser

Pai

rs

1 2 5 10 50 200

1e+

011e

+03

1e+

051e

+07

Click

Number Common Names

Num

ber

of U

ser

Pai

rs

1 2 5 10

550

500

5000

Favorite

Number Common Names

Num

ber

of U

ser

Pai

rs

1 2 5 10 20 50 100

1e+

011e

+03

1e+

05

Enter

Number Common Users

Num

ber

of N

ame

Pai

rs

1 2 5 10 20 50

1e+

011e

+03

1e+

05

Click

Number Common Users

Num

ber

of N

ame

Pai

rs

1 2 5 10

550

500

1000

0

Favorite

Number Common Users

Num

ber

of N

ame

Pai

rs

Figure 7: Distribution of the number of common names between pairs of users and the number of common usersbetween pairs of names in the different projection graphs corresponding to Enter, Click and Favorite activities.

Again, correlations are more pronounced within a system(e. g., 0.406 for the co-occurrence networks of the Englishand German Wikipedia). But the graph level correlationsfor networks obtained from the Nameling exhibit signifi-cantly higher correlation scores for the co-occurrence net-work obtained from the German Wikipedia, indicatingthat the dominance of German users has an impact on theemerging name contexts within the search based networkswhich are more related to the accordingly language de-pendent network structure within the co-occurrence net-work from Wikipedia.

VII PERFORMANCE OF STRUCTURAL SIMI-LARITY METRICS

Up to now, the performance of the various structural simi-larity metrics presented in III are evaluated with respect toan external notion of semantic relatedness, as summarizedin Section V. This section presents a comparative eval-uation of the considered similarity metrics’ performancewith respect to actual name preferences of users as indi-cated by the usage data presented in Section VI.

For this purpose, an according ranking scenario is formu-lated, where for a given (set of) name(s), a ranking on allnames is obtained by calculating pairwise similarity withall other names based on the considered similarity met-

rics with respect to the co-occurrence networks obtainedfrom Wikipedia. Then, an established evaluation metricis applied for assessing the relative quality of the differentsimilarity functions.

Firstly, we need a “ground truth” as a reference for eval-uating the performance of a given ranking on the set ofgiven names. Considering the usage data presented inSection VI, a first natural choice would be to predict foreach search request of a user the names which will beadded as favorites. But thereby, the evaluation wouldbe biased towards the similarity metric which was im-plemented in Nameling during the period of evaluation.Thus, these evaluation target is only valid in a live set-ting, where different ranking systems are comparativelypresented to a user.

We therefore consider for each user u the corresponding

• set of names which were directly entered (Enteru),• the set of names the user clicked on (Clicku) and• the set of u’s favorite names (Favoriteu).

We argue, that by directly entering a name into the searchmask, a user already expresses interest in the correspond-ing name. Furthermore, a name which a user entered intothe search mask is not directly influenced by the nameswhich were presented based on the similarity function im-plemented in Nameling during the period of evaluation.

Page 9 of 13c©ASE 2012

Page 10: Relatedness of Given Names - Uni Kassel

We evaluate the performance of a structural similar-ity metric SIM , by randomly selecting just one namei ∈ Enteru and calculating the average precision scoreAveP(SIM(i, ∗),Enteru) for every user u and then calcu-lating the mean average precision MAP (cf. Section III).For reference, we also computed as a baseline the MAPscore for randomly ordered given names (labeled with“RND” in the corresponding figures). All results in thissection are averaged over five repetitions of the corre-sponding random experiments.

Figure 8 shows the obtained results for each consideredsimilarity metric in its weighted and unweighted variant,separately evaluated on the Enter, Click and Favorite sets.First of all we have to note that all MAP scores are verylow. Partly, this is due to the uncleaned evaluation dataset which also contains search queries for names, whichare not contained in the list of known names and thereforecould not be listed by the considered similarity metrics.Also we retained from requiring a minimum number ofsearch queries per user (despite that there must be at leastone to predict). This renders the task of predicting “miss-ing” names even more hard, as the profile of a user whosearched for more than 20 names is expected to be moreconsistent than the profile of a user who just entered twosearch queries (depending on the similarity metric, the re-sults are improved by 40% to 150% if only users withmore than 10 search queries are considered). We arguethat the usage data which is taken as a reference shouldbe applied without further adjustments, as other rankingalgorithms may overcome some of the mentioned factors.Such improvements can then be assessed by applying thesame evaluation on the new system and comparing thegain in MAP over the present approaches.

In any case, all considered similarity metrics show signif-icant better performance than the random baseline. Wefirstly consider the Enter sets. Notably, all but PPR+ arenegatively effected by the weights in the co-occurrencegraph. On the other hand, the performance of PPR+drops to the level of the considered baseline for the un-weighted case. The dependence of PPR+ on the edgeweights is in line with the motivation of its design, as theglobal PageRank score is subtracted to reduce the impactof global frequencies, which are absent in the unweightedcase. Please note that the influence of weights is discussedin the related field of link prediction in social networks(cf. [7, 21]). Altogether, the PageRank based similaritymetrics outperform the other metrics.

As for the Click set, the weighted cosine similarity COSsignificantly outperforms all other metrics. This is in linewith the bias towards the implemented similarity metricwithin Nameling, as all result lists were ordered according

to the weighted cosine similarity during the time period ofevaluation. Considering the Favorite set, the unweightedJaccard coefficient JAC performs surprisingly well, out-performing even the weighted cosine similarity.

Summing up, the results of the experiment describedabove indicate that the weighted variant of the adjustedPageRank similarity should be considered for ranking re-sult lists within Nameling as an alternative to the weightedcosine similarity, as it consistently performs well on allconsidered test sets.

VIII RANKINGS FOR MULTI TERM QUERIES

A user might be interested in names, which are most suit-able for a given set of names (consider, e. g., the givennames of both parents or possible siblings) rather than forjust a single name. Therefore, we also applied a “take-k-in” evaluation paradigm, where, for each user u, we ran-domly selected k names from his search profile Enteru.These names are then used for predicting the remainingnames. The prediction accuracy is evaluated by calculat-ing mean average precision with respect to the user’s re-maining search profile. As with varying k the size of thecorresponding evaluation data also varies and induces abias towards lower k, we additionally select n− k namesfrom Enteru with n := |Enteru|. These names are thenignored while calculating the corresponding average pre-cision score.

For similarity metrics based on the preference PageRank,the combination of multiple query names is straightfor-ward by giving equally preference to each of the queriednames. At least for the basic preference PageRank, lin-earity of the preference PageRank [10] allows to calculaterankings each query name separately in advance. Thesepre-calculated rankings can then be combined by meansof the applied database management system.

For vector based similarity metrics such as COS and JAC(cf. Section III), it is straightforward to combine multiplequery vectors just by calculating the corresponding cen-troid. But it is not obvious whether the centroid vector isstill semantically grounded, that is, for example, whetherthe combination of two male given names is again mostsimilar to other male given names. Alternatively, eachname could be queried separately and the respective resultsets combined afterwards. The latter would be of practi-cal use, as it is computationally not feasible to calculate allpairs of cosine similarity online in the running system and,again, similarity scores are pre-calculated for each possi-ble query term and combined by means of the databasemanagement system.

Page 10 of 13c©ASE 2012

Page 11: Relatedness of Given Names - Uni Kassel

JAC COS PPR PPR+ RND

Enter

Similarity Metric

Mea

n A

vera

ge P

reci

sion

0.00

00.

004

0.00

8

JAC COS PPR PPR+ RND

Click

Similarity Metric

Mea

n A

vera

ge P

reci

sion

0.00

00.

010

JAC COS PPR PPR+ RND

Favorite

Similarity Metric

Mea

n A

vera

ge P

reci

sion

0.00

00.

005

0.01

00.

015

Figure 8: Mean average precision for all considered structural similarity metrics and activity classes in the weighted(blue) and unweighted (red) case.

Therefore, the following four types of combining multi-ple query vectors ~u1, . . . , ~un or respective result list cor-responding to query names u1, . . . , un were evaluated:

• Query relative to the centroid vector ~u:

ui :=1

n

n∑j=1

~uj(i) (COMB)

• Average the resulting similarity scores:

SIM({u1, . . . , un}, i) :=1

n

n∑j=1

SIM(uj , i)

(AVG)• Take the minimum similarity score:

SIM({u1, . . . , un}, i) :=n

minj=1

(SIM(uj , i))

(MIN)• Take the maximum similarity score:

SIM({u1, . . . , un}, i) :=n

minj=1

(SIM(uj , i))

(MAX)

Figure 9 shows the results obtained on the Enter set forthe different similarity metrics in all considered variants.

As for the adapted preferential PageRank based similaritymetrics we note a constant improvement in terms of MAPwith increasing number of queried names, although itsweighted significantly outperforms the unweighted vari-ant. The plain preferential PageRank’s MAP scores arenot affected by the varying number of “known” names.This is probably caused by the dominance of the globalnetwork structure which is explicitly reduced by construc-tion in case of the adapted preferential PageRank.

The results obtained for the unweighted and weighted Jac-card coefficient JAC shows a progression in terms ofMAP score with increasing number of names for AVG,

COMB and MAX. In its weighted variant, only MAX andCOMB yield better results with increasing k. In both vari-ants, the MIN result aggregation decreases MAP perfor-mance.

As for the cosine similarity, we first note that for theweighted cosine similarity C̃OS on the Enter set, the av-erage query combination scheme yields the best results,even outperforming the computationally more involvedcentroid approach. Concerning the relative ranking ofthe different combination approaches, the unweighted andweighted cosine similarity show no consistent behavior.

Summing up, the results presented in this chapter indi-cate that the cosine similarity and the adjusted prefer-ence PageRank are candidates for recommending similarnames, based on co-occurrences of given names withinWikipedia. Altogether, the latter showed more stableand consistent results in the different settings. For rec-ommendations based on a single query term it yieldedbetter results than the weighted cosine similarity, thoughslightly worse then the plain preference PageRank on theunweighted co-occurrence graphs. But it showed consis-tent well performance scores even in the evaluation setsClick and Favorite which are strongly biased towards theweighted cosine similarity. Also the combination of mul-tiple query terms is straightforward and showed stable andconsistent improvements with an increasing number ofquery terms. Nevertheless, the much simpler weighted co-sine similarity showed comparable results, or even betterperformance scores, in case of the combination of multi-ple query terms.

Of course, the usage data can be used for personalizingsearch results or implementing recommendation systemsbased on the own name preferences and those of similarusers. Even the very simple baseline recommender whichjust recommends the most popular names yields resultswhich significantly outperform all statical name rankingsconsidered in this paper. Building recommendation sys-

Page 11 of 13c©ASE 2012

Page 12: Relatedness of Given Names - Uni Kassel

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

PUnweighted Cosine

● ●●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Unweighted Jaccard

●●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Unweighted PPR

● ● ● ●●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Unweighted PPR+

●●

●●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Weighted Cosine

●● ●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Weighted Jaccard

●●

●● ●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Weighted PPR

● ● ● ● ●

2 4 6 8 10

0.00

50.

010

0.01

5

k

MA

P

Weighted PPR+

●●

● ●

● COMBAVGMINMAX

Figure 9: Mean average precision for all considered structural similarity metrics (weighted and unweighted) for vary-ing number of known entered names k and different combination of query vectors (COMB, AVG, MIN, MAX).

tems which apply collaborative filtering techniques [26]are expected to yield results which outperform those pre-sented above by magnitude. A thorough discussion andevaluation of according recommendation systems is outof the scope of the present work and are subject to futureresearch.

IX CONCLUSION & FUTURE WORK

The present work introduces the research task of discov-ering relations among given names. A new approach forranking names based on co-occurrence graphs is proposedand evaluated. For this, usage data from the running sys-tem Nameling is firstly introduced and thoroughly ana-lyzed and then used for setting up an experimental frame-work for evaluating the performance of name rankings.The results presented in Sections VII,VIII and V form abasis for deciding which similarity metric can be used forranking names relative to a given set of query terms.

By making all considered usage data publicly available,other researchers are invited to build and evaluate newranking and recommendation systems. The success of theNameling indicates that there is a need for such recom-mendation systems and the inherent interdisciplinary re-search questions render the task of discovering relationsamong given names fascinating and challenging to tackle.

For future work we plan to implement personalized rec-ommendation and ranking systems, thereby incorporatingfurther influencing factors, such as, e. g., the geographic

distance among users. Additionally we plan to implementan open and flexible recommendation framework whichwill allow other research to directly integrate and evaluatetheir approaches within the running system.

We conclude by observing that the field of recommend-ing given names is lacking scientific contributions. Yetmany approaches for discovering relations exist, ready tobe adapted to this domain. We expect that further researchwill support future parents in finding and choosing a suit-able given name.

Acknowledgements This work has been partially sup-ported by the Commune project funded by the Hertiefoundation.

References

[1] R. Agrawal and R. Srikant. Fast algorithms for min-ing association rules in large databases. In Proceed-ings of the 20th international conference on VeryLarge Data Bases (VLDB’94), pages 478–499. Mor-gan Kaufmann, September 1994.

[2] S. Brin and L. Page. The anatomy of a large-scalehypertextual web search engine. In Computer Net-works and ISDN Systems, pages 107–117, 1998.

[3] R. Bunescu and M. Pasca. Using encyclopedicknowledge for named entity disambiguation. InProc. of EACL, volume 6, pages 9–16, 2006.

Page 12 of 13c©ASE 2012

Page 13: Relatedness of Given Names - Uni Kassel

[4] C. Butts. Social network analysis: A methodologicalintroduction. Asian J. of Soc. Psychology, 11(1):13–41, 2008.

[5] C. T. Butts and K. M. Carley. Some simple algo-rithms for structural comparison. Comput. Math. Or-gan. Theory, 11:291–305, December 2005.

[6] T. Cohen and D. Widdows. Empirical distributionalsemantics: methods and biomedical applications. JBiomed Inform, 42(2):390–405, Apr. 2009.

[7] H. de Sá and R. Prudencio. Supervised link pre-diction in weighted networks. In Neural Networks(IJCNN), The 2011 Int. Joint Conference on, pages2281–2288. IEEE, 2011.

[8] E. Gabrilovich and S. Markovitch. Computing se-mantic relatedness using wikipedia-based explicitsemantic analysis. In Proc. of the 20th int. joint con-ference on artificial intelligence, volume 6, page 12.Morgan Kaufmann Publishers Inc., 2007.

[9] J. Golbeck and J. Hendler. Filmtrust: Movie rec-ommendations using trust in web-based social net-works. In Proceedings of the IEEE Consumer com-munications and networking conference, volume 96.Citeseer, 2006.

[10] T. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Knowl-edge and Data Engineering, IEEE Transactions on,15(4):784–796, 2003.

[11] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme.Information retrieval in folksonomies: Search andranking. In Y. Sure and J. Domingue, editors, TheSemantic Web: Research and Applications, volume4011 of Lecture Notes in Computer Science, pages411–426, Heidelberg, June 2006. Springer.

[12] R. Jäschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag recommendations insocial bookmarking systems. Ai Communications,21(4):231–247, 2008.

[13] G. Jeh and J. Widom. Simrank: a measure ofstructural-context similarity. In Proc. of the eighthACM SIGKDD int. conference on Knowledge dis-covery and data mining, KDD ’02, pages 538–543,New York, NY, USA, 2002. ACM.

[14] M. G. Kendall. A new measure of rank correlation.Biometrika, 30(1/2):81–93, 1938.

[15] T. Landauer and S. Dumais. A solution to plato’sproblem: The latent semantic analysis theory of ac-quisition, induction, and representation of knowl-edge. Psychological Review, 104(2):211, 1997.

[16] E. Leicht, P. Holme, and M. Newman. Ver-tex similarity in networks. Physical Review E,73(2):026120, 2006.

[17] M. Lesk. Word-word associations in document re-trieval systems. American documentation, 20(1):27–38, 1969.

[18] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. of theAmerican society for inf. science and technology,58(7):1019–1031, 2007.

[19] G. Linden, B. Smith, and J. York. Amazon. com rec-ommendations: Item-to-item collaborative filtering.Internet Computing, IEEE, 7(1):76–80, 2003.

[20] L. Lü, C. Jin, and T. Zhou. Similarity index based onlocal paths for link prediction of complex networks.Physical Review E, 80(4):046122, 2009.

[21] L. Lü and T. Zhou. Link prediction in weighted net-works: The role of weak ties. EPL (EurophysicsLetters), 89:18001, 2010.

[22] B. Markines, C. Cattuto, F. Menczer, D. Benz,A. Hotho, and G. Stumme. Evaluating similar-ity measures for emergent semantics of social tag-ging. In 18th Int’l WWW Conference, pages 641–641, 2009.

[23] T. Murata and S. Moriyasu. Link prediction of socialnetworks based on weighted proximity measures. InWeb Intelligence, IEEE/WIC/ACM Int. Conferenceon, pages 85–88. IEEE, 2007.

[24] M. E. J. Newman. The structure and function ofcomplex networks. SIAM Review, 45(2):167–256,2003.

[25] M. Strube and S. P. Ponzetto. WikiRelate! Comput-ing semantic relatedness using Wikipedia. In proc.of the 21st national conference on Artificial intel-ligence - Volume 2, AAAI’06, pages 1419–1424.AAAI Press, 2006.

[26] X. Su and T. Khoshgoftaar. A survey of collabora-tive filtering techniques. Advances in Artificial In-telligence, 2009:4, 2009.

[27] P. Turney. Mining the web for synonyms: Pmi-irversus lsa on toefl. In Proc. of the 12th EuropeanConference on Machine Learning, pages 491–502.Springer-Verlag, 2001.

[28] E. Voorhees, D. Harman, N. I. of Standards, andT. (US). TREC: Experiment and evaluation in infor-mation retrieval, volume 63. MIT press Cambridge,2005.

Page 13 of 13c©ASE 2012