102
Search in Social and Navigation Networks Denis Helic TU Graz, KTI May 13, 2013 Helic (KTI) Search in Networks May 13, 2013 1 / 98

Search in Social and Navigation Networks

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search in Social and Navigation Networks

Search in Social and Navigation Networks

Denis Helic

TU Graz, KTI

May 13, 2013

Helic (KTI) Search in Networks May 13, 2013 1 / 98

Page 2: Search in Social and Navigation Networks

Navigation in Social Networks

Nebraska stockholders

Nebraska random

Boston random

Bostonstockbroker

Target

Helic (KTI) Search in Networks May 13, 2013 2 / 98

Page 3: Search in Social and Navigation Networks

Navigation in social networks

Two experimental findings

There are short paths in social networks

People find those short paths efficiently

Local search

Watts and Kleinberg explained the phenomenon

Helic (KTI) Search in Networks May 13, 2013 3 / 98

Page 4: Search in Social and Navigation Networks

Network Navigability

Definition

Network is navigable if and only if there is a short path between all oralmost all pairs of nodes in the network.

Formally:

1 There exist a giant component

2 The effective diameter is low – bounded by log(n), where n is thenumber of nodes in the network

Helic (KTI) Search in Networks May 13, 2013 4 / 98

Page 5: Search in Social and Navigation Networks

Network Navigability

Example 1:

Knowledge Management Institute

7

Denis Helic 2010

Navigability: Examples

Example 1:

Not navigable: No giant component

Example 2:

Not navigable: giant component, BUTeff.diam: 7 > log2(8)

Figure: Network is not navigable because there is no giant component, i.e. thenetwork is not connected.

Helic (KTI) Search in Networks May 13, 2013 5 / 98

Page 6: Search in Social and Navigation Networks

Network Navigability

Example 2:

Knowledge Management Institute

7

Denis Helic 2010

Navigability: Examples

Example 1:

Not navigable: No giant component

Example 2:

Not navigable: giant component, BUTeff.diam: 7 > log2(8)Figure: Now, there is a giant component, i.e. the network is connected. However

the network is not navigable because eff .diam = 6, and 6 > log2(8).

Helic (KTI) Search in Networks May 13, 2013 6 / 98

Page 7: Search in Social and Navigation Networks

Network Navigability

Example 3:

Knowledge Management Institute

8

Denis Helic 2010

Navigability: Examples

Navigable: Giant component AND eff.diam: 2 < log2(10)

Is this efficiently navigable? There are short paths between all nodes, but can an

agent or algorithm find them with local knowledgeonly?

Figure: The network is navigable because there is a giant component andeff .diam = 2. Effective diameter is bounded by log2(10).

Helic (KTI) Search in Networks May 13, 2013 7 / 98

Page 8: Search in Social and Navigation Networks

Global Network Navigability

We discussed so far global network navigability

In case we have global knowledge of network

Easy to design efficient navigation procedures to find an arbitrarytarget node from an arbitrary start node

E.g., breadth-first search with time complexity O(n + m)

Helic (KTI) Search in Networks May 13, 2013 8 / 98

Page 9: Search in Social and Navigation Networks

Local Network Navigability

What is with local network navigability?

In case we have only local knowledge of network

We know only outgoing links from the current node

We do not know links of the neighbors of the current node

Is it now easy to design efficient navigation procedures?

These procedure are typically called decentralized search

Helic (KTI) Search in Networks May 13, 2013 9 / 98

Page 10: Search in Social and Navigation Networks

Local Network Navigability

How we are able to find other people efficiently?

What is the structure of a social network that make efficientdecentralized search possible?

There are structural cues in the network which allows us to designsub-linear algorithms

What are these structural clues?

Helic (KTI) Search in Networks May 13, 2013 10 / 98

Page 11: Search in Social and Navigation Networks

Local Network Navigability

Example:

Knowledge Management Institute

9

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Efficiently navigable, if the algorithm knows it needs togo through A B C

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

Figure: A is start node and D is target node.

Helic (KTI) Search in Networks May 13, 2013 11 / 98

Page 12: Search in Social and Navigation Networks

Local Network Navigability

Step 1:

Knowledge Management Institute

10

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Efficiently navigable, if the algorithm knows it needs togo through A B C

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

Figure: There are two possible paths from A. Obviously, the optimal path leads toB. What is the structural property that can guide us in selecting B?

Degree

Helic (KTI) Search in Networks May 13, 2013 12 / 98

Page 13: Search in Social and Navigation Networks

Local Network Navigability

Step 1:

Knowledge Management Institute

10

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Efficiently navigable, if the algorithm knows it needs togo through A B C

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

Figure: There are two possible paths from A. Obviously, the optimal path leads toB. What is the structural property that can guide us in selecting B?

Degree

Helic (KTI) Search in Networks May 13, 2013 12 / 98

Page 14: Search in Social and Navigation Networks

Local Network Navigability: Power-law networks

Utilizing high-degree nodes in power-law networks (Search inpower-law networks by Adamic, 2003)

Adapt the breadth-first search by first investigating nodes with higherdegrees.

Reduces time complexity to range O(n12 ) to O(log(n)) for 2 ≤ α ≤ 3

Helic (KTI) Search in Networks May 13, 2013 13 / 98

Page 15: Search in Social and Navigation Networks

Local Network Navigability

Step 2:

Knowledge Management Institute

11

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Efficiently navigable, if the algorithm knows it needs togo through A B C

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

Figure: There are seven possible paths from B. Obviously, the optimal path leadsto C. What is the structural property that can guide us in selecting C?

Local clustering

Helic (KTI) Search in Networks May 13, 2013 14 / 98

Page 16: Search in Social and Navigation Networks

Local Network Navigability

Step 2:

Knowledge Management Institute

11

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Efficiently navigable, if the algorithm knows it needs togo through A B C

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

Figure: There are seven possible paths from B. Obviously, the optimal path leadsto C. What is the structural property that can guide us in selecting C?

Local clustering

Helic (KTI) Search in Networks May 13, 2013 14 / 98

Page 17: Search in Social and Navigation Networks

Local Network Navigability

Summarizing, local network navigability requires:

1 Existence of network hubs that are connected to many nodes

2 Existence of network clusters where nodes are highly interlinked

Helic (KTI) Search in Networks May 13, 2013 15 / 98

Page 18: Search in Social and Navigation Networks

Local Network Navigability

We base our strategy on probability:

1 The first requirement increases the probability that a hub has a linkto the target node or to the target node cluster

2 The second requirement increases the probability that there is a linkto the target node from the cluster it belongs to

Helic (KTI) Search in Networks May 13, 2013 16 / 98

Page 19: Search in Social and Navigation Networks

Local Network Navigability

Formally:

1 Power-low degree distribution with exponent γ

2 High clustering coefficient C

Helic (KTI) Search in Networks May 13, 2013 17 / 98

Page 20: Search in Social and Navigation Networks

Local Network Navigability

4

102

103

104

105

0

0.1

0.2

γ=2.1γ=2.2γ=2.3γ=2.4γ=2.5

102

103

104

105

network size (N)

0.3

0.4

0.5

0.6

0.7

succ

ess

prob

abili

ty (

p s)

2 2.2 2.4 2.6 2.8 3

degree exponent (γ)

0

0.2

0.4

0.6

succ

ess

prob

abili

ty (

p s)

α=1.1α=1.5α=2.0α=3.0α=5.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

clustering coefficient (C)

2

2.5

3

degr

ee e

xpon

ent (

γ)

102

103

104

105

network size (N)

0

0.1

0.2su

cces

s pr

obab

ility

(p s)

γ=2.6γ=2.7γ=2.8γ=2.9γ=3.0

α=1.1

α=5.0 navigable region

non-navigable region

Internet

Web of trust

Airports

Metabolic

FIG. 3: Success probability of greedy routing. Leftplots: success probability ps as a function of network size Nfor different values of γ with weak (top) and strong (bottom)clustering. The top-right plot shows ps as a function of γand α for networks of fixed size N ≈ 105. In the bottom-right plot, parameter α is mapped to clustering coefficientC [15] by computing C for each network with given γ andα. For each value of C, there is a critical value of γ = γc(C)such that the success ratio in networks with this C and γ >γc(C) decreases with the network size (ps(N) −−−−→

N→∞0), while

ps(N) reaches a constant value for large N in networks withγ < γc(C). The solid line in the plot shows these criticalvalues γc(C), separating the low-γ, high-C navigable region,in which greedy routing remains efficient in the large-graphlimit, from the high-γ, low-C non-navigable region, wherethe efficiency of greedy routing degrades for large networks.The plot labels measured values of γ and C for several realcomplex networks. Internet is the global Internet topology ofautonomous systems as seen by the Border Gateway Protocol(BGP) [31]; Web of trust is the Pretty Good Privacy (PGP)social network of mutual trust relationships [32]; Metabolic isthe network of metabolic reactions of E. coli [33]; and Airportsis the network of the public air transportation system [34].

the number of successful paths once clustering is abovea threshold, α ≥ 1.5. These observations mean that fora fixed clustering strength, there is a critical value of theexponent γ (Fig. 3 bottom-right) below which networksremain navigable as their size increases, but above whichtheir navigability deteriorates with their size.

In summary, strong clustering improves both naviga-bility metrics. We also find a delicate trade-off betweenvalues of γ close to 2 minimising path lengths, and highervalues – not exceeding γ ≈ 2.6 – maximising the per-centage of successful paths. We explain these findingsin the next section, but we note here that qualitatively,this navigable parameter region contains a majority ofcomplex networks observed in reality [1, 2, 3], as con-firmed in Fig. 3 (bottom-right), where we juxtapose fewparadigmatic examples of communication, social, biolog-ical, and transportation networks vs. the identified nav-

igable region of clustering and degree distribution expo-nent. Interestingly, power grids, which propagate elec-tricity rather than route information, are neither scale-free nor clustered [15, 35].

IV. AIR TRAVEL BY GREEDY ROUTING ASAN EXPLANATION

We illustrate the greedy routing function, and thestructure of networks conductive to such routing, withan example of passenger air travel. Suppose we wantto travel from Toksook Bay, Alaska, to Ibiza, Spain, bythe public air transportation network. Nodes in this net-work are airports, and two airports are connected if thereis at least one flight between them. We travel accord-ing to the greedy routing strategy using geography asthe underlying metric space. At each airport we choosethe next-hop airport geographically closest to the desti-nation. Under these settings, our journey goes first toBethel, then to Anchorage, to Detroit, over the Atlanticto Paris, then to Valencia and finally to Ibiza, see Fig. 4.The sequence and sizes of airport hops reveal the struc-ture of our greedy-routing path. The path proceeds froma small airport to a local hub at a small distance, fromthere to a larger hub at a larger distance, and so on un-til we reach Paris. At that point, when the distance tothe destination becomes sufficiently small, greedy routingleads us closer to our final destination by choosing notanother hub, but a less connected neighbouring airport.

We observe that the navigation process has two, some-what symmetric phases. The first phase is a coarse-grained search, travelling longer and longer distances perhop toward hubs, thus “zooming out” from the startingpoint. The second phase corresponds to a fine-grainedsearch, “zooming in” onto the destination. The turningpoint between the two phases appears naturally: once weare in a hub near the destination, the probability that itis connected to a bigger hub closer to the destinationsharply decreases, but at this point we do not need hubsanyway, and greedy routing directs us to smaller airportsat shorter distances next to the destination.

This zoom out/zoom in mechanism works efficientlyonly if the coupling between the airport network topol-ogy and the underlying geography satisfies the follow-ing two conditions: the sufficient hubs condition andthe sufficient clustering condition. The first conditionensures that a network has enough hub airports (high-degree nodes) to provide an increasing sequence duringthe zoom out phase. This condition is fulfilled by the realairport network and by other scale-free networks withsmall values of degree distribution exponent γ, becausethe smaller the γ, the larger the proportion of hubs inthe network.

However, the presence of many hubs does not ensurethat greedy routing will use them. Unlike humans, whocan use their knowledge of airport size to selectivelytravel via hub airports, greedy routing uses only one con-

Figure: Navigable networks in γ, C space.

Helic (KTI) Search in Networks May 13, 2013 18 / 98

Page 21: Search in Social and Navigation Networks

Local Network Navigability

Revisiting Step 2:

Knowledge Management Institute

12

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

E

Figure: There are seven possible paths from B. Obviously, the optimal path leadsto C. What is an additional hint that can guide us in selecting C over E?

Node similarity

Helic (KTI) Search in Networks May 13, 2013 19 / 98

Page 22: Search in Social and Navigation Networks

Local Network Navigability

Revisiting Step 2:

Knowledge Management Institute

12

Denis Helic 2010

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

only local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

A

B

C

J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

D

E

Figure: There are seven possible paths from B. Obviously, the optimal path leadsto C. What is an additional hint that can guide us in selecting C over E?

Node similarity

Helic (KTI) Search in Networks May 13, 2013 19 / 98

Page 23: Search in Social and Navigation Networks

Local Network Navigability

Similarity between nodes is external to the network

It is derived from some additional information that we have aboutnetwork nodes

In Millgram’s experiment people selected the next person according totheir occupation or geography

E.g., a friend in Boston, or a friend working in financial sector

Helic (KTI) Search in Networks May 13, 2013 20 / 98

Page 24: Search in Social and Navigation Networks

Local Network Navigability

All this information, i.e. degrees, clustering, similarity is ourbackground knowledge about the network

We use this background knowledge to guide us in our search for atarget node

At each step we consult the background knowledge

We ask which link leads with the highest probability to a given targetnode

Helic (KTI) Search in Networks May 13, 2013 21 / 98

Page 25: Search in Social and Navigation Networks

Decentralized search

Model of navigation with local knowledge only

Intuitions about the network nodes

Distance metric

Algorithmically people are greedy

Helic (KTI) Search in Networks May 13, 2013 22 / 98

Page 26: Search in Social and Navigation Networks

Decentralized Search

Background knowledge defines a notion of distance between nodes

It is a metric space assigning to each node unique coordinates

A distance metrics allows to calculate the distance between nodes

Background knowledge is a black-box executing a simple function:getDistance(node, target node)

Helic (KTI) Search in Networks May 13, 2013 23 / 98

Page 27: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

4

3

4

3

3

2

1

1

1

0

12

11

1

14 15

22

21

2

24

23 25

31

3

3213 33

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 24 / 98

Page 28: Search in Social and Navigation Networks

Decentralized search

4

5

1

13

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 25 / 98

Page 29: Search in Social and Navigation Networks

Decentralized search

4

5

1 1

13

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 26 / 98

Page 30: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

2

1

12

11

1

14

22

21

2313

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 27 / 98

Page 31: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

2

1

2

12

11

1

14

22

21

2313

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 28 / 98

Page 32: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

3

2

1

1

2

12

11

1

14

22

21

2

23

3

13

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 29 / 98

Page 33: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

3

2

1

1

32

12

11

1

14

22

21

2

23

3

13

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 30 / 98

Page 34: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

3

2

1

1

1

0

1

32

12

11

1

14

22

21

2

23

31

3

3213 33

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 31 / 98

Page 35: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

3

3

2

1

1

1

0

1

32

4

12

11

1

14

22

21

2

23

31

3

3213 33

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 32 / 98

Page 36: Search in Social and Navigation Networks

Decentralized search

4

5

5

5

5

4

4

3

4

3

3

2

1

1

1

0

1

32

4

12

11

1

14 15

22

21

2

24

23 25

31

3

3213 33

Figure: Informed greedy decentralized search.

Helic (KTI) Search in Networks May 13, 2013 33 / 98

Page 37: Search in Social and Navigation Networks

Background Knowledge

What is the nature of background knowledge?

It is a metric space, e.g. 1-D spaces, 2-D vector spaces, 3-DEuclidean spaces, hyperbolic spaces, ...

Example with a circle as a metric space by Boguna: Navigability ofComplex Networks, 2007

Example with a hyperbolic space by Krioukov: Hyperbolic Geometryof Complex Networks, 2010

Helic (KTI) Search in Networks May 13, 2013 34 / 98

Page 38: Search in Social and Navigation Networks

Hierarchical Background Knowledge

A hierarchy of nodes is also a metric space

Kleinberg: The Small-World Phenomenon: An AlgorithmicPerspective, 2000 and Small-World Phenomena and the Dynamics ofInformation, 2001

Watts: Identity and Search in Social Networks, 2002

Adamic: How to search a social network, 2005

Helic (KTI) Search in Networks May 13, 2013 35 / 98

Page 39: Search in Social and Navigation Networks

Hierarchy as a Metric Space

1 15

141312

22

11

2

2423 21

25 3

33323111

1

2

3

33

44 44

55 5 5

Figure: Node distances in a hierarchy.

Helic (KTI) Search in Networks May 13, 2013 36 / 98

Page 40: Search in Social and Navigation Networks

Hierarchy as a Metric Space

Height of least common ancestor in the hierarchy

Parent and siblings are at d = 1

From there on recursively: the parent’s siblings are at d = 2, thechildren of the parent’s siblings are at d = 3, and so on

Helic (KTI) Search in Networks May 13, 2013 37 / 98

Page 41: Search in Social and Navigation Networks

Hierarchy as a Metric Space

If neither of the nodes are parents, children, grandparents, grandchildren,etc:d(i , j) = h(i) + h(j)− 2h(lca(i , j))− 1

Otherwise:d(i , j) = abs(h(i)− h(j))

Helic (KTI) Search in Networks May 13, 2013 38 / 98

Page 42: Search in Social and Navigation Networks

Navigation in Information vs. Social Networks

Social Networks Information Networks

Agents per search multiple agents single agentType of routing decentralized centralizedKnowledge of network local localLocal knowledge rich limitedSearcher endogenous exogenousRouting decisions social intuitions topical intuitionsCandidate consultation costly cheap

Helic (KTI) Search in Networks May 13, 2013 39 / 98

Page 43: Search in Social and Navigation Networks

Human Navigation on the Web

Human Wayfinding in Information Networks

Robert West and Jure Leskovec

A study of a large collection of human click paths

Statistical analysis of the click paths and strategies humans applywhen navigating information networks

Helic (KTI) Search in Networks May 13, 2013 40 / 98

Page 44: Search in Social and Navigation Networks

Wikispedia Game

Users play a navigation game in Wikipedia

You get a randomly selected starting page and a randomly selectedtarget page

In a short time period you have to reach the target page only byclicking on links

Similar as Wikigame: http://thewikigame.com/

Helic (KTI) Search in Networks May 13, 2013 41 / 98

Page 45: Search in Social and Navigation Networks

Human Navigation

e1

2

5

3

4

6

3

2

2

1

2

DIK-DIK

WATER

ELECTRON

QUANTUMMECHANICS

ALBERTEINSTEIN

ATOM

0

-

Figure 1: A human example path between the conceptsDIK -DIK and ALBERT EINSTEIN . Nodes represent Wikipedia arti-cles and edges the hyperlinks clicked by the human. Edge labelsindicate the order of clicks, the framed numbers the shortest-path length to the target.

which articles are likely to be interlinked. In this respect, the taskhumans are trying to solve at each visited article is that of guessingwhich of the outgoing links to follow in order to eventually reachthe target article.

What makes our study unique is that we have been collectingdetailed data on more than 30,000 instances of human wayfindingin an information network describing general human knowledge.This allows us to computationally analyze human wayfinding on alarge scale. Even more importantly, for every instance we knowthe starting and the given target article the user is trying to reach.Hence, we do not have not infer or guess the information need ofthe information seeker, but can base our methods on the groundtruth instead.

To illustrate the dynamics of the Wikispeedia game, as well aspotential reasoning schemes and classes of strategies humans mightuse, Fig. 1 gives the example of a human path between thestart ar-ticle DIK -DIK and thetarget ALBERT EINSTEIN. (We call such apair amission.) Note that using the browser’s back button is al-lowed. In the example, the information seeker clicked fromELEC-TRON to ATOM, but backed up after not finding the link to the targetthat he had expected there. We call the sequence includingATOM

and theback-clickthe full path, while referring to〈DIK -DIK , WA-TER, ELECTRON, QUANTUM MECHANICS, ALBERT EINSTEIN〉 asthe effective path. Theshortest-path length(SPL) from every ar-ticle to the target is shown in squares in the picture. If a clickdecreases the SPL, we call itlucrative. Note that in the examplenot every click is lucrative; rather, the information seeker makesprogress at first, but then orbits at a distance of 2 fromALBERT

EINSTEIN, before finally gravitating towards it with the choice ofQUANTUM MECHANICS. We also emphasize the special role thearticle onWATER plays in the example. It connects to many partsof the network—hence we call it ahub—and marks the transitionbetween getting away from the animal kingdom and homing in onthe realm of physics.

Despite the lack of global knowledge, humans are good at con-necting the dots: the median human game path is only one clicklonger than the median optimal solution. We explain this effectby showing that certain properties of Wikipedia’s hyperlink struc-ture make it easily navigable. For instance, our Wikipedia graph(we use a version containing about 4,000 articles and 120,000 links[25]) has a skewed degree distribution (median/mean/max degree19/26/294) and contains a few high-degree hubs that contribute toeverything being connected to everything else by short chains (me-dian/mean/max shortest-path length 3/3.2/9). This makes our net-work a typical ‘small word’. Our analysis shows that people com-monly find their way in it by first locating a hub and by constantlydecreasing the conceptual distance to the target thereafter. While

approaching the target through a series of conceptually very relatedarticles is safer and often humans’ preferred solution (cf.the ex-ample of Fig. 1), it is typically not the most efficient: we findthatthinking ‘out of the box’ often allows information seekers to findshorter paths between concepts—at the risk of getting lost.A strat-egy that is both popular and often successful is to connect conceptsin terms of their geographical commonalities. In the above exam-ple, 〈DIK -DIK , AFRICA, EUROPE, GERMANY, ALBERT EINSTEIN〉would have been such a solution.

Following this analysis, we formulate a task that captures someof the key motivating issues discussed above. We show how in-formation from a short prefix of the navigation path can be usedto predict what the information seeker is looking for. We designa ranking-based machine learning model and an efficient parame-ter estimation algorithm. Our method is informed by the lessonslearned in our analysis and is trained on real human paths. The ex-perimental evaluation shows that it can predict humans’ intendedtargets with high accuracy.

Overall, our results provide insights into how people navigateand solve the task of wayfinding in information networks. From thepractical perspective, our findings can be applied in order to makebetter sense of observed human search paths. Our performanceon the target prediction task suggests that features of the underly-ing path can provide useful information beyond simply predictingthe next action of the user. We therefore think that results of ourresearch can be incorporated into intelligent systems to facilitatehuman information browsing and navigation.

2. RELATED WORKThe work related to our explorations here can be separated into

three parts: Web click-trail analysis, systems that aid users in Webnavigation, and decentralized search in networks. Next we brieflyreview each of these three lines of related work.

Information retrieval has focused on analyzing Web-browsingclick trails of millions of users mainly for the purpose of improvingWeb search results. Click trails can be used as endorsementsto ranksearch results more effectively [3, 19], trail destinationpages canthemselves be used as search results [24], and the concept oftele-portation can be used to navigate directly to the desired page [20].Similarly, large-scale studies of Web-page revisitation patterns [1]focus on how often users revisit the same page while ignoringhowpeople get there. In contrast, our work focuses on understand-ing how people reach information by navigating through networks.Another important difference is that in our case we know the ex-act target of human search and can thus quantitatively analyze thestrategies people use when navigating information spaces,as wellas their efficiency.

Observational and laboratory studies have conducted smallscalecontrolled experiments about users’ thought processes during searchby having them think aloud as they search [13] and about theirinter-action with Web information [17]. These studies spawned sophis-ticated descriptive models, like information scent [4] andinforma-tion foraging [16], which uses the metaphor of how animals foragefor food. Other analogies, such as orienteering [14] and berrypick-ing [2], have also been used to describe users’ information-seekingstrategies. Other systems like ScentTrails [15] and guidedtours [21]have been proposed to create annotations to indicate where otherusers have navigated in the past, all with the goal of helpingpeoplefind information faster. Our present work differs in two importantways. First, our goal is not to formulate an analogy for humanwayfinding, but rather to analyze it computationally using alarge-scale collection of real search traces. Second, we address the taskof predicting the actual target of human search, not just thenext

Figure: Sample game

Helic (KTI) Search in Networks May 13, 2013 42 / 98

Page 46: Search in Social and Navigation Networks

Human Navigation

Start page: DIK-DIK

Target page: Albert Einstein

Using back button is allowed (electron-atom)

Full path is a path including back clicks

Effective path is a path without back clicks

Helic (KTI) Search in Networks May 13, 2013 43 / 98

Page 47: Search in Social and Navigation Networks

Human Navigation

Shortest path to the target in squares

If a click decreases the shortest path it is called lucrative

In the example not every click is lucrative

At first there is progress and then a cycle

Water is important because it connects two parts of the network → itis a hub

Helic (KTI) Search in Networks May 13, 2013 44 / 98

Page 48: Search in Social and Navigation Networks

Human Navigation

100

101

102

10−5

10−4

10−3

10−2

10−1

100

number of clicks

perc

enta

ge

Figure 2: Distribution of game length, according to differentpath-length metrics. Black circles: shortest possible paths.BlueX’s: effective human paths (i.e., ignoring back-clicks).Red dots:complete human paths (i.e., including back-clicks).Green plussigns: complete human paths, corrected for drop-out rates.

path-length metric mode median meanshortest possible paths 3 3 2.9human, effective 4 4 4.9human, incl. back-clicks 4 5 5.8human, drop-out–corrected 4 6 8.9

Table 1: Summary statistics of the distributions of Fig. 2.

action [6].The last line of related work can be traced back to Milgram’s

small-world experiment [12] and the algorithmic problem ofde-centralized search in networks [8]. Decentralized search considersa scenario in which a starting nodes is trying to send a message to agiven target nodet, by forwarding the message to one of its neigh-bors, where the process continues in the same way until eventually tis reached. This process has been investigated both experimentallyas well as through simulations [5, 9, 18]. In this respect, each gameof Wikispeedia is a decentralized search in a network, whereplay-ers try to navigate between given start and target pages using onlythe local information provided on the current page (i.e., players canonly follow hyperlinks of the current page). In the small-world ex-periment, search is in a sense even more decentralized, since eachnode—i.e., human—on the path independently forwards the mes-sage and then forfeits control; while in Wikispeedia the informationseeker also has only local knowledge about the unknown network,he/she stays in control all the way and can thus form more elaboratestrategies than in the multi-person scenario. Moreover, aspreviousempirical studies of search behavior had very few completedpaths(e.g., only 384 [5]), we work with more than 30,000 completedchains.

Our study is unique in several respects. We collected large-scaledata about human navigation in a network of real-world concepts,where we know the precise target node humans are trying to reach.We focus on computationally investigating and modeling howhu-mans navigate information networks and what strategies they use.This allows us to build accurate predictive models of where theusers are trying to navigate.

3. EFFICIENCY OF HUMAN SEARCHThe Wikipedia graph is an example of a ‘small world’ in which

most pairs of nodes are connected by short chains, with a mean/me-dian/max shortest-path length (SPL) across all pairs of 3.2/3/9. Anatural first question to ask is, How good are humans at findingsuch short chains?

Fig. 2 gives a good impression of how the paths found by humans

4 6 8 10 12 140

0.1

0.2

0.3

0.4

number of clicks

perc

enta

ge

All games of SPL 3Pyramid → BeanBrain → TelephoneAsteroid → VikingTheatre → Zebra

Figure 3: Distribution of game length for 4 specific missionswith an optimal solution of 3 clicks. We recorded between 216and 376 paths per mission. The gray curve shows the lengthdistribution for all games with an optimal solution of 3 clicks.

compare to optimal solutions (summary statistics of the distribu-tions in the figure are provided in Table 1). The red line showsthedistribution of human path lengths (where back-clicks are countedas regular clicks), while effective paths were used for the blue line.For each human game we also computed an optimal solution, andthe resulting path length distribution is represented by the blackline. We make three observations:

1. The variance in search time is much larger for human thanfor optimal solutions. While the distribution of optimal pathlength is tight around 3 clicks, the human distribution ex-poses a heavy tail.

2. Nonetheless, the effective paths found by humans (the blueline in Fig. 2) are typically not much longer than shortestpaths. Both mode and median search time differs from op-timal by just 1 click (3 vs. 4 clicks), mean search time by 2(2.9 vs. 4.9 clicks). (See Table 1.)

3. When considering full path length with back-clicks (the redline in Fig. 2), the mode search time is still 4, and the meanand median search times are 1 click more than for effectivepaths (5 vs. 4 and 5.8 vs. 4.9 clicks). That is, humans clickback on average once per game.

Two questions arise: First, what is the reason for the large vari-ance in human search time? Second, why is human search still soefficient on average?

The first question permits two potential answers. Either somemissions are inherently harder than others, or some informationseekers are better than others. Some missions have longer opti-mal solutions than others, so necessarily some games are inherentlyharder. However, even when restricting ourselves to missions of afixed SPL, the numbers stay virtually unchanged (e.g., for gameswith a SPL of 3 clicks, the mode/mean/median is 4/5/6.0, as op-posed to 4/5/5.8 for all games). Of course, even among missionsof a fixed SPL, some are harder for humans because the lucrativelinks might be less obvious. To control such effects, we posted 4missions—all of SPL 3—on the game website with increased fre-quency. This allows us to find out how different humans perform onthe exact same task. The search time distributions for the 4 frequentmissions are plotted in Fig. 3. We see that for each separate missionthere is considerable search time variance, but also that some mis-sions allow for shorter games on average than others. This leads usto conclude that both hardness of mission and individual skill playa roll in explaining the large search time variance.

Regarding the second question, too,—Why is human search soefficient on average?—several answers are conceivable. Onemight

Figure: Clicks histogram

Helic (KTI) Search in Networks May 13, 2013 45 / 98

Page 49: Search in Social and Navigation Networks

Human Navigation

Black circles show shortest paths → small-world network

Blue X’s show human effective paths

Red dots show human complete paths

Green +’s complete human paths corrected for drop-outs

Helic (KTI) Search in Networks May 13, 2013 46 / 98

Page 50: Search in Social and Navigation Networks

Human Navigation

100

101

102

10−5

10−4

10−3

10−2

10−1

100

number of clicks

perc

enta

ge

Figure 2: Distribution of game length, according to differentpath-length metrics. Black circles: shortest possible paths.BlueX’s: effective human paths (i.e., ignoring back-clicks).Red dots:complete human paths (i.e., including back-clicks).Green plussigns: complete human paths, corrected for drop-out rates.

path-length metric mode median meanshortest possible paths 3 3 2.9human, effective 4 4 4.9human, incl. back-clicks 4 5 5.8human, drop-out–corrected 4 6 8.9

Table 1: Summary statistics of the distributions of Fig. 2.

action [6].The last line of related work can be traced back to Milgram’s

small-world experiment [12] and the algorithmic problem ofde-centralized search in networks [8]. Decentralized search considersa scenario in which a starting nodes is trying to send a message to agiven target nodet, by forwarding the message to one of its neigh-bors, where the process continues in the same way until eventually tis reached. This process has been investigated both experimentallyas well as through simulations [5, 9, 18]. In this respect, each gameof Wikispeedia is a decentralized search in a network, whereplay-ers try to navigate between given start and target pages using onlythe local information provided on the current page (i.e., players canonly follow hyperlinks of the current page). In the small-world ex-periment, search is in a sense even more decentralized, since eachnode—i.e., human—on the path independently forwards the mes-sage and then forfeits control; while in Wikispeedia the informationseeker also has only local knowledge about the unknown network,he/she stays in control all the way and can thus form more elaboratestrategies than in the multi-person scenario. Moreover, aspreviousempirical studies of search behavior had very few completedpaths(e.g., only 384 [5]), we work with more than 30,000 completedchains.

Our study is unique in several respects. We collected large-scaledata about human navigation in a network of real-world concepts,where we know the precise target node humans are trying to reach.We focus on computationally investigating and modeling howhu-mans navigate information networks and what strategies they use.This allows us to build accurate predictive models of where theusers are trying to navigate.

3. EFFICIENCY OF HUMAN SEARCHThe Wikipedia graph is an example of a ‘small world’ in which

most pairs of nodes are connected by short chains, with a mean/me-dian/max shortest-path length (SPL) across all pairs of 3.2/3/9. Anatural first question to ask is, How good are humans at findingsuch short chains?

Fig. 2 gives a good impression of how the paths found by humans

4 6 8 10 12 140

0.1

0.2

0.3

0.4

number of clicks

perc

enta

ge

All games of SPL 3Pyramid → BeanBrain → TelephoneAsteroid → VikingTheatre → Zebra

Figure 3: Distribution of game length for 4 specific missionswith an optimal solution of 3 clicks. We recorded between 216and 376 paths per mission. The gray curve shows the lengthdistribution for all games with an optimal solution of 3 clicks.

compare to optimal solutions (summary statistics of the distribu-tions in the figure are provided in Table 1). The red line showsthedistribution of human path lengths (where back-clicks are countedas regular clicks), while effective paths were used for the blue line.For each human game we also computed an optimal solution, andthe resulting path length distribution is represented by the blackline. We make three observations:

1. The variance in search time is much larger for human thanfor optimal solutions. While the distribution of optimal pathlength is tight around 3 clicks, the human distribution ex-poses a heavy tail.

2. Nonetheless, the effective paths found by humans (the blueline in Fig. 2) are typically not much longer than shortestpaths. Both mode and median search time differs from op-timal by just 1 click (3 vs. 4 clicks), mean search time by 2(2.9 vs. 4.9 clicks). (See Table 1.)

3. When considering full path length with back-clicks (the redline in Fig. 2), the mode search time is still 4, and the meanand median search times are 1 click more than for effectivepaths (5 vs. 4 and 5.8 vs. 4.9 clicks). That is, humans clickback on average once per game.

Two questions arise: First, what is the reason for the large vari-ance in human search time? Second, why is human search still soefficient on average?

The first question permits two potential answers. Either somemissions are inherently harder than others, or some informationseekers are better than others. Some missions have longer opti-mal solutions than others, so necessarily some games are inherentlyharder. However, even when restricting ourselves to missions of afixed SPL, the numbers stay virtually unchanged (e.g., for gameswith a SPL of 3 clicks, the mode/mean/median is 4/5/6.0, as op-posed to 4/5/5.8 for all games). Of course, even among missionsof a fixed SPL, some are harder for humans because the lucrativelinks might be less obvious. To control such effects, we posted 4missions—all of SPL 3—on the game website with increased fre-quency. This allows us to find out how different humans perform onthe exact same task. The search time distributions for the 4 frequentmissions are plotted in Fig. 3. We see that for each separate missionthere is considerable search time variance, but also that some mis-sions allow for shorter games on average than others. This leads usto conclude that both hardness of mission and individual skill playa roll in explaining the large search time variance.

Regarding the second question, too,—Why is human search soefficient on average?—several answers are conceivable. Onemight

Figure: Click summaries

Helic (KTI) Search in Networks May 13, 2013 47 / 98

Page 51: Search in Social and Navigation Networks

Human Navigation

Effective human paths are typically not much longer than optimalshortest paths

They differ typically by one click

Full path lengths with back-clicks are one click more than effectivepath lengths

Typically, one back-click per game

Still very efficient

The variance in human paths is higher than for optimal solutions →heavy tail

Helic (KTI) Search in Networks May 13, 2013 48 / 98

Page 52: Search in Social and Navigation Networks

Human Navigation

Two questions

What is the reason for the large variance?

Why is human search still so efficient on average?

Two possible answers to the first questions

Some missions are harder than the others

Some users are better than the others

Helic (KTI) Search in Networks May 13, 2013 49 / 98

Page 53: Search in Social and Navigation Networks

Human Navigation

Two questions

What is the reason for the large variance?

Why is human search still so efficient on average?

Two possible answers to the first questions

Some missions are harder than the others

Some users are better than the others

Helic (KTI) Search in Networks May 13, 2013 49 / 98

Page 54: Search in Social and Navigation Networks

Human Navigation

Some missions have longer optimal paths

Necessarily, some of them are harder than the others

Even missions with the same optimal paths can be different becausee.g. links are more or less lucrative

Four selected missions with shortest path of 3 have been posted moreoften to users

Helic (KTI) Search in Networks May 13, 2013 50 / 98

Page 55: Search in Social and Navigation Networks

Human Navigation

100

101

102

10−5

10−4

10−3

10−2

10−1

100

number of clicks

perc

enta

ge

Figure 2: Distribution of game length, according to differentpath-length metrics. Black circles: shortest possible paths.BlueX’s: effective human paths (i.e., ignoring back-clicks).Red dots:complete human paths (i.e., including back-clicks).Green plussigns: complete human paths, corrected for drop-out rates.

path-length metric mode median meanshortest possible paths 3 3 2.9human, effective 4 4 4.9human, incl. back-clicks 4 5 5.8human, drop-out–corrected 4 6 8.9

Table 1: Summary statistics of the distributions of Fig. 2.

action [6].The last line of related work can be traced back to Milgram’s

small-world experiment [12] and the algorithmic problem ofde-centralized search in networks [8]. Decentralized search considersa scenario in which a starting nodes is trying to send a message to agiven target nodet, by forwarding the message to one of its neigh-bors, where the process continues in the same way until eventually tis reached. This process has been investigated both experimentallyas well as through simulations [5, 9, 18]. In this respect, each gameof Wikispeedia is a decentralized search in a network, whereplay-ers try to navigate between given start and target pages using onlythe local information provided on the current page (i.e., players canonly follow hyperlinks of the current page). In the small-world ex-periment, search is in a sense even more decentralized, since eachnode—i.e., human—on the path independently forwards the mes-sage and then forfeits control; while in Wikispeedia the informationseeker also has only local knowledge about the unknown network,he/she stays in control all the way and can thus form more elaboratestrategies than in the multi-person scenario. Moreover, aspreviousempirical studies of search behavior had very few completedpaths(e.g., only 384 [5]), we work with more than 30,000 completedchains.

Our study is unique in several respects. We collected large-scaledata about human navigation in a network of real-world concepts,where we know the precise target node humans are trying to reach.We focus on computationally investigating and modeling howhu-mans navigate information networks and what strategies they use.This allows us to build accurate predictive models of where theusers are trying to navigate.

3. EFFICIENCY OF HUMAN SEARCHThe Wikipedia graph is an example of a ‘small world’ in which

most pairs of nodes are connected by short chains, with a mean/me-dian/max shortest-path length (SPL) across all pairs of 3.2/3/9. Anatural first question to ask is, How good are humans at findingsuch short chains?

Fig. 2 gives a good impression of how the paths found by humans

4 6 8 10 12 140

0.1

0.2

0.3

0.4

number of clicks

perc

enta

ge

All games of SPL 3Pyramid → BeanBrain → TelephoneAsteroid → VikingTheatre → Zebra

Figure 3: Distribution of game length for 4 specific missionswith an optimal solution of 3 clicks. We recorded between 216and 376 paths per mission. The gray curve shows the lengthdistribution for all games with an optimal solution of 3 clicks.

compare to optimal solutions (summary statistics of the distribu-tions in the figure are provided in Table 1). The red line showsthedistribution of human path lengths (where back-clicks are countedas regular clicks), while effective paths were used for the blue line.For each human game we also computed an optimal solution, andthe resulting path length distribution is represented by the blackline. We make three observations:

1. The variance in search time is much larger for human thanfor optimal solutions. While the distribution of optimal pathlength is tight around 3 clicks, the human distribution ex-poses a heavy tail.

2. Nonetheless, the effective paths found by humans (the blueline in Fig. 2) are typically not much longer than shortestpaths. Both mode and median search time differs from op-timal by just 1 click (3 vs. 4 clicks), mean search time by 2(2.9 vs. 4.9 clicks). (See Table 1.)

3. When considering full path length with back-clicks (the redline in Fig. 2), the mode search time is still 4, and the meanand median search times are 1 click more than for effectivepaths (5 vs. 4 and 5.8 vs. 4.9 clicks). That is, humans clickback on average once per game.

Two questions arise: First, what is the reason for the large vari-ance in human search time? Second, why is human search still soefficient on average?

The first question permits two potential answers. Either somemissions are inherently harder than others, or some informationseekers are better than others. Some missions have longer opti-mal solutions than others, so necessarily some games are inherentlyharder. However, even when restricting ourselves to missions of afixed SPL, the numbers stay virtually unchanged (e.g., for gameswith a SPL of 3 clicks, the mode/mean/median is 4/5/6.0, as op-posed to 4/5/5.8 for all games). Of course, even among missionsof a fixed SPL, some are harder for humans because the lucrativelinks might be less obvious. To control such effects, we posted 4missions—all of SPL 3—on the game website with increased fre-quency. This allows us to find out how different humans perform onthe exact same task. The search time distributions for the 4 frequentmissions are plotted in Fig. 3. We see that for each separate missionthere is considerable search time variance, but also that some mis-sions allow for shorter games on average than others. This leads usto conclude that both hardness of mission and individual skill playa roll in explaining the large search time variance.

Regarding the second question, too,—Why is human search soefficient on average?—several answers are conceivable. Onemight

Figure: Specific search missions

Helic (KTI) Search in Networks May 13, 2013 51 / 98

Page 56: Search in Social and Navigation Networks

Human Navigation

There is considerable search time variance

Some missions allow for lower average search time

Hardness of mission plays a role

Individual skill play plays also a role

Helic (KTI) Search in Networks May 13, 2013 52 / 98

Page 57: Search in Social and Navigation Networks

Human Navigation

Why humans are able to navigate efficiently?

Is there a sampling bias?

There are many drop-outs, e.g. 54% of all games are canceled beforereaching the target page

This might introduce a bias towards observing shorter chains than wewould observe by forcing users to finish

This might produce longer chains

Helic (KTI) Search in Networks May 13, 2013 53 / 98

Page 58: Search in Social and Navigation Networks

Human Navigation

0 5 10 15 20

0.05

0.1

0.15

path pos. (incl. back−clicks)

drop

−ou

t rat

e

Figure 4: Drop-out rate as a function of path position (with95% confidence intervals). At each step, players give up withaprobability of around 10%.

argue that the efficiency of observed games is caused by a sam-pling bias. In studies that collect data from human volunteers, onealways faces the problem of participants dropping out before fin-ishing the task assigned. In our case, this might result in a biastowards observing shorter chains than what we would observebyforcing participants to finish all tasks, since the longer the gametakes, the more likely the subject is to give up at some point.Forinstance, 54% of all games in our data set were canceled before fin-ishing. Fig. 4 shows that the drop-out rateRi , i.e., the probabilityof giving up at thei-th step, is roughly constant at around 10%.

Using drop-out rates, we can correct for the aforementionedbiasand compute an ideal search time histogram, for the hypotheticalcase that participants never give up [5]. The result is shownas thegreen line in Fig. 2. Although longer games are more frequentun-der the ideal than under the observed distribution, the distributionsstill look similar qualitatively, with mode 4 and a power-law–liketail. The median search time is only 1 click higher (6 vs. 5 clicks),and mean search time rises by 3 clicks (8.9 vs. 5.8 clicks). Wecon-clude that the observed human efficiency in Wikispeedia playis notexplicable by a sampling bias alone.

Instead, we conjecture that, even without knowing the set ofallexisting links, the Wikipedia graph is efficiently navigable for hu-mans because they have an intuition about what links to expect.Clearly, the probability of 2 articles linking to each otheris higherthe more related they are. This can lead to efficient navigation evenin the absence of global knowledge. In particular, Liben-Nowellet al. [10] have shown analytically that short search times (tech-nically defined as polylogarithmic in the number of nodes) can beexpected under their model of ‘rank-based friendship’, i.e., if theprobability of a node linking to itsr-th closest fellow node decaysas 1/r. Intuitively, such a scenario is desirable because it consti-tutes an appropriate mix of many short- and a few long-range links.The latter are helpful for getting somewhat close to the target, whilethe former are necessary for fully reaching it.

We strive to investigate whether the Wikipedia graph satisfiesrank-based friendship. Humans may tap into all their knowledgeand reasoning skills during play, so it is hard to formalize their nodedistance measure. In the present analysis, we therefore coarselyapproximate the human by a standard text-based distance measureand define the similarity of 2 articles as the cosine of their TF-IDFvectors [11] (and distance as 1 minus similarity). Fig. 5 plots thelink probability P(r) as a function of rankr. The black line wasadded to show an ideal slope of−1, as postulated by the rank-basedfriendship model. Note that, althoughP(r) does not fully follow apower law, the overall slope of the curve comes close to−1, whichleads us to conclude that Wikipedia is conducive to efficientnavi-gation because its links represent an appropriate mix of long- andshort-range connections across concept space.

Also note the red, upper curve in Fig. 5: after adding a smallconstantǫ = 0.005 toP(r), the plot looks considerably more likethe required power law. We take this as an indication that there is

100

101

102

103

104

10−4

10−3

10−2

10−1

100

rank r

link

prob

abili

ty P

(r)

Figure 5: Link probability P(r) as a function of rank r. Givenr, consider all node pairs(u,v) such that v is the node that isr-th closest tou among all nodes. ThenP(r) is defined as thefraction of these nodes for whichu links to v. Blue: P(r). Red:P(r)+ ǫ, with ǫ= 0.005. Black: ideal slope of−1 (not a fit; onlyfor orientation).

slight underlinking in the Wikipedia graph: if every node linked toeven its furthest fellow nodes with a small background probabil-ity ǫ, then Wikipedia could become even more easily navigable (atleast under the TF-IDF distance measure).

4. ELEMENTS OF HUMAN WAYFINDINGIn the previous section we have argued that human search in the

Wikipedia network is made possible by the statistical properties ofits link structure. Next we turn our attention to a detailed analysisof how people actually exploit these properties.

4.1 Anatomy of typical pathsIn our analysis, we investigate how some key quantities of arti-

cles and clicks change as games progress from the start towards thetarget article. To facilitate the analysis, we restrict ourselves to allgames whose start and target articles are optimally connected byexactly 3 clicks and consider only effective paths.

Fig. 6 contains a graphical summary of the findings we are aboutto discuss. Each subfigure tracks one quantity along game paths;each curve is computed from all games of the same effective pathlength, the leftmost curve representing games of length 8, the nextone games of length 7, etc. (to avoid clutter, we consider onlygames of a maximum length of 8 clicks). Thex-axes show thehuman-path distance, i.e., the number of clicks to the target on theeffective path (i.e., paths may be thought of as running fromleft toright), while they-axes represent the mean of the respective quan-tity over all games, alongside 95% confidence intervals. Theboldgray curves plot the given quantity for the average optimal solution.To compute it, we found an optimal solution for every human gameinstance and averaged. We refer to the figure in rowr and columnc as plot(r,c).

Making progress is easiest far from and close to the target.Plot (1,1) shows how the shortest-path length (SPL) to the tar-get changes as a function of human-path distance. Necessarily, theshorter the game, the steeper the curve. Additionally, all curvesshare a typical anatomy: with the first click, the information seekergets significantly closer to the target on average, then the curveflattens out and becomes steeper again towards the endgame. Inshort games the players blasts straight through to the target, mak-ing progress with nearly every step, while in long games the playergoes through a phase of inefficient circling around the target be-fore finally gravitating towards it. Another perspective ofthe samephenomenon is afforded by plot(2,3), which shows the fraction of

Figure: Drop-out rates

Helic (KTI) Search in Networks May 13, 2013 54 / 98

Page 59: Search in Social and Navigation Networks

Human Navigation

At each step players give up with 10% probability

Using those drop-rates we can correct for that bias

We compute an ideal histogram assuming that the users never give up

Although longer games are now more frequent the distributions aresimilar qualitatively

Median e.g. 1 click higher

Helic (KTI) Search in Networks May 13, 2013 55 / 98

Page 60: Search in Social and Navigation Networks

Human Navigation

100

101

102

10−5

10−4

10−3

10−2

10−1

100

number of clicks

perc

enta

ge

Figure 2: Distribution of game length, according to differentpath-length metrics. Black circles: shortest possible paths.BlueX’s: effective human paths (i.e., ignoring back-clicks).Red dots:complete human paths (i.e., including back-clicks).Green plussigns: complete human paths, corrected for drop-out rates.

path-length metric mode median meanshortest possible paths 3 3 2.9human, effective 4 4 4.9human, incl. back-clicks 4 5 5.8human, drop-out–corrected 4 6 8.9

Table 1: Summary statistics of the distributions of Fig. 2.

action [6].The last line of related work can be traced back to Milgram’s

small-world experiment [12] and the algorithmic problem ofde-centralized search in networks [8]. Decentralized search considersa scenario in which a starting nodes is trying to send a message to agiven target nodet, by forwarding the message to one of its neigh-bors, where the process continues in the same way until eventually tis reached. This process has been investigated both experimentallyas well as through simulations [5, 9, 18]. In this respect, each gameof Wikispeedia is a decentralized search in a network, whereplay-ers try to navigate between given start and target pages using onlythe local information provided on the current page (i.e., players canonly follow hyperlinks of the current page). In the small-world ex-periment, search is in a sense even more decentralized, since eachnode—i.e., human—on the path independently forwards the mes-sage and then forfeits control; while in Wikispeedia the informationseeker also has only local knowledge about the unknown network,he/she stays in control all the way and can thus form more elaboratestrategies than in the multi-person scenario. Moreover, aspreviousempirical studies of search behavior had very few completedpaths(e.g., only 384 [5]), we work with more than 30,000 completedchains.

Our study is unique in several respects. We collected large-scaledata about human navigation in a network of real-world concepts,where we know the precise target node humans are trying to reach.We focus on computationally investigating and modeling howhu-mans navigate information networks and what strategies they use.This allows us to build accurate predictive models of where theusers are trying to navigate.

3. EFFICIENCY OF HUMAN SEARCHThe Wikipedia graph is an example of a ‘small world’ in which

most pairs of nodes are connected by short chains, with a mean/me-dian/max shortest-path length (SPL) across all pairs of 3.2/3/9. Anatural first question to ask is, How good are humans at findingsuch short chains?

Fig. 2 gives a good impression of how the paths found by humans

4 6 8 10 12 140

0.1

0.2

0.3

0.4

number of clicks

perc

enta

ge

All games of SPL 3Pyramid → BeanBrain → TelephoneAsteroid → VikingTheatre → Zebra

Figure 3: Distribution of game length for 4 specific missionswith an optimal solution of 3 clicks. We recorded between 216and 376 paths per mission. The gray curve shows the lengthdistribution for all games with an optimal solution of 3 clicks.

compare to optimal solutions (summary statistics of the distribu-tions in the figure are provided in Table 1). The red line showsthedistribution of human path lengths (where back-clicks are countedas regular clicks), while effective paths were used for the blue line.For each human game we also computed an optimal solution, andthe resulting path length distribution is represented by the blackline. We make three observations:

1. The variance in search time is much larger for human thanfor optimal solutions. While the distribution of optimal pathlength is tight around 3 clicks, the human distribution ex-poses a heavy tail.

2. Nonetheless, the effective paths found by humans (the blueline in Fig. 2) are typically not much longer than shortestpaths. Both mode and median search time differs from op-timal by just 1 click (3 vs. 4 clicks), mean search time by 2(2.9 vs. 4.9 clicks). (See Table 1.)

3. When considering full path length with back-clicks (the redline in Fig. 2), the mode search time is still 4, and the meanand median search times are 1 click more than for effectivepaths (5 vs. 4 and 5.8 vs. 4.9 clicks). That is, humans clickback on average once per game.

Two questions arise: First, what is the reason for the large vari-ance in human search time? Second, why is human search still soefficient on average?

The first question permits two potential answers. Either somemissions are inherently harder than others, or some informationseekers are better than others. Some missions have longer opti-mal solutions than others, so necessarily some games are inherentlyharder. However, even when restricting ourselves to missions of afixed SPL, the numbers stay virtually unchanged (e.g., for gameswith a SPL of 3 clicks, the mode/mean/median is 4/5/6.0, as op-posed to 4/5/5.8 for all games). Of course, even among missionsof a fixed SPL, some are harder for humans because the lucrativelinks might be less obvious. To control such effects, we posted 4missions—all of SPL 3—on the game website with increased fre-quency. This allows us to find out how different humans perform onthe exact same task. The search time distributions for the 4 frequentmissions are plotted in Fig. 3. We see that for each separate missionthere is considerable search time variance, but also that some mis-sions allow for shorter games on average than others. This leads usto conclude that both hardness of mission and individual skill playa roll in explaining the large search time variance.

Regarding the second question, too,—Why is human search soefficient on average?—several answers are conceivable. Onemight

Figure: Clicks histogram

Helic (KTI) Search in Networks May 13, 2013 56 / 98

Page 61: Search in Social and Navigation Networks

Human Navigation

Conjecture: Wikipedia networks is efficiently navigable (even withoutglobal knowledge)

Humans have an intuition of what to expect behind a link

Probability of two articles being linked is higher if they are similar

Helic (KTI) Search in Networks May 13, 2013 57 / 98

Page 62: Search in Social and Navigation Networks

Human Navigation

Click path analysis

Games with shortest path 3

Only effective paths analyzed

Analysis of how some quantitative properties change along the pathswith the game progress

Curves combined from games having the same effective path length

Helic (KTI) Search in Networks May 13, 2013 58 / 98

Page 63: Search in Social and Navigation Networks

Human Navigation

24681

2

3SPL to target

2468

40

60

80

100

outdegree

2468

24

68

lucrative degree

2468

0.1

0.2

0.3

0.4lucrative ratio

2468

0.1

0.2

0.3

rel. info gain

24680.2

0.4

0.6

0.8

prob. of lucr. click

2468

2

3

4

cat. tree dist. to target

2468

0.750.8

0.850.9

0.95

TF−IDF dist. to target

2468

0.75

0.8

0.85

0.9

TF−IDF dist. to next

Figure 6: The evolution of article properties along search paths,for games of optimal length 3. Only games of between 3 and 8clicks are shown. Each colored line represents games of thesame length. Thex-axis shows the distance-to-go to the target,the y-axis the average value of the respective property (with95% confidence intervals). The bold gray curve is computedbased on optimal solutions for the considered human paths.

times humans picked a lucrative link, i.e., one that led themcloserto the goal in terms of SPL. We observe a down–up pattern in thecurves: information seekers are more likely to make progress withthe first click than with the second. Later on, in the endgame,clicksbecome again ever more likely to be lucrative. In long games,thephases of progress in the opening and endgame are separated bya phase of stagnation where the probability of picking a goodlinkstays roughly constant, a manifestation of the circling effect de-scribed above.1

Hubs are crucial in the opening.The initial progress with the firstclick is afforded by leaping to a ‘hub’ article, i.e., a high-degreenode that is easily reachable from all over the graph and thathasconnections to many regions of it. This makes sense intuitively,since a good hub gives the information seeker more options tocon-tinue the search, and is demonstrated by plots(1,2), (1,3), and(2,1). While the start article has an average degree of only about30 (cf. plot(1,2)), the first click leads to an article with an averagedegree of between 80 and 100. After the sudden degree increasewith the first click, the quantity decreases slowly as the target isapproached.

Note that the shorter the game, the higher the degree of the hub(and of any given position, for that matter). This could mean(1)that better information seekers pick better hubs, or (2) that somemissions are easier because the start articles have links tobetterhubs. While the availability of good hubs certainly helps, Fig. 7demonstrates that the first alternative plays a role as well.We plot

1The fact that the probability is not 100% even when humansachieve the optimal path length (the blue curve) is due to thefactthat the data includes clicks from which the players later backed upby means of the browser’s back button, such that they may takesub-optimal links while still achieving the optimal effective path length.

4 6 8 10

0.4

0.45

0.5

0.55

path length

hub

qual

ity

Figure 7: Hub quality as a function of search time (with 95%confidence intervals). Hub quality is defined as the degree ofthesecond article, divided by the degree of the maximum-degreeneighbor of the start article.

the ratio deg(u2)/deg(u∗2) of the degree of the second article andthat of the highest-degree neighbor of the start article, averagedover all games of the respective length. The quantity decreases withincreasing game length, implying that better information seekerstend to start games with relatively higher-degree hubs.

Let the term ‘lucrative degree’ stand for the number of outgoinglinks that decrease the SPL to the target. Plot(1,3) shows that,just like the plain degree, the lucrative degree, too, increases sig-nificantly with the first click—the hub article typically offers morelucrative options than the start article. Also, the mean lucrative de-gree then decreases as the games continue (necessarily, since thereare more articles far from the target than close to it). We do not seea correlation between the hub’s lucrative degree and game length.However, the start article itself has higher lucrative degree for veryshort games than for longer ones, an indicator that some games areinherently easier than others, even if the optimal number ofclicksis held fixed. This certainly is a factor in the aforementioned neg-ative correlation between search time and hub degree: if there aremany good hubs it is easier to find one of them.

An interesting additional insight is afforded by looking athowthe average of the ratio of lucrative degree and degree changes dur-ing games (cf.(2,1)). The resulting quantity, which we call ‘lu-crative ratio’, corresponds to the probability of getting closer to thetarget when randomly choosing an outgoing link. While both de-gree and lucrative degree achieve their maximum with the secondarticle, their ratio drops drastically between the first andsecond ar-ticles. From this we conclude that the second article is a true hub,in that it does not only have many outlinks leading closer to thetarget, but also has many that lead further away from it, i.e., that ithas connections into many different regions of the graph.

Conceptual distance to the target decreases steadily.Plots(3,1)and(3,2) show that articles get ever more related textually to thetarget as the latter is approached (in other words, textual distancedecreases). We verify this using 2 distinct measures of conceptualrelatedness, (1) the cosine of the TF-IDF vectors of the 2 respectivearticles, as in Section 3, and (2) the number of edges that have to betraversed in the category tree that comes with our Wikipediaver-sion, in order to reach one article from the other (‘categorytree dis-tance’). The fact that the conceptual distance to the targetdecreasesstrictly along paths corroborates our conjecture from Section 3 thathumans approximately perform a decentralized search usinga dis-tance measure between concepts. Also, note that the very intuitionthat the distance between concepts along the path and targets de-creases was the originalraison d’êtreof the game of Wikispeedia[22].

Big leaps first, followed by smaller steps.While plots(3,1) and(3,2) track the textual distance between the current article and the

Figure: Properties evolution along the click paths

Helic (KTI) Search in Networks May 13, 2013 59 / 98

Page 64: Search in Social and Navigation Networks

Human Navigation

Degree and similarity are most important factors in human navigation

In the early phase degree is the most important

In the endgame similarity is the most important

To analyze the importance of features train a logistic regression model

The feature weights allow us to infer the importance of a particularfeature

Helic (KTI) Search in Networks May 13, 2013 60 / 98

Page 65: Search in Social and Navigation Networks

Human Navigation

1 20

1

2

3

4length 3

1 2 30

1

2

3

4length 4

1 2 3 40

1

2

3

4length 5

1 2 3 4 50

1

2

3

4length 6

1 2 3 4 5 60

1

2

3

4length 7

1 2 3 4 5 6 70

1

2

3

4length 8

Figure 8: Logistic regression weights for classifying human vs.non-human clicks (with standard errors). Green: textual simi-larity. Red: degree. There is one plot per human path length;the x-axes show path positions, they-axes weights.

red curves are the degree and the green ones the similarity weights.The weight of the bias term was omitted from the plots, since it isnot informative. Note that, for visibility’s sake, we do notshowthe weights for the last click. There, similarity becomes a nearlyperfect indicator for the target article, since the target has maxi-mum similarity with itself, so the similarity feature gets avery largeweight, and the interesting part of the plots would get squished andhard to read.

Interpreting the plots, our expectation is confirmed. Both fea-tures obtain positive weights everywhere, which means thatbothhigh degree and high similarity with the target are characteristicsof the click choices made by humans. More interesting, as hypoth-esized, degree dominates in the beginning of games, but as gamesprogress, similarity becomes ever more important, superseding de-gree starting with the second or third click. Furthermore, similaritystarts dominating earlier in more efficient games.

We emphasize that the purpose of this experiment is an analy-sis of the fitted feature weights, not maximizing the accuracy of theclassifiers. Still, to justify our conclusions, we need to show that theclassifiers perform better than chance (50%) on a statistically sig-nificant level. Evaluating the classifiers on the held-out test set, wefind that this is the case. Accuracy is similar for all game lengths. Itdrops from around 90% for the first path position to about 65% forthe second and then stays in the regime of between 55% and 65%.When maximum accuracy is the goal, more powerful features, suchas TF-IDF cosine, perform better, but as mentioned earlier,featurecorrelation does not permit us to use this feature in our analysis.

4.3 Endgame strategiesThe main finding of the previous section is that navigating through

hubs is common in the opening of games. Next we take a closerlook at the strategies players adopt in endgames, in order tohomein on the target.

In the present analysis we define anendgameas the last 3 articles(i.e., 2 clicks) of a path. To make sure the endgames we analyzedo not contain artifacts from the game openings, we consideronlygames of a full length of at least 5 articles (i.e., 4 clicks).We alsoneglect all games above the length threshold of 20 articles.Theendgame strategycorresponding to an endgame〈un−2,un−1,un〉 isdefined as〈C(un−2),C(un−1),C(un)〉, whereC(u) is u’s top-levelcategory in the hierarchy that comes with our Wikipedia version.For instance, the full category ofDIK -DIK is SCIENCE/BIOLOGY/MAMMALS , andC(DIK -DIK) = SCIENCE. All 14 valuesC can take

0

0.5

1

1.5

one group per target category

mea

n ov

erhe

ad

Figure 9: Overhead with respect to optimal solutions, forsingle-category(red) and most popular multi-category (blue)strategies, with one group per target category. Thegreenbars show means over all games of the respective target cat-egory. From left to right: PEOPLE, MUSIC , IT , LANGUAGE

AND LITERATURE , HISTORY , SCIENCE, RELIGION , DESIGN

AND TECHNOLOGY , CITIZENSHIP , ART , BUSINESS STUDIES,MATHEMATICS , EVERYDAY LIFE , GEOGRAPHY .

on are listed in the caption of Fig. 9.We divide the set of all Wikispeedia games into subsets accord-

ing to target categories, such that all games with target articles fromthe same category are placed in the same subset. For each tar-get category we observe between 29 and 104 distinct strategies,out of the possible 142 = 196. For all target categories, the dis-tribution over strategies is highly non-uniform, with mostgamesfollowing one of only a few top strategies. As a consequence,foreach target category, the top 10 strategies typically coverbetween60% and 90% of all games. Furthermore, the distributions over ar-ticles within each category are also non-uniform; e.g., 14%of allinstances ofGEOGRAPHYareUNITED STATES and 6.1%UNITED

KINGDOM.In 12 out of the 14 target categories, the most popular strategy

is the one that consists of the target category only, which wecallthe ‘simple’ strategy: people tend to approach the target througharticles from the same category as the target. In the remaining 2categories, the simple strategy has very high rank, too: it is sec-ond most frequent when the target is fromDESIGN AND TECH-NOLOGY, and fourth most frequent when it is fromPEOPLE. Inthe former case, the more frequent strategy is〈GEOGRAPHY, GE-OGRAPHY, DESIGN AND TECHNOLOGY〉; in the latter case, the3 more frequent strategies are〈GEOGRAPHY, GEOGRAPHY, PEO-PLE〉, 〈GEOGRAPHY, CITIZENSHIP, PEOPLE〉, and〈GEOGRAPHY,HISTORY, PEOPLE〉.

In these examples,GEOGRAPHYseems to play a prominent role.And indeed this is a general property of human paths. To demon-strate this, we count for each categoryc how often articles from itappear in endgames in which the target isnotalso of categoryc. Wefind that GEOGRAPHY accounts for 20% of articles in endgamesof which GEOGRAPHY is not the target. The next most commoncategories according to this metric areSCIENCE (7.5%) andHIS-TORY (5.1%). One might argue that certain categories area priorimore likely, since they contain more articles. We can correct forthis bias by considering the ratio of the above-introduced frequencyand thea-priori category frequency, i.e., the number of articles inthe category, divided by the overall number of articles. In the re-sulting ranking,GEOGRAPHY is still top, now followed byCITI-ZENSHIP (mostly about politics and culture) andRELIGION. Thisfinding might imply that humans organize their knowledge stronglyaccording to geographical lines, and that they often associate con-cepts with their countries of origin.

Previously, we saw that simple single-category endgames aretypically most popular with players. Next we investigate how ef-ficient they are compared to other, more complex strategies.Letl be the length (number of clicks) of a human game, andl∗ the

Figure: Feature weights

Helic (KTI) Search in Networks May 13, 2013 61 / 98

Page 66: Search in Social and Navigation Networks

Human Navigation

Both features are positive everywhere

Both high degree and high similarity are characteristics of the humanclicks

Degree dominates in the early game

As the game progress similarity is increasingly important

Similarity starts dominating earlier in more efficient games

Helic (KTI) Search in Networks May 13, 2013 62 / 98

Page 67: Search in Social and Navigation Networks

Modeling Human Navigation in Information Networks

Decentralized search as a model of navigation in informationnetworks?

Limited topical knowledge

Cheap candidate consultation

High variation in user navigation

We model uncertainty probabilistically

Helic (KTI) Search in Networks May 13, 2013 63 / 98

Page 68: Search in Social and Navigation Networks

Action Selection

Mechanism Definition Agent

Greedy min(d(j , t)) Always follows intuitions

ε-greedymin(d(j , t), P = 1− ε Follows intuitions withrandom, P = ε probability 1− ε; acts randomly

with probability ε

Softmaxp(j) ∝ ecf (j), Follows intuitions to different

f (j) = 1− d(j , t)

maxk,l∈V

d(k , l)extents (controlled by c)

Inv. distancep(j) ∝ f (j)−c , Follows intuitions to different

f (j) =d(j , t)

maxk,l∈V

d(k, l)extents (controlled by c)

Helic (KTI) Search in Networks May 13, 2013 64 / 98

Page 69: Search in Social and Navigation Networks

ε-greedy

4

5

1

13

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 65 / 98

Page 70: Search in Social and Navigation Networks

ε-greedy

4

5

1 1

13

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 66 / 98

Page 71: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

3

2

1

12

11

1

14

22

21

2313

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 67 / 98

Page 72: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

3

2

1

2

12

11

1

14

22

21

2313

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 68 / 98

Page 73: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

3

2

1

2

12

11

1

14 15

22

21

2

2313

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 69 / 98

Page 74: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

3

2

1

32

12

11

1

14 15

22

21

2

2313

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 70 / 98

Page 75: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

4

3

3

2

1

1

32

12

11

1

14 15

22

21

2

24

23 25

3

13

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 71 / 98

Page 76: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

4

3

3

2

1

1

32

12

11

1

14 15

22

21

2

24

23 25

3

13

4

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 72 / 98

Page 77: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

4

3

3

2

1

1

32

12

11

1

14 15

22

21

2

24

23 25

3

13

4

1

1

1

0

31

32 33

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 73 / 98

Page 78: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

4

3

3

2

1

1

32

12

11

1

14 15

22

21

2

24

23 25

3

13

4

1

1

1

0

31

32 33

5

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 74 / 98

Page 79: Search in Social and Navigation Networks

ε-greedy

4

5

5

5

5

4

4

3

4

3

3

2

1

1

32

12

11

1

14 15

22

21

2

24

23 25

3

13

4

1

1

1

0

31

32 33

5

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 75 / 98

Page 80: Search in Social and Navigation Networks

Action Selection

Mechanism Definition Agent

Greedy min(d(j , t)) Always follows intuitions

ε-greedymin(d(j , t), P = 1− ε Follows intuitions withrandom, P = ε probability 1− ε; acts randomly

with probability ε

Softmaxp(j) ∝ ecf (j), Follows intuitions to different

f (j) = 1− d(j , t)

maxk,l∈V

d(k , l)extents (controlled by c)

Inv. distancep(j) ∝ f (j)−c , Follows intuitions to different

f (j) =d(j , t)

maxk,l∈V

d(k, l)extents (controlled by c)

Helic (KTI) Search in Networks May 13, 2013 76 / 98

Page 81: Search in Social and Navigation Networks

Softmax

4

5

5

5

5

4

3

2

1

12

11

1

14

22

21

2313

Figure: Stochastic decentralized search.

Helic (KTI) Search in Networks May 13, 2013 77 / 98

Page 82: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 1

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 78 / 98

Page 83: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 2

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 79 / 98

Page 84: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 5

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 80 / 98

Page 85: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 10

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 81 / 98

Page 86: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 25

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 82 / 98

Page 87: Search in Social and Navigation Networks

Softmax

0

0.2

0.4

0.6

0.8

1

11 12 13 14 21 22 23

p

NodeId

c = 50

Figure: Softmax probabilities.

Helic (KTI) Search in Networks May 13, 2013 83 / 98

Page 88: Search in Social and Navigation Networks

Experimental Setup

The WikiGame

A randomly selected start page and a randomly selected target page

Log of 250,000 successfully completed games

Wikipedia network with 10 million nodes and 250 million links

Distance space: a hierarchy of Wikipedia pages

Helic (KTI) Search in Networks May 13, 2013 84 / 98

Page 89: Search in Social and Navigation Networks

Results

0

1

2

3

4

5

6

Gr. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

s, τ

ε

Human Succes RateHuman StretchSuccess Rate

Stretch

(a) ε-greedy

0

1

2

3

4

5

6

Gr. 65-1 60-1 55-1 50-1 45-1 40-1 35-1

s, τ

c-1

Human Succes RateHuman StretchSuccess Rate

Stretch

(b) Softmax

0

1

2

3

4

5

6

Gr. 15-1 12-1 10-1 8-1 7-1 6-1 5-1

s, τ

c-1

Human Succes RateHuman StretchSuccess Rate

Stretch

(c) Inverse distance

Figure: Success rate and stretch of different navigation models.

Helic (KTI) Search in Networks May 13, 2013 85 / 98

Page 90: Search in Social and Navigation Networks

Discussion

Greedy strategy is too efficient

People are not always greedy and deterministic

Human navigation is stochastic in nature

Appropriately configured any probabilistic model is good

For simplicity and intuitive interpretation we consider only ε-greedy

Helic (KTI) Search in Networks May 13, 2013 86 / 98

Page 91: Search in Social and Navigation Networks

Discussion

ε-greedy: With P = 1− ε humans are greedy and follow theirintuitions and with P = ε they navigate randomly

In our dataset ε = 0.15

Our interpretation

Humans are greedy when they exploit what they know

They act “randomly” when they explore what they do not know

Exploration: incomplete knowledge and cheap cost of consultation

Helic (KTI) Search in Networks May 13, 2013 87 / 98

Page 92: Search in Social and Navigation Networks

Results

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25 30 35 40 45 50

Fre

quen

cy

Hops

Shortest pathHumans

Greedy(ε=0)ε=0.5ε=0.1

ε=0.15ε=0.2

ε=0.25ε=0.3ε=0.5

(a) ε-greedy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25 30 35 40 45 50

Fre

quen

cy

Hops

Shortest pathHumans

Greedy(c=70)c=35c=40c=45c=50c=55c=60c=65

(b) Softmax

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25 30 35 40 45 50

Fre

quen

cy

Hops

Shortest pathHumans

Greedy(c=20)c=5c=6c=7c=8

c=10c=12c=15

(c) Inverse distance

Figure: Hop length distributions.

Helic (KTI) Search in Networks May 13, 2013 88 / 98

Page 93: Search in Social and Navigation Networks

Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Gr. 0.5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

DK

L

ε

Models from Shortest PathModels from Humans

(a) ε-greedy

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Gr. 65-1 60-1 55-1 50-1 45-1 40-1 35-1

DK

L

c-1

Models from Shortest PathModels from Humans

(b) Softmax

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Gr. 15-1 12-1 10-1 8-1 7-1 6-1 5-1

DK

L

c-1

Models from Shortest PathModels from Humans

(c) Inverse distance

Figure: Similarity of hop length distributions between models & humans, andmodels & shortest path.

Helic (KTI) Search in Networks May 13, 2013 89 / 98

Page 94: Search in Social and Navigation Networks

Discussion

Mode of human distribution is 4

Mode of all simulators is 3

Simulators are more efficient in the early game phases

ε is not constant

It is a function of the navigation progress

Helic (KTI) Search in Networks May 13, 2013 90 / 98

Page 95: Search in Social and Navigation Networks

Discussion

0.45

0.5

0.55

0.6

0.65

0.7

0.75

1 2 3 4 5 6 7 8 9 10

Exp

lora

tion

prob

abili

ty

t

Exploration vs. Hop Length

Figure: Rates of exploration vs. navigation progress.

Helic (KTI) Search in Networks May 13, 2013 91 / 98

Page 96: Search in Social and Navigation Networks

Decaying ε-greedy

As humans make progress they orient them better in the network

Ratio of exploitation to exploration increases

ε(t) = ε0λ−t (1)

Helic (KTI) Search in Networks May 13, 2013 92 / 98

Page 97: Search in Social and Navigation Networks

Decaying ε-greedy

0

1

2

3

Gr. 0.4 0.5 0.6 0.7 0.8 0.9 1.0

s, τ

ε0

Human Succes RateHuman StretchSuccess Rate

Stretch

(a) Success rate and stretch

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25 30 35 40 45 50

Fre

quen

cy

Hops

Shortest pathHumans

Greedy(ε0=0)ε0=0.4ε0=0.5ε0=0.6ε0=0.7ε0=0.8ε0=0.9ε0=1.0

(b) Hop length distributions

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Gr. 0.4 0.5 0.6 0.7 0.8 0.9 1.0

DK

L

ε0

Models from Shortest PathModels from Humans

(c) KL divergence

Figure: Humans are matched with λ = 2, ε0 = 0.9.

Helic (KTI) Search in Networks May 13, 2013 93 / 98

Page 98: Search in Social and Navigation Networks

Decaying ε-greedy

General approach adaptable to other navigational settings

E.g. a related starting page: pick a smaller ε0

E.g. users orient them faster: pick a greater λ

Fit the model instead of using it as a generative model

Use another background knowledge, e.g. a smaller hierarchy

Helic (KTI) Search in Networks May 13, 2013 94 / 98

Page 99: Search in Social and Navigation Networks

Decaying ε-greedy

0

1

2

3

Gr. 0.4 0.5 0.6 0.7 0.8 0.9 1.0

s, τ

ε0

Human Succes RateHuman StretchSuccess Rate

Stretch

(a) Success rate and stretch

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

5 10 15 20 25 30 35 40 45 50

Fre

quen

cy

Hops

Shortest pathHumans

Greedy(ε0=0)ε0=0.4ε0=0.5ε0=0.6ε0=0.7ε0=0.8ε0=0.9ε0=1.0

(b) Hop length distributions

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Gr. 0.4 0.5 0.6 0.7 0.8 0.9 1.0

DK

L

ε0

Models from Shortest PathModels from Humans

(c) KL divergence

Figure: Humans are matched with λ = 2, ε0 = 0.7.

Helic (KTI) Search in Networks May 13, 2013 95 / 98

Page 100: Search in Social and Navigation Networks

Conclusions

Social networks

Good knowledge about ego-networks

people exploit that knowledge

Information networks

Incomplete knowledge

Humans exploit but they also tend to explore

Helic (KTI) Search in Networks May 13, 2013 96 / 98

Page 101: Search in Social and Navigation Networks

Conclusions

“Randomness” aspect of navigation

Ratio of exploitation and exploration is not constant

The rate of change and the initial ratio depend on navigation settings

Helic (KTI) Search in Networks May 13, 2013 97 / 98

Page 102: Search in Social and Navigation Networks

Thank You!

Questions?

Helic (KTI) Search in Networks May 13, 2013 98 / 98