Link Predictioncse.iitkgp.ac.in/~psvijay/docs/BTP.pdf · 2014-12-02 · Certificate This is to certify that the thesis titled Link Prediction submitted by Srihari Pratapa ... Introduction

Link Prediction Bachelor Thesis Project

By

Srihari Pratapa (10CS30032)

advised by

Dr. Pabitra Mitra

Certificate

This is to certify that the thesis titled Link Prediction submitted by Srihari

Pratapa(10CS30032) to the Department of Computer Science and Engineering is a

bona fide record of work carried out by him under my supervision and guidance. The

thesis has fulfilled all the requirements as per the regulations of the Institute and, in

my opinion, has reached the standard needed for submission.

Dr. Pabitra Mitra

Professor

Department of Computer Science and Engineering

Indian Institute of Technology, Kharagpur

May 2014

Contents

1 Introduction……………………………………………………………… 1

2 Mathematical Definition of Problem………………………………….…2

3 Section I – Link Prediction for Social Networks………………………..2

3.1 Social Networks…………………………………………………….....2

3.2 Datasets………………………………………………………………..3

3.3 Evaluation Method and Accuracy……………………………………..4

3.4 Methods for Link Prediction…………………………………………..4

3.5 Results………………..…………………………………………….… 5

4 Section II – Link Prediction for Recommendations……………………6

4.1 Recommender Systems………….….………….……………………...6

4.2 Recommenders: Types and Approaches……..…………….……….…7

4.3 Datasets………….……………………….…….…….…….………….9

4.4 Methods for Recommendations……….……….………….…………10

4.5 Results…………………….…………..……….………….………….13

5 Conclusions……….………………………………………….……….…14

6 References…………………………….……………………….………...14

1

1. Introduction

Complex networks are networks with a variety of topological features, with

patterns of connection between their elements that are not regular and to most extent

are random. The study of complex networks helps to understand and model these

interactions between elements. Most important part of complex networks are social

networks structures whose nodes represent people or some elements in a social

context, and the edges represent interaction, collaboration between the elements.

Examples of complex networks are co-authorship networks in which all elements are

scientists and interactions between pairs who co-authored articles, social networking

sites in which nodes are people and interactions are friendship between people, e-

commerce sites which model interactions between products people brought and

people. To model the complex networks graphs are used as it is quite intuitive that

complex have elements which are nodes, interactions are edges.

Complex networks are highly dynamic objects, they grew very fast over time

with addition of new edges and nodes making their study very difficult. Studying the

evolution and dynamics of social networks is a complex problem because of the large

number of parameters. A more addressable problem is to understand the association

between two specific nodes. How new interactions are formed? What are the factors

responsible for formation of new associations? More specifically, the problem being

addressed here is: prediction of likelihood association between two nodes in the near

future. This problem is the link-prediction problem [1].

The link prediction problem can be applied in different complex networks

and handled in different ways in different complex systems. Suppose in social

networks it’s just prediction of future links between users, but in e-commerce

networks(amazon) or networks involving links between users and elements (video

sites) a bi-partite graph is formed between users and elements and link prediction is

done between elements and users. The new links predicted in the later problem are

suggested as recommendations and is generally handled as recommender systems

problem comes under broad classification of link prediction [10].

The entire work done is presented in two sections, first section contains work

done on link prediction in social networks as a part of BTP-I and in the second section

work done on recommender systems in movie recommendation as a part of BTP-2.

2. Mathematical Definition

A Social network G = < V, E> in which each edge e = <u, v> ∊ E represents

an association between u and v that existed at a particular time t(e). We can consider

multiple interactions between u and v as different edges, with potentially different

time-stamps. For two times t< t’, let G [t, t’] denote the sub-graph of G consisting of

2

all edges with a time-stamp between t and t’. And it is evident, the interval [t0, t’0] is

training interval and [t1, t’1] is the test interval [1].

Now, the solid formulation of the link prediction problem is, given four

different times t0 < t’0 < t1 < t’

1 given G[t0, t’0] input to our link prediction algorithm,

the output is a list of all probable future interactions(edges) not present in G[t0, t’0]

and are present in G[t1, t’1]. And the social networks grow on nodes with involvement

of new persons or elements in the network, in addition to the growth through new

associations. As it is not possible for new nodes not present in the training interval,

so future edges for nodes present in the training interval are predicted.

3. Section I – Link Prediction for Social Networks

3.1 Social Networks

Social networks are an elegant way to represent and model the links and

interactions among people in a group or community. Mathematically, they can be

represented as graphs where a vertex corresponds to a person or an object in a

community and an edge represents some kind of association between different

persons or objects. The interactions among a group are intrinsic to that particular kind

of social network. However, social networks are very dynamic, since new edges and

vertices are added to the graph over time. Studying the evolution and dynamics of

evolution of social networks is a complex problem because of the large number of

parameters.

Even though understanding the whole of social networks is a complex

problem, a more addressable problem is to understand the association between two

specific nodes. For example, some of the questions that can be asked are: How does

the graph evolve over time? How new interactions are formed? What are the factors

responsible for formation of new associations? More specifically, the problem being

addressed here is: prediction of likelihood association between two nodes in near

future. The problem is called Link Prediction in Social Networks [2].

Formal Definition: Given a snapshot of a social network, can we find out

which new interactions among its members are likely to occur in the near future? We

generalize this problem as the Link Prediction Problem, and develop approaches to

link prediction based on measures for analysing the ``proximity" of nodes in a

network.

Besides helping in analysing and prediction of future associations between

nodes in social networks, link prediction methods also have several important task

3

for analysing many complex networks. A scientific problem relevant to network

analysis in the current context is Information Retrieval. It can be viewed as prediction

of relations between words and documents. In many biological networks, food webs

and protein-protein networks and metabolic networks, to determine interactions

between nodes. The methods in link prediction can also be used to extract missing

information, identifying fake or false interactions.

3.2 Datasets

One of the most interesting communities in a social network is the scientific

community, where the co-authorship or collaboration networks in which nodes are

the authors and two authors have a edge if they have published a paper together.

For the sake of experiment, two co-authorship networks G are obtained from the

authors list of papers at two sections of the physics e-print arXiv, www.arxiv.org.

The two datasets used are from Stanford Large Network Dataset Collection

[6], Collaboration networks: nodes represent scientists, edges represent

collaborations (co-authoring a paper). Both the datasets used are:

Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration

network is from the e-print arXiv and covers scientific collaborations between

authors papers submitted to General Relativity and Quantum Cosmology

category.

Arxiv HEP-TH (High Energy Physics - Theory) collaboration network is

from the e-print arXiv and covers scientific collaborations between authors

papers submitted to High Energy Physics - Theory category.

For both networks, if an author i co-authored a paper with author j, the graph

contains an undirected edge from i to j. If the paper is co-authored by k authors this

generates a completely connected (sub) graph on k nodes.

http://www.arxiv.org/

4

3.3 Evaluation Method and Accuracy

Take an undirected network G (V, E), where V is the set of nodes and E is the

set of links. Let U be the full set containing all |V|.(|V|−1)

2 possible links, where |V|

denotes the number of elements in set V. All the methods assign a score(x, y) to all

the edges <x, y> in the set U-E. Then produce a ranked list in decreasing order of

score(x ,y).

To test the algorithm's accuracy, a training network is constructed, GTr. The

training set is formed by randomly removing some edges from the network G. The

training set is the input to the algorithm and from the output of the top predicted links

we measure accuracy as how much % of links are present in the original network. To

make this kind of measure more valid, K-fold cross-validation method is used, in

which the edges in the network are randomly divided into K-partitions. Each time

one of the set is selected as the removal set, and the union of all remaining K-1 is

considered as the training set GTr, and the accuracy(A) is measured as the average of

the all the accuracies of K times. And we call the network test graph GTe for checking

if an edge predicted is present or not.

Now to measure the accuracy, we first choose how many of the top predicted

edges we are going to consider. Let that number be N, so accuracy can be defined as

[2]:

3.4 Methods for Link Prediction

Many methods, especially similarity based measures to assign scores, are

present. Some of the important and popular methods are:

Graph Distance: May be this is the most basic and trivial method present.

The approach in this method is to rank pairs (x,y) based on the shortest path

between them. Score (x,y) is defined as the negative of the shortest path length

between x and y. The negative is because we arrange scores in decreasing

order.

5

For a node x, let T(x) denote the set of neighbours of x in GTe. A number of

approaches are based on the idea that two nodes x and y are more likely to

form a link in the future if their sets of neighbours T(x) and T(y) have large

overlap; this follows the natural intuition that such nodes x and y represent

authors with many colleagues in common, and hence are more likely to

come into contact themselves.

Common Neighbours: In this method, score (x,y) is defined based on the

neighbours of x and y. The score (x, y) is directly defined as the number of

neighbours x and y have in common [1].

Jaccard’s coefficient: This is the most commonly used similarity metric in

information retrieval. It measures the probability that both x and y have a

feature f, for a randomly selected feature f that either x or y has. If we take

“features” here to be neighbours, the score would be [1]:

Adamic/Adar(AA): In AA, they compute features of the pages, and define the

similarity between two pages to be [3]

In this kind of measure, the rarer terms are given more priority, making score:

6

Resource Allocation: Consider a pair of nodes, x and y, which are not directly

connected. The node x can send some resource to y, with their common

neighbours playing the role of transmitters. In the simplest case, we assume

that each transmitter has a unit of resource, and will equally distribute it to all

its neighbours. The similarity between x and y can be defined as the amount

of resource y received from x, which is [4]:

Preferential Attachment: The basic idea of this model is that the probability

that a new edge involves node x is proportional to |T(x)|, the current number

of neighbours of x. The probability of co-authorship of x and y is correlated

with the product of the number of collaborators of x and y. This corresponds

to the measure [4]:

3.5 Results

All the above metrics are performed on the two datasets. 5-fold cross-

validation is used on both of them. And the accuracy is measured with N as top

20% of all the predicted list in decreasing order of score (x, y).The results are:

In the next section work done in link prediction for recommender systems is

described. It is done as a part of BTP-2.

7

4. Section II – Link Prediction for Recommendations

4.1 Recommender Systems

Recommender Systems are techniques providing suggestions for items to a

user which he might be interested. The suggestions relate to various decision-making

processes, such as what items to buy, what music to listen to, what movie to watch

or what online news to read. In the old times people doesn’t have a vast range of

options to choose from so they could get an overview of all and get what they want.

But in the current day scenario users have a lot of options, virtually infinite options

to choose from so it will be difficult for them to check them all this is where the role

of recommender systems come into picture. Now a days consider anything and there

are millions of options for music, movie and thousands of options for any product so

users definitely need suggestions [10].

These systems are very widespread in many areas and not restricted to some

particular areas. Some of the best recommender systems are Netflix and amazon to

recommend movies and products respectively. In general whatever the area and

product recommended, the basic recommendation premise is same for all kinds. It is

basically recommending a subset Ia of items out all items I that are available to a user

a from a set of users U.

Additionally we have information about user’s ratings for some products

he/she previously liked and also some meta-data on user or about the products etc.

In general the number of products suggested for user is way less than total items.

Generally here we talk about top-N-recommendations. Apart from suggesting user’s

products based on his interests, there is also context based search, in which user

searches for something entirely out of his interest and we need to recommend

something similar to what he searched instead of what his interests are. The main

question is how we use the available data to get the best recommendations.

The main work done here is some music recommendation to the users based

on his pre-defined interests. Some of the implemented methods are basic methods

present in the literature and some of the methods based on link predictions techniques

over user data for own experiments, for music recommendation Last.fm data is used.

Some methods based on Context based search recommendations are implemented,

same as for previous some of the own methods are implemented based on weight

estimation using learning and some basic methods are implemented, for context

based search recommendation Movie Lens(IMDB) dataset was used.

8

4.2 Recommenders: Types and Approaches

There are three basic principal approaches to recommendation systems,

Metadata-based Approach, the Content-based Approach and Hybrid Approaches.

Metadata-based Approaches are sub-divided into other categories but the main

category is Collaborative Filtering-based Approaches [10].

Collaborative Filtering

This approach is already a great success and is used in many commercial

applications. A Collaborative filtering system has user’s data in which items or rated

on a predefined scale, generally 5 or 10. Almost all current music and movie

recommendation systems are collaborative filtering. Apart from taking explicit rating

these systems also collect a lot of implicit user feedback like what are the kind of

genres he’s listening to and stopping or skipping a song might indicate some

disliking.

There are two different variants in collaborative filtering: item-based and

user-based. In user-based filtering first there is some data present about user-user

friendship or based on the ratings given by users similarity between users is

estimated. Once user similarity is found things are recommended to a particular user

from similar users products. In item-based filtering first based on items metadata

similarity between items is estimated then similar products to his interests are

recommended.

In the music recommendation on Last.fm dataset, user based recommendation

methods are implemented. In Last.fm dataset user friendship is presented this user

friendship data is more manipulated to get some refined friendship data which is used

in recommendation.

Content-based Recommendations

Content based recommendations analyse the content of the items and extract

features of items. Basic idea is to have meaningful features out of the objects and

represent them mathematically. Content based approaches belong to the category of

item-based recommendation where item-item similarities are used to generated

recommendations. The similarities are measured using the extracted data. So there

are two basic steps in this kind of recommender systems.

Get all the features and make a meaningful representation of the features

9

Define a similarity function which corresponds to how people perceive object

similarity, between the features to get item-item similarity, then make

recommendations.

Some basic methods in content-based recommendations are implemented on

Movie Lens dataset in which some features are presented like genre, average rating,

year of release etc.

Hybrid Recommendation Approaches

In hybrid based recommendation methods two or more approaches are mixed

together, like some technique from collaborative filtering and content based methods

to gain advantage of both approaches. For a basic case a method from metadata-based

approach and a content-based recommendation are combined to improve the

recommendation quality and solve some basic underlying problems like the cold-

start problem like if some new elements get added into the systems how to integrate

them.

A hybrid based recommendation approach, learning feature weights for

content features but are weighed on the importance to users described in [4] is

implemented, some improvements over the presented method in [4] are implemented

which improved the results significantly. Basically it gives score for every movie

pairs and the score is the number of users they have in commonly watched. But that

score doesn’t isn’t well justified, consider a person for who watched both movies but

like one movie and didn’t like the other, this data isn’t captured in the method

presented in the paper. So some improvised scoring methods are implemented and

the results have significantly improved.

4.3 Datasets and Evaluation Method

For Collaborative-based methods done here Last.fm music data is used to

evaluate the methods. This data set contains information about 2,000 users and

17,000 artists and their corresponding albums belonging to the artist. Along with this

user-user binary friendship matrix is given, user-artist ratings data, and artist tags

data is given. This is the data mainly used in the collaborative method. The user-user

friendship matrix is processed to get more into about users [8].

For Hybrid methods done here Movie Lens movie data is used to evaluate the

methods. This data set contains a total of 100K ratings with 1000 users and 1700

movies, each user has rated at least 20 movies [9]. And the movies have lots of data

lots of features for which weights are learned. On this data context based search

methods are implemented to get item-tem similarity.

10

For the evaluation purposes standard recall method is used. A 10-fold cross

validation is done on the data, cross validation is same as discussed in the above

section.

4.4 Methods for Recommendation

In this section all the methods implemented are discussed in detail. First the

methods done on collaborative-filtering for music recommendation are discussed

then hybrid methods implemented for context based search for movies are discussed.

Results from these experiments are discussed in the next section.

In the collaborative methods mostly the user-user friendship a binary

friendship matrix presented is used. The three performed methods are (music data):

Singular Value Decompostion (SVD): The singular value decomposition

method. SVD is generally used in situations to remove noises even in information

retrieval is it used. By performing SVD on user-user friendship matrix, we get the

eigen values which represent the dimensions of friendship, the lower dimensions or

not-so important dimensions are ignored and the rank of the user-user friendship

matrix is reduced to some lower value k. Once this is done for each user his top 30

friends are found, then all the artists these friends have listened to are collected and

these artists are given a score, top 25 artists with highest score are suggested to each

user. The scoring method used for artists is sum of all product of user-friend

friendship valued and score of the friend for that artist.

Adamic/Adar(AA): In this method the from the given user-user matrix, a new

friendship matrix is creating using the link prediction Adamic/Adar method. Each

user-user pairs are given the AA score. Once the new user-user friendship matrix is

defined, artists are given score as the same as above and artists are suggested to the

user. Although the results of AA method are low compared to SVD method.

Resource Allocation (RA): In this method from the given user-user matrix, a

new friendship matrix is creating using the link prediction method. Each user-user

pairs are given the AA score. Once the new user-user friendship matrix is defined,

artists are given score as the same as described in the SVD method and artists are

suggested to users. The results of the RA method are very in link prediction and same

happened here, results are very good compared to SVD and AA method.

For context based recommendations, some naïve methods are implemented

then one hybrid method present in literature is implemented, and two improvised

methods over the learned weights hybrid method are implemented one of the method

11

gave significant improvement in both recall and the other gave same recall as the one

described in [ref]. So the aim here is to measure movie-movie similarity. The

methods are:

Naïve method [10]: Suppose the user-movie ratings matrix be A. Then a basic

movie-movie similarity score can be formed by ATA, in which the ijth entry gives the

similarity between the ith and jth movie. Once the movie-movie similarity is measured

then to evaluate the method we take a user from training set and consider any of the

movie he watched then take top similar movies according to the score and check how

many of the movies suggested are watched by him, ground truth recall is measured.

Learning Feature weights recommendation [12]: In this every movie is

represented by a feature vector of attributes. The attributes can be average rating,

year of release, genre etc. And in all general methods variety of distance measures

between the feature vectors are used to get the item-item similarity. But the human

judgement of similarity is quite different from this, generally human judgement gives

different weights to different attributes. For example the average rating of movie is

given more weight than the year it was released, language is given more weight than

rating in general. So we can define similarity between two movies Mi, and Mi is [12]:

S(Mi, Mi) = w1g(f1j,f1i) + w2g(f2j,f2i) + ………..+ wng(fnj,fni)

where wn is the weight given to a difference of features given by a function g(fnj,fni)

here the weights are to be learned. And the four different feature attributes are used,

Year of release, Average rating, Genre, Language. Next, how the functions g are

defined for each feature is given in the Table 1. The weights used here are learned

using linear regression. For the score S(Mi, Mi) there can be different measures,

suppose the one used in [ref] is the total number of people who watched both Mi and

Mi. But it has a basic problem with it and doesn’t count if people’s tastes are different.

For each common user a score of 1 is given, but it may be the user liked movie Mi

but disliked movie Mj. Taking this into consideration two different methods have

been implemented here with different scoring functions, and the results from the

experiment are better in these two cases than number of common watched people

score.

12

Feature Type Difference Measure

Release Year (270-|Y1 – Y2 |) / 270

Rating Integer(0-5) (5-|R1 – R2 |) / 5

Genre Boolean Vector Common true values/total

genres

Language String (L1==L2)? 1:0

Table 4.1

Four different features are used for weighing and the feature difference

measurement g for the features are presented in Table 4.1. Next, different scoring

functions which considers some factors and hence gives better results are discussed.

Cosine Similarity Score: The score is given as cosine similarity of all the

ratings for all the common users between a movie Mi, and Mi . So two vectors Vi and

Vj, for each movie containing the ratings given by the common users who have

watched them. Then cosine similarity between these two vectors is considered

𝑆(𝑀𝑖, 𝑀𝑖) = (𝑽𝑖 . 𝑽𝑗)/|𝑽𝑖||𝑽𝑗|

Resource Allocation Score: Same as the above score we have two vectors Vi

and Vj having the ratings of common users for movies Mi, and Mi . Then the score is

calculated as sum of (Vi[k]-Vj[k])-1 where V[k] is the rating of the kth common

watched person for Mi, and Mi, if Vi[k] and Vj[k] are same value 2 is given.

𝑆𝑘 = 1|𝐕i[k] − 𝐕j[k]|⁄ If Vi[k]-Vj[k]!= 0

Else 𝑆𝑘 = 2

Now the score is measure as:

𝑆(𝑀𝑖, 𝑀𝑖) = ∑ 𝑆𝑘𝑘=𝑛𝑘=1

It is quite intuitive from above that the above method gives a penalty if a particular

common user doesn’t like one movie and likes the other. And from intuition this

scoring method relates more to how humans perceive and so the ground truth results

for this is better than number of common users score, slightly greater than cosine-

similarity score results. The weights are learned using linear regression.

13

4.5 Results

The experimental results of both Collaborative filtering for music data on

Last.fm data set are presented in Table 4.2 and the results of hybrid methods for

movies data on Movie Lens data set are presented in Table 4.3. The measures used

are ground truth recall values as explained in the section 3.II. It is expected in both

cases Resource allocation method works well, so in collaborative filtering it works

better than SVD and AA but in the case of hybrid methods in learning feature weights

method both RA and Cosine Similarity gives similar results.

Collaborative Filtering Results. Table 4.2

RA method gives large Recall Score compared to both SVD and AA method

Similarity score method Recall

ATA 0.173

Number of Common people watched 0.225

Cosine Similarity 0.310

RA Score 0.331

Hybrid Filtering Methods. Table 4.3

From the hypothesis RA score is supposed to perform very well than other measures,

but from the experiment results the performance of Cosine Similarity and RA Score

are very nearly comparable. But both of them perform better than the basic, number

of common people watched score and ATA method as expected as it is also quite

reasonable and intuitive.

Method Recall

Binary Friendship Matrix 0.121

SVD 0.253

AA 0.281

RA 0.322

14

5. Conclusions

Recommender systems are a useful for extracting additional value for a

business from its user databases. These systems help users find items they want to

buy from a large pool of options. Recommender systems benefit users by enabling

them to find items they like. Conversely, they help the business by generating more

sales. Recommender systems are rapidly becoming a crucial tool in E-commerce on

the Web. Recommender systems are being stressed by the huge volume of user data

in existing corporate databases, and will be stressed even more by the increasing

volume of user data available on the Web.

Although all the commercial recommender systems in the real world today

are metadata based systems as users data is always collected by the databases and

then based on them recommendation becomes easy. But context based systems and

hybrid systems are gaining more popularity and academic research these days.

Because these systems address many problems that collaborative systems fail to

answer, like cold start problem, sparsity problem etc.

6. References

[1] Nowell, L., Kleinberg, J.: The Link Prediction Problem for Social Networks. In: CIKM

(2003).

[2] Zhou, T., Lu, L.: Link Prediction in Complex complex networks : A survey. J. Phys. A

390,1150–1170 (2011).

[3] Adamic, L., A., Adar, E.: Friends and neighbours on the Web. J. Social Networks 25,

211–230 (2003).

[4] Newman, M., E., J.: Clustering and Preferential attachment in growing newtorks.J.Phys.

Rev. E 64, (2001).

[5] Soundarajan, S., Hopcroft, J.: Using Community information to improve the Precision of

Link Prediction Methods. In: WWW (2012).

[6] Jure Leskovec, Stanford University. Stanford Large Network Data Collection.

(http://snap.stanford.edu/data/index.html)

[7] J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking

Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1),

2007.

[8] Last.fm website, http://www.lastfm.com

[9] Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for

Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and

Development in Information Retrieval. Aug. 1999.

http://www.lastfm.com/

15

[10] Bruke, R. Hybrid recommender systems: survey and experiments, User Modelling and

User Adapted Interaction 12 (2002) 331-370.

[11] P. Melville, R.J. Mooney, R. Nagarajan Content-Boosted Collaborative Filtering for

Improved Recommendations, Proceedings of the 18th National Conference on Artificial

Intelligence (AAAI-2002), July 2002, Edmonton, Canada.

[12] Souvik Debnath, Niloy Ganguly, Pabitra Mitra Feature Weighting in Content Based

Recommendation System Using Social Network Analysis, WWW 2008,April 21–25, 2008,

Beijing, China.

Documents

Link Predictioncse.iitkgp.ac.in/~psvijay/docs/BTP.pdf · 2014-12-02 · Certificate This is to certify that the thesis titled Link Prediction submitted by Srihari Pratapa ... Introduction