Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Link Prediction Bachelor Thesis Project
By
Srihari Pratapa (10CS30032)
advised by
Dr. Pabitra Mitra
Certificate
This is to certify that the thesis titled Link Prediction submitted by Srihari
Pratapa(10CS30032) to the Department of Computer Science and Engineering is a
bona fide record of work carried out by him under my supervision and guidance. The
thesis has fulfilled all the requirements as per the regulations of the Institute and, in
my opinion, has reached the standard needed for submission.
Dr. Pabitra Mitra
Professor
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
May 2014
Contents
1 Introduction……………………………………………………………… 1
2 Mathematical Definition of Problem………………………………….…2
3 Section I – Link Prediction for Social Networks………………………..2
3.1 Social Networks…………………………………………………….....2
3.2 Datasets………………………………………………………………..3
3.3 Evaluation Method and Accuracy……………………………………..4
3.4 Methods for Link Prediction…………………………………………..4
3.5 Results………………..…………………………………………….… 5
4 Section II – Link Prediction for Recommendations……………………6
4.1 Recommender Systems………….….………….……………………...6
4.2 Recommenders: Types and Approaches……..…………….……….…7
4.3 Datasets………….……………………….…….…….…….………….9
4.4 Methods for Recommendations……….……….………….…………10
4.5 Results…………………….…………..……….………….………….13
5 Conclusions……….………………………………………….……….…14
6 References…………………………….……………………….………...14
1
1. Introduction
Complex networks are networks with a variety of topological features, with
patterns of connection between their elements that are not regular and to most extent
are random. The study of complex networks helps to understand and model these
interactions between elements. Most important part of complex networks are social
networks structures whose nodes represent people or some elements in a social
context, and the edges represent interaction, collaboration between the elements.
Examples of complex networks are co-authorship networks in which all elements are
scientists and interactions between pairs who co-authored articles, social networking
sites in which nodes are people and interactions are friendship between people, e-
commerce sites which model interactions between products people brought and
people. To model the complex networks graphs are used as it is quite intuitive that
complex have elements which are nodes, interactions are edges.
Complex networks are highly dynamic objects, they grew very fast over time
with addition of new edges and nodes making their study very difficult. Studying the
evolution and dynamics of social networks is a complex problem because of the large
number of parameters. A more addressable problem is to understand the association
between two specific nodes. How new interactions are formed? What are the factors
responsible for formation of new associations? More specifically, the problem being
addressed here is: prediction of likelihood association between two nodes in the near
future. This problem is the link-prediction problem [1].
The link prediction problem can be applied in different complex networks
and handled in different ways in different complex systems. Suppose in social
networks it’s just prediction of future links between users, but in e-commerce
networks(amazon) or networks involving links between users and elements (video
sites) a bi-partite graph is formed between users and elements and link prediction is
done between elements and users. The new links predicted in the later problem are
suggested as recommendations and is generally handled as recommender systems
problem comes under broad classification of link prediction [10].
The entire work done is presented in two sections, first section contains work
done on link prediction in social networks as a part of BTP-I and in the second section
work done on recommender systems in movie recommendation as a part of BTP-2.
2. Mathematical Definition
A Social network G = < V, E> in which each edge e = <u, v> ∊ E represents
an association between u and v that existed at a particular time t(e). We can consider
multiple interactions between u and v as different edges, with potentially different
time-stamps. For two times t< t’, let G [t, t’] denote the sub-graph of G consisting of
2
all edges with a time-stamp between t and t’. And it is evident, the interval [t0, t’0] is
training interval and [t1, t’1] is the test interval [1].
Now, the solid formulation of the link prediction problem is, given four
different times t0 < t’0 < t1 < t’
1 given G[t0, t’0] input to our link prediction algorithm,
the output is a list of all probable future interactions(edges) not present in G[t0, t’0]
and are present in G[t1, t’1]. And the social networks grow on nodes with involvement
of new persons or elements in the network, in addition to the growth through new
associations. As it is not possible for new nodes not present in the training interval,
so future edges for nodes present in the training interval are predicted.
3. Section I – Link Prediction for Social Networks
3.1 Social Networks
Social networks are an elegant way to represent and model the links and
interactions among people in a group or community. Mathematically, they can be
represented as graphs where a vertex corresponds to a person or an object in a
community and an edge represents some kind of association between different
persons or objects. The interactions among a group are intrinsic to that particular kind
of social network. However, social networks are very dynamic, since new edges and
vertices are added to the graph over time. Studying the evolution and dynamics of
evolution of social networks is a complex problem because of the large number of
parameters.
Even though understanding the whole of social networks is a complex
problem, a more addressable problem is to understand the association between two
specific nodes. For example, some of the questions that can be asked are: How does
the graph evolve over time? How new interactions are formed? What are the factors
responsible for formation of new associations? More specifically, the problem being
addressed here is: prediction of likelihood association between two nodes in near
future. The problem is called Link Prediction in Social Networks [2].
Formal Definition: Given a snapshot of a social network, can we find out
which new interactions among its members are likely to occur in the near future? We
generalize this problem as the Link Prediction Problem, and develop approaches to
link prediction based on measures for analysing the ``proximity" of nodes in a
network.
Besides helping in analysing and prediction of future associations between
nodes in social networks, link prediction methods also have several important task
3
for analysing many complex networks. A scientific problem relevant to network
analysis in the current context is Information Retrieval. It can be viewed as prediction
of relations between words and documents. In many biological networks, food webs
and protein-protein networks and metabolic networks, to determine interactions
between nodes. The methods in link prediction can also be used to extract missing
information, identifying fake or false interactions.
3.2 Datasets
One of the most interesting communities in a social network is the scientific
community, where the co-authorship or collaboration networks in which nodes are
the authors and two authors have a edge if they have published a paper together.
For the sake of experiment, two co-authorship networks G are obtained from the
authors list of papers at two sections of the physics e-print arXiv, www.arxiv.org.
The two datasets used are from Stanford Large Network Dataset Collection
[6], Collaboration networks: nodes represent scientists, edges represent
collaborations (co-authoring a paper). Both the datasets used are:
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration
network is from the e-print arXiv and covers scientific collaborations between
authors papers submitted to General Relativity and Quantum Cosmology
category.
Arxiv HEP-TH (High Energy Physics - Theory) collaboration network is
from the e-print arXiv and covers scientific collaborations between authors
papers submitted to High Energy Physics - Theory category.
For both networks, if an author i co-authored a paper with author j, the graph
contains an undirected edge from i to j. If the paper is co-authored by k authors this
generates a completely connected (sub) graph on k nodes.
4
3.3 Evaluation Method and Accuracy
Take an undirected network G (V, E), where V is the set of nodes and E is the
set of links. Let U be the full set containing all |V|.(|V|−1)
2 possible links, where |V|
denotes the number of elements in set V. All the methods assign a score(x, y) to all
the edges <x, y> in the set U-E. Then produce a ranked list in decreasing order of
score(x ,y).
To test the algorithm's accuracy, a training network is constructed, GTr. The
training set is formed by randomly removing some edges from the network G. The
training set is the input to the algorithm and from the output of the top predicted links
we measure accuracy as how much % of links are present in the original network. To
make this kind of measure more valid, K-fold cross-validation method is used, in
which the edges in the network are randomly divided into K-partitions. Each time
one of the set is selected as the removal set, and the union of all remaining K-1 is
considered as the training set GTr, and the accuracy(A) is measured as the average of
the all the accuracies of K times. And we call the network test graph GTe for checking
if an edge predicted is present or not.
Now to measure the accuracy, we first choose how many of the top predicted
edges we are going to consider. Let that number be N, so accuracy can be defined as
[2]:
3.4 Methods for Link Prediction
Many methods, especially similarity based measures to assign scores, are
present. Some of the important and popular methods are:
Graph Distance: May be this is the most basic and trivial method present.
The approach in this method is to rank pairs (x,y) based on the shortest path
between them. Score (x,y) is defined as the negative of the shortest path length
between x and y. The negative is because we arrange scores in decreasing
order.
5
For a node x, let T(x) denote the set of neighbours of x in GTe. A number of
approaches are based on the idea that two nodes x and y are more likely to
form a link in the future if their sets of neighbours T(x) and T(y) have large
overlap; this follows the natural intuition that such nodes x and y represent
authors with many colleagues in common, and hence are more likely to
come into contact themselves.
Common Neighbours: In this method, score (x,y) is defined based on the
neighbours of x and y. The score (x, y) is directly defined as the number of
neighbours x and y have in common [1].
Jaccard’s coefficient: This is the most commonly used similarity metric in
information retrieval. It measures the probability that both x and y have a
feature f, for a randomly selected feature f that either x or y has. If we take
“features” here to be neighbours, the score would be [1]:
Adamic/Adar(AA): In AA, they compute features of the pages, and define the
similarity between two pages to be [3]
In this kind of measure, the rarer terms are given more priority, making score:
6
Resource Allocation: Consider a pair of nodes, x and y, which are not directly
connected. The node x can send some resource to y, with their common
neighbours playing the role of transmitters. In the simplest case, we assume
that each transmitter has a unit of resource, and will equally distribute it to all
its neighbours. The similarity between x and y can be defined as the amount
of resource y received from x, which is [4]:
Preferential Attachment: The basic idea of this model is that the probability
that a new edge involves node x is proportional to |T(x)|, the current number
of neighbours of x. The probability of co-authorship of x and y is correlated
with the product of the number of collaborators of x and y. This corresponds
to the measure [4]:
3.5 Results
All the above metrics are performed on the two datasets. 5-fold cross-
validation is used on both of them. And the accuracy is measured with N as top
20% of all the predicted list in decreasing order of score (x, y).The results are:
In the next section work done in link prediction for recommender systems is
described. It is done as a part of BTP-2.
7
4. Section II – Link Prediction for Recommendations
4.1 Recommender Systems
Recommender Systems are techniques providing suggestions for items to a
user which he might be interested. The suggestions relate to various decision-making
processes, such as what items to buy, what music to listen to, what movie to watch
or what online news to read. In the old times people doesn’t have a vast range of
options to choose from so they could get an overview of all and get what they want.
But in the current day scenario users have a lot of options, virtually infinite options
to choose from so it will be difficult for them to check them all this is where the role
of recommender systems come into picture. Now a days consider anything and there
are millions of options for music, movie and thousands of options for any product so
users definitely need suggestions [10].
These systems are very widespread in many areas and not restricted to some
particular areas. Some of the best recommender systems are Netflix and amazon to
recommend movies and products respectively. In general whatever the area and
product recommended, the basic recommendation premise is same for all kinds. It is
basically recommending a subset Ia of items out all items I that are available to a user
a from a set of users U.
Additionally we have information about user’s ratings for some products
he/she previously liked and also some meta-data on user or about the products etc.
In general the number of products suggested for user is way less than total items.
Generally here we talk about top-N-recommendations. Apart from suggesting user’s
products based on his interests, there is also context based search, in which user
searches for something entirely out of his interest and we need to recommend
something similar to what he searched instead of what his interests are. The main
question is how we use the available data to get the best recommendations.
The main work done here is some music recommendation to the users based
on his pre-defined interests. Some of the implemented methods are basic methods
present in the literature and some of the methods based on link predictions techniques
over user data for own experiments, for music recommendation Last.fm data is used.
Some methods based on Context based search recommendations are implemented,
same as for previous some of the own methods are implemented based on weight
estimation using learning and some basic methods are implemented, for context
based search recommendation Movie Lens(IMDB) dataset was used.
8
4.2 Recommenders: Types and Approaches
There are three basic principal approaches to recommendation systems,
Metadata-based Approach, the Content-based Approach and Hybrid Approaches.
Metadata-based Approaches are sub-divided into other categories but the main
category is Collaborative Filtering-based Approaches [10].
Collaborative Filtering
This approach is already a great success and is used in many commercial
applications. A Collaborative filtering system has user’s data in which items or rated
on a predefined scale, generally 5 or 10. Almost all current music and movie
recommendation systems are collaborative filtering. Apart from taking explicit rating
these systems also collect a lot of implicit user feedback like what are the kind of
genres he’s listening to and stopping or skipping a song might indicate some
disliking.
There are two different variants in collaborative filtering: item-based and
user-based. In user-based filtering first there is some data present about user-user
friendship or based on the ratings given by users similarity between users is
estimated. Once user similarity is found things are recommended to a particular user
from similar users products. In item-based filtering first based on items metadata
similarity between items is estimated then similar products to his interests are
recommended.
In the music recommendation on Last.fm dataset, user based recommendation
methods are implemented. In Last.fm dataset user friendship is presented this user
friendship data is more manipulated to get some refined friendship data which is used
in recommendation.
Content-based Recommendations
Content based recommendations analyse the content of the items and extract
features of items. Basic idea is to have meaningful features out of the objects and
represent them mathematically. Content based approaches belong to the category of
item-based recommendation where item-item similarities are used to generated
recommendations. The similarities are measured using the extracted data. So there
are two basic steps in this kind of recommender systems.
Get all the features and make a meaningful representation of the features
9
Define a similarity function which corresponds to how people perceive object
similarity, between the features to get item-item similarity, then make
recommendations.
Some basic methods in content-based recommendations are implemented on
Movie Lens dataset in which some features are presented like genre, average rating,
year of release etc.
Hybrid Recommendation Approaches
In hybrid based recommendation methods two or more approaches are mixed
together, like some technique from collaborative filtering and content based methods
to gain advantage of both approaches. For a basic case a method from metadata-based
approach and a content-based recommendation are combined to improve the
recommendation quality and solve some basic underlying problems like the cold-
start problem like if some new elements get added into the systems how to integrate
them.
A hybrid based recommendation approach, learning feature weights for
content features but are weighed on the importance to users described in [4] is
implemented, some improvements over the presented method in [4] are implemented
which improved the results significantly. Basically it gives score for every movie
pairs and the score is the number of users they have in commonly watched. But that
score doesn’t isn’t well justified, consider a person for who watched both movies but
like one movie and didn’t like the other, this data isn’t captured in the method
presented in the paper. So some improvised scoring methods are implemented and
the results have significantly improved.
4.3 Datasets and Evaluation Method
For Collaborative-based methods done here Last.fm music data is used to
evaluate the methods. This data set contains information about 2,000 users and
17,000 artists and their corresponding albums belonging to the artist. Along with this
user-user binary friendship matrix is given, user-artist ratings data, and artist tags
data is given. This is the data mainly used in the collaborative method. The user-user
friendship matrix is processed to get more into about users [8].
For Hybrid methods done here Movie Lens movie data is used to evaluate the
methods. This data set contains a total of 100K ratings with 1000 users and 1700
movies, each user has rated at least 20 movies [9]. And the movies have lots of data
lots of features for which weights are learned. On this data context based search
methods are implemented to get item-tem similarity.
10
For the evaluation purposes standard recall method is used. A 10-fold cross
validation is done on the data, cross validation is same as discussed in the above
section.
4.4 Methods for Recommendation
In this section all the methods implemented are discussed in detail. First the
methods done on collaborative-filtering for music recommendation are discussed
then hybrid methods implemented for context based search for movies are discussed.
Results from these experiments are discussed in the next section.
In the collaborative methods mostly the user-user friendship a binary
friendship matrix presented is used. The three performed methods are (music data):
Singular Value Decompostion (SVD): The singular value decomposition
method. SVD is generally used in situations to remove noises even in information
retrieval is it used. By performing SVD on user-user friendship matrix, we get the
eigen values which represent the dimensions of friendship, the lower dimensions or
not-so important dimensions are ignored and the rank of the user-user friendship
matrix is reduced to some lower value k. Once this is done for each user his top 30
friends are found, then all the artists these friends have listened to are collected and
these artists are given a score, top 25 artists with highest score are suggested to each
user. The scoring method used for artists is sum of all product of user-friend
friendship valued and score of the friend for that artist.
Adamic/Adar(AA): In this method the from the given user-user matrix, a new
friendship matrix is creating using the link prediction Adamic/Adar method. Each
user-user pairs are given the AA score. Once the new user-user friendship matrix is
defined, artists are given score as the same as above and artists are suggested to the
user. Although the results of AA method are low compared to SVD method.
Resource Allocation (RA): In this method from the given user-user matrix, a
new friendship matrix is creating using the link prediction method. Each user-user
pairs are given the AA score. Once the new user-user friendship matrix is defined,
artists are given score as the same as described in the SVD method and artists are
suggested to users. The results of the RA method are very in link prediction and same
happened here, results are very good compared to SVD and AA method.
For context based recommendations, some naïve methods are implemented
then one hybrid method present in literature is implemented, and two improvised
methods over the learned weights hybrid method are implemented one of the method
11
gave significant improvement in both recall and the other gave same recall as the one
described in [ref]. So the aim here is to measure movie-movie similarity. The
methods are:
Naïve method [10]: Suppose the user-movie ratings matrix be A. Then a basic
movie-movie similarity score can be formed by ATA, in which the ijth entry gives the
similarity between the ith and jth movie. Once the movie-movie similarity is measured
then to evaluate the method we take a user from training set and consider any of the
movie he watched then take top similar movies according to the score and check how
many of the movies suggested are watched by him, ground truth recall is measured.
Learning Feature weights recommendation [12]: In this every movie is
represented by a feature vector of attributes. The attributes can be average rating,
year of release, genre etc. And in all general methods variety of distance measures
between the feature vectors are used to get the item-item similarity. But the human
judgement of similarity is quite different from this, generally human judgement gives
different weights to different attributes. For example the average rating of movie is
given more weight than the year it was released, language is given more weight than
rating in general. So we can define similarity between two movies Mi, and Mi is [12]:
S(Mi, Mi) = w1g(f1j,f1i) + w2g(f2j,f2i) + ………..+ wng(fnj,fni)
where wn is the weight given to a difference of features given by a function g(fnj,fni)
here the weights are to be learned. And the four different feature attributes are used,
Year of release, Average rating, Genre, Language. Next, how the functions g are
defined for each feature is given in the Table 1. The weights used here are learned
using linear regression. For the score S(Mi, Mi) there can be different measures,
suppose the one used in [ref] is the total number of people who watched both Mi and
Mi. But it has a basic problem with it and doesn’t count if people’s tastes are different.
For each common user a score of 1 is given, but it may be the user liked movie Mi
but disliked movie Mj. Taking this into consideration two different methods have
been implemented here with different scoring functions, and the results from the
experiment are better in these two cases than number of common watched people
score.
12
Feature Type Difference Measure
Release Year (270-|Y1 – Y2 |) / 270
Rating Integer(0-5) (5-|R1 – R2 |) / 5
Genre Boolean Vector Common true values/total
genres
Language String (L1==L2)? 1:0
Table 4.1
Four different features are used for weighing and the feature difference
measurement g for the features are presented in Table 4.1. Next, different scoring
functions which considers some factors and hence gives better results are discussed.
Cosine Similarity Score: The score is given as cosine similarity of all the
ratings for all the common users between a movie Mi, and Mi . So two vectors Vi and
Vj, for each movie containing the ratings given by the common users who have
watched them. Then cosine similarity between these two vectors is considered
𝑆(𝑀𝑖, 𝑀𝑖) = (𝑽𝑖 . 𝑽𝑗)/|𝑽𝑖||𝑽𝑗|
Resource Allocation Score: Same as the above score we have two vectors Vi
and Vj having the ratings of common users for movies Mi, and Mi . Then the score is
calculated as sum of (Vi[k]-Vj[k])-1 where V[k] is the rating of the kth common
watched person for Mi, and Mi, if Vi[k] and Vj[k] are same value 2 is given.
𝑆𝑘 = 1|𝐕i[k] − 𝐕j[k]|⁄ If Vi[k]-Vj[k]!= 0
Else 𝑆𝑘 = 2
Now the score is measure as:
𝑆(𝑀𝑖, 𝑀𝑖) = ∑ 𝑆𝑘𝑘=𝑛𝑘=1
It is quite intuitive from above that the above method gives a penalty if a particular
common user doesn’t like one movie and likes the other. And from intuition this
scoring method relates more to how humans perceive and so the ground truth results
for this is better than number of common users score, slightly greater than cosine-
similarity score results. The weights are learned using linear regression.
13
4.5 Results
The experimental results of both Collaborative filtering for music data on
Last.fm data set are presented in Table 4.2 and the results of hybrid methods for
movies data on Movie Lens data set are presented in Table 4.3. The measures used
are ground truth recall values as explained in the section 3.II. It is expected in both
cases Resource allocation method works well, so in collaborative filtering it works
better than SVD and AA but in the case of hybrid methods in learning feature weights
method both RA and Cosine Similarity gives similar results.
Collaborative Filtering Results. Table 4.2
RA method gives large Recall Score compared to both SVD and AA method
Similarity score method Recall
ATA 0.173
Number of Common people watched 0.225
Cosine Similarity 0.310
RA Score 0.331
Hybrid Filtering Methods. Table 4.3
From the hypothesis RA score is supposed to perform very well than other measures,
but from the experiment results the performance of Cosine Similarity and RA Score
are very nearly comparable. But both of them perform better than the basic, number
of common people watched score and ATA method as expected as it is also quite
reasonable and intuitive.
Method Recall
Binary Friendship Matrix 0.121
SVD 0.253
AA 0.281
RA 0.322
14
5. Conclusions
Recommender systems are a useful for extracting additional value for a
business from its user databases. These systems help users find items they want to
buy from a large pool of options. Recommender systems benefit users by enabling
them to find items they like. Conversely, they help the business by generating more
sales. Recommender systems are rapidly becoming a crucial tool in E-commerce on
the Web. Recommender systems are being stressed by the huge volume of user data
in existing corporate databases, and will be stressed even more by the increasing
volume of user data available on the Web.
Although all the commercial recommender systems in the real world today
are metadata based systems as users data is always collected by the databases and
then based on them recommendation becomes easy. But context based systems and
hybrid systems are gaining more popularity and academic research these days.
Because these systems address many problems that collaborative systems fail to
answer, like cold start problem, sparsity problem etc.
6. References
[1] Nowell, L., Kleinberg, J.: The Link Prediction Problem for Social Networks. In: CIKM
(2003).
[2] Zhou, T., Lu, L.: Link Prediction in Complex complex networks : A survey. J. Phys. A
390,1150–1170 (2011).
[3] Adamic, L., A., Adar, E.: Friends and neighbours on the Web. J. Social Networks 25,
211–230 (2003).
[4] Newman, M., E., J.: Clustering and Preferential attachment in growing newtorks.J.Phys.
Rev. E 64, (2001).
[5] Soundarajan, S., Hopcroft, J.: Using Community information to improve the Precision of
Link Prediction Methods. In: WWW (2012).
[6] Jure Leskovec, Stanford University. Stanford Large Network Data Collection.
(http://snap.stanford.edu/data/index.html)
[7] J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking
Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1),
2007.
[8] Last.fm website, http://www.lastfm.com
[9] Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for
Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and
Development in Information Retrieval. Aug. 1999.
15
[10] Bruke, R. Hybrid recommender systems: survey and experiments, User Modelling and
User Adapted Interaction 12 (2002) 331-370.
[11] P. Melville, R.J. Mooney, R. Nagarajan Content-Boosted Collaborative Filtering for
Improved Recommendations, Proceedings of the 18th National Conference on Artificial
Intelligence (AAAI-2002), July 2002, Edmonton, Canada.
[12] Souvik Debnath, Niloy Ganguly, Pabitra Mitra Feature Weighting in Content Based
Recommendation System Using Social Network Analysis, WWW 2008,April 21–25, 2008,
Beijing, China.