Upload
ajay-karri
View
193
Download
1
Tags:
Embed Size (px)
Citation preview
Recommending whom to follow in Twitter
using Genetic Algorithm
By
Ajay Karri
Rajiv Neal
Harsheel Saraiya
Mentored by
Prof. Lixin Gao
[1] Introduction
A Social Network is composed of various individuals and organizations that are connected by
some common interest. The importance of Social Networks is increasing with popularity of
websites like Facebook, Twitter, and Orkut, and this is influencing human social behavior. There
is always a lot of interest in understanding the complexity of a graph representing a network and
trying to predict formation of links between nodes in a network.
[2] Motivation
We suggest formation of new links between user nodes in the network. The motivation behind
this idea is that a Recommendation System might assist user to get reconnected with some old
friend with whom he/ she has lost contact. With increase in popularity of e-commerce websites,
our system can recommend buyers items that they might consider buying with regards to the
products they already bought or browsed earlier. This has a strong probability in increasing the
sales of items for the company. This approach is popular among websites like Amazon, Wal-
Mart, and Target.
How is our recommendation system different from twitter’s recommendation system?
If we consider twitter it recommends only the famous personalities and it gives less importance
to the people whom we are being followed by, even though they have good connectivity and
reputation. In our project we plan to give equal importance to the people whom we are being
followed by, along with the famous personalities considering other factors in mind which will be
discussed in section 4.
[3] Project Overview
In our project we try to implement a friend recommendation system in case of Twitter, a popular
social networking website. In essence, we recommend nodes that a given user should follow.
Various algorithms are present which provides this functionality. Most of algorithms fall under
‘k-nearest Neighboring approach’ or ‘Collaborative Filtering approach’.
We use the Genetic Algorithm for ‘whom to follow in Twitter’ Recommendation System. It is
based on the topology and structure of network surrounding the central user. Unlike other
topology based approaches, we use a different concept of Clustering indexes and new calibration
technique which is the Fitness Function.
We developed an algorithm that analyses the sub-graph composed by a user and all the others
connected people separated by three degree of separation. However, only users separated by two
degree of separation are candidates to be recommended to be followed by the user. The
algorithm uses the patterns defined by connections between the nodes to find those users who
have similar behavior as the root user.
In Section 4 we explain the parameters which are used in our recommendation system. In section
5 we explain about genetic algorithm and in section 6 we propose our recommendation system.
[4] Recommendation System Steps/ Mechanism
The Recommendation mechanism procedure utilizes graph topology to filter and order a set of
nodes that have some important properties in relation to a given user node vi. The nodes of the
resulting set are recommendations of new edges that should be connected to node vi.
The process of Recommendation is broadly divided into two main steps – Filtering, and Ordering
[4.1] Filtering Step
The Filtering step filters out all the nodes that have high probability to get recommended to the
given user node. This step is based on concept of Clustering Coefficient which is characteristic in
small world networks.
Fig 4.1 A Visual example of a sub-network showing the links between single users in relation to his
followers and followers-of-followers in a social network.
As shown in the Figure 4.1, region A represents the central user and the people whom he is
following. Region B represents followers of followers of the main user. We will recommend
only those nodes that are two hops away from the central node. These nodes fall in shaded region
between circles defined by B and A.
[4.2] Ordering Step
The Ordering Step is equally important as the Filtering step. The procedure is based on the
calculation of different indexes (explained later) and their normalization.
Three independent indexes are considered. Finally, a Fitness Function (explained later) is
calculated that merges these three indexes into one value. Based on this value we put the most
relevant nodes in the top of the resulting list that we obtain after each iteration of the Genetic
Algorithm.
Three indicators are mentioned in this method. These indexes measure specific properties of a
sub-graph composed by the nodes that are analyzed. You can use different indexes. The indexes
that we use are based on the concept of Friends-of-Friends approach. Motivated from the idea of
Clustering coefficients, we define and use a parameter called the Adjacent Density.
[4.2.1] Adjacent Density
Consider a network that has many nodes. Let ‘C’ represents the set of all nodes in the given
network. Then, the Adjacent Density DC among the nodes in given network is given by,
Dc= ∑ �∑ ���� � � � � ��|�|∗ |�|��� �⁄
M represents the Adjacency Matrix. Mij is one if there is a link between nodes i and j, else its
zero. The denominator is a Normalizing factor where |C| represents the number of nodes in set C.
[4.3] Concept of Indexes
[4.3.1] First Index
The First Index measures number of nodes common to center node i and node j that is being
ordered for the recommendation system. It is given by,
���� = � ��⋂���
Thus, the intuition behind above idea is that it returns the number of nodes that are being
“followed” in Twitter by both the Central user as well as the node being recommended to the
Central user.
[4.3.2] Second Index
The Second index measures the cohesion level inside the group formed by the common nodes
followed by person i and person j. It is given by,
���� = � ���⋂���
The intuition behind this index is that we want to know how densely connected are the nodes in
the common region. This directly influences the Recommendation System. If the common region
is very dense, there will be many entries in matrix M that are value ‘1’ resulting in high second
index value. If the common region does not have many links between the nodes then the second
index value will be less.
[4.3.3] Third Index
The Third index is a variation of the second index. It measures the cohesion level in the region
formed by group of nodes that are being followed by node i or node j. It is given by,
���� = � ��� ⋃ ���
The intuition here is that a high second index value does not necessarily lead us to obtain high
third index value. The third index is independent of the second index.
[4.3.4] Fitness Function
Our main goal is to select good nodes to be recommended from a pool of nodes based on a single
parameter. Hence, after getting the three index values of a particular node, we convert it into a
single value. This conversion is performed by the Fitness function that is mentioned below,
M (n, w) = I1 (nc, n) * w1 + I2 (nc, n) * w2+ I3 (nc, n) * w3
In essence, the Fitness Function is nothing but the weighted average of all the index values.
w1, w2, w3 are the weights associated with each node. During the start of the Algorithm, all these
weighs have the same value which is ‘1’. The three weights mentioned above are fed to the
Genetic Algorithm which tries to optimize the weight values. This happens at every iteration of
Genetic Algorithm. The Fitness Function values are ordered in decreasing order and only some
set of nodes are considered who have high values for future iterations. Since the above function
discards the unfit nodes from the fit nodes it is called as the ‘Fitness’ function.
[5] Genetic Algorithm
The genetic algorithm is a probabilistic search algorithm that iteratively transforms a set (called
population) of mathematical objects (binary strings), each with an associated fitness value, into a
new population of offspring objects using the Darwin’s principle of natural selection and using
operations that are patterned after naturally occurring genetic operations, such as crossover and
mutation.
Figure 5.1: Represents genetic algorithm flow chart
The main procedure (Fig 5.1) returns the best weight combination for recommend friends for a
specific user:-
• Generate an initial population with random weight values.
• Until the fitness function value of the best individual of the population do not improve for
five generations do:
o Evaluate the fitness function for each individual in the population.
o Exclude the worst individuals in the population according to the fitness function
value.
o Generate new individuals applying crossover in remaining individuals.
o Apply mutation operations on children.
o Merge children and parents eliminating duplicates.
• Return the weight combination of the best individual.
[5.1] Crossover
Selecting population and producing offspring from these populations is known as crossover.
Multiple crossover techniques exits like single point crossover, multipoint crossover and random
point crossover. We have implemented the project using single point crossover. In a single point
crossover, a locus is chosen at which we swap the remaining alleles from parent. Each new
generation replaces the worst individuals by children of the best individual.
In the flowchart shown below (Fig 5.2), our aim is to select a dog who barks loud. Weights are
distributed among the dogs depending on their barking sound. The loudest barking dog is
assigned a weight of 7 and lowest barking dog is assigned with a weight of 2. Among the
following dogs, higher weighted dogs are selected and prepared for crossover. After going
through the process of crossover, two worst barking dogs are replaced by the child of best
individuals.
Figure 5.2: Represents the processing levels inside genetic algorithm
The function to generate population sons from two parents is summarized as:-
• Crossover is applied over each pair from the Cartesian product of individual of the elite
population.
• Crossover is applied on same chromosomes type of two parents. Two chromosomes
generate two new chromosomes, and the combination of three chromosomes from each
parent six different new individual can be generated. Crossover is performed in a single
random cutoff point.
[5.2] Mutation
Mutation is used in order to ensure that the individuals are not all exactly the same. Mutation
gives a child his own traits to ensure his standing in the population is unique. Generally the
probability of mutation is usually between 1 and 2 tenths of a percent.
Figure 5.3: Represents mutation by flipping the bits
[5.3] Chromosome:
In our project we used a binary genetic algorithm where each chromosome is represented by
weights w1, w2, and w3. Each population in a generation is represented by 3 bytes ranging from 0
to 255.
Chromosome= [w1 w2 w3]. Initially weights are assigned at random based on the connectivity of
the population of the user to be recommended with the central user described in step 5 of section
6.2.
[6] Recommendation Procedure
The Recommendation procedure is represented in the figure below. Each step is explained in
detail later. It consists of 6 steps in total.
Figure 6.1
[6.1] Recommendation Steps in detail
Step 1: Get Data from Twitter
We first decided the central node that should be recommended to follow some node. We got all
those nodes information till 4th
library for Twitter API. The 1st
those being followed by the central user. The 3
in 2nd
level. Thus the recommended node to be followed by user belongs to level 3. Level 4
nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to
user). This is shown in the Figure 6.2
[6] Recommendation Procedure
The Recommendation procedure is represented in the figure below. Each step is explained in
detail later. It consists of 6 steps in total.
Figure 6.1: Recommendation Procedure
teps in detail
We first decided the central node that should be recommended to follow some node. We got all
level. We obtain all information using Twitter4J
level node is the Central User itself. The 2nd
level nodes are all
those being followed by the central user. The 3rd
level nodes are nodes being followed by nodes
level. Thus the recommended node to be followed by user belongs to level 3. Level 4
nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to
user). This is shown in the Figure 6.2 below:
Get Data from
Form the
network graph
Perform Graph
reduction
Calculate the
three indexes
Apply Genetic
Algorithm
Recommend
whom to follow
The Recommendation procedure is represented in the figure below. Each step is explained in
We first decided the central node that should be recommended to follow some node. We got all
level. We obtain all information using Twitter4J – A JAVA
level nodes are all
level nodes are nodes being followed by nodes
level. Thus the recommended node to be followed by user belongs to level 3. Level 4
nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to
Figure 6.2: Network View
Step 2: Form the /etwork Graph
To calculate the indexes values, we have to find the adjacency matrix first. However the
adjacency matrix can be obtained only from a graph. Hence once we get information of all users
we form the network graph so that the indexes values can be found. This step is particularly
important in calculating the second index and third index.
Step 3: Perform Graph Reduction
In this step we remove all those nodes in level 2 that have a link in between them. Consider
nodes A and B at level 2. If there is a link between them, say A is following B, then from Central
user’s perspective, B falls in level 3. Hence our system might ask Central user to follow B. But
we know that B is already being followed by Central user and should not be present in list of
nodes being recommended to be followed by Central user. Hence we discard such nodes. Hence
this step is called as a Graph reduction Step.
Step 4: Calculate the three indexes
The three indexes concept and formulae were explained before. For each node we find the three
indexes values and then the Fitness Function value. Thus there is a one-to-one correspondence
between a given node and its fitness function value.
Step 5: Apply Genetic Algorithm
As mentioned previously, the three weights are given as input to Genetic Algorithm which tries
to optimize it in every iteration. In our case the each chromosome represents an individual in a
population. Each chromosome has three genes which are represents by the weights w1,w2,w3,
assigned to indexes I1,I2,I3 respectively. If user in level 1 is directly being followed by level 3
user we assign more weight to the level 3 user whose value will be greater than 127. Our goal is
to optimize the weights such that the fittest of the population are to be recommended to the user.
Here the each gene is considered to be of 8 bits length. Every time before starting next iteration
we consider only those nodes with high fitness function values. We take the weights of these
nodes and apply Crossover and Mutation techniques to get optimized weights.
Step 6: Recommend whom to follow
Finally after running the Genetic Algorithm after four iterations, we come up with a set of nodes
with their fitness function values. We arrange these nodes in decreasing order of fitness function
values and recommend the top nodes that the user should follow.
[6.2] Pseudo Code
We have implemented the program in JAVA language. The Pseudo code for each step that we
defined earlier is shown below:
Step 3: Perform Graph Reduction
For each user in level-3
if user equals level-2 user
then remove from level-3
end
end
Step 4: Calculate the three indexes
For each user in level-3
Calculate First Index, Second Index, Third Index
Check if level-3 user is connected to level-1
if so set flag to Y
end
end
Step 5: Apply Genetic Algorithm
Initial Step performed only once for a Central user node: Assign random weights to the nodes
and form initial chromosomes.
For each user in level-3
If user is directly connected to level-1 user
then assign random weight ≥127 corresponding to each index
Else assign random weight <127 corresponding to each index
end
End
Step 1: Create the population using the weights.
Step 2: Select two random people from population at a time until count equal the population size.
Step 3: Perform crossover operation to get offspring.
Step 4: Perform mutation operation on off springs.
Step 5: Calculate Fitness function for each person in the population.
Step 6: Sort the population in descending order of Fitness Function value.
Step 7: Discard least fit half of the population.
Step 8: Repeat Step2 to Step7 for four iterations
Step 6: Recommend whom to follow
Select the unique users out of the remaining population for recommendation. Give preference
according to their Fitness Function value.
[7] Results:
[7.1] Dataset:
For our testing we have collected 11,459 user Ids related to a single user. We found that the
number of users in the level 3 of the network graph were 59. After graph reduction the
recommendable users reduced to 56.
The total size of the chromosome is 24 bits. The Genetic algorithm is applied for 4 generations
such that the characteristics of the fittest parent are passed over to the other population. There is
a tradeoff between the number of generations and the accuracy of the results. Higher the
generations better the output but it will increase the number of computations. The mutation
probability is set 4%
Total Number of users Numbers of users in level 3 Number of recommendable
users
11,456 59 56
Table 7.1 Population
The accuracy of our recommendation system is limited as we were not able to get the private
user data. Hence, some of the users who had a better chance of being recommended were
neglected because we were not able to collect whom they were following.
[7.2] /etwork Graphs:
Connectivity Graph:
Figure 7.1: Connectivity Graph
Figure 7.1 represents the connectivity graph of the 11,456 users. It can be observed that the
network is pretty dense, which agrees with the social network principle. Due to intensive dense
graph, it is not possible to see individual connection between users.
Common Users:
In Figure 7.2 the user 73048275 is the central user for whom the users are recommended, the
user 45597677 belongs to level 3 who is supposed to be recommended to the user; the rest of the
nodes represent the common users between the central user and level 3 user. It can be seen from
Figure 7.2 that no link exists between the user 73048275 and 45597677. After running all
iterations, user 45597677 is recommended to user 73048275 since the actual user is not
following him. It can be verified with Table 7.1 the value of 1st index is 3 which agrees with
Figure 7.2 which shows three common users between 73048275 and 45597677 and moreover
this connectivity graph is used in the calculation of 2nd
Index.
Figure 7.2: Common Users
Sub Graph:
Figure 7.3: Sub Graph
Figure 7.3 represents the connectivity graph of the common users and the non-common users
being followed by central user and level 3 users. This connectivity graph is used in the
calculation of 3rd Index.
Crossover:
We show here a glimpse of how Crossover and Mutation works. The result shown below is after
first iteration of Genetic Algorithm applied on user with id 73048275. The chromosomes of 2
nodes from level 3 are shown below:
The crossover point for child 1 is 3. This means that the child will inherit first three bits from one Parent and next five bits from other Parent. This is true for each byte of chromosome. For example, the first three bits of first byte in Child 1 comes from Parent 1 and rest five bits come from Parent 2. This is shown by Violet line. Similarly, the crossover point for child 2 is 5. This means that the child will inherit five bits from one Parent and next three bits from other Parent. This is true for each byte of chromosome. For example, the first five bits of second byte in Child 2 comes from Parent 2 while rest three bits come from Parent 1. This is shown by Red line. The next step is to perform Mutation on the chromosomes of children nodes. The result is shown below:
Since the probability of mutating a bit is 4%, only one bit gets flipped in chromosome of child 1. The chromosome of child 2 remains unaffected.
Recommendation:
We have tested our Recommendation system on numerous users, and it mostly gave successful
results. Our recommendation system recommends followers from the user’s network graph. For
recommending a user various factors are considered:
1) Is the level 1 user directly followed by level 3 user?
2) How many common people the level 3 user and level 1 user are following (1st and 2
nd
Index)?
3) What’s the spread of the level 3 user (3rd
Index) i.e. number of users he is following?
How these factors influence the user recommendation process:
If a user at level 3 has more common users with level 1 user then he has high chance of
getting recommended. His chances are further enhanced if he is following level 1 user.
If the level 3 user does not have more users in common with level 1 user then 3rd
index
comes into play. The more spread he has, greater the chance to get recommended.
Recommendation List 1:
Recommendation for user with user Id: 73048275
The value within the brackets represents the fitness value of that user.
• UserId 71496042 user name ankit goel
chromosome (1017.978)110111001111111001101011
• UserId 75572867 user name Abhishek Agarwal
chromosome (644.8637)111100011110101101110000
• UserId 45597677 user name Arun Kartha
chromosome (892.2909)111110011110101101110000
• UserId 33821494 user name Ishaan Guliani
chromosome (434.57565)011010001100110000001101
• UserId 60321463 user name LateNightTales
chromosome (321.128)001000110010010101111101
• UserId 41877111 user name harman
chromosome (135.34)011000100111000001101011
List 7.1 Output 1
User Id 1st Index 2
nd Index 3
rd Index Level 1
71496042 4 0.533333333 0.023470661672908864 y
75572867 2 0.576 0.0553306342780027 y
45597677 3 0.6 0.03831168831168831 n
33821494 3 0.6 0.013512304250559284 n
41877111 2 0.6666666666666666 0.03333333333333333 n
60321463 1 0.4321 0.02666666666666667 n
Table 7.2 Indexes
Table 7.2 consists of individuals to be recommended to the actual user. Users are arranged taking
into account each index value. The character value in last column represents the situation
whether the user is following the actual user or not. In case, the character is y then recommended
user is following the actual user else he is not following him.
The list 7.1 shown above is the actual recommended list for user with Id 73048275. The priority
of the users being recommended is based on his fitness which is calculated through genetic
algorithm. If we compare the users being recommended in Table 7.2, we can observe that all
three indexes are equally important. However, being directly connected to level 1 user plays a
vital role in getting recommended as higher weights are assigned to users who are directly
connected to level 1 user, which further agrees with the principle behind our recommendation
system. If we compare users 45597677 and 33821494, they have the same 1st and 2
nd index. But
their network spread (3rd
index) played an important role which ultimately gave preference to
user 45597677.
Recommended List 2
Recommendation for user with user Id: 113943142
• UserId 56304605 user name Rajdeep Sardesai
chromosome (495.9645)100110110011001100100101
• UserId 145125358 user name Amitabh Bachhan
chromosome (428.36288)011000001101111011010111
• UserId 135421739 user name sachin tendulkar
chromosome (416.52875)011101010110001101010110
• UserId 113609977 user name Umal Muranjan
chromosome (401.42856)011000001101111011010111
• UserId 161304900 user name Polynamous
chromosome (315.66666)011000001101111011010111
• UserId 97865628 user name Farhan Akhtar
chromosome (311.89334)011101100111000100001110
• UserId 116135959 user name Viral Desai
chromosome (200.155)011011010000110000010111
• UserId 6509832 user name CNN-IBN News
chromosome (194.233)010000001101111011010111
The above list is an actual recommended list for a different user (user 2). Here you can observe
most of the users who are being recommended are celebrities. This is because none of the users
in level 3 are directly following level 1 user.
[8] Conclusion:
We have proposed and implemented a new way to recommend a user whom he can follow. This
method differs from the conventional Twitter recommendation system. Unlike Twitter which
gives more importance to famous personalities, we gave equal importance to all nodes. Our
approach is based on the Friends-Of-Friends approach. The effectiveness of our system differs a
little because we were not able to retrieve the private user information. The Genetic Algorithm
used is extremely difficult and complex, nevertheless we get good results.
[9] Improvements
One major problem that we face was that we cannot access data from private users. Hence, we
had to ignore private users.
Other problem that we faced was the Rate-Limiting problem. To prevent abuse of many
sequential searches in given period of time, Twitter blocks the requests from a machine with
same IP Address. As a result, we could not run the Algorithm on users who followed large
number of nodes.
The above problem could be solved using the Twitter Streaming API which does not have any
rate limiting problem. But the drawback is that we get access to only those users who are
currently tweeting.
[10] References
[1] A Graph-Based Friend Recommendation System Using Genetic Algorithm, Nitai B.
Silva, Ing-Ren Tsang, George D.C. Cavalcanti, and Ing-Jyh Tsang.
[2] Introduction to social network methods, Robert A. Hanneman and Mark Riddle
[3] Practical Genetic Algorithms, Randy L. Haupt, S. E. Haupt
[4] twitter4j, http://twitter4j.org
[5] Jung, http://jung.sourceforge.net/