20
Recommending whom to follow in Twitter using Genetic Algorithm By Ajay Karri Rajiv Neal Harsheel Saraiya Mentored by Prof. Lixin Gao

Online Social Netowrks- report

Embed Size (px)

Citation preview

Page 1: Online Social Netowrks- report

Recommending whom to follow in Twitter

using Genetic Algorithm

By

Ajay Karri

Rajiv Neal

Harsheel Saraiya

Mentored by

Prof. Lixin Gao

Page 2: Online Social Netowrks- report

[1] Introduction

A Social Network is composed of various individuals and organizations that are connected by

some common interest. The importance of Social Networks is increasing with popularity of

websites like Facebook, Twitter, and Orkut, and this is influencing human social behavior. There

is always a lot of interest in understanding the complexity of a graph representing a network and

trying to predict formation of links between nodes in a network.

[2] Motivation

We suggest formation of new links between user nodes in the network. The motivation behind

this idea is that a Recommendation System might assist user to get reconnected with some old

friend with whom he/ she has lost contact. With increase in popularity of e-commerce websites,

our system can recommend buyers items that they might consider buying with regards to the

products they already bought or browsed earlier. This has a strong probability in increasing the

sales of items for the company. This approach is popular among websites like Amazon, Wal-

Mart, and Target.

How is our recommendation system different from twitter’s recommendation system?

If we consider twitter it recommends only the famous personalities and it gives less importance

to the people whom we are being followed by, even though they have good connectivity and

reputation. In our project we plan to give equal importance to the people whom we are being

followed by, along with the famous personalities considering other factors in mind which will be

discussed in section 4.

[3] Project Overview

In our project we try to implement a friend recommendation system in case of Twitter, a popular

social networking website. In essence, we recommend nodes that a given user should follow.

Various algorithms are present which provides this functionality. Most of algorithms fall under

‘k-nearest Neighboring approach’ or ‘Collaborative Filtering approach’.

We use the Genetic Algorithm for ‘whom to follow in Twitter’ Recommendation System. It is

based on the topology and structure of network surrounding the central user. Unlike other

topology based approaches, we use a different concept of Clustering indexes and new calibration

technique which is the Fitness Function.

We developed an algorithm that analyses the sub-graph composed by a user and all the others

connected people separated by three degree of separation. However, only users separated by two

degree of separation are candidates to be recommended to be followed by the user. The

algorithm uses the patterns defined by connections between the nodes to find those users who

have similar behavior as the root user.

Page 3: Online Social Netowrks- report

In Section 4 we explain the parameters which are used in our recommendation system. In section

5 we explain about genetic algorithm and in section 6 we propose our recommendation system.

[4] Recommendation System Steps/ Mechanism

The Recommendation mechanism procedure utilizes graph topology to filter and order a set of

nodes that have some important properties in relation to a given user node vi. The nodes of the

resulting set are recommendations of new edges that should be connected to node vi.

The process of Recommendation is broadly divided into two main steps – Filtering, and Ordering

[4.1] Filtering Step

The Filtering step filters out all the nodes that have high probability to get recommended to the

given user node. This step is based on concept of Clustering Coefficient which is characteristic in

small world networks.

Fig 4.1 A Visual example of a sub-network showing the links between single users in relation to his

followers and followers-of-followers in a social network.

As shown in the Figure 4.1, region A represents the central user and the people whom he is

following. Region B represents followers of followers of the main user. We will recommend

only those nodes that are two hops away from the central node. These nodes fall in shaded region

between circles defined by B and A.

Page 4: Online Social Netowrks- report

[4.2] Ordering Step

The Ordering Step is equally important as the Filtering step. The procedure is based on the

calculation of different indexes (explained later) and their normalization.

Three independent indexes are considered. Finally, a Fitness Function (explained later) is

calculated that merges these three indexes into one value. Based on this value we put the most

relevant nodes in the top of the resulting list that we obtain after each iteration of the Genetic

Algorithm.

Three indicators are mentioned in this method. These indexes measure specific properties of a

sub-graph composed by the nodes that are analyzed. You can use different indexes. The indexes

that we use are based on the concept of Friends-of-Friends approach. Motivated from the idea of

Clustering coefficients, we define and use a parameter called the Adjacent Density.

[4.2.1] Adjacent Density

Consider a network that has many nodes. Let ‘C’ represents the set of all nodes in the given

network. Then, the Adjacent Density DC among the nodes in given network is given by,

Dc= ∑ �∑ ���� � � � � ��|�|∗ |�|��� �⁄

M represents the Adjacency Matrix. Mij is one if there is a link between nodes i and j, else its

zero. The denominator is a Normalizing factor where |C| represents the number of nodes in set C.

Page 5: Online Social Netowrks- report

[4.3] Concept of Indexes

[4.3.1] First Index

The First Index measures number of nodes common to center node i and node j that is being

ordered for the recommendation system. It is given by,

���� = � ��⋂���

Thus, the intuition behind above idea is that it returns the number of nodes that are being

“followed” in Twitter by both the Central user as well as the node being recommended to the

Central user.

[4.3.2] Second Index

The Second index measures the cohesion level inside the group formed by the common nodes

followed by person i and person j. It is given by,

���� = � ���⋂���

The intuition behind this index is that we want to know how densely connected are the nodes in

the common region. This directly influences the Recommendation System. If the common region

is very dense, there will be many entries in matrix M that are value ‘1’ resulting in high second

index value. If the common region does not have many links between the nodes then the second

index value will be less.

[4.3.3] Third Index

The Third index is a variation of the second index. It measures the cohesion level in the region

formed by group of nodes that are being followed by node i or node j. It is given by,

���� = � ��� ⋃ ���

The intuition here is that a high second index value does not necessarily lead us to obtain high

third index value. The third index is independent of the second index.

Page 6: Online Social Netowrks- report

[4.3.4] Fitness Function

Our main goal is to select good nodes to be recommended from a pool of nodes based on a single

parameter. Hence, after getting the three index values of a particular node, we convert it into a

single value. This conversion is performed by the Fitness function that is mentioned below,

M (n, w) = I1 (nc, n) * w1 + I2 (nc, n) * w2+ I3 (nc, n) * w3

In essence, the Fitness Function is nothing but the weighted average of all the index values.

w1, w2, w3 are the weights associated with each node. During the start of the Algorithm, all these

weighs have the same value which is ‘1’. The three weights mentioned above are fed to the

Genetic Algorithm which tries to optimize the weight values. This happens at every iteration of

Genetic Algorithm. The Fitness Function values are ordered in decreasing order and only some

set of nodes are considered who have high values for future iterations. Since the above function

discards the unfit nodes from the fit nodes it is called as the ‘Fitness’ function.

[5] Genetic Algorithm

The genetic algorithm is a probabilistic search algorithm that iteratively transforms a set (called

population) of mathematical objects (binary strings), each with an associated fitness value, into a

new population of offspring objects using the Darwin’s principle of natural selection and using

operations that are patterned after naturally occurring genetic operations, such as crossover and

mutation.

Figure 5.1: Represents genetic algorithm flow chart

Page 7: Online Social Netowrks- report

The main procedure (Fig 5.1) returns the best weight combination for recommend friends for a

specific user:-

• Generate an initial population with random weight values.

• Until the fitness function value of the best individual of the population do not improve for

five generations do:

o Evaluate the fitness function for each individual in the population.

o Exclude the worst individuals in the population according to the fitness function

value.

o Generate new individuals applying crossover in remaining individuals.

o Apply mutation operations on children.

o Merge children and parents eliminating duplicates.

• Return the weight combination of the best individual.

[5.1] Crossover

Selecting population and producing offspring from these populations is known as crossover.

Multiple crossover techniques exits like single point crossover, multipoint crossover and random

point crossover. We have implemented the project using single point crossover. In a single point

crossover, a locus is chosen at which we swap the remaining alleles from parent. Each new

generation replaces the worst individuals by children of the best individual.

In the flowchart shown below (Fig 5.2), our aim is to select a dog who barks loud. Weights are

distributed among the dogs depending on their barking sound. The loudest barking dog is

assigned a weight of 7 and lowest barking dog is assigned with a weight of 2. Among the

following dogs, higher weighted dogs are selected and prepared for crossover. After going

through the process of crossover, two worst barking dogs are replaced by the child of best

individuals.

Page 8: Online Social Netowrks- report

Figure 5.2: Represents the processing levels inside genetic algorithm

The function to generate population sons from two parents is summarized as:-

• Crossover is applied over each pair from the Cartesian product of individual of the elite

population.

• Crossover is applied on same chromosomes type of two parents. Two chromosomes

generate two new chromosomes, and the combination of three chromosomes from each

parent six different new individual can be generated. Crossover is performed in a single

random cutoff point.

[5.2] Mutation

Mutation is used in order to ensure that the individuals are not all exactly the same. Mutation

gives a child his own traits to ensure his standing in the population is unique. Generally the

probability of mutation is usually between 1 and 2 tenths of a percent.

Figure 5.3: Represents mutation by flipping the bits

Page 9: Online Social Netowrks- report

[5.3] Chromosome:

In our project we used a binary genetic algorithm where each chromosome is represented by

weights w1, w2, and w3. Each population in a generation is represented by 3 bytes ranging from 0

to 255.

Chromosome= [w1 w2 w3]. Initially weights are assigned at random based on the connectivity of

the population of the user to be recommended with the central user described in step 5 of section

6.2.

Page 10: Online Social Netowrks- report

[6] Recommendation Procedure

The Recommendation procedure is represented in the figure below. Each step is explained in

detail later. It consists of 6 steps in total.

Figure 6.1

[6.1] Recommendation Steps in detail

Step 1: Get Data from Twitter

We first decided the central node that should be recommended to follow some node. We got all

those nodes information till 4th

library for Twitter API. The 1st

those being followed by the central user. The 3

in 2nd

level. Thus the recommended node to be followed by user belongs to level 3. Level 4

nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to

user). This is shown in the Figure 6.2

[6] Recommendation Procedure

The Recommendation procedure is represented in the figure below. Each step is explained in

detail later. It consists of 6 steps in total.

Figure 6.1: Recommendation Procedure

teps in detail

We first decided the central node that should be recommended to follow some node. We got all

level. We obtain all information using Twitter4J

level node is the Central User itself. The 2nd

level nodes are all

those being followed by the central user. The 3rd

level nodes are nodes being followed by nodes

level. Thus the recommended node to be followed by user belongs to level 3. Level 4

nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to

user). This is shown in the Figure 6.2 below:

Get Data from

twitter

Form the

network graph

Perform Graph

reduction

Calculate the

three indexes

Apply Genetic

Algorithm

Recommend

whom to follow

The Recommendation procedure is represented in the figure below. Each step is explained in

We first decided the central node that should be recommended to follow some node. We got all

level. We obtain all information using Twitter4J – A JAVA

level nodes are all

level nodes are nodes being followed by nodes

level. Thus the recommended node to be followed by user belongs to level 3. Level 4

nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended to

Page 11: Online Social Netowrks- report

Figure 6.2: Network View

Step 2: Form the /etwork Graph

To calculate the indexes values, we have to find the adjacency matrix first. However the

adjacency matrix can be obtained only from a graph. Hence once we get information of all users

we form the network graph so that the indexes values can be found. This step is particularly

important in calculating the second index and third index.

Step 3: Perform Graph Reduction

In this step we remove all those nodes in level 2 that have a link in between them. Consider

nodes A and B at level 2. If there is a link between them, say A is following B, then from Central

user’s perspective, B falls in level 3. Hence our system might ask Central user to follow B. But

we know that B is already being followed by Central user and should not be present in list of

nodes being recommended to be followed by Central user. Hence we discard such nodes. Hence

this step is called as a Graph reduction Step.

Step 4: Calculate the three indexes

The three indexes concept and formulae were explained before. For each node we find the three

indexes values and then the Fitness Function value. Thus there is a one-to-one correspondence

between a given node and its fitness function value.

Page 12: Online Social Netowrks- report

Step 5: Apply Genetic Algorithm

As mentioned previously, the three weights are given as input to Genetic Algorithm which tries

to optimize it in every iteration. In our case the each chromosome represents an individual in a

population. Each chromosome has three genes which are represents by the weights w1,w2,w3,

assigned to indexes I1,I2,I3 respectively. If user in level 1 is directly being followed by level 3

user we assign more weight to the level 3 user whose value will be greater than 127. Our goal is

to optimize the weights such that the fittest of the population are to be recommended to the user.

Here the each gene is considered to be of 8 bits length. Every time before starting next iteration

we consider only those nodes with high fitness function values. We take the weights of these

nodes and apply Crossover and Mutation techniques to get optimized weights.

Step 6: Recommend whom to follow

Finally after running the Genetic Algorithm after four iterations, we come up with a set of nodes

with their fitness function values. We arrange these nodes in decreasing order of fitness function

values and recommend the top nodes that the user should follow.

[6.2] Pseudo Code

We have implemented the program in JAVA language. The Pseudo code for each step that we

defined earlier is shown below:

Step 3: Perform Graph Reduction

For each user in level-3

if user equals level-2 user

then remove from level-3

end

end

Step 4: Calculate the three indexes

For each user in level-3

Calculate First Index, Second Index, Third Index

Check if level-3 user is connected to level-1

if so set flag to Y

end

end

Page 13: Online Social Netowrks- report

Step 5: Apply Genetic Algorithm

Initial Step performed only once for a Central user node: Assign random weights to the nodes

and form initial chromosomes.

For each user in level-3

If user is directly connected to level-1 user

then assign random weight ≥127 corresponding to each index

Else assign random weight <127 corresponding to each index

end

End

Step 1: Create the population using the weights.

Step 2: Select two random people from population at a time until count equal the population size.

Step 3: Perform crossover operation to get offspring.

Step 4: Perform mutation operation on off springs.

Step 5: Calculate Fitness function for each person in the population.

Step 6: Sort the population in descending order of Fitness Function value.

Step 7: Discard least fit half of the population.

Step 8: Repeat Step2 to Step7 for four iterations

Step 6: Recommend whom to follow

Select the unique users out of the remaining population for recommendation. Give preference

according to their Fitness Function value.

[7] Results:

[7.1] Dataset:

For our testing we have collected 11,459 user Ids related to a single user. We found that the

number of users in the level 3 of the network graph were 59. After graph reduction the

recommendable users reduced to 56.

The total size of the chromosome is 24 bits. The Genetic algorithm is applied for 4 generations

such that the characteristics of the fittest parent are passed over to the other population. There is

a tradeoff between the number of generations and the accuracy of the results. Higher the

generations better the output but it will increase the number of computations. The mutation

probability is set 4%

Total Number of users Numbers of users in level 3 Number of recommendable

users

11,456 59 56

Table 7.1 Population

Page 14: Online Social Netowrks- report

The accuracy of our recommendation system is limited as we were not able to get the private

user data. Hence, some of the users who had a better chance of being recommended were

neglected because we were not able to collect whom they were following.

[7.2] /etwork Graphs:

Connectivity Graph:

Figure 7.1: Connectivity Graph

Figure 7.1 represents the connectivity graph of the 11,456 users. It can be observed that the

network is pretty dense, which agrees with the social network principle. Due to intensive dense

graph, it is not possible to see individual connection between users.

Common Users:

In Figure 7.2 the user 73048275 is the central user for whom the users are recommended, the

user 45597677 belongs to level 3 who is supposed to be recommended to the user; the rest of the

nodes represent the common users between the central user and level 3 user. It can be seen from

Figure 7.2 that no link exists between the user 73048275 and 45597677. After running all

iterations, user 45597677 is recommended to user 73048275 since the actual user is not

following him. It can be verified with Table 7.1 the value of 1st index is 3 which agrees with

Figure 7.2 which shows three common users between 73048275 and 45597677 and moreover

this connectivity graph is used in the calculation of 2nd

Index.

Page 15: Online Social Netowrks- report

Figure 7.2: Common Users

Sub Graph:

Figure 7.3: Sub Graph

Figure 7.3 represents the connectivity graph of the common users and the non-common users

being followed by central user and level 3 users. This connectivity graph is used in the

calculation of 3rd Index.

Page 16: Online Social Netowrks- report

Crossover:

We show here a glimpse of how Crossover and Mutation works. The result shown below is after

first iteration of Genetic Algorithm applied on user with id 73048275. The chromosomes of 2

nodes from level 3 are shown below:

The crossover point for child 1 is 3. This means that the child will inherit first three bits from one Parent and next five bits from other Parent. This is true for each byte of chromosome. For example, the first three bits of first byte in Child 1 comes from Parent 1 and rest five bits come from Parent 2. This is shown by Violet line. Similarly, the crossover point for child 2 is 5. This means that the child will inherit five bits from one Parent and next three bits from other Parent. This is true for each byte of chromosome. For example, the first five bits of second byte in Child 2 comes from Parent 2 while rest three bits come from Parent 1. This is shown by Red line. The next step is to perform Mutation on the chromosomes of children nodes. The result is shown below:

Since the probability of mutating a bit is 4%, only one bit gets flipped in chromosome of child 1. The chromosome of child 2 remains unaffected.

Page 17: Online Social Netowrks- report

Recommendation:

We have tested our Recommendation system on numerous users, and it mostly gave successful

results. Our recommendation system recommends followers from the user’s network graph. For

recommending a user various factors are considered:

1) Is the level 1 user directly followed by level 3 user?

2) How many common people the level 3 user and level 1 user are following (1st and 2

nd

Index)?

3) What’s the spread of the level 3 user (3rd

Index) i.e. number of users he is following?

How these factors influence the user recommendation process:

If a user at level 3 has more common users with level 1 user then he has high chance of

getting recommended. His chances are further enhanced if he is following level 1 user.

If the level 3 user does not have more users in common with level 1 user then 3rd

index

comes into play. The more spread he has, greater the chance to get recommended.

Recommendation List 1:

Recommendation for user with user Id: 73048275

The value within the brackets represents the fitness value of that user.

• UserId 71496042 user name ankit goel

chromosome (1017.978)110111001111111001101011

• UserId 75572867 user name Abhishek Agarwal

chromosome (644.8637)111100011110101101110000

• UserId 45597677 user name Arun Kartha

chromosome (892.2909)111110011110101101110000

• UserId 33821494 user name Ishaan Guliani

chromosome (434.57565)011010001100110000001101

• UserId 60321463 user name LateNightTales

chromosome (321.128)001000110010010101111101

• UserId 41877111 user name harman

chromosome (135.34)011000100111000001101011

List 7.1 Output 1

Page 18: Online Social Netowrks- report

User Id 1st Index 2

nd Index 3

rd Index Level 1

71496042 4 0.533333333 0.023470661672908864 y

75572867 2 0.576 0.0553306342780027 y

45597677 3 0.6 0.03831168831168831 n

33821494 3 0.6 0.013512304250559284 n

41877111 2 0.6666666666666666 0.03333333333333333 n

60321463 1 0.4321 0.02666666666666667 n

Table 7.2 Indexes

Table 7.2 consists of individuals to be recommended to the actual user. Users are arranged taking

into account each index value. The character value in last column represents the situation

whether the user is following the actual user or not. In case, the character is y then recommended

user is following the actual user else he is not following him.

The list 7.1 shown above is the actual recommended list for user with Id 73048275. The priority

of the users being recommended is based on his fitness which is calculated through genetic

algorithm. If we compare the users being recommended in Table 7.2, we can observe that all

three indexes are equally important. However, being directly connected to level 1 user plays a

vital role in getting recommended as higher weights are assigned to users who are directly

connected to level 1 user, which further agrees with the principle behind our recommendation

system. If we compare users 45597677 and 33821494, they have the same 1st and 2

nd index. But

their network spread (3rd

index) played an important role which ultimately gave preference to

user 45597677.

Recommended List 2

Recommendation for user with user Id: 113943142

• UserId 56304605 user name Rajdeep Sardesai

chromosome (495.9645)100110110011001100100101

• UserId 145125358 user name Amitabh Bachhan

chromosome (428.36288)011000001101111011010111

• UserId 135421739 user name sachin tendulkar

chromosome (416.52875)011101010110001101010110

• UserId 113609977 user name Umal Muranjan

chromosome (401.42856)011000001101111011010111

Page 19: Online Social Netowrks- report

• UserId 161304900 user name Polynamous

chromosome (315.66666)011000001101111011010111

• UserId 97865628 user name Farhan Akhtar

chromosome (311.89334)011101100111000100001110

• UserId 116135959 user name Viral Desai

chromosome (200.155)011011010000110000010111

• UserId 6509832 user name CNN-IBN News

chromosome (194.233)010000001101111011010111

The above list is an actual recommended list for a different user (user 2). Here you can observe

most of the users who are being recommended are celebrities. This is because none of the users

in level 3 are directly following level 1 user.

[8] Conclusion:

We have proposed and implemented a new way to recommend a user whom he can follow. This

method differs from the conventional Twitter recommendation system. Unlike Twitter which

gives more importance to famous personalities, we gave equal importance to all nodes. Our

approach is based on the Friends-Of-Friends approach. The effectiveness of our system differs a

little because we were not able to retrieve the private user information. The Genetic Algorithm

used is extremely difficult and complex, nevertheless we get good results.

[9] Improvements

One major problem that we face was that we cannot access data from private users. Hence, we

had to ignore private users.

Other problem that we faced was the Rate-Limiting problem. To prevent abuse of many

sequential searches in given period of time, Twitter blocks the requests from a machine with

same IP Address. As a result, we could not run the Algorithm on users who followed large

number of nodes.

The above problem could be solved using the Twitter Streaming API which does not have any

rate limiting problem. But the drawback is that we get access to only those users who are

currently tweeting.

Page 20: Online Social Netowrks- report

[10] References

[1] A Graph-Based Friend Recommendation System Using Genetic Algorithm, Nitai B.

Silva, Ing-Ren Tsang, George D.C. Cavalcanti, and Ing-Jyh Tsang.

[2] Introduction to social network methods, Robert A. Hanneman and Mark Riddle

[3] Practical Genetic Algorithms, Randy L. Haupt, S. E. Haupt

[4] twitter4j, http://twitter4j.org

[5] Jung, http://jung.sourceforge.net/