View
213
Download
0
Category
Preview:
Citation preview
http://www.iaeme.com/IJARET/index.asp 12 editor@iaeme.com
International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 9, Issue 1, Jan - Feb 2018, pp. 12–25, Article ID: IJARET_09_01_002
Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=9&IType=1
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
© IAEME Publication
SOCIAL MEDIA HASHTAG CLUSTERING
USING GENETIC ALGORITHM
Nilesh Gambhava
Institute of Technology, Nirma University, Ahmedabad, India
Dr. Ketan Kotecha
Parul Universiy, Baroda, India
ABSTRACT
Twitter is one of the most influencing microblogging platforms in the
revolutionary era of social media. Tweets, short messages posted by the user to
interact with the social world, are an invaluable source of data which can be used to
predict trends, timeline generation, community detection, etc. Extracting useful
information from tweets is challenging because of two reasons; first, a short length of
a tweet (140 characters) and second, users just focus on the meaning of a tweet,
neither on grammar rules nor on correct spellings. People use the hashtag symbol (#)
before keyword or phrase in the tweet to emphasize the importance of those words in
the tweet during a search. Hashtag clustering is an important technique to extract the
knowledge by categorizing tweets in different clusters. Hashtag clustering is the
challenging task due to three major reasons. First, the number of clusters is not known
in advance, second, domain-related information is not available and third, different
hashtags are being created for the same topic (#deelnet, #deeplearning, #dl, etc.).
Genetic Algorithm is an adaptive heuristic search algorithm that mimics the
evolutionary process of natural selection and survival of the fittest. To the best of our
knowledge, this is the first attempt to cluster hashtags using Genetic Algorithm. We
have experimented our algorithm on a large set of tweets downloaded from popular
Indian media twitter accounts. The results obtained by our model are compared using
crowdsourcing method as there is no other source available to validate the quality of
the results. The results achieved by our model are superior compared to
crowdsourcing results. Also, the users’ validation for the clusters generated proves
the accuracy of the proposed model.
Key words: Social Media, Hashtag Clustering, Genetic Algorithm, Crowd Sourcing.
Cite this Article: Nilesh Gambhava and Dr. Ketan Kotecha, Social Media Hashtag
Clustering Using Genetic Algorithm. International Journal of Advanced Research in
Engineering and Technology, 9(1), 2018, pp 12–25.
http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=9&IType=1
1. INTRODUCTION
Twitter [1] is the most widely used online social networking service where users share their
opinions and communicate with others using short messages known as tweets, limited to 140
characters. Twitter has become one of the largest repositories of news, opinion, and data.
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 13 editor@iaeme.com
Tweets posted by different users all over the world represent thoughts and views for a broad
variety of categories. 100 Million+ users tweet 500 Million+ tweets daily on what’s
happening in the world and express their views on thousands of different topics. These tweets
are the valuable source of data to know the trend, for generating a timeline of an event,
finding people of the similar interest group, etc…
Extracting information from tweets is extremely difficult because of its nature and
structure. First, Tweets are limited to 140 characters so many users use the acronym of a word
to shorten the message. Though the message is conveyed well to readers, the acronyms used
in tweets are not meaningful for automatically extracting information. For example, the word
Tomorrow was written as 2m, 2mar, 2mara, 2maro, 2marrow, 2mor, 2moro, 2morow, 2morr,
2morro, 2morrow, 2moz, 2mr, 2mro, 2mrrw, 2mrw, 2mw, tmmrw, tmo, tmoro, tmorrow,
tmoz, tmr, tmro, tmrow, tmrrow, tmrrw, tmrw, tmrww, tmw, tomaro, tomarow, tomarro,
tomarrow, tomm, tommarow, tommarrow, tommoro, tommorow, tommorrow, tommorw,
tommrow, tomo, tomolo, tomoro, tomorow, tomorro, tomorrw, tomoz, tomrw, tomz in
different tweets. Second, users don’t follow any structure of language; they just focus on the
meaning of tweet, neither on grammar rules nor on correct spellings. Third, daily 500
Million+ tweets are posted plus retweets, likes, etc. It requires real-time analytics on a huge
amount of data. Fourth, due to live streaming of thousands of tweets, a search for tweets is not
effective as Google. As on date, Twitter shows a list of tweets containing searched keywords
in order by date of posting of tweet irrespective of importance or relevance of the tweet with
respect to the searched keywords.
Clustering [2-5] is the process of grouping objects in such a way that similar objects
reside in one cluster and dissimilar objects reside in the different clusters. Clustering can be
considered as the most important unsupervised learning problem because of its applicability
to a large set of problems. Clustering process identifies a structure in a collection of unlabeled
data. Clustering algorithms can be classified into two categories, 1) number of clusters known
in advance at initial step and 2) number of clusters are not known in advance. K-means [6-7],
Fuzzy C-means [8] are well-known examples centroid-based clustering algorithms where a
number of clusters, k, is required at first step. These types of algorithms are not suited for
hashtag clustering because we don’t know how many clusters or groups exist for the given set
of hashtags. Even we can’t assume an approximate number of clusters. Hierarchical clustering
algorithms [9-10] belong to the second type of clustering algorithms where a number of
clusters are not known in advance but required at last step. It is also not applicable to hashtag
clustering because we don’t know where to cut dendogram or when to stop. Hashtag
clustering using conventional clustering algorithms is practically not possible.
Kwak et al. [11] have studied the topological characteristics of Twitter and shown the fact
that majority of tweets are news in nature through classifying the trending topics based on
temporal behavior. Java et al. [12] have focused on hierarchical spatio-temporal hashtag
clustering techniques. Using STREAMCUBE, events have been identified based on space and
time hierarchy by Feng et al. [13]. Song et al. [14] have improved text understanding by using
a probabilistic knowledgebase then using a Bayesian inference mechanism to conceptualize
words and short text. Sakaki et al. [15] have investigated earthquakes in Twitter and proposed
an algorithm to monitor tweets and to detect a target event using a classifier of tweets based
on features such as the keywords in a tweet, the number of words, and their context. Stilo and
Velardi [16] have proposed a temporal sense clustering algorithm using Symbolic Aggregate
ApproXimation. Muntean et al. [17] have clustered a large set of hashtags using K-means on
map reduce in order to process data in a distributed manner. Crockett et al. [18] have
reviewed 13 various unsupervised learning algorithms to analyze Twitter data streams and
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 14 editor@iaeme.com
identify hidden patterns in tweets where the text is highly unstructured. Tripathy et al. [19]
have proposed to use Wikipedia topic taxonomy to discover the themes from the tweets and
use the themes along with traditional word based similarity metric for clustering. The study of
Adel et al. [20] focuses on clustering tweets based on their textual content similarity using
cellular genetic algorithm cGA. Keshavarz and Abadeh [21] have combined corpora-based
and lexicon-based approaches and lexicons are generated from text. Using these lexicons, a
novel genetic algorithm is proposed to solve optimization problem and find lexicons to
classify text. The authors have classified the tweets into subjective and objective tweets [22].
They extract two meta-level features from tweets, which show their count of objective and
subjective words. The tweets then are classified using these meta-features. Genetic algorithm
is used for creating subjectivity lexicons from training datasets.
Genetic Algorithm (GA) [23-25] is an adaptive heuristic search algorithm that mimics the
evolutionary process of natural selection and survival of the fittest. It is a subset of a much
broader branch of evolutionary computation. It represents an intelligent exploitation of a
random search based on historical information to direct the search into the region of a
solution. It is widely used to find optimal or near to optimal solutions of NP-Hard problems
like optimization, clustering, etc. in a reasonable time which otherwise may take much longer
time to solve the problem, in some cases years also. Many researchers have explored
clustering capability of GA for various kinds of problems [26-29]. GA can be one of the best
methods to solve hashtag clustering problem because of some unique features like,
It does not require any derivative information.
It searches in parallel for global optima in a solution space
It optimizes both continuous and discrete functions
It provides a list of good solutions and not just a single solution
It gives a solution in a finite time which gets better over the time
It is useful when the search space is very large and there are a large number of parameters
involved
The clustering capability of GA is explored in this research work to cluster a set of
hashtags. To the best of our knowledge, this is the first attempt of applying GA for hashtag
clustering without any prior processing of tweets or using domain related information. We
have verified our results using crowdsourcing because there is no other available alternative.
A Hashtag is a word or phrase preceded by a hash sign (#). It is used on social media
websites, especially Twitter, to tag messages on a specific theme or domain. People use the
hashtag symbol (#) before word or phrase in their tweets to emphasize the importance of those
words in the tweet during a search. Twitter also uses hashtags to index the keywords. Because
of a hashtag, it’s possible that your tweet is seen by hundreds or even millions of users who
are not following you but have searched using that particular hashtag. Tweets without hashtag
have a very short life. Users create different hashtags for same topic like #deeplearning, #dl,
#deepnet, #deepneuralnet, etc. Moreover, many users use nearby similar hashtags like
#machinelearning #ai #neuralnets #deeplearning etc… Single hashtag searching may not give
optimal result hence grouping of similar hashtags plays a vital role in the improvement of
twitter hashtag search result. Grouping of a similar hashtag is a typical clustering problem.
Hashtag clustering has various applications like an event timeline generation, community
detection, finding people of a similar interest group, etc…
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 15 editor@iaeme.com
The rest of the paper is organized as follows. The working principle of GA is described in
section 2. Section 3 presents our proposed algorithm and implementation details of it. Results
are discussed in section 4. Section 5 concludes the paper with an insight into the future work.
2. THE PROPOSED MODEL
2.1. Introduction
Genetic Algorithm (GA) is an adaptive heuristic search algorithm that mimics the
evolutionary process of natural selection and survival of the fittest. GA begins by randomly
generating a set of possible solutions (chromosomes) known as an initial population. The
chromosomes carry the parameter values that create a solution. GA iterates through several
generations by applying genetic operations on population and explores the better solution
after each generation.
The First step in each generation is to calculate fitness value of chromosomes based on
optimization function. Once the fitness value is calculated three basic genetic operators are
applied on chromosomes. The first genetic operator is a selection operator; two parent
chromosomes are selected from the randomly generated chromosomes based on high-quality
fitness value to perform subsequent operations. The second operator is crossover; two child
chromosomes are created using two parent chromosomes by applying crossover operators like
a single point, uniform, and heuristics based. Next operator is a mutation which is used to
avoid stuck in local minima. The mutation changes gene value randomly with very low
probability. Newly produced children chromosomes and some of the elite chromosomes are
passed to the next generation population. The algorithm stops when any one of the stopping
criteria is satisfied or maximum number of generations is reached. Basic algorithm for GA is
as follow;
Figure 1 Algorithm for Genetic algorithm
Implementation of genetic algorithm varies from problem to problem. Fine-tuning of
parameters play a crucial role in achieving an optimal solution to the problem. They have to
be crafted very carefully otherwise GA would never give the optimal result.
To the best of our knowledge, this is the first research attempt, 1) to cluster hashtags
without prior processing or using domain knowledge and 2) using GA. We do not have any
prior reference research work of hashtag clustering using GA hence we experimented
begin t=0 randomly initialize population P(t) while (t <= max_generation or termination criteria is achieved) Calculate fitness of each individual in population Selection_Operation(P(t)) Crossover_Operation(P(t)) Mutation_Operation(P(t)) t=t+1 Copy elite individuals of P(t-1) and children to P(t) end end
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 16 editor@iaeme.com
different operators with different possible values of parameters in this research work to find
out the best set of operators and parameters.
2.2. Chromosome Representation
The Chromosome is defined as a string of positive integer values. Suppose there are h
different hashtags then the length of the chromosome is h where each gene represents a
hashtag. Value of the gene, also known as an allele, is the cluster number where that particular
hashtag belongs. Suppose we have 5 hashtags and if the chromosome is 12112 then first, third
and fourth hashtags belongs to cluster number 1; second and fifth hashtags belongs to cluster
number 2. Table 1 and table 2 show an example of 10 hashtags where gene index represents
hashtag number and value of gene represents cluster number to which the particular hashtag
belongs.
Table 1 Sample list of hashtags
1 2 3 4 5 6 7 8 9 10
#news #AI #delhi #smog #cricket #dnn #food #taste #India #kashmir
Table 2 Randomly generated chromosome from hashtag list of Table 1
1 2 3 4 5 6 7 8 9 10
4 6 1 3 2 1 2 4 2 5
2.3. Initialization of Population
The Population (P) is a set of n chromosomes where each chromosome represents one valid
solution. The initial population is randomly generated. Let n be the size of the population, Ch
denotes the chromosome and c is the length of a chromosome. For initializing the rth
chromosome, Chr, in the population (r = 1, 2, …, n), an integer is randomly selected from the
range [1, Clustmax] for each gene of the rth
chromosome. We do not know the approximate
number of clusters in advance so we generate chromosomes with Clustmax clusters and divide
number of cluster by 2 after each step until all hashtags belong to single cluster. Fig. 2 shows
the algorithm for the generation of the initial population. Input: List of Hashtags
Output: Randomly Generated Initial Population
initial_chromosome_generation()
{
chromosome_tobe_generated = min_chromosomes //e.g. 5
min_cluster = 1
cluster_count = number_of_hahtags
while(cluster_count < min_cluster)
{
while(i < chromosome_tobe_generated)
randomly generate chromosome with clusters with cluster_count
chromosome_tobe_generated *= 2
cluster_count /= 2
}
}
Retrun P(i)
Figure 2 The proposed method to generate initial random population
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 17 editor@iaeme.com
The fitness function is based on co-occurrence frequency of hashtags. Two hashtags are
related to a topic if they are present together in minimum t tweets. If two hashtags belong to
the same cluster in a solution and their co-occurrence frequency is high, then assign high
fitness value to the solution and if their co-occurrence frequency is zero or low, then assign
low value or even negative value to the solution. We have tried two different fitness functions
described as in table 3.
Table 3 Fitness Functions
Sr. Fitness Function
1
𝑗 is the co-occurrence weight of hashtag i and hashtag j if 𝑗 > 0 else -3 or -1 if 𝑗
= 0. coefficient is derived from
2 if 𝑗 > 0 else -5 or 0
2.4. Selection Operation
Widely used RWS & SUS are not applicable to our problem because fitness value can be
negative also. We have experimented with two selection operators; Tournament Selection and
Linear Rank Selection (LRS).
2.4.1. Tournament Selection
Tournament selection operator (TSO) is one of the simplest selection operators. Randomly
two or more chromosomes are chosen and the fittest of them is selected for crossover
operation. The lowest fit chromosome never gets a chance to be selected. In our case, team
size is 2.
Figure 3 Tournament Selection
2.4.2. Linear Rank Selection
Sometimes GA converges prematurely to local optima because of few chromosomes with
very high fitness value. LRS tries to overcome the drawback of premature convergence of the
GA. LRS is based on the rank of individuals rather than the fitness value. The chromosomes
are ordered from best fitness value to worst fitness value. The best chromosome is assigned
rank n and worst is assigned rank 1. Based on rank, each individual has the probability of
being selected given by the expression . The selection probability is linearly assigned to
the individuals according to the rank.
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 18 editor@iaeme.com
2.5. Crossover Operation
We have experimented two widely used crossover operators; single point crossover and
uniform crossover (gene to gene). Single point crossover selects crossover point randomly
and exchanges genes after the crossover point to produce two new chromosomes. Uniform
crossover is applied on the gene to gene bases. A coin is tossed for each gene to decide
whether the first child selects gene from the first parent or the second parent. Figs. 4(a) & 4(b)
show the example of both the types of crossover operation.
Figure 4(a) Single point Crossover
Figure 4(b) Uniform Crossover
2.6. Mutation
Mutation is a small random tweak in the chromosome to start with new search solution. It
changes a value of gene randomly with very low probability. We have used this classic
technique with the novel approach. Our approach first decides whether to apply mutation or
not on the chromosome with reasonably moderate portability and then apply mutation on each
gene with very low probability. Fig. 5 represents mutation operation.
Figure 5 Mutation Operation
3. EXPERIMENTAL RESULTS
3.1. Data Collection
We have used Twitter REST API [30] to download tweets. Using REST API, we can fetch
recent 3200 tweets of a user. We have downloaded tweets from 76 popular Indian media
twitter accounts, as depicted in table 4, starting from 24 September 2017 to backward. This
dataset contains 2,27,000+ tweets, 1,55,000+ hashtags, 48,500+ hashtags pairs and 24,500+
unique hashtags. We have removed hashtags with co-occurrence frequency less than 30 to
make it computably feasible.
3.2. Parameter Selection
Proper selection of GA parameters is a highly crucial task because we cannot achieve optimal
results even if any one of the parameters is improper. We developed an interactive and
generalized experimental model for measuring the effect of specific parameters and operators
for achieving GA’s best performance. Table 5 gives the parameter values which we have
evaluated using our model
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 19 editor@iaeme.com
Table 4 Tweets downloaded from following Indian Media Twitter Accounts
@aajtak
@abpnewstv
@AmarUjalaNews
@Avinash_Mirror
@BDUTT
@BTVI
@CNBC_Awaaz
@CNBCTV18Live
@CNBCTV18News
@DainikBhaskar
@DDNewsLive
@DeccanChronicle
@DeccanHerald
@dibang
@EconomicTimes
@Eenadu_English
@FeminaIndia
@filmfare
@grihshobha
@gujratsamachar
@Haribhoomicom
@htTweets
@IBN7Media
@IndianExpress
@IndiaToday
@indiatvnews
@JagranNews
@KanchanGupta
@Live_Hindustan
@loksabhatv
@madhutrehan
@MayantiLanger_B
@mid_day
@MiniMenon
@mjakbar
@MumbaiMirror
@NavbharatTimes
@ndtv
@NDTVProfit
@NewIndianXpress
@News18Breaking
@News24
@NewsNationTV
@NewsWorldIN
@NewsX
@Nidhi
@Outlookindia
@prabhatkhabar
@PrabhuChawla
@PrannoyRoyNDTV
@punjabkesari
@rahulkanwal
@RajatSharmaLive
@ravishndtv
@readersdigest
@republic
@SachinKalbag
@sagarikaghose
@sardesairajdeep
@ShereenBhan
@ShomaChaudhury
@sportstarweb
@suchetadalal
@sudhirchaudhary
@SwetaSinghAT
@Telegraph
@thetribunechd
@THexplains
@TimesNow
@timesofindia
@totaltvguide
@vikramchandra
@WIONews
@WTOV9
@ZeeBusiness
@ZeeNews
Table 5 Parameters and Operators used in the proposed model
Parameter Values
Fitness Function 1) ln (Max Value 50, 25, Penalty -1, -3)
2) Weight (Penalty -5, -10)
Selection Method 1) Tournament Selection
2) Linear Rank Selection
Crossover Method 1) Single Point 2) Gene to Gene
Crossover Probability 1) 100 2) 90 3) 80 4) 70 5) 60
Mutation Probability 1) 10 2) 20 3) 25
Mutation Gene Probability 1) 5 2) 10 3) 20
Number of Generations 1000
Tweets Duration 24 Sep 2017 to backward
3.3. Results and Discussion
The initial population is created from the randomly generated solutions by assigning hashtags
in different clusters. The chromosomes which contain hashtags with higher co-occurrence
frequency are assigned higher fitness value. As generation passes such chromosomes exhibit
the better possibility of moving forward in evolutionary iterations. Fig. 6 shows the evolution
of solution as the generations pass.
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 20 editor@iaeme.com
Figure 6 The evolution of solution with generation
As the algorithm converges, the clusters are formed inclusive of most relevant hashtags.
Table 6 shows the fittest chromosome achieved after 1000 generations which has fitness value
of 414. Total 56 clusters have been evolved from the chosen data set. Some clusters contain
only one hashtag as they are not related to any other hashtags.
Table 6 Clusters generated using the proposed method
Cluster
Number Hashtags included specific cluster
1.
CNBCAWAAZ | HEADLINES | LIVE | MARKETCOUNTDOWN | MAR
KETKAPANCHNAMA | MORNINGCALL | Q1WITHAWAAZ | STOCK2
020;
2. DERASACHASAUDA | HARYANA | HONEYPREET | RAMRAHIM | R
AMRAHIMSINGH | RAMRAHIMVERDIC;
3. GURUGRAM | PRADYUMAN | PRADYUMANMURDERCASE | RYANI
NTERNATIONALSCHOOL;
4. FUELPRICES | MIDDAYMUMBAI | MIDDAYNEWS | MUMBAI | MUM
BAINEWS | MUMBAIRAINS;
5. AUSTRALIA | CHINA | DOKLAM | INDIA | JAMMU | KASHMIR | PAKI
STAN | UNGA;
6. BHASKARALERT | UPGOVT | UTTARPRADESH | YOGIADITYANAT
H;
7. INSPIRATIONALQUOTES | PKTHOUGHT | THOUGHTOFTHEDAY;
8. BEAUTY | FASHION | HAIRCARE | MAKEUP | SKINCARE;
9. APPLEEVENT | IPHONE8 | IPHONE8PLUS | IPHONEX;
10. FICTION | HINDISHORTSTORY | LITERATURE | STORY;
11. AMITSHAH | BJP | INFOSYS | VISHALSIKKA;
12. MARKETATCLOSE | MARKETS | NIFTY | SENSEX;
13. INCREDIBLEINDIA | TOURISM | TRAVEL;
14. BANGLADESH | MYANMAR | ROHINGYA;
15. BOLLYWOOD | ENTERTAINMENT | PKVIDEO;
16. KAREENAKAPOORKHAN | TAIMURALIKHAN;
17. BULLETTRAIN | PMMODI | SHINZOABE;
18. INDVSL | INDVSSL | SLVIND | TEAMINDIA;
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 21 editor@iaeme.com
19. DONALDTRUMP | JAPAN | NORTHKOREA;
20. AWAAZADDA | BIGDEBATE | POLL;
21. JANMAN | UPVASFIXING | जनमन;
22. BOLLYWOODPHOTOS | MIDDAYBOLLYWOOD;
23. FREEDOMATMIDNIGHT | MISSIONGST;
24. NOZOMIOKUHARA | PVSINDHU;
25. MUKESHAMBANI | RILAGM2017;
26. KANGANARANAUT | SIMRAN;
27. INDVENG | WWC17;
28. FOOD | INDIANFOOD;
29. LOVE | RELATIONSHIP;
30. JUSTIN | NDTVNEWS;
31. AIADMK | SASIKALA;
32. BIHAR | NITISHKUMAR;
33. CONGRESS | RAHULGANDHI;
34. DAWOODIBRAHIM | IQBALKASKAR;
35. ADIKHAMGUJARAT | SLVSIND;
36. CABINETREJIG | CABINETRESUFFLE;
37. SUPREMECOURT | TRIPLETALAQ;
38. FIIBROKERAGES | HOUSEVIEWS;
39. MODIINVARANASI | NARENDRAMODI;
40. BANKSEBACHAO | TWEETMORCHA;
41. NATIONALVOICE | SACHCHIBAAT;
42. INDVAUS | VAARTA;
43. NEWTON | OSCARS;
44. DIESEL | PETROL;
45. AMARUJALATV;
46. GST;
47. PANCHKULA;
48. MAHIRAKHAN;
49. SANSKRIT;
50. AUVIDEO;
51. l8IND;
52. EARTHQUAKE;
53. MEXICO;
54. RANBIRKAPOOR;
55. BOLLYWOODUPDATE;
56. SRILANKA;
We have experimented the quality of our proposed model on recently downloaded real-
world data set as no benchmark results are available to compare our results. We have used
crowdsourcing model to validate the results and check the applicability of the proposed
model. Total 100 distinct users were allocated the task of performing the clustering from the
given set of hashtags. Table 7 shows fitness value of different solutions based on clusters
assigned by different users. The best fitness value found by GA is 414 whereas best fitness
value found by crowdsourcing is 404. GA has produced the better result than the best result
found by crowdsourcing. The clusters generated using GA are also validated by the users
involved in our experiments. According to the common observations derived from them are,
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 22 editor@iaeme.com
two of the clusters are such that where hashtags are misplaced (INDVAUS | VAARTA;
ADIKHAMGUJARAT | SLVSIND;). Also, there is a cluster (AMITSHAH | BJP | INFOSYS | VISHALSIKKA;
which needs to be divided further into sub-clusters for more concentrated results. In rest of all
the cases, the clusters are well generated with the most relevant tweets.
Table 7 Fitness value of solutions clustered by crowdsourcing users
User Fitness
value
User Fitness
value
User Fitness
value
User Fitness
value
User 1 306 User 26 390 User 51 286 User 76 282
User 2 309 User 27 368 User 52 336 User 77 387
User 3 284 User 28 373 User 53 383 User 78 329
User 4 363 User 29 267 User 54 275 User 79 276
User 5 334 User 30 355 User 55 343 User 80 291
User 6 386 User 31 258 User 56 396 User 81 297
User 7 404 User 32 318 User 57 267 User 82 343
User 8 275 User 33 327 User 58 373 User 83 283
User 9 377 User 34 303 User 59 338 User 84 275
User 10 332 User 35 210 User 60 391 User 85 350
User 11 354 User 36 368 User 61 394 User 86 330
User 12 393 User 37 283 User 62 283 User 87 341
User 13 265 User 38 279 User 63 396 User 88 278
User 14 373 User 39 282 User 64 325 User 89 390
User 15 381 User 40 397 User 65 292 User 90 380
User 16 263 User 41 278 User 66 289 User 91 270
User 17 258 User 42 349 User 67 398 User 92 288
User 18 308 User 43 363 User 68 261 User 93 341
User 19 266 User 44 321 User 69 315 User 94 287
User 20 340 User 45 361 User 70 337 User 95 365
User 21 389 User 46 387 User 71 362 User 96 379
User 22 387 User 47 397 User 72 327 User 97 283
User 23 360 User 48 376 User 73 368 User 98 386
User 24 317 User 49 356 User 74 281 User 99 275
User 25 272 User 50 321 User 75 334 User 100 272
Table 8 shows top 10 results achieved by our proposed model. Different solutions
represent different clustering and all these results are comparatively good, hence all these
solutions can be used based on the requirement.
Table 8 Top 10 best results achieved using the proposed model
SM Cp CM FF FF Max FF Pen Mp MGp Fitness Value
TSO 80 G2G Freq Freq ;-5 25 5 414
LRS 80 G2G ln 50 -1 25 5 407
TSO 100 G2G Freq Freq -5 25 5 403
LRS 100 G2G Freq Freq -5 25 5 403
LRS 80 G2G Freq Freq -5 25 5 393
LRS 90 G2G Freq Freq -5 25 5 390
TSO 100 G2G ln 50 -1 20 5 388
LRS 90 G2G Freq Freq -5 25 10 385
TSO 100 G2G Freq Freq -5 25 10 382
TSO 90 G2G Freq Freq -5 25 5 371
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 23 editor@iaeme.com
Table 9(a) & 9(b) present average fitness value found using different GA operators and
parameters. Both the selection operators (TSO & LRS) can be used for this application as they
produce similar results. Uniform (Gene to Gene) crossover produces substantially good
results compared to single point crossover. Crossover Probability (Cp) should be as high as 90
to 100. Mutation probability (Mp) on chromosome should be moderate around 25 whereas
mutation gene probability (Mgp) should be as low as 5.
Table 9(a) Comparison of different values for Crossover Operation
Crossover
Probability
Avg.
Fitness
100 325
90 318
80 307
70 292
60 260
Crossover
Method
Avg.
Fitness
Uniform
(Gene to
Gene)
316
Single Point 240
Table 9(b) Comparison of different values for Mutation Operation
Mutation
Probability
Avg.
Fitness
25 312
20 299
;15 278
10 264
Mutation Gene
Probability
Avg.
Fitness
5 331
10 302
15 290
20 278
The huge amount of tweets which preserve the concrete observations for every incident
happening in real life worldwide is a treasurable source of information. Practically, every time
performing certain mining operations on such a large data set is not feasible. There is a
growing requirement of automatic and intelligent tools and methods which generate quite a
useful outcome. Such outcomes can be further exploited with respect to many important
dimensions in real life like decision making, optimization etc…
4. CONCLUSIONS
The genetic algorithm has been used widely for clustering for many years. To the best of our
knowledge, this is the first effort to use the genetic algorithm for twitter hashtag clustering.
Also, this is the first attempt to cluster hashtags without preprocessing or using any domain
knowledge. We have experimented our proposed novel model on live dataset downloaded
from Twitter. Benchmark results of this dataset are not available with whom we can compare
results achieved using our model. Hence using crowdsourcing, we have compared the results
of our model with the clustering results achieved by various users. The results achieved using
our model are the most encouraging and better then the best result achieved using
crowdsourcing model. Two most promising highlights of our work are; first, there is no need
of preprocessing hashtags or use of domain knowledge. Second, the number of clusters is not
required at initial step as our model automatically detects it from the given data set. This study
shows that major hurdles related to hashtag clustering can be solved using the genetic
algorithm. Automatic hashtag clustering using GA leads to various developments like event
Social Media Hashtag Clustering Using Genetic Algorithm
http://www.iaeme.com/IJARET/index.asp 24 editor@iaeme.com
timeline generation, community finding, trend detection, etc... However, it is equally
important to select proper genetic parameters, which may possibly be crucial task for a good
performance of the algorithm. We have developed an efficient generalized model and derived
a set of parameters for the best results.
ACKNOWLEDGEMENT
We are very much thankful to the Nirma University for providing resources and other
facilities to carry out this research work.
REFERENCES
[1] Twitter. It’s what’s happening. www.twitter.com
[2] Jain, A.K., and Dubes, R.C., 1988. Algorithms for clustering data. Prentice-Hall, Inc.
[3] Jain, A.K., Murty, M.N., and Flynn, P.J., 1999. Data clustering: a review. ACM computing
surveys (CSUR), 31(3), pp.264-323.
[4] Steinbach, M., Karypis, G., and Kumar, V., 2000, August. A comparison of document
clustering techniques. In KDD workshop on text mining, 400(1), pp. 525-526.
[5] Xu, R., and Wunsch, D., 2005. Survey of clustering algorithms. IEEE Transactions on
neural networks, 16(3), pp.645-678.
[6] Hartigan, J.A., and Wong, M.A., 1979. Algorithm AS 136: A k-means clustering
algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1),
pp.100-108.
[7] Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., and Wu, A.Y.,
2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE
transactions on pattern analysis and machine intelligence, 24(7), pp.881-892.
[8] Gath, I., and Geva, A.B., 1989. Unsupervised optimal fuzzy clustering. IEEE Transactions
on pattern analysis and machine intelligence, 11(7), pp.773-780.
[9] Sneath, P.H., and Sokal, R.R., 1973. Numerical taxonomy. The principles and practice of
numerical classification.
[10] Steinbach, M., Karypis, G., and Kumar, V., 2000. A comparison of document clustering
techniques. In KDD workshop on text mining, 400(1), pp. 525-526.
[11] Kwak, H., Lee, C., Park, H., and Moon, S., 2010, April. What is Twitter, a social network
or a news media? In Proceedings of the 19th international conference on World Wide
Web, pp. 591-600. ACM.
[12] Java, A., Song, X., Finin, T., and Tseng, B., 2007, August. Why we twitter: understanding
microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-
KDD 2007 workshop on Web mining and social network analysis, pp. 56-65. ACM.
[13] Feng, W., Zhang, C., Zhang, W., Han, J., Wang, J., Aggarwal, C., and Huang, J., 2015,
April. STREAMCUBE: hierarchical spatio-temporal hashtag clustering for event
exploration over the twitter stream. In Data Engineering (ICDE), 2015 IEEE 31st
International Conference, pp. 1561-1572. IEEE.
[14] Song, Y., Wang, H., Wang, Z., Li, H., and Chen, W., 2011, July. Short text
conceptualization using a probabilistic knowledgebase. In Proceedings of the Twenty-
Second international joint conference on Artificial Intelligence, 3, pp. 2330-2336. AAAI
Press.
Nilesh Gambhava and Dr. Ketan Kotecha
http://www.iaeme.com/IJARET/index.asp 25 editor@iaeme.com
[15] Sakaki, T., Okazaki, M., and Matsuo, Y., 2010, April. Earthquake shakes Twitter users:
real-time event detection by social sensors. In Proceedings of the 19th international
conference on World Wide Web, pp. 851-860. ACM.
[16] Stilo, G., and Velardi, P., 2017. Hashtag sense clustering based on temporal similarity.
Computational Linguistics.
[17] Muntean, C.I., Morar, G.A., and Moldovan, D., 2012. Exploring the meaning behind
twitter hashtags through clustering. In Business Information Systems Workshops, pp. 231-
242. Springer Berlin Heidelberg.
[18] Crockett, K.A., Mclean, D., Latham, A., and Alnajran, N., 2017. Cluster Analysis of
Twitter Data: A Review of Algorithms. In Proceedings of the 9th International Conference
on Agents and Artificial Intelligence, 2, pp. 239-249. Science and Technology
Publications (SCITEPRESS)/Springer Books.
[19] Tripathy, R.M., Sharma, S., Joshi, S., Mehta, S., and Bagchi, A., 2014. Theme based
clustering of tweets. In Proceedings of the 1st IKDD Conference on Data Sciences, pp. 1-
5. ACM.
[20] Adel, A., ElFakharany, E. & Badr, A., 2014. Clustering tweets using cellular genetic
algorithm. Journal of Computer Science, 10, pp. 1269-1280.
10.3844/jcssp.2014.1269.1280.
[21] Keshavarz, H., and Abadeh, M.S., 2017. ALGA: Adaptive lexicon learning using genetic
algorithm for sentiment analysis of microblogs. Knowledge-Based Systems, 122, pp.1-16.
[22] Keshavarz, H., and Abadeh, M.S., 2016, March. SubLex: Generating subjectivity lexicons
using genetic algorithm for subjectivity classification of big social data. In Swarm
Intelligence and Evolutionary Computation (CSIEC), 2016 1st Conference, pp. 136-141.
IEEE.
[23] Holland, J.H., 1992. Adaptation in natural and artificial systems: an introductory analysis
with applications to biology, control, and artificial intelligence. MIT press.
[24] Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning,
1989. Reading: Addison-Wesley.
[25] Mitchell, M., 1998. An introduction to genetic algorithms. MIT press.
[26] Maulik, U., and Bandyopadhyay, S., 2000. Genetic algorithm-based clustering technique.
Pattern Recognition, 33(9), pp.1455-1465.
[27] Bandyopadhyay, S., and Maulik, U., 2002. Genetic clustering for automatic evolution of
clusters and application to image classification. Pattern Recognition, 35(6), pp.1197-1208.
[28] Hruschka, E.R., Campello, R.J., and Freitas, A.A., 2009. A survey of evolutionary
algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics, Part C
(Applications and Reviews), 39(2), pp.133-155.
[29] Rahman, M.A., and Islam, M.Z., 2014. A hybrid clustering technique combining a novel
genetic algorithm with K-Means. Knowledge-Based Systems, 71, pp.345-365.
[30] https://dev.twitter.com/twitterkit/android/access-rest-api
Recommended