9
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 6, Issue 3, May- June 2017 ISSN 2278-6856 Volume 6, Issue 3, May – June 2017 Page 395 Abstract The success of twitter and the opportunities provided by it led to creation of new web based applications for social network and open new frontiers. Thus, discovering the usage pattern of social media sites might be useful in taking a decision about the design and implementation of those applications as well as educational tools. Therefore, in this study we extract tweets tweeted by user, analyze them and predict age group and gender attributes of Twitter users. Classification model is developed by employing lexical features and learning algorithms. The ability to classify latent user attributes, including gender and age exclusively from informal language used by the Twitter user/s has important applications in personalization, advertising, and recommendation. The work includes a novel investigation of classification algorithms over a rich set of features [1], applied for classifying these user attributes. Extensive analysis of features and approaches that are effective in classifying user attributes from casual written genres which are different from the other commonly spoken genres, are also included. Since Twitter only provides the name of its users, latent attributes are not available on social site directly, they are the hidden elements. So there is a need to develop a prediction system that predicts latent attributes of Twitter user based on his/her tweets. The investigation of same data set is also done with WEKA’s different classification algorithms. Then results of work carried out by the author and WEKA’s classifiers are compared and analyzed. As a conclusion statements, the study proved that our method works better than that of other Classifiers Keywords: Latent Attribute, SVM, WEKA, 1. INTRODUCTION A social network is an interactive network that often connects many individuals by a relationship. Initially these relationships may appear simple but further scrutiny may lead to form an interesting structures. As social media becomes more and more integrated into our environment, it plays a bigger role in our daily lives. So researchers find social media as one of the major area for scientists. The topic is becoming even more worthwhile as economic and business opportunities have sprouted up around these sites. Thus, things like profile customization, advertisement targeting, and interest prediction are automated by Social media sites using this user information. Nevertheless, the data provided by users on site is often segmented and not complete. Users may provide information such as name and working place but could leave out other information such as interests or gender and sometimes age. Thus, it is very important to predict unavailable information from the subset of information that they provide. Predicting user’s preferences and demographic information has a long history. Thus, finding Gender seems to be basic piece of information but can open doors to many other applications of information prediction. Given a user’s actual name, we can simply evaluate whether it is a boy’s name or a girl’s name if and only if there is no discrepancies in names of users. (Ex. Taylor, Sam, Kiran etc.). However, sometimes we do not have access to a user’s actual name and thus have to access other information like screen name to predict their gender and followers (or friends). Since Twitter only provides the name and location of its users, we develop a classification system that predicts latent attributes of Twitter user based on his/her tweets. We have designed the system which predicts age group and gender attributes of Twitter users of a particular region. Classification model is developed by employing lexical features and learning algorithms. Thus we can propose a system for “Automated identification of user attributes from social media site: case-study Twitter”. Elaborating the same, a social media outlets such as Twitter has become an important forum for peer interaction. Thus the designed system has ability to classify user attributes, including gender and age solely from tweets 2. FLOW MODEL OF THE PROPOSED SYSTEM The flow model of the proposed system shown in fig. 1. The steps followed in the flow as given below Figure 1 Data Flow Diagram of system Automated Identification of Latent Attributes of Twitter users Karuna C. Gull 1 , Sudip Padhye 2 , Dr. Subodh Jain 3 1 Department of Computer Science and Engineering,K.L.E. Institute of Technology, V.T.U., Hubli -580030, India. 2 Business Intelligence, Digital Transformation Unit,KPIT Technologies Ltd., Navi Mumbai – 400710, India. 3 Department of Computer Science,SVN University, Sagar, MP, India.

Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 395

Abstract The success of twitter and the opportunities provided by it led to creation of new web based applications for social network and open new frontiers. Thus, discovering the usage pattern of social media sites might be useful in taking a decision about the design and implementation of those applications as well as educational tools. Therefore, in this study we extract tweets tweeted by user, analyze them and predict age group and gender attributes of Twitter users. Classification model is developed by employing lexical features and learning algorithms. The ability to classify latent user attributes, including gender and age exclusively from informal language used by the Twitter user/s has important applications in personalization, advertising, and recommendation. The work includes a novel investigation of classification algorithms over a rich set of features [1], applied for classifying these user attributes. Extensive analysis of features and approaches that are effective in classifying user attributes from casual written genres which are different from the other commonly spoken genres, are also included. Since Twitter only provides the name of its users, latent attributes are not available on social site directly, they are the hidden elements. So there is a need to develop a prediction system that predicts latent attributes of Twitter user based on his/her tweets. The investigation of same data set is also done with WEKA’s different classification algorithms. Then results of work carried out by the author and WEKA’s classifiers are compared and analyzed. As a conclusion statements, the study proved that our method works better than that of other Classifiers Keywords: Latent Attribute, SVM, WEKA,

1. INTRODUCTION A social network is an interactive network that often connects many individuals by a relationship. Initially these relationships may appear simple but further scrutiny may lead to form an interesting structures. As social media becomes more and more integrated into our environment, it plays a bigger role in our daily lives. So researchers find social media as one of the major area for scientists. The topic is becoming even more worthwhile as economic and business opportunities have sprouted up around these sites. Thus, things like profile customization, advertisement targeting, and interest prediction are automated by Social media sites using this user information. Nevertheless, the data provided by users on site is often segmented and not complete. Users may provide information such as name and working place but could leave out other information

such as interests or gender and sometimes age. Thus, it is very important to predict unavailable information from the subset of information that they provide. Predicting user’s preferences and demographic information has a long history. Thus, finding Gender seems to be basic piece of information but can open doors to many other applications of information prediction. Given a user’s actual name, we can simply evaluate whether it is a boy’s name or a girl’s name if and only if there is no discrepancies in names of users. (Ex. Taylor, Sam, Kiran etc.). However, sometimes we do not have access to a user’s actual name and thus have to access other information like screen name to predict their gender and followers (or friends). Since Twitter only provides the name and location of its users, we develop a classification system that predicts latent attributes of Twitter user based on his/her tweets. We have designed the system which predicts age group and gender attributes of Twitter users of a particular region. Classification model is developed by employing lexical features and learning algorithms. Thus we can propose a system for “Automated identification of user attributes from social media site: case-study Twitter”. Elaborating the same, a social media outlets such as Twitter has become an important forum for peer interaction. Thus the designed system has ability to classify user attributes, including gender and age solely from tweets

2. FLOW MODEL OF THE PROPOSED SYSTEM The flow model of the proposed system shown in fig. 1. The steps followed in the flow as given below

Figure 1 Data Flow Diagram of system

Automated Identification of Latent Attributes of Twitter users

Karuna C. Gull1, Sudip Padhye2 , Dr. Subodh Jain3

1Department of Computer Science and Engineering,K.L.E. Institute of Technology, V.T.U., Hubli -580030, India.

2 Business Intelligence, Digital Transformation Unit,KPIT Technologies Ltd., Navi Mumbai – 400710, India.

3Department of Computer Science,SVN University, Sagar, MP, India.

Page 2: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 396

1. First authentication process is carried out by the authors to access the relevant data from twitter [7]

2. Preprocessing is done on the extracted data [8] 3. The preprocessed data is now acting as training data or

feature selecting information. 4. Now collect information related screen or user actual

name and analyze it. 5. The analyzed data is fed as input to the Support Vector

Machine (SVM) algorithm to predict user attributes like gender and age.

6. Predicted output is sent to the display module for displaying the result.

5 ANALYSIS OF THE DESIGNED SYSTEM Elaborating the proposed method as shown in fig. 2. to identify the attributes of users. The steps followed are given below Step 1: Authentication process issues access token and access secret key [7] which can be used in our application to extract the tweets and user information separately. Step 2: Preprocessing work like removal of Stop words, Calculation of occurrence of words in the tweets etc., is carried out on data extracted from Twitter, in order to train the system designed. Convert the preprocessed data into the vector form which is acceptable by the Machine learning algorithm – Support Vector Machine (SVM). Step 3: Now use SVM Machine Learning Algorithm to identify the attributes of users specifically age and gender from the Meta-data collected.

Figure 2 Elaboration of Proposed System

6 SUPPORT VECTOR MACHINE The support vector machine is the most sophisticated algorithm. It is one of the common classification methods. Its high classification accuracy which is linked with its use has made it so popular. The support vector machine [2] is classed as a non-probabilistic binary linear classifier. It works by plotting the training data in multidimensional space. It then tries to separate the classes with a hyperplane. If the classes are not immediately linearly separable in the multidimensional space the algorithm will add a new dimension in an attempt to further separate the classes. It will continue this process until it is able to separate the training data into its two separate classes using a hyperplane [6]. A basic representation of how it splits the data is shown in fig. 3 below.

Figure 3 SVM basic operation (Anon., 2011)

Working: Support Vectors are the co-ordinates of each observation. Support Vector Machine [3] is a boundary (hyper-plane/line) which best segregates the two classes. In most cases there may be multiple hyperplanes or in some cases an infinite number of hyperplanes that could separate that classes. The SVM algorithm chooses the hyperplane which provides the maximum separation between the classes i.e. which has the greatest margin or the maximal margin hyperplane which minimizes the upper bound of the classification errors. For a given dataset, there could be multiple possibilities of hyperplanes but the SVM algorithm [5] chooses the one that provides the maximum separation between the classes i.e. which has the greatest margin or the maximal margin. In an n-dimensional plot, each point represents a data item in SVM algorithm, where n represents the number of features. Classification is done with the help of hyper-plane, which differentiates the two classes efficiently.

Figure 4 Hyper-plane separating the two classes [9]

Page 3: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 397

The objective function for Linear SVM is:

Where, θ is the parameter vector, x is the feature vector, n is the number of features, m is the number of training sets, C is the regularization parameter, 푐표푠푡 Θ 푥( ) and 푐표푠푡 Θ 푥( ) are the costs when y=1 and y=0 respectively.

7 IMPLEMENTATION PROCESS After Authentication of the application to twitter account, query the Twitter database by giving the screen name. Collect the status information (min 500) of screen name. Training process of the system designed is shown fig. 5.

Figure 5 Process to collect the information needed for

training of system

Insert status information along with screen name and tweets into a table designed. Database of system consists of tweet-info table whose sample contents are shown in the table 1.

Table 1: Status information into tweet_info table

tweet_id

screen_name tweet timing

re_tweet

favourite

583640176238277000

_AkshataShetty

Nashville Recap: Smash Landings: I’m so glad they started right after the slap-hug combo, aren’t you? ... http://t.co/wDRTOvXogh #music

2015-04-03 08:40:36 0 0

583640176917754000

_AkshataShetty

GRRM Posts a New Winds of Winter Preview Chapter, and It’s All About Sansa: Sansa Starks story line o... http://t.co/hnG4kQvisG #music

2015-04-03 08:40:36 0 0

58364 _Akshat True Story Author 2015- 1 1

0177593098000

aShetty Michael Finkel on His Relationship With the Murderer Who Inspired the James Franco–Jo... http://t.co/9DUGbIzE0e #music

04-03 08:40:36

583640176917754000

_AkshataShetty

GRRM Posts a New Winds of Winter Preview Chapter, and It’s All About Sansa: Sansa Starks story line o... http://t.co/hnG4kQvisG #music

2015-04-03 08:40:36 0 0

583640177593098000

_AkshataShetty

True Story Author Michael Finkel on His Relationship With the Murderer Who Inspired the James Franco–Jo... http://t.co/9DUGbIzE0e #music

2015-04-03 08:40:36 1 1

: :

: :

Enter user information like gender and age (range) with category to create user_info table which acts as training data for the designed system as shown in table 2 with sample data.

Table 2: Creation of user_info table

User_name Gender Age Category

_AkshataShetty Female 21-40 Test

AkanchaS female 41-60 train akrout81 male 21-40 train

AmarRamesh male 21-40 train amitnimade male 41-60 train Amitvele male 21-40 train

: cadrsunilgupta male 41-60 train cartoonistpai male 21-40 train chakraberty male 41-60 train

ChefAroraBhakti female 21-40 train crazydiode male 21-40 train

: System training 1. Access the number of documents having category

“Train” and place in Docs. 2. Access the tweets from tweet-info able for every user

and place them in document . (d:docs i.e. d1 tweets for user1 ; d2 tweets for user2 ; and so on)

Page 4: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 398

Figure 6 Process to predict the user attributes

3. Pre-processing of data : Clean tweets removal of link, @,newline,

punctuations, numbers etc. Create text concatenation of tweets of single user. Create token Tokenization of concatenated tweets 4. Fill document contains all tokens 5. Eliminate single and two character/s words (Ex. I, a, an,

am etc) and place all the words whose characters are more than 2.

This includes creating a vocabulary, pre-processing the tweet_info by cleaning & concatenating and tokenizing it, which finally fetches the processed_tweets. The table 3 shows sample pre-processed tweets.

Table 3: Sample pre-processed tweets

Tweet Age Gender

are you serious isupportmsg 21-40 Female

katju 21-40 Female

great going richa todays delhi times 21-40 Female only the brave come here faces thirddegree at am watch now 21-40 Female kejriwal ke saamne kiran pm halla bol delhiassemblyelections watch live at aajtakin 21-40 Female

hahahaha 21-40 Female first exclusive interview of on right now at pm mustwatch 21-40 Female

first exclusive interview of former ips on right now at pm mustwatch kiranbedi 21-40 Female former ips kiranbedi says she will make delhi a world class city watch full interview at pm watch now 21-40 Female modi at centre bedi in delhi kiranbedi says its modi at centre make bjp govt in delhi to get centres support at pm 21-40 Female

: Using the tweet_info, user_info and processed_tweets data, create gender_vocabulary. Table 4 shows sample data of gender vocabulary:

Table 4: Sample Gender Vocabulary table word frequency gender

aab 2 female

aadarshliberal 1 female

aadhar 1 female

aadiguru 1 female

aadmi 1 female

aagyani 1 female

aahuti 1 female

aaj 5 female

aajtak 7 female

aajtakdillikadil 1 female

aajtakin 4 female

aaka 1 female

aakar 1 female

: :

Now, use formulas 1 and 2 to find the IDF values and TFvalues to find TFIDF values to provide proper training set to construct model file of SVM.

Term Frequency 퐓퐅(퐰) =.

. -- (1)

Term Frequency 퐓퐅(퐰) =.

. -- (2)

Table 5 shows a calculated IDF values for keywords.

Table 5: Calculated IDF values for keywords

Count Word idf

0 sanjay 0.82193

1 champions 1.26126

2 nietzsche 1.86332

Page 5: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 399

3 event 0.5209

4 liberalisation 1.86332

5 generation 0.960233

6 islamists 1.86332

7 told 0.562293

8 possibly 1.56229

9 meet 0.21987

10 somebody 1.08517

11 rejection 1.86332

12 biggest 0.5209

13 relief 1.16435

14 bangladesh 0.82193

15 times 0.14732

16 pity 1.56229

: : The Age/Gender Model file acts as input to SVM is as below table 6:

Table 6: Age/Gender model file for SVM Age Model file acts as input to SVM +1 7:7.0E-4 13:0.0014 14:0.001 15:2.0E-4 19:6.0E-4 20:1.0E-4 25:3.0E-4 26:6.0E-4 35:1.0E-4 37:9.0E-4 39:8.0E-4 41:1.0E-4 45:3.0E-4 54:9.0E-4 57:1.0E-4 58:0.0013 60:1.0E-4 61:6.0E-4 62:2.0E-4 66:3.0E-4 67:1.0E-4 71:0.0 72:3.0E-4 76:4.0E-4 77:0.0011 78:1.0E-4 79:6.0E-4 80:0.0 82:3.0E-4 83:1.0E-4 87:3.0E-4 90:9.0E-4 91:2.0E-4 96:3.0E-4 98:0.0 99:1.0E-4 100:1.0E-4 101:5.0E-4 102:1.0E-4 103:0.0 : -1 7:6.0E-4 9:2.0E-4 14:9.0E-4 18:4.0E-4 24:1.0E-4 25:3.0E-4 26:5.0E-4 29:4.0E-4 35:1.0E-4 37:8.0E-4 39:7.0E-4 41:1.0E-4 45:3.0E-4 46:7.0E-4 50:3.0E-4 51:6.0E-4 57:1.0E-4 60:0.0 61:5.0E-4 62:2.0E-4 64:7.0E-4 66:3.0E-4 67:1.0E-4 71:0.0 78:1.0E-4 80:0.0 83:1.0E-4 89:0.0011 91:2.0E-4 :

Gender Model file acts as input to SVM +1 1:0.0013 2:0.0019 3:5.0E-4 4:0.0019 5:0.001 6:0.0019 7:6.0E-4 8:0.0016 9:2.0E-4 10:0.0011 11:0.0019 12:5.0E-4 13:0.0012 14:8.0E-4 15:1.0E-4 16:0.0016 17:0.0019 18:4.0E-4 19:5.0E-4 20:1.0E-4 21:0.0019 22:0.0019 23:0.0019 24:1.0E-4 25:3.0E-4 26:5.0E-4 27:0.0019 28:0.0012 29:4.0E-4 30:0.001 31:9.0E-4 32:0.0016 33:0.0019 34:0.0016 : -1 1:0.0017 7:7.0E-4 9:3.0E-4 15:2.0E-4 20:2.0E-4 24:1.0E-4 25:3.0E-4 26:6.0E-4 29:5.0E-4 35:1.0E-4 39:9.0E-4 41:1.0E-4 45:3.0E-4 46:9.0E-4 48:0.0014 57:1.0E-4 60:1.0E-4 64:9.0E-4 67:1.0E-4 71:0.0 76:5.0E-4 78:1.0E-4 79:7.0E-4 80:0.0 83:1.0E-4 87:3.0E-4 91:3.0E-4 96:4.0E-4 98:0.0 99:1.0E-4 :

8 EXPERIMENTAL RESULTS The various steps and results with snapshots of the work carried out are shown below. Step 1: In training part, after connecting to twitter through secret key and consumer key which are generated from twitter developer site, tweets are fetched. Normally tweets contain repeated words, numbers and symbols etc. Then these tweets are preprocessed i.e. words like and, or, as and symbols like @, &, * are removed. The training part of the designed system needs minimum hundreds of twitter users. Here training of the system helps to separate the tweets of male and female along with age group to store in database i.e. Tweets_info collected from Twitter server for given screen name. Thus the Tweet-info table contains Tweet_id, Screen name, Tweets, Timing and Re_tweets. Insert the collected tweets, the status information for the user (screen name) name into tweet_info table (Table 1). Enter user information like gender and age (range) with category to create user_info table (Table 2) which acts as training data for the designed system. Pre-process the tweet_info by cleaning, concatenating and tokenizing to create processed_tweets (Table 3). Using the tweet_info, user_info and processed_tweets data, create gender vocabulary (table 4) and age vocabulary tables (table 4). The model files named age and gender are created in the form required by the SVM algorithm. Now it uses linear SVM algorithm to differentiate male and female with age group. SVM algorithm used here works only with integers so words are stored as integer format as shown table 6. Let us enter the tweet id as BeingSalmanKhan which helps us to extract tweets which is shown in fig.7. Then enter related stuffs of twitter id to train the system as shown in fig 8. Now insert the data into database as shown in fig 9.

Figure 7 Screenshot to extract the tweets for given screen name

Page 6: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 400

Figure 8 Screenshot to train the system by giving the

options like gender and age for a screen name

Figure 9 Screenshot to insert the data into database for

given screen name

Step 2: This step explain about testing part. This part is similar to training part. Tokens are generated and find the total number of unique words in the document. Now enter the user’s twitter id or screen name to find the age group and gender of that screen name. Let us enter the tweet id as iamsrk as test case-1 and see the result of it which is shown in fig 10 to fig 12 as male and age group 41-60.

Figure 10 Screenshot to test the system by giving a screen

name

Figure 11 Screenshot to extract the tweets for given a

screen name under test

Figure 12 Screenshot to display the age and gender for a

screen name under test

Let us enter the tweet id as deepikapadukone as test case-2 and see the result of it which is shown in fig 13 and fig 14 as female and age group 21-40.

Figure 13 Screenshot to extract the tweets for another

screen name under test

Page 7: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 401

Figure 14 Screenshot to display the age and gender for

another screen name under test

Let us enter the tweet id as seemag as test case-3 and see the result of it, which is shown in fig 15. As the there were no sufficient tweets for the twitter id given. Hence an error message is displayed.

Figure 15 Screenshot to display error message for screen

name which has less number of tweets

9 COMPARATIVE RESULTS: Collected data for our work will be now given as input for the WEKA tool. The following data set are given as input for various algorithms of WEKA for analysis.

Table 7: Data Set for training and testing for both age and gender classification

For Age Classification: Training data: Testing data :

42000 (for age grp 21-40) + 20000 (for age grp 40+) = 62000 tweets

20 (for age grp 21-40) + 20 (for age grp 40+) = 40 tweets

For Gender Classification Training data Testing data :

160000 (for male grp) + 160000 (for female grp) = 320000 tweets

40 (for male grp) + 40 (for female grp) = 80 tweets

Collection of the analyzed results from WEKA tool’s Naïve-Bayes, Decision Table Classification and J48 Classification implementation for age for the given data set (table 7) are shown in fig.16 to fig 18.

Figure 16 Weka Tool’s Naïve Bayes Analysis for age

Figure 17 Weka Tool’s Decision table Analysis for age

Figure 18 Weka Tool’s J48 classification Analysis for age

Table 8 shows True Positive, False Negative, False Positive and True Negative age values for WEKA’s algorithms implementations and our work SVM Algorithm.

Page 8: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 402

Table 8: TP, FN, FP and TN age values for WEKA’s algorithms and SVM algorithm

Naïve Bayes J48 Decision

Tables SVM

True Positives 40 40 0 40

False Negatives 0 0 40 4

False Positives 40 40 0 6

True Negatives 0 0 40 30

Figure 19 Shows graph of True Positive, False Negative, False Positive and True Negative age values of WEKA’s

various algorithms and SVM algorithm.

Collection of the analyzed results from WEKA tool’s Naïve-Bayes, Decision Table Classification and J48 Classification implementation for gender for the given data set (table 7) are shown in fig. 20 to fig 22.

Figure 20 Weka Tool’s Naïve Bayes Analysis for gender

Figure 21 Weka Tool’s Decision table Analysis for gender

Figure 22 Weka Tool’s J48 classification Analysis for gender

Table 9 shows True Positive, False Negative, False Positive and True Negative gender values for WEKA’s algorithms implementations and our work SVM Algorithm.

Table 9: TP, FN, FP and TN gender values for WEKA’s algorithms and SVM algorithm

Naïve Bayes J48

Decision Tables

SVM

True Positives 40 40 0 40 False Negatives 0 0 40 4 False Positives 40 40 0 6 True Negatives 0 0 40 30

Figure 23 Shows graph of True Positive, False Negative, False Positive and True Negative gender values of WEKA’s various algorithms and SVM algorithm

10 CONCLUSION For peer interaction the outlets of Social media such as Twitter have become an important forum. Thus the ability to classify latent user attributes, including gender, age and regional origin exclusively from informal language used by the Twitter user/s has important applications in personalization, advertising, and recommendation. The work includes a novel investigation of classification algorithms over a rich set of features [1], applied for classifying these user attributes. It also includes extensive analysis of features and approaches that are effective in classifying user attributes from casual written genres which are different from the other commonly spoken genres. Since Twitter only provides the name of its users, latent attributes are not available on social site directly, they are

05

101520

True Positives

False Negatives

False Positives

True Negatives

Naïve Bayes J48 Decision Tables SVM

010203040

True Positives

False Negatives

False Positives

True Negatives

Naïve Bayes J48 Decision Tables SVM

Page 9: Automated Identification of Latent Attributes of Twitter users · Classification model is developed by employing lexical features and learning algorithms. The ability to classify

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected]

Volume 6, Issue 3, May- June 2017 ISSN 2278-6856

Volume 6, Issue 3, May – June 2017 Page 403

the hidden elements. So the work carried out by the classification system here, helps in predicting latent attributes of Twitter user based on his/her tweets. It can be used in finding the age (in the given range) and gender of the user name / screen name simultaneously, which was not done before. The previous work carried out is able to find either age or gender of the given user name. The investigation of same data set is also done with WEKA’s different classification algorithms. Then results of work carried out by the author and WEKA’s classifiers are compared and analyzed. As a conclusion statements, the study proved that our method works better than that of other Classifiers. To begin with sentiment analysis of data, the sentences are classified into three classes- Positive, Negative and Neutral. Then, the results are provided in the form of pie charts. List containing screen names that will come from Sentiment analysis module whose sentiments are classified are given as input to the predictive module. The output is a table showing the gender & age classification with their screen names, which is linked to Marketing module. Now, list containing screen names, their sentiment, gender & age group classified are given as input to marketing module. Finally, messages will be sent to different user groups as selected by the user.

References [1] Bo Pang, and Lillian Lee, “Thumbs up?: sentiment

classification using machine learning techniques, EMNLP '02”, the ACL-02 conference on Empirical methods in natural language processing - Volume 10, Pages 79-86, Association for Computational Linguistics Stroudsburg, PA, USA ©2002

[2] Miles N. Wernick, Robert M. Nishikawa, Nikolas P. Galatsanos, “A Support Vector Machine Approach for Detection of Micro calcifications”, IEEE Transactions On Medical Imaging, Vol. 21, No. 12, December 2002.

[3] Durgesh K. Srivastava, Lekha Bhambhu, “Data classification using support vector machine”, Journal of Theoretical and Applied Information Technology © 2005 - 2009 JATIT. All rights reserved. www.jatit.org.

[4] Christopher J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”, [email protected], Bell Laboratories, Lucent Technologies, Kluwer Academic Publishers, Boston

[5] Mingmin Chi, Rui Feng, Lorenzo Bruzzone, “Classification of hyperspectral remote-sensing data with primal SVM for small-sized training dataset problem”, 0273-1177/$34.00 2008 COSPAR. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.asr.2008.02.012

[6] Dee Shi and Xiaojun Yang, “Support Vector Machines for Landscape Mapping from Remote Sensor Imagery”, Proceedings - AutoCarto 2012 - Columbus, Ohio, USA - September 16-18, 2012

[7] Narashima S. Purohit, Meghana Bhat, Akshata B. Angadi, Karuna C. Gull, (2015) “Crawling through Web to Extract the Data from Social Networking Site-

Twitter”, IEEE National Conference on Parallel Computing Technologies PARCOMPUTECH, India, 2015. doi:10.1109/PARCOMPTECH.2015.7084522, ISBN:978-1-4799-6916-6,pp.1-6. Available:http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7084522

[8] Karuna Gull, Sudip Padhye, Dr. Subodh Jain, (2017), “A Comparative Analysis of Lexical/NLP Method with WEKA’s Bayes Classifier”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), Volume: 5 Issue: 2, February 2017, pp. 221 – 227 ISSN: 2321-8169 ISSN: 2321-8169. Available: http://www.ijritcc.org or http://www.researcherid.com/rid/A-9769-2016.

[9] http://dni-institute.in/blogs/building-predictive model -using-svm-and-r/

Author

Karuna C. Gull received the B.E. degree in Electronics and Communication from Karnataka University, India in the year 1996 and the M.Tech degree in Computer science and Engineering from the Visvesvaraya Technological

University, India in the year 2008. She has been working in the area of data mining and social networking since 2013. She has published 10 papers in International journals, 6 in International and 6 in national conference proceedings on Data Mining and Image Processing. She has also attended many of the workshops and conferences held in different places on High Impact Teaching Skills, Embedded System Using Microcontroller, Information Storage and Management (ISM), Data Mining, and many more. She worked as a Lecturer and Senior Lecturer for about 15 years. She is currently working as an Assistant Professor in K.L.E.IT, Hubli, India.

Sudip S. Padhye received the Bachelor of Engineering (B.E.) degree in Computer Science & Engineering from Visvesvaraya Technological University, India (V.T.U.) in 2016. He is currently working as Business Intelligence (BI) Developer at KPIT Technologies Ltd. with extensive experience in both ETL

and Reporting tools such as Informatica Data Center & Oracle Business Intelligence Enterprise Edition (OBIEE) respectively. In addition to this, he has many projects on his name such as “Movies & Books Recommender system using Collaborative Filtering”, ”Rainfall predictor using Regression techniques” and “Context-based Attitude Scrutiny using NLP”, to name a few. He has also published 2 International papers in the field of Data mining. He is fascinated towards R, Python and Java & thus has many MOOCs certifications from renowned Universities such as Stanford University, USA and Johns Hopkins University, USA.