24
Tweet Classification Mentor: Romil Bansal GROUP NO-37 Manish Jindal(201305578) Trilok Sharma(201206527) Yash Shah Guided by : Dr. Vasudeva Varma

IRE Project IIIT Hyderabad Tweet classification Group 37

Embed Size (px)

DESCRIPTION

Classification of tweets using different machine learning approach into wiki categories.IIIT hyderabad Project no 9 , Group no 37. Submitted by- Manish Jindal Trilok Sharma Yash Shah

Citation preview

Page 1: IRE Project IIIT Hyderabad Tweet classification Group 37

Tweet Classification

Mentor: Romil Bansal

GROUP NO-37 Manish Jindal(201305578)Trilok Sharma(201206527)Yash Shah (201101127)

Guided by : Dr. Vasudeva Varma

Page 2: IRE Project IIIT Hyderabad Tweet classification Group 37

Problem Statement : To automatically classify Tweets from Twitter into various genres based on predefined Wikipedia Categories. Motivation:o Twitter is a major social networking service with

over 200 million tweets made every day. o Twitter provides a list of Trending Topics in real

time, but it is often hard to understand what these trending topics are about.

o It is important and necessary to classify these topics into general categories with high accuracy for better information retrieval.

Page 3: IRE Project IIIT Hyderabad Tweet classification Group 37

DataDataset : o Input Data is the static / real-time data consisting

of the user tweets.o Training dataset :

Fetched from twitter with twitter4j api.

Final Deliverable:o It will return list of all categories to which the

input tweet belongs.o It will also give the accuracy of the algorithm

used for classifying tweets.

Page 4: IRE Project IIIT Hyderabad Tweet classification Group 37

CategoriesWe took following categories into consideration for classifying twitter data.

1)Business 5)Law 9)Politics

2)Education 6)Lifestyle 10)Sports

3)Entertainment 7)Nature 11)Technology

4)Health 8)Places

Page 5: IRE Project IIIT Hyderabad Tweet classification Group 37

Concepts used for better performanceOutliers removal

To remove low frequent and high frequent words using Bag of words approach .

Stop words removalTo remove most common words, such as the,

is, at, which, and on.Keyword Stemming

To reduce inflected words to their stem, base or root form using porter stemming

Cleaning crawl data

Page 6: IRE Project IIIT Hyderabad Tweet classification Group 37

Other Concepts used ..Spelling Correction

To correct spellings using Edit distance method.

Named Entity Recognition:For ranking result category and finding most

appropriate.

Synonym formIf feature(word) of test query not found as one of

dimension in feature space than replace that word with its synonym. Done using WordNet.

Page 7: IRE Project IIIT Hyderabad Tweet classification Group 37

Tweets Classification AlgorithmsWe used 3 algorithms for classification

1) Naïve based2) SVM based Supervised3) Rule based

Page 8: IRE Project IIIT Hyderabad Tweet classification Group 37

Crawl tweeter

dataTweets

Cleaning, Stop word removal

Create Index file

Of feature vector

Extract Features (Unique wordlist)

Create feature vector for each

tweet

Edit Distance, WordNet

(synonyms)

TestQuery/Tweet

Create Index fileOf feature

vectors

Create / Apply

Model files

Output Category

Tra

inin

g T

estin

g

Remove Outliers

Tweets Cleaning, Stop word removal

Create feature vector for test

tweet

Apply Named Entity

Recognition

Rank result category

Page 9: IRE Project IIIT Hyderabad Tweet classification Group 37

Main idea for Supervised LearningAssumption: training set consists of

instances of different classes described cj as conjunctions of attributes values

Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj C

Key idea: assign the most probable class using supervised learning algorithm.

Page 10: IRE Project IIIT Hyderabad Tweet classification Group 37

Method 1 : Bayes ClassifierBayes rule states :

We used “WEKA” library for machine learning in Bayes Classifier for our project.

Normalization Constant

Likelihood Prior

Page 11: IRE Project IIIT Hyderabad Tweet classification Group 37

Method 2 : SVM Classifier (Support Vector Machine)Given a new point x, we can score its

projection onto the hyperplane normal:I.e., compute score: wTx + b = Σαiyixi

Tx + b Decide class based on whether < or > 0

Can set confidence threshold t.

11

-10

1

Score > t: yes

Score < -t: no

Else: don’t know

Page 12: IRE Project IIIT Hyderabad Tweet classification Group 37

12

Multi-class SVM

Page 13: IRE Project IIIT Hyderabad Tweet classification Group 37

13

Multi-class SVM Approaches1-against-all

Each of the SVMs separates a single class from all remaining classes (Cortes and Vapnik, 1995)

1-against-1Pair-wise. k(k-1)/2, k Y SVMs are trained. Each

SVM separates a pair of classes (Fridman, 1996)

Page 14: IRE Project IIIT Hyderabad Tweet classification Group 37

Advantages of SVMHigh dimensional input space Few irrelevant features (dense concept)Sparse document vectors (sparse instances)Text categorization problems are linearly

separable

For linearly inseparable data we can use kernels to map data into high dimensional space, so that it becomes linearly separable with hyperplane.

Page 15: IRE Project IIIT Hyderabad Tweet classification Group 37

Method 3 : Rule BasedWe defined set of rule to classify a tweet

based on term frequency.a. Extract the features of a tweet.b. Count term frequency of each feature , the

feature having maximum term frequency from all categories mentioned above will be our first classification.

c. As it cannot be right all time so now we maintain count of categories in which tweet falls , category which is near to tweet will be our next classification.

Page 16: IRE Project IIIT Hyderabad Tweet classification Group 37

Example-Tweet=sachin is a good player, who eats apple

and banana which is good for health.Feature- sachin,player,eats,apple,health,bananaStop word-is,a,good,he,was,for,which,and,whoClassification- Feature-category term-

frequencysachin-sports 2000

player-sports 900

eating-health 500

apple-technology 1000

health-health 800

banana-health 700

Page 17: IRE Project IIIT Hyderabad Tweet classification Group 37

Max term-frequency - sachinSo our category is - sports

2nd approximation -Max feature is laying in health i.e. 3 times ,So our second approximation would be health.If both of these are in same category then we

have only one category.i.e. if here max feature would be laying in sports than we have only one result that is sports.

Page 18: IRE Project IIIT Hyderabad Tweet classification Group 37

Cross-validation (Accuracy)Steps for k-fold cross-validation : Step 1: split data into k subsets of equal size

Step 2 : use each subset in turn for testing, the remainder for training

Often the subsets are stratified before the cross-validation is performedThe error estimates are averaged to yield an

overall error estimate

Page 19: IRE Project IIIT Hyderabad Tweet classification Group 37

Accuracy Results ( 10 folds)Accuracy of Algorithm in %

Categories\ Algo. SVM Naïve Rule

Business 86.6 81.44 98.30

Education 85.71 76.07 81.8

Entertainment 86.8 79.1 87.49

Health 95.67 84.62 90.93

Law 81.17 73.38 75.25

Lifestyle 93.27 89.71 82.42

Nature 87.0 78.64 84.24

Places 81.01 75.35 80.73

Politics 81.91 81.88 76.31

Sports 87.11 83.57 81.87

Technology 83.64 82.44 77.05

Page 20: IRE Project IIIT Hyderabad Tweet classification Group 37

Unique featuresWorked on latest Crawled tweeter data using tweeter4j

apiWorked on Eleven different Categories.Applied three different method of supervised learning

to classify in different categories. Achieved high performance speed with accuracy in

range of 85 to 95 %Done Tweets Cleaning , Stemming , Stop Word removal.Used Edit distance for spelling correction.Used Named entity recognition for ranking.Used WordNet for Query Expansion and Synonyms

finding.Validated using CrossFold (10 fold) validation.

Page 21: IRE Project IIIT Hyderabad Tweet classification Group 37

Snapshot

Page 22: IRE Project IIIT Hyderabad Tweet classification Group 37

Result

Page 23: IRE Project IIIT Hyderabad Tweet classification Group 37

Accuracy

Page 24: IRE Project IIIT Hyderabad Tweet classification Group 37

Thank You!