Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University...

Peiti Li1, Shan Wu2, Xiaoli Chen1

1Computer Science Dept. 2Statistics Dept.

Columbia University116th Street and Broadway, New York, NY 10027, USA

introducing

Movie Review

It is a fast and more direct way for people to share their opinions on a topic

Python

Twitter Search API + Stream API

Opinion Mining or Sentiment Analysis

Computational study of opinions, sentiments, subjectivity, attitudes

Just like a text classification task but different from topic-based text classification

In topic-based text classification (e.g., computer, sport, science), topic words are important.

But in sentiment classification, opinion/sentimentwords are more important, e.g., awesome, great, excellent, horrible, bad, worst, etc.

Structure the unstructured: Natural language text is often regarded as unstructured dataBesides data mining, we need NLP technologies

Why a HARD task?

I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive,…

Credits: Bing Liu for this example

Tell people whether to go to buy a movie ticket using tweets

Classify the tweet as either positive or negative

Give a rating of the movie based on tweets

Different Machine Learning Approaches Accuracies

Table from: Bo Pang et al. 2002. Thumbs up? Sentiment Classification using Machine LearningTechniques. In Proc. Of the ACL, pp. 79-86. Association for Computational Linguistics

Our approach is Naïve Bayes

P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)

Smoothing:

P(token | sentiment) = (count(this token in class) + 1) / (count(all tokens in class) + count(all tokens))

We didn’t use any third-party classifier, we coded our classifier all by ourselves.Reason: want to explore what is under the hook; tune the algorithm structure according to the experiment result

Getting Started

» Dev set: The movie review dataset provided by Bo Pang and Lillian Lee, Cornell University sentence_polarity_dataset_v1.0 5331 positive, 5331 negative

» Real set: Tweets about a specific movie Cannot tell exact number Twitter Search API(REST): last 6-7 days Twitter Stream API: real timeline(Drawbacks:REST API has rate limiting; Stream data takes time to collect.)

Dataset

Top 100 words including stopwords

Better and better but….

Baseline model is the Naïve Bayes, without any nontrivial text preprocessing; punctuations excluded, stopwords included

Tuned model still Naïve Bayes, better feature extraction technique: eliminating low information features. Best unigram model, best unigram and bigram model

Dev set result:

Trainset 5000, Testset 331 Recall Specificity Accuracy

Baseline 76.13% 82.78% 79.46%

Baseline, stopwords removed

75.83% 79.46% 77.64%

Best unigram, stopwords not removed

83.99% 85.20% 84.60%

Best unigram, stopwords removed

82.78% 85.80% 84.29%

Best unigram and bigram, stop words not removed

N/A N/A 78.24%

Takes 1 hour! Intel Core i5 laptop died in the middle because of too hot for too long

Observation: definitely not consider bigrams, but still don’t know whether we should remove the stopwords

5 neg, 87 pos

150 tweets

75 labeled by Xiaoli, 75 labeled by Shan

150 tweets

76 neg, 32 pos

Regular expression 1: (?:@\S*|#\S*|http(?=.*://)\S*)

Regular expression 2: (#[A-Za-z0-9]+) | (@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)(All punctuations removed)

Hugo Muppets together

stopwords remv 64.13% 64.81% 64.50%

stopword incld 63.04% 54.63% 58.50%

stopwords remv 70.65% 62.96% 66.5%

stopwords incld 65.22% 53.70% 59.00%

Results on the 2 recent movies(Real set)

Which regular expression should we choose based on this result? Hard to say…. :-(

lingPipe, Twendz, Twitter Sentiment, tweetfeel

other similar productsWe moved our attention to:

twittersentiment.appspot.com

They are new too.

www.tweetfeel.com

Our classifier get the exact same results with them, but wait…

Two pieces of tweet made us frown :-(

Emoticons play a role!!!

:-)>:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) >:D :-D :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 :P FTW

:'( ;*( :_( T.T T_T Y.Y Y_Y >:[ :-( :( :-c :c :-< :< :-[ :[ :{ >.> <.< >.< >:\ >:/ :-/ :-. :/ :\ =/ =\ :S

So we choose the regular expression that will keep emoticons

And we build a dictionary to eliminate all the punctuations that appear alone

'`','~','!','@','#','$','%','^','&','*','(',')','-','_','+','=','{','}','[',']',';',':','"',"'",'<','>',',','.','?','|','\\','/'

Finally, the python begins to catch the twittering bird……..

“Happy” Feet? So all tweets are positive?

We still need to do more semi-supervised learning.

1.Specific bigrams like “don’t love”

2.Finer classifier which can exclude objectives

3. Detect and remove annoying movie name like “Happy Feet”

4. Give more weights to dominant words like “excellent”, “worst”

5. Our final task: Give ratings

Thank you all!Thank you STAT

4240!Thank you Columbia!

Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University...

Documents

Enzyme immobilization: fundamentals and application Xiaoli wang 2013.11.30

Unit 8 How do you make a banana milk shake? NO.2 Middle School Ma Xiaoli The revision of vacabulary

DEPARTMENTAL A/C Account.pdfTrading & profit & loss a/c Particular Dept(A) Dept(B) Dept(C) Total Particular Dept(A) Dept(B) Dept(C) Opening stock purchase Sales Less closing stock

Xiaoli Tan Professor, Iowa State University

HEME DEGRADATION AND JAUNDICE xiaoli Molecular Biochemistry II

Ecoulement dans une géométrie plan-plan confinée: …...Innocent Boudimbou, Cécile Roux, Véronique Collin, Christian Peiti, Patrick Navard, et al.. Ecoule-ment dans une géométrie

Liying Yu, Weiqi Tang, Weiyi He, Xiaoli Ma, Liette Vasseur ... · PDF filePUBLISHED VERSION Liying Yu, Weiqi Tang, Weiyi He, Xiaoli Ma, Liette Vasseur, Simon W. Baxter, Guang Yang,

Medstar: a prototype for biomedical social network Xiaoli Li Institute for Infocomm Research A*Star, Singapore

Anjing Yang, Xiaoli Zhang, Hélène Agogué, Dupuy Christine ... 2014... · Anjing Yang, Xiaoli Zhang, Hélène Agogué, Dupuy Christine, Jun Gong To cite this ... Key Laboratory

The following ˜ve departments of IHI Scube Co. Ltd. has ... · Digital Business Project Dept. Information Security Dept. Procurement Dept. Quality Assurance Dept. Sales Dept. ICT

Matthew McCrea Tong Kai Tan Hui-Hsuan Ting Xiaoli Zuofranke.uchicago.edu/bigproblems/team6.pdf · 1 Matthew McCrea Tong Kai Tan Hui-Hsuan Ting Xiaoli Zuo A Cost-Benefit Analysis of

Hockey NSW President’s Welcome Message - … · Hockey NSW President’s Welcome Message ... Tournament Director Ange rown ... Leanne Albertini Jane Parrish Serene Peiti

English Major Students’ Preferences for Hypermedia Annotations and Their Effects on Reading Comprehension Tang Xiaoli

An overview of Biochem. Dept. : Xiaoli Zheng. 1 Genomics and genomic era 2 Two pillars of genomics 3 Genomics method and technology 4 The progress of

Legislative - Monroe County, New York · 2020-04-01 · e t I Fire Dept Fire Dept Police Dept e t Police Dept Fire Dept & Ambulance Fire Dept Fire Dept e t Fire Dept Fire Dept Kodak

National Science Foundation Toward a Greener World: The Development of Lead-Free Electroceramics Xiaoli Tan, Iowa State University, DMR 1037898 Outcome:

NSAPK CIRCUIT 2018 · Urs Albrecht - Pink umbrella4 Taiwan Claudia Xiaoli LEE - Cooking trio Claudia Xiaoli LEE - Floral headgear ... Hung Kam Yuen - Whistle Linda K Y Wei - Boy in

Xiaoli Li xlli@ucdavis.edu University of California Davis 3/14/20121CEAL Annual Meeting, Toronto

Xiaoli Zhou, Pavlos Kollias Department of Atmospheric and Oceanic Sciences, McGill University, Montreal, CA

Matthew McCrea Tong Kai Tan Hui-Hsuan Ting Xiaoli Zuonews.cleartheair.org.hk/wp-content/uploads/2013/07/team6.pdfTong Kai Tan Hui-Hsuan Ting Xiaoli Zuo A Cost-Benefit Analysis of Different