[IEEE 2011 3rd Conference on Data Mining and Optimization (DMO) - Putrajaya, Malaysia (2011.06.28-2011.06.29)] 2011 3rd Conference on Data Mining and Optimization (DMO) - Bess or xbest:

2011 3rd Conference on Data Mining and Optimization (DMO) 28-29 June 2011, Selangor, Malaysia

978-1-61284-212-7/11/$26.00©2011 IEEE

bess or xbest: Mining the Malaysian Online Reviews Norlela Samsudin / Mazidah Puteh

Faculty of Computer and Mathematical Science Universiti Teknologi MARA Terengganu Dungun, 23000, Terengganu, Malaysia

[email protected] [email protected]

Abdul Razak Hamdan Faculty of Information Science and Technology

Universiti Kebangsaan Malaysia Bangi, 43600, Selangor, Malaysia

[email protected]

Abstract—Advancement in information and technology facilities especially the Internet has changed the way we communicate and express opinions or sentiments on services or products that we consume. Opinion mining aims to automate the process of mining opinions into the positive or the negative views. It will benefit both the customers and the sellers in identifying the best product or service. Although there are researchers that explore new techniques of identifying the sentiment polarization, few works have been done on opinion mining created by the Malaysian reviewers. The same scenario happens to micro-text. Therefore in this study, we conduct an exploratory research on opinion mining of online movie reviews collected from several forums and blogs written by the Malaysian. The experiment data are tested using machine learning classifiers i.e. Support VectorMachine, Naïve Baiyes and k-Nearest Neighbor. The result illustrates that the performance of these machine learning techniques without any preprocessing of the micro-texts or feature selection is quite low. Therefore additional steps are required in order to mine the opinions from these data.

Keywords- Sentiment mining, opinion mining, Malaysian, movie reviews

I. INTRODUCTION

At the beginning of the century, researchers have started to explore a new field known as opinion mining or sentiment analysis. Thanks to the advancement of Internet technologies, more and more people are expressing their likes and dislikes publicly using personal blogs or online forums. In addition, the evolution of technology such as the fast growing Facebook and Twitter causes a collection of huge amount of online reviews. These reviews contain rich and useful information for other customers, marketing officers or anyone who are interested in extracting and mining the views. Unfortunately, finding useful information from these online data is difficult without proper opinion mining techniques.

Opinion mining aims to mine reviews of products and services such as cameras, cars, books and movie reviews by classifying them into positive or negative opinions. The terms opinion mining, sentiment analysis and sentiment classification have been used interchangeably in reference to activities of finding sentiment in words, sentences or documents[13]. Other than that, opinion mining is also defined as the processes of automatically identify the positive and negative reviews [3]. More definition and explanation on sentiment may be found in [19].

With a mission “to realize Malaysia as global hub and preferred location for ICT and multimedia innovations, services and operations” [10], new policies have been introduced by the Malaysian government such as advancement in ICT and telecominfrastructure. These initiatives have caused an increase on the use of internet by the Malaysian. In June 2009, it was reported that 65.7% of its population used Internet as compared to that of 37.9% in 2005 [24].The strong support and encouragement from the government will definitely lead to an increase in the number of Internet users and the amount of online information that is being created by them. Unfortunately, very few researchers have studied the use of Malay language in this environment. In fact very few works have been executed on mining the Malay texts [11]. [7] has created outlier detector using Malaysian records which are written in English language. In addition, the Malay language is also used by the Indonesian, Singaporean and Bruneian. This study tries to fill in this gap. Performance of several text mining techniques in mining online opinions using data that are created by the Malaysian were explored and reported.

The remainder of the paper is organized as follows. In Section II, some related works on micro texts and opinion mining are introduced. The method and dataset are explained in Section III. In addition, the performance of the opinion mining is evaluated. Lastly, conclusion of the experiments and future direction are presented in Section IV.

II. BACKGROUND

A. Opinion Mining Evaluating the human emotion, which is also known as

‘private state’ or ‘internal state [19] has created interest in many researchers in the domain of language processing. When the process involves huge amount of data, it has to be automated. Basically, opinion mining may be carried out using two approaches i.e. semantic oriented approach or machine learning approach. The semantic oriented approach uses rules created based on the structure of a particular language. Normally several language processes such as tokenization, case conversion, part of speech tagging, and stemming are performed to identify the semantic or the structure of texts. In addition, extra tools such as Word Net, SentiWordNet and dictionary are utilized to identify the sentiment polarization of a sentence or a document. Work in

38

978-1-61284-212-7/11/$26.00©2011 IEEE

[22], [23], [18] and [5] utilized this technique. Early work on sentiment analysis in [5],[6] and [18] have evaluated the relationship between sentiment and the semantics of language such as adjective and adverb. In [5] regression model was used to predict whether the conjoined adjective was of the same or different sentiment. Turney [18] presented an approach of determining document’s polarity by calculating the average semantic orientation (SO) of extracted phrases using point wise mutual information (PMI).It measures the dependence between extracted phrases and the reference words “excellent” and “poor” by using web search hit counts.

On the other hand, researchers such as in [14], [15], [8] and [1] had chosen the machine learning approach. In this approach, a supervised learning model from a large training corpus is constructed using common text classification techniques such as Naïve Baiyes (NB), Support Vector Machine (SVM) and k-Nearest Neighbors (kNN). The new created model is then applied to a set of unclassified documents. The machine learning approach was utilized in this study. It was applied to online movie reviews that are created by the Malaysian.

The movie review corpus has been used in several studies since it is easily available online. Movie reviews corpus in English may be downloaded from [12]. In addition to that, users normally indicate rating scores (1 – 5 or 1 - 10) on the movies, which reflect their opinions of the product. Nevertheless, mining the corpus is very challenging since other than sentiment words; it also contains facts about a particular movie. Another problem with this corpus is the use of micro-text language, which does not abide by any language rules and structure.

Pang et al [14] used 700 positive and 700 negative reviews from the Internet Movie Database (IMDb) and compared the result of opinion mining using SVM, NB and Maximum Entropy with the human generated values. Even though the result of the experiment using the machine learning techniques is comparable with the results from human generated values, it is not as good as the normal text classification values. They concluded that opinion mining requires extra approach in addition to the normal processes of text classification. In 2004, they repeated the experiment using the positive (subjective) potion of a particular review. The negative (objective) potion was discharged before applying the standard machine-learning classifier. This technique has reduced the features and leads to better performance by certain classifier such as Naïve Baiyes.

Other than the English movie reviews, Ye et al. [22] evaluated the performance of semantic approach on 550 Chinese movie reviews by adopting two-word phrase patterns from Turney’s study [18]. They concluded that the performance of semantic approach on Chinese movie reviews is similar to the English movie reviews. Zhuang et al. [23] classified and summarized movie reviews by extracting high frequency feature keywords and high frequency opinion keywords. Other than that, Kang et al. [8] used 1680 positive reviews and 1680 negative movie reviews from the Korean blogs. Unlike [14], they compared the accuracies of each polarization instead of the overall accuracies. They found out

that some techniques such as feature selection seemed to perform better in either positive or negative reviews. For example, SVM with TFIDF as vector weight formula work better with positive orientation but very low with negative orientation. On the other hand, other techniques such as SVM with Odd Ratio as feature selection performed well in both reviews.

Other than [8], Abbasi [1] also checked how feature selection influenced the performance of opinion mining on movie reviews. He introduced a new algorithm named Entropy Weight Genetic Algorithm (EWGA) and concluded that it reduced the feature size, thus improved the performance of SVM classification technique.

All of these studies indicate that the normal text classification processes are not enough to achieve high performance result in opinion mining or sentiment analysis of movie reviews.

B. Micro text Natural Language Processing (NLP) techniques assume

that the input texts are ‘clean’ and grammatically correct. Unfortunately, this is not true with online entries data from online forum, chat forum, blogs, online reviews, feedbacks, Face book entries, Instant Messages (IM) entries, Twitter or Short Message Services (SMS) entries. The quality of texts generated from these sources is extremely ‘dirty’ and ‘noisy’. Therefore, an efficient mechanism is required to interpret them since these texts consist of spelling errors, invalid punctuation, ad hoc abbreviation, creative symbols, incorrect casing, out of dictionary vocabulary and inappropriate sentence structure. Therefore, mining text using semantic scores approach fails to capture sentiments in these noisy texts. Other terms for these texts are dirty text, micro-text [16] and texting languages [2]. People use creative compression methods in order to limit the length of message and speed up the typing process especially when small keyboard such as the mobile phone keyboard is used. The following table lists some of the methods, which are used by the Malaysian.

Table 1 :-Pattern of noisy text used by the Malaysian.

Method Example of texts

Mixed of English and Malay words.

best ctenienjoy glertengok

Loss of “alpha-case” information makes it very difficult to identify the sentence boundaries, proper nouns and acronyms.

datgkmmg berbaloi2lah! bessstttsgtmse part lawakmmggelak r tp part ygsedih pun cm nknangismmg best rugikluxtgk.nktgkyg 3d tpptgsgtlak so tgkygbesepnye je

Use of slang • chettttt.. alasanG • kitekurengsukednganparaplakonnye.

39

978-1-61284-212-7/11/$26.00©2011 IEEE

Method Example of texts

• YesssszeeeFlexible use of punctuations symbols or do not use the punctuation symbols at all.

• bagiaku... best gak la cite nie.... lawak pun ada... haha..

• xbeslangsung~!!!!!

Use of phonetic spelling

• b4 before, • cu see you, • r are • 2u to you.

Use of initial letters only

• hw homework, • wrt with respect to • sc stay cool

Use of a new form of written representation to express emotion

• :) smiling • :0 shocked • :-\ skeptical.

Dropping the vowel

• tp tapi • tgk tengok

Grammatical errors

• manakenakalangilakasyahyg FK lakonkanmmgtakbape gila2 sgttu.

• viewplgcantikmasarapunzelngan u-jinlepak d perahutgk lampu2 berterbangan (x tau apanamabendatuhehe)...

A process known as text cleaning, text filtering or text normalization is required before any knowledge may be extracted from these texts. There are three metaphors of text normalization i.e. the ‘spell checking’ metaphor, the ‘translation’ metaphor and the ‘speech recognition’ metaphor. Wong et al. [21] introduced a mechanism known as Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC) in their work. Choudhury et al. [2] used Hidden Markov Models (HMM) to translate SMS messages to the correct English text.Kobus et al. [9] combined the common techniques of translation metaphor and speech recognition metaphor on text cleaning of French SMS messages.

In an attempt to minimize keystrokes, people use creativity in order to deliver their opinions and feelings. Several researchers had incorporated processing of noisy text in opinion mining research. Dey and Haque [4] started the normalization process by identifying sentence boundaries, correcting improper casing and correcting the spelling mistakes. By doing so, they were able to use the NLP processes in their opinion mining project. Rosa and Ellen [16] used the normal natural language processes except stemming to avoid the change in meaning when abbreviation was used. Gamon[5] looked at the effect of feature reduction based on linguistic analysis on customer reviews.

Nevertheless, we have not found any work that processes the noisy texts that are created online by the Malaysian users.

III. METHODOLOGY

A. Experiment Data There is no ready-to-use Malaysian movie reviews data

on the web. Data were collected from several online forums and blogs of Malaysian website. The main contributors were from http://www.cari.net.my and http://www.mesra.net websites. Data were collected from various websites to ensure that they would represent online reviews by the Malaysian communities rather than concentrated on one community from a particular online forum. At the beginning, about 1200 reviews were retrieved. Two annotators read and labeled them manually into the negative or positive opinion. Only reviews with agreement by both annotators were kept. Finally, 500 positive reviews and 500 negative reviews were used in the experiment. We realized that the reviews are very noisy as compared to the English movie reviews used in [15], which were collected in 2002.

Examples of positive movie reviews from the Malaysian movie reviews corpus are

• best!..vry straight 4ward pnyecter, rase jap jer dah hbis..k ler, bbaloi2 le p tgk;

• aku dah tgksmlm.....yummylicious!; • bestsgttaditgk.

Example of negative movie reviews from this corpus

• akusetujufilemini GAGAL. noktah.; • expect yang lain, yang lain yang muncul.

:|sayasangattaksukafilemini.! 1/10. XXXX dulu2 lagi best;

• bagiaku,citenimmgklaka..tapijln cite die xkuat..skrip pun xkuat..akupnxtau y mnsatuklimaksnye..endingpnxbest.

As the examples stated, data were unstructured and noisy. Therefore opinion mining using semantic oriented approach requires the following extra steps.

1. Normalize the micro-text into its corresponded text either in Malay or English language. Unfortunately, as far as the researchers’ knowledge, there is no translation of micro-texts for the Malay language; and

2. Translate all texts in Malay language into its corresponded words in English language in order to simplify steps such as tokenization, part of speech, stemming and lemmatizing.

Both steps require extra tools and effort. In addition, the meaning of the phrase may be changed during the translation process. Therefore, the machine learning approach was chosen to experiment on these data.

40

978-1-61284-212-7/11/$26.00©2011 IEEE

B. Machine learning techiques Motivated by the works on similar data set in [15]and

[22], three commonly known machine learning classifiers which are Support Vector Machine (SVM), Naïve Baiyes (NB) and K-Nearest Neighbors (KNN) were selected to be applied in this research. Detailed explanation of these techniques is available in [20] and [17]. Naïve Baiyes is one of the simplest approaches in text classification. It is based on an assumption that the probability of one word occurs in a document does not affect the probability of another word to appearing. The probability of a review (R) is classified as (C) is calculated using the following formula. | ! ∏ ! (1)

Where N = n1 + n2 + … + nk is the number of words in the review. SVM uses a different approach than NB. Other that using probability, SVM attempts to find a hyperplane that divides the training documents with the largest margin using the following formula: ∑ , (2) Where a andb are the parameter for the hyper plane. If the value is higher than zero, then the class y for the review is set to 1, otherwise it is set to -1. K-Nearest Neighbor is a widely used text classifier because of its simplicity and efficiency. It is also called as lazy learner since ‘it defers the decision on how to generalize beyond the training data until each new query instance is encountered’ [17]. To decide whether a review is a positive or negative review, the similarity of all documents in the training set is determined. The k most similar neighbors are selected. The proportion of neighbors having the same class may be taken as an estimator for the probability of that class, and the class with the largest proportion is assigned to the review.

The experiments were executed using Rapid Miner 5, which is a comprehensive text mining tool. It was an exploratory research. Therefore preprocessing step was not incorporated since most of the preprocessing processes involved NLP methods. Data were divided 90/10 into training and test data. 10-fold cross validation was performed for each experiment with the average values are reported in this paper. Instead of using binary weight vector, td-idf measurement was used to weight the features. In order to compare the result with English movie reviews, the same experiment usingdata that were downloaded from [12] was carried out. The same data were used by [14] in their study.

C. Evaluation method There are three values that are commonly used to evaluate

performance of text categorization i.e. Accuracy, Recall, Precision. These values were calculated based on the following table.

Table 2: Evaluation of experiments

Actual positive Actual negative

Predict positive a b

Predict negative c d

Accuracy is the fraction of correctly classified reviews in relation to the total number of reviews as stated in formula (3) and (4). % (3)

% (4)

Recall is the fraction of relevant reviews that are classified correctly. The formula for positive recall value and negative recall value is stated in formula (5) and (6) respectively (5)

(6)

Precision is the fraction of classified reviews that are classified correctly from the relevant prediction values. The formula for positive precision value and negative precision value is stated in formula (7) and (8) respectively (7)

(8)

Table 3 shows the result of applying different machine learning techniques on Malaysian movie reviews. Surprisingly the accuracy value of NB technique is the best as compare to that of SVM and kNN. Other than that, identifying positive sentiment seems to be a problem with SVM technique. It also supports Kang’s experiment [8] where NB and kNN techniques recognize the positive and negative sentiment better than SVM.

Table 3: Performance of opinion mining on Malaysian movie reviews using SVM, NB and kNN

41

978-1-61284-212-7/11/$26.00©2011 IEEE

SVM NB kNN

Accuracy 62.32 68.35 68.14

Recall (p) 27.2 69.8 68.02

Recall(n) 97.59 66.87 68.07

Precision (p) 91.87 67.9 68.02

Precision(n) 57.18 68.8 68.07

Another significant observation is that the overall SVM’s accuracy value is much lower than the result of the same experiment using the English movie reviews in [14] as shown in Table 4. It indicates that the margin that separates the positive and negative samples is narrower in the Malaysian movie reviews dataset. Therefore it is more difficult to identify between the positive and the negative sentiment by the SVM classifier. It illustrates that the noisy texts may influence the performance of opinion mining on movie reviews written by the Malaysian especially when SVM technique is used. Low value of accuracy when using NB is expected since NB treats all attributes as though they are completely independent whereas in opinion mining more than one words such as ‘tidakbest’ or ‘ best giler’state an opinion. Nevertheless, the result supports Pang et al [14] and other researchers’ conclusion that extra steps are required to execute opinion mining using machine learning approaches.

Table 4: Comparison of accuracy value between Malaysian movie reviews and English movie reviews.

Malaysian movie reviews

English movie reviews

SVM NB SVM NB

Accuracy 62.32 68.35 76.06 64.58

IV. CONCLUSION In this study, movie reviews written by Malaysians in

online forums and blogs were collected. We realized that they are very ‘noisy’ in comparison to the English movie reviews. Without any feature selection or feature reduction technique, the performance of the opinion mining using SVM, NB and kNN is quite low with the performance of NB exceeds that of SVM and kNN with 68.35% accuracy. In fact the figure is even lower when it is compared to opinion mining using English movie review data.

This result indicates that pre processing step is vital in order to improve the performance of opinion mining on this data set. How to deal with noisy texts written by the Malaysian writers is one of the areas that need further investigation. Other than that, choosing suitable feature selection and feature reduction for the Malaysian online reviews is another area of future research.

ACKNOWLEDGMENT

This work has been supported by UKM-OUP-FTSM-2011.

REFERENCES

[1] A.Abbasi, H. Chen, A. Salem, ”Sentiment Analysis in Multiple

Languages: Feature Selection for Opinion Classification in Web Forum”, ACM Trans. Information Systems, vol 26, no 3, article 12, 2008.

[2] M. Choudhury, S. Rahuf, V. Jain, S. Sudeshma, and B. Anupam, ”Investigation andmodeling of the structure of texting language”,Proc. of the IJCAIWorkshop on "Analytics forNoisy Unstructured Text Data", Hyderabad,India, pg 63-70, 2007.

[3] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut gallery: Opinionextraction and semantic classification of product reviews,” in Proceedings of WWW, pg 519-528, 2003.

[4] L. Dey, M. Haque, “Opinion Mining from Noisy Text Data”, ACM, Singapore, pp 83-90, July 2008.

[5] M. Gamon, “Sentiment classification on customer feedback data: Noisy data large feature vectors, and the role of linguistic analysis” In Proc. of the 20th International Conference on Computational Linguistics, pp. 841–847, 2004.

[6] Hatzivassiloglou, V. and K. McKeown, “Predicting the Semantic Orientation of Adjectives”, Proc. of the 35th ACL Conf., pg 174–181, 1997.

[7] Kamaruddin, S.S.; Hamdan, A.R.; Bakar, A.A.; Nor, F.M.; , "Dissimilarity algorithm on conceptual graphs to mine text outliers," Data Mining and Optimization, 2009. DMO '09. 2nd Conference on , vol., no., pp.46-52, 27-28 Oct. 2009

[8] H. Kang, S. J. Yoo, D. Han, “Accessing Positive and Negative Online Opinion”, Universal Access in Human-Computer Interaction, Applications and Services,Lecture Notes in Computer Science, Volume 5616/2009, pg 359-368, 2009.

[9] C. Kobus, F. Yvon,G. Damnati, “Normalizing SMS: are two metaphors are better than one?”, Proc. of the 22nd Inter. Conference on Computational Linguistics (Coling 2008), Manchester, pg 441–448, August 2008.

[10] M.A. Mustapha,October 6-7, 2008, “National ICT Policy and its Effect on ICT Development “[Online], Available: http://www.sigma-orionis.com/eurosoutheastasia-ict.org/events/coop_forum/Session4/Mohamad_Aziph.pdf

[11] M.Z.A. Nazri, S. M. Shamsudin, A.Abu Bakar, T. Abd Ghani, "Using linguistic patterns in FCA-based approach for automatic acquisition of taxonomies from Malay text," International Symposium on Information Technology, 2008. ITSim 2008. Kuala Lumpur, vol.2, no., pp.1-7, 26-28 Aug. 2008

[12] B.Pang,”Paper Using Our Movie Review Data”, [Online], Available: http://www.cs.cornell.edu/people/pabo/movie-review-data/otherexperiments.html

[13] B. Pang and L. Lee, “Opinion Mining and Sentiment Analysis.” Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.

[14] B. Pang, L. Lee and S. Vaithyanathain,”Thumbs Up? Sentiment Classification using Machine Learning Technique”, Proc. Empirical Methods in Natural Language Processing, pg 79-86,2002.

[15] B.Pang and L. Lee, “A Sentiment Education:Sentiment Analysis using Subjectivity Summarization Based on Minimum Cuts” Proc. 42nd Ann. Meeting of the Assoc. for Computional Linguistics, pp 271-278, 2004

[16] K. D. Rosa and J. Ellen, “Text Classification Methodologies Applied to Micro-text in Military Chat”, Proc. of 2009 International Conference on Machine Learning and Applications, pg 710-714, 2009

[17] F. Sebastiani, "Machine Learning in Automated Text Categorization." ACM Computing Surveys, vol 34, no 1, pg 1-47,2002

[18] P. D. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”,

42

978-1-61284-212-7/11/$26.00©2011 IEEE

[19] J. Wiebe, T. Wilson, R. Bruce, M. Bell and M. Martin, “Learning Subjective Language “, Computational Linguistic, vol 30, no 3, pg 227-208, 2004.

[20] I.H. Witten & E. Frank, “Data Mining Practical Machine Learning Tools and Technique 2nd Edition”, Morgan Kaufaman Publisher, 2005.

[21] W. Wong, W. Lie, M. Bennamoun, “Integrated Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text”, Proc. of the 5th Australasian Conf. on Data Mining and Analytics, vol 61,2006.

[22] Q.Ye, W. Shi and Y. li, “Sentiment Classification for Movie Reives in Chinese by Improved Semantic Oriented Approach”, Proc. Of the 39th Hawaii Int. Conf. on System Science, 2006.

[23] L. Zhuang, F. Jing, X.-Y. Zhu, and L. Zhang, “Movie review mining and summarization,” Proc. of the ACM Conf. on Information and Knowledge Management (CIKM), 2006.

[24] “Malaysia Internet Usage Stats and Marketing Report”, [Online], Avalable: http://www.internetworldstats.com/asia/my.htm

43

Documents

[IEEE 2011 3rd Conference on Data Mining and Optimization (DMO) - Putrajaya, Malaysia (2011.06.28-2011.06.29)] 2011 3rd Conference on Data Mining and Optimization (DMO) - Bess or xbest: