43
SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie Supervisor: Lizhen Qu A thesis submitted for the degree of Master of Computing The Australian National University Oct 2017

SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

SENTIMENT ANALYSIS IN CHINESESOCIAL MEDIA

Nan XieSupervisor: Lizhen Qu

A thesis submitted for the degree of Master ofComputing

The Australian National University

Oct 2017

Page 2: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Acknowledgement

First of all, I would like to express my sincere gratitude to my su-pervisor, Lizhen Qu, for the opportunity of doing this project with you.With your wealth knowledge of deep learning and natural language pro-cess, I am able to complete this project, with your patient teaching, I havea deeper understanding of the deep learning and natural process. Youradvices has significant influence for my future career.

Then, I would also like to thanks all members from data61 NLP group, Ibenefited a lot from group meeting with you, thank you for your insightfuladvices.

Finally, I would like to thank my family for supporting my study.

2

Page 3: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Contents

1 Introduction 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and on Sentiment Analysis 32.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Lexicon-based approach . . . . . . . . . . . . . . . . . . . 5

2.2.1 Dictionary-based approach . . . . . . . . . . . . . . 52.2.2 Corpus-based approach . . . . . . . . . . . . . . . . 6

2.3 Statistical Model-Based Approaches . . . . . . . . . . . . . 72.3.1 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Support Vector Machine(SVM) . . . . . . . . . . . 9

2.4 Deep Learning-Based Approaches . . . . . . . . . . . . . . 102.4.1 Word Embedding . . . . . . . . . . . . . . . . . . . 112.4.2 RNN based Approaches . . . . . . . . . . . . . . . 12

3 Characteristics of Chinese Language 143.1 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . 143.2 PinYin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Origin of Pinyin . . . . . . . . . . . . . . . . . . . . 153.2.2 Pinyin and Pronunciation . . . . . . . . . . . . . . 163.2.3 Pinyin-based input method . . . . . . . . . . . . . 163.2.4 Popularity and features of Pinyin-based input meth-

ods . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.5 Criticism of Pinyin-based input methods’ “auto-correction” 17

4 PinYinNet for Analyzing Sentiments on Chinese SocialMedia 194.1 PinYin Net . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 214.3 Results and Discussions . . . . . . . . . . . . . . . . . . . 22

5 Conclusion 275.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4.1 Appendix 1: Project Description . . . . . . . . . . 315.4.2 Appendix 2: Study Contract . . . . . . . . . . . . . 315.4.3 Appendix 3: Description of Software . . . . . . . . 35

3

Page 4: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

5.4.4 Appendix 4: README . . . . . . . . . . . . . . . . 35

4

Page 5: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

List of Figures

2.1 Example for statistical machine learning classification task 72.2 Learning and apply(testing) algorithm for NB . . . . . . . 92.3 [22]There are many possible hyperplanes(decision bound-

ary) able to separate positive and negative sample, so thegoal is to try to find the one with maximizing margin(boldone). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 An example of feed forward net, where x1, ..., xn are input,y1, ..., yz are output . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Architectures for CBOW and Skip-gram [3]. . . . . . . . . 122.6 Architectures for RNN in sentiment analysis task. Sentence

are tokenized into words (word1, ..., wordn). H1, ..Hn−1 arehidden layer which able to transit information to next hid-den layer until the last hidden layer Hn, Hn will has thewhole information about the sentences(a fixed size vector)which fully connected with a activation function(sigmoid).Finally, the probability distribution over classes will begiven by output layer, we can choose the class with thehighest probability as predicted result . . . . . . . . . . . 13

2.7 LSTM memory cell[4] . . . . . . . . . . . . . . . . . . . . . 13

4.1 Possible words associated with lazy pinyin baozi(by thesoftware: Sogou pinyin) . . . . . . . . . . . . . . . . . . . . 19

4.2 A example of standard Chinese(right) and Cyberspeak(left) 204.3 Meaning in standard Chinese . . . . . . . . . . . . . . . . 204.4 Word representation in PinYin Net . . . . . . . . . . . . . 204.5 Architecture for PinYin Net . . . . . . . . . . . . . . . . . 214.6 Statistics of data over 6 domains. . . . . . . . . . . . . . . 224.7 (a) Accuracy for baseline(using word embedding only) vs

PinYiN Net(concatenated with lazy pinyin) in testing set.(b) Loss for baseline(using word embedding only) vs PinYiNNet(concatenated with lazy pinyin) during training. . . . 23

4.8 (a) Accuracy for baseline(using word embedding only) vsPinYiN Net(concatenated with standard pinyin) in test-ing set. (b) Loss for baseline(using word embedding only)vs PinYiN Net(concatenated with standard pinyin) duringtraining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.9 (a) Accuracy for baseline(using word embedding only) vsPinYiN Net(using lazy pinyin only) in testing set. (b)Loss for baseline(using word embedding only) vs PinYiNNet(using lazy pinyin only) during training. . . . . . . . . 25

5

Page 6: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

List of Tables

4.1 Statistics for number of positive data and number of nega-tive data over 6 domains. . . . . . . . . . . . . . . . . . . . 22

6

Page 7: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Abstract

Sentiment Analysis, sometimes knowns as opinion mining or emotionAI. Due to the popularity of social media and e-commerce in China, thereis a huge amount of texts with emotion are available, like reviews for pro-duces, customer’s feedbacks, or forums/Weibo(Chinese tweets) for a par-ticular event etc. Making good use of this information and understandingpeople/customer’s sentiment or opinion about various events/goods arecrucial for business success.

However, Chinese sentiment analysis research is rarely, in this thesis,we proposed a method which designed for Chinese, called PinYin, whichby takes pinyin as additional information to the deep learning based sen-timent analysis model.

Page 8: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Chapter 1

Introduction

Due to the popularity of social media and e-commerce in China, anincreasing amount of netizen is habituated to expressing their opinionson-line. People like to express their opinions about a particular event andwriting comments on their social media like blog or weibo (the Chinesetwitter). People like to share their experience of shopping via e-commerceand writing reviews to talk about their preference for a particular productor a seller, for example in taobao (the Chinese amazon), customers areable to rate the product on a 5-star scale and writing review to commenton that product.

Hence, there are a huge amount of texts with sentiment are availableon line, those kinds of information are able to tell us what people like andwhat people don’t like. If we can make good usage of those data, we canbenefit a lot. A wise usage of e-commerce data (like reviews, comments orfeedbacks) we are able to get some rational information about what peoplelike and dislike about various products, even in more fine gram those dataare able to tell a given product, what particular part do people like anddislike, for example “I like this new iPhone’s screen”,” I hate this newiPhone’s camera”. Those kinds of information are significantly importantfor the manufacturer to improve their product with pertinence in order tocater to customers (like pay for effort and cost about the camera in theexample.) Those kinds of information also some reference worth in thedecision making about marketing for the retailer, to selling more productwhich is people liked, fewer people disliked.Hence, a good usage of thosedata and predict people’ emotion for various events/products are crucialfor business success. A famous example [1], which doing sentiment analysisof twitter data for predicting stock market movements.

However, most sentiment analysis research is based on English, researchabout sentiment analysis in the Chinese language is rarely. Hence, in thisthesis, we aim to propose a method which is designed for Chinese. It’scalled PinYin Net, our hypothesis is pinyin will be a helpful informationto understanding writer’s intention in Chinese, so we take pinyin as ad-ditional information to sentiment analysis algorithm. More detail aboutPinYin Net will be introduced in Chapter 4.

1

Page 9: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

1.1 Contributions

Main contribution of this thesis is we put forward a method of usingthe information about pinyin in Chinese sentiment analysis task. More-over, with the information about pinyin, we are able to make the deeplearning based approach method more easier to train or computationalcheaper.

1.2 Outline

In following chapters, first, we will introduce some background andtypical approaches for sentiment analysis(chapter2). In chapter 3, wewill talk about some characteristics of Chinese language which includethe propositional knowledge for PinYin Net - pinyin. In chapter 4, moredetail about PinYin Net will be introduced, following by some results anddiscussions.

2

Page 10: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Chapter 2

Background and on SentimentAnalysis

In this chapter, we will give a brief introduction about the backgroundof sentiment analysis and some typical approaches for sentiment analysis.

2.1 Background

Although research about NLP (natural language processing) have along history, but for sentiment analysis, there is limited research had beendone until the year 2000. Since Web 2.0, social networking, web appli-cation, e-commerce becomes more and more popular. A huge amount ofdata like product reviews, comments are available and people began torealize the value of those data. The sentiment is playing the key role inalmost all human’s activities and is the significance of influencers people’sbehaviors. For example, when we making decisions about purchasing, wewill always have the emotional preference which influenced by some ex-perts’ comments or other customer’s reviews etc. Under this circumstance,sentiment analysis has become a rapidly growing research area.

Sentiment analysis, sometimes also called opinion mining or emotionAI, is a significant field of studies of natural language processing whichinvolved with theories and methods from statistics, linguistics, psychologyand artificial intelligence. By Bing Liu’s [2] definition which the field ofstudy that :

“Sentiment analysis, also called opinion mining, is the field ofstudy that analyzes people’s opinions, sentiments, evaluations, appraisals,attitudes, and emotions towards entities such as products, services, orga-nizations, individuals, issues, events, topics, and their attributes”

— Bing LiuGenerally speaking, sentiment analysis is the task aim to extract

and categorize opinions/emotions for a certain text. In most cases, thistask can be treated as a polarity classification task, since the polarity ofsentiment usually expressed simply in term of positive and negative (bi-nary classification). Like follow example:

3

Page 11: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

This restaurant is so good ! -> ‘positive’

This restaurant is so dreadful ! -> ‘negative’

An ideal sentiment analysis system is able to recognize positive emotionfrom the first sentence and recognize negative emotion from the secondsentence. Moreover, multi-class classification case is also a not to be ne-glected research field, since human’s emotion is diversified, for example,there is a various emotion amount positive emotion like joy, gratitude,pride etc.

Setiment analysis mainly have three different level of analysis: Document-level, sentence-level, aspect-level:Document-level [3]: At this level, the task is aim to recognize whetherthe whole document expresses a sentiment/opinion. Let’s take buildingsentiment analysis system which serves for e-commerce as an example. Ine-commerce, every product has its own page. A document-level sentimentanalysis task would be read all reviews in that page (about that particularproduct) and expresses an overall sentiment/opinion for that product.Sentence level [4]: For this level, the task would be finer grained andgoes to the sentences, aim to recognize sentiment/opinion for that partic-ular sentence. In this level, the analysis is closely related to subjectivityclassification [5]. Let’s go back our e-commerce example. A sentence leveltask would express the sentiment/opinion about a particular review (in-stead of overall sentiment).Aspect level [6]: Aspect level, sometimes also called feature level. In thislevel, task aims to infer the sentiment polarity of a specific aspect by agiven sentence. For the e-commerce example, if we a sentence from iPhonereview” I like this phone’s screen, but its camera is dreadful”. Aspect leveltask is try expresses positive emotion for the aspect “screen” and negativeemotion for the aspect “camera”.

Methods to approach sentiment analysis task can be mainly dividedinto three methods, lexicon-based approaches, machine learning based ap-proaches and deep learning based approaches. Detail about this threeapproaches will be explained in following three subsections.

4

Page 12: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

2.2 Lexicon-based approach

From the very beginning, sentiment analysis task is approached bylexicon-based approach [7]. Which are classifies the texts by rule-based.This approach uses lexicon which consists terms, terms can be a singleword or phrase etc. Each term in lexicon has corresponding sentimentscore. Overall sentiment score for a document/sentence is calculated bythe presence of terms in the lexicon.

For instance, assume we have a simple lexicon with two words, "good"and "bad", "good" is positive sentiment with +1 score and "bad" is neg-ative sentiment with -1 score, 0 for others, then the algorithm for classi-fication is to check whether overall sentiment score is greater than 0. Weare able to classify "This restaurant is good!" to positive sentiment sincethe overall score is +1 which greater than 0 and classify "This restaurantis bad!" to negative sentiment since the overall score is -1 which less than0.

In this approaches, the algorithm is simple, the key challenge is build-ing a reliable lexicon, we will introduce two approaches for dealing withthis challenge which are dictionary-based approach and corpus-based ap-proach.

2.2.1 Dictionary-based approach

This approach denotes that the sentiment analysis utilizes lexicaldatabases extracted from dictionaries. As dictionaries may have listedsynonyms and antonyms under each entry word, sentiment words liststhus can be built by bootstrapping related word attached. To begin with,a set of seed sentiment (excellent, nice, dislike, dreadful etc.) is collected,and their polarities (simply positive, negative or a particular sentimentscore) is labeled accordingly. The next step is using the algorithm to en-rich the set of sentiment words, by searching for synonyms and antonymsfrom the e-dictionaries website such as WordNet thoroughly. The wordsreached will join the initial word set, and this process goes on until thereis no more new words can be reached. Hu and Liu[8] had implementedthis approach. Following by the searching task, a manual inspection ofthe result is required. A similar means is demonstrated in the works ofValitutti[9]. Also, Kim and Hovy[10] tagged sentiment strength to everyword in the final set, resorting to a probabilistic fashion, while Moham-mad, Dunne, and Dorr[4] further expended the list by adding affix (as dis-in dislike).

Hu and Liu[8] demonstrated on providing a feature-based summary ofcustomer reviews of products listed online, by mining customer reviews onproducts, polarizing respective review and summering the results, basedon data mining and natural language processing means. Several productsreviews online were demonstrated in that paper. An example of this ap-proach could be a camera, where picture quality feature had 253 positivereviews vs 6 negative reviews; size feature had 134 positive reviews vs 10negative reviews. Tallying reviews, whether positive or negative, was doneby tagging each review a polarized value based on the adjective words inthe sentences, where those adjective words will be compared to the polar-

5

Page 13: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

ized words’ synonyms or antonyms based on WordNet. The results werepromising, and the average accuracy was claimed to be 84

Kim and Hovy[10] analyzed the Sentiments in a Claim within a Topicfrom a Holder, where a polarized or neutral sentiments are identified givena topic. There were several classifiers were employed. Each word’s polar-ity was accessed in a sentence using the first classifier, while the polarity ofopinion on the sentence level is denoted by the second classifier. As Hu’sapproach[8], a seed set of words were preset, and the size of the wordsset was getting larger by retrieving synonym and antonym from Word-Net. However, as the authors put it, some synonyms and antonyms hadcontradicting sentiment or neutral, and thus cannot be used. In addition,

Park[12] developed three dictionaries to build up the seed words set,rather than a single dictionary pool since it was a necessary step to ap-proach tweets classification. However, problems remained, as the retriev-ing process was time-consuming and on Twitter, informal expressions pre-vail.

Given the simplicity of the dictionary-based approach, the manualcheck of generated word list is time-consuming. Additionally, the mainweak point is its domain or context independency, namely, the difficultyin approaching sentiment words with domain or context dependent orien-tation[13]. For instance, small screen may be a desirable feature in handsetwhereas it is not favored in TV set, and this attribute is quite common inlanguage phenomenon. Corpus-based approach can address this issue.

2.2.2 Corpus-based approach

There are two scenarios where corpus-based approach can be ap-plied[14]: (1) identifying sentiment words and their polarity from a domaincorpus, based on a preset seed list of sentiment words; (2) transforming ageneral-purpose sentiment words to a new one, by adopting a domain cor-pus for sentiment analysis purpose in that context. Nevertheless, the issuemay persist when having retrieved context-dependent sentiment words,since the sentiment word may still have both positive and negative at-tribute within the very context. Some studies targeting this issue areshown below.

Hatzivassiloglou and McKeown[15] pioneered in the study of corpus-based approach, where a corpus and a set of seed adjective sentiment wordsare used to build up the corpus with more adjectives. Linguistic rules wereadopted in their study with more adjective sentiment words and theirpolarities being found from the corpus. One example can be illustrated isthe conjunction “and” which indicating that the two adjectives combinedhave the same polarity whereas “but” entails the opposite. Other siblingwords are like “neither”, “or” etc. Since those rules are not universal,further efforts to check the validity is needed. Thus, they employed thelog-linear regression model to check if some conjoined adjectives have thesame orientation. A graph is generated with links of the same or differentorientation between those adjectives set, followed by clustering to producetwo sets of orientation words, which enabled them achieving precision with90%.

As one sentiment word may have two opposite polarity at the same

6

Page 14: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

time, Ding et al.[16] used several means to solve the ambiguity in theorientation of those content-dependent words. In their study, methodsemployed included using word pair, annotating a large number of idioms,defining polarity score for each feature, adopting linguistic rules for differ-ent usage of conjunction, holistic approach to three consistency techniquesabout connectivity etc. They claimed that those approaches were effectivecompared to previous studies.

Corpus-based approach contributes to the field of content-dependentsentiment words analysis, but the corpus-based approach should be em-ployed with other approaches together to be effective because the wordsin the corpus are not exhaustive.

However, due to the complexity of language, lexicon-based methods arestill not sufficient for all scenarios and more subtle cases do exist. Hence,in order to have a more precise approach, we want the “rules” can belearned automatically rather than manually defined

2.3 Statistical Model-Based Approaches

Since the limitation about lexicon-based approach, to improve thislimitation, statistical machine learning based approach was proposed. Instatistical learning-based approaches, we want the “rule” of classificationis learned from experiences (data) instead of manually defined the “rules”(define a lexicon which claims good for positive sentiment, bad for negativesentiment etc.).

What is statistical machine learning and how is it work? Generallyspeaking, statistical machine learning aims to run an algorithm whichable to learn a model, that model is able to express a function y(x) whichtakes any new input x and generates an output y.

Figure 2.1: Example for statistical machine learning classification task

In this example the input x is a feature vector (x1, x2), output y is Oor + (represent two different labels,), aim of statistical machine learningis to try to learning the model which able to express the dashed line (theclassification rule to classify two labels by given a new data point x).

Back to our sentiment analysis task, the label y is sentiment polarity(positive, negative), normally can be quantified as 0,1 for binary classescases or a one hot vector (0, 0 . . . , 1, ..., 0) for multi-class cases. Input x istexts (document or sentence), usually bag-of-word model [17] is used torepresent texts as a 1×N feature vector (w1, w2, w3, . . . , wN), where N is

7

Page 15: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

the size of vocabulary and wi is the weight of i-th word. To convert texts tothis kind feature vector, first, we create a vocabulary with N unique wordfrom the training dataset, then assign the weight to each word accordingto given texts. There are two main methods to assign weight:

One-Hot: Assign wi to 1, if i-th word is present in the document/sentence,otherwise assign 0.

Term frequency: Assign the weight as a total number of times i-thword occurred in the document/ sentence.

For example, if we have a vocabulary (‘bag′,′ of ′,′words′,′ nice′), thenthe sentence: ” bag of bag of words.” will be converted to vector (1, 1, 1, 0)by one-hot, will be converted as (2, 2, 1, 0) by term frequency.

With documents/sentences is converted as a feature vector, we are ableto apply statistical learning method now. In following two subsections, wewill talk about two commonly used statistical machine learning algorithmfor text classification which are Naïve Bayes[18] and SVM.

2.3.1 Naïve Bayes

The first statistical machine learning method we will introduce isNaïve Bayes which is the simplest and most commonly used method intext classification task. Naïve Bayes is based on Bayes’ theorem andunder the assumption which assumes all features are emphiid (identicallyindependent).

Naïve Bayes aim to calculate the posterior probability of a class (pos-itive or negative) based on conditional probability of a word occurringin a document/sentence of a class, prior probability of a word and priorprobability of a class.

For a document/sentence d and a class c

P (c|d) = P (d|c)P (c)P (d)

P (c) = prior probability of classP (d) = prior probability of document/sentenceP (d|c) = probability of document/sentence given classP (c|d) = probability of class given document/sentenceIn text classification task, our goal is to find the most likely class for a

given document/sentence. The most likely class is means this class havemaximum posteriori Maximum a posteriori hypothesis cMAP :

cMAP = argmaxc∈C

P (c|d)

= argmaxc∈C

P (d|c)P (c)P (d)

= argmaxc∈C

P (d|c)P (c)

Document d can be represented as features(words) x1,x2...xn, then byiid assumption, finally we can have:

cMAP = argmaxc∈C

P (x1, x2, . . . , xn|c)P (c)

= argmaxc∈C

∏x∈d

P (x|c) (2.1)

8

Page 16: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

(2.1) is also called Multinomial Naïve Bayes Classifier, in practical,learning and testing can be expressed by following algorithm [19]:

Figure 2.2: Learning and apply(testing) algorithm for NB

Naïve Bayes is easy to implement, training is fast, even underthe strong iid assumption, however, according to Gautam’s experimentin twitter[20] and Bütow’s experiments in German news[21], Naïve Bayesstill able to give good results in some cases, but in real life most featuresare dependent, Naïve Bayes may have problems in most real life cases.

2.3.2 Support Vector Machine(SVM)

Support vector machine (SVM) is another commonly used statisticalmachine learning algorithm in sentiment analysis and text classificationtask. SVM is a large-margin classifier. In sentiment analysis task, supposewe have two classes: positive sentiment and negative sentiment. SVM is avector space based method which aims to find a decision boundary betweenthis two classes, and the optimization goal for that decision boundary triesto separate training sample data from positive class and negative class withmaximal margin.

Figure 2.3: [22]There are many possible hyperplanes(decision boundary) able to sepa-rate positive and negative sample, so the goal is to try to find the one with maximizingmargin(bold one).

To find this kind of hyperplane (decision boundary), first, let’s assume

9

Page 17: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

we have n labeled data (x1, y1), . . . , (xn, yn) and quantize class label to−1, 1,yi ∈ {1,−1}.Then the hyperplane can be represented by:

wTx+ b = 0 (2.2)

Where w is the weight vector and b is bias.Hence, the hyperplane can berepresented as parameter(w, b).

In order to classify class label correctly, hyperplane need satisfy follow-ing two conditions, for any point (xi, yi) if yi = +1 then wTxi + b > 0, ifyi = −1 , then wTxi + b < 0, hence we need have

wTxi + b >= +1, yi = +1

wTxi + b >= −1, yi = −1(2.3)

Under those condition ((2.3) holds), the sum of the distance for twodifferent class to the hyperplane is 2

||w|| , which are called margin.Since we want to find the hyperplane with maximum margin, so our

problem becomes:

maxw,b

2

||w||s.t yi(wTxi + b) >= 1, i = 1, ..., n

(2.4)

Which are equivalent to:

minw,b

1

2||w||2

s.t yi(wTxi + b) >= 1, i = 1, ..., n

(2.5)

(2.5) is the basic form of SVM, learning is a quadratic programmingproblem. Since we only want give a brief idea about SVM here, so moreabout SVM like the dual problem, kernel method will be no further de-scription.

SVM was commonly used is sentiment analysis task and don’t haveany strong assumption like Naïve Bayes. Chikersal, P., Poria [23] appliedSVM algorithm, uses the linear kernel and L-1 regularization combinewith rule-based method achieved really impressed result at that period.However even SVM sometimes can outperform than other algorithm suchas Naïve Bayes, SVM still is not suitable for large datasets since thetime complexity. Moreover, even with kernel method, SVM still havelimitations in the non-linear problem.

2.4 Deep Learning-Based Approaches

Due to statistical machine learning’s limitation in non-linear cases,people started interesting about deep learning (neural network). It’s auniversal approximator[24]. Same as statistical machine learning, the goal

10

Page 18: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

of deep learning is to approximate some function. For example, in oursentiment analysis task, our goal is to approximate a function y = f(x),which able to map the representation of a sentence x to some sentimentcategory y.

Feedforward networks are the basic form of deep learning, it’s a di-rected acyclic graph which describing how functions are composed to-gether and no feedback connections for outputs of the model are fed backinti itself. For example, if we have a function f(x) which can be approxi-mated by composing other two function f 1, f 2, we can express as a chainf(x) = f 2(f 1(x)), these chain structures are the most commonly usedfeed-forward net structures. Training for a neural network open can beapproached by back propagation[25].

...

......

x1

x2

x3

xn

H1

Hk

y1

yz

Inputlayer

Hiddenlayer

Ouputlayer

Figure 2.4: An example of feed forward net, where x1, ..., xn are input, y1, ..., yz areoutput

In following two subsections, we will introduce a better representationof words which based on the idea of deep learning and a commonly usedarchitecture for sentiment analysis task.

2.4.1 Word Embedding

Since Bag of Words (BOW) model only able to tell whether a wordis presented in the document/sentence or not, unable to tell the similaritybetween words. Word embedding is designed for solving this limitation.Word embedding aims to map words into vectors and words’ vector is closeto each in vector space if they’re meaning are similar. Word2vec[26] is thecommonly used method for doing word embedding with state-of-the-artperformance.

The intuition behind word2vec is the language model. For instance, ifwe have two sentences “A bird is flying in the sky.” and “An eagle is flyingin the sky.” Assume we don’t know the meaning of “eagle”, but we observedit has very similar context (“is”,“flying”,”in”,”the”,”sky”) with “bird”, thenwe can deduce that “eagle” may have similar meaning with “bird”.

There are two parts in word2vec: continuous bag of words(CBOW) andskip-gram. CBOW aims to predict the word given context, skip-gram aim

11

Page 19: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

to predict the context given a word. For instance, if we have a sentence“A bird is flying in the sky”, CBOW aim to predict “bird” by given (“A",,"is","flying", "in", "the", "sky”), skip-gram aim to predict (“A", "is","flying", "in", "the", "sky”) by given “bird”. In particular, we often justchose one of those two model (CBOW or Skip-gram) to train our wordembedding.

Figure 2.5: Architectures for CBOW and Skip-gram [3].

After training the deep learning model, we can have weights of theprojection layer (hidden layer). Those weights are used as vector repre-sentation of the word w(t).

2.4.2 RNN based Approaches

Feed-forward neural network is a powerful model. However, it’s notan ideal model for sentiment analysis tasks. In feed-forward net, it requiresfixed-size input, but in sentiment analysis, the input often has variable-length of words. Hence RNN (Recurrent Neural Network) was proposed.RNN can use the internal hidden layer to recursively transit informationto next cell, so RNN is able to process arbitrary length of inputs.

12

Page 20: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Figure 2.6: Architectures for RNN in sentiment analysis task. Sentence are tokenizedinto words (word1, ..., wordn). H1, ..Hn−1 are hidden layer which able to transit in-formation to next hidden layer until the last hidden layer Hn, Hn will has the wholeinformation about the sentences(a fixed size vector) which fully connected with a acti-vation function(sigmoid). Finally, the probability distribution over classes will be givenby output layer, we can choose the class with the highest probability as predicted result

However, standard RNN has gradient vanishing problem or explodingproblem during the training, for solving this problem, Long short-termmemory cell[27] was developed for the hidden unit.

Figure 2.7: LSTM memory cell[4]

LSTM has one memory state and three gates: Input gate, forget gateand output gate. More formally a LSTM cell can be expressed as following:

X =

[ht−1xt

](2.6)

ft = σ(Wf ·X + bf ) (2.7)it = σ(Wi ·X + bi) (2.8)ot = σ(Wo ·X + bo) (2.9)

ct = ft � ct−1 + it � tanh(Wc ·X + bc) (2.10)ht = ot � tanh(ct) (2.11)

Where t are the time(moment of input), f is the forget gate, i is theinput gate, o is the output gate. it,Wi,Wf,Wo are weight matrix, bi, bf , boare bias. � means activation function(like sigmoid).

In next chapter, we will introduce some characteristics about Chineselanguage which include the propositional knowledge for PinYin Net -pinyin.

13

Page 21: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Chapter 3

Characteristics of ChineseLanguage

Since we want to design a method for the Chinese language. Hencein this chapter, we will introduce some characteristics of the Chinese lan-guage. In section 3.1, we will introduce the word segmentation problemfor Chinese which are the key step in preprocessing of the Chinese lan-guage. In section 3.2, we will introduce pinyin, which are propositionalknowledge for PinYin Net.

3.1 Word Segmentation

Unlike English, the sentence can be simply can be simply tokenizedinto words by a blank. Chinese words are composed of multiple charac-ters but no blank between words. Since without the standardized notionfor words in Chinese. This task is started by designing a segmentationstandard base which aiming to output words are conform to the standard.Most famous work for this approach is the Penn Chinese Treebank Seg-mentation Standard[28]. However, like Lexicon approach for sentimentanalysis, we can’t simply define a standard (a rule), by given differentcontext, Chinese word segmentation may be different.

Since the limitation of by using segmentation standard. Chinese wordsegmentation problem started to be formalized as a character-based se-quence labeling task with character position tags. By setting a fixed sizedlocal window as contextual information, the interaction between adjacentcan be captured. This can be solved by many supervised learning methodslike Hidden Markov model or Conditional Random Fields[29]. However,like most sequence labeling problem, by using those methods, a good per-formance relies on a good feature engineering.

Chinese word segmentation is still an opening research question. Re-cently, due to the success of deep learning in many NLP task, peoplestarted thinking about deep learning based approach for segmentation,like Cai’s work [30], it’s also a potential research topic.

In next section, we will introduce pinyin which are the core part of thischapter and the propositional knowledge for PinYin Net.

14

Page 22: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

3.2 PinYin

Pinyin, or Hànyu Pınyın, is the romanization system, or the phoneticsymbols[31] of official language in China, which is called the StandardChinese, or Modern Standard Mandarin, simply Mandarin.

3.2.1 Origin of Pinyin

As Feng put it[32], historically there is no such Chinese phonetic al-phabet, rather, initially the pronunciation of one character is denoted byanother or two Chinese characters with the same or similar pronuncia-tion. During Tang dynasty, Wen Shou developed 36 phonetic symbols intranscribing the pronunciation, but those symbols were still Chinese char-acters until about 1500s when “Xiao Jing” (“Small Scriptures”), one kindof Arabic characters, were introduced and used as part of the ChineseMuslim minority. It had 36 characters and was believed to be the earli-est stage of the usage of Pinyin. During late Ming dynasty (1368–1644),western missionaries introduced Latin letters to transcribe the phoneticsof Chinese for their learning and compiled the earliest Chinese-Englishdictionary where Cantonese, one Chinese dialect, phonetics was denotedby Roman alphabet. Other dialects were compiled, too, mainly aimingfor missionary, before a British diplomat and sinologist, Thomas FrancisWade, designed spelling system in the textbook of Peking official language,known as eponymous fashion. Other missionaries also designed their ownspelling system. All those Latin letter systems were contributed to themodernization of Pinyin.

According to Feng[32], the Chinese own development of modern Pinyinbegan after the Opium Wars (mid-19th century). Zhuangzhang Lu is thefirst designer of modern Pinyin letter system, and his works, Easy Read-ing: Elementary, was published in 1892, with 20 years boom of publicationof Pinyin systems innovation thereafter, but circulations were limited andmost of them employed double-Pinyin method, denoted by Chinese char-acters’ strokes, rather than Latin letter.

Feng[32] further noted that during that period of publication fever,those proposed spelling systems can be categorized into 3 fashions: Chi-nese characters strokes; shorthand; Latin letters. There was heated debateabout which spelling system should be designated as the official phoneticsystem of Chinese characters, and even there were advocacy for Latiniza-tion of Chinese, for the reasons such as efficiency of wiping illiteracy.However, in 1958 the dust settled as the Pinyin scheme was in the leg-islation. Pinyin was inspired by all those proposed systems available inhistory, especially by those emerged after 1900, and choose to use only 26wide-accepted Latin letters, and pursued simplicity with no compromis-ing the coverage of details of the phonetics. It is also worth noticing that,in 1913, the official phonetics of Chinese were standardized by nationaldelegations in a 3-month long convention, as there were many dialects inChina.

15

Page 23: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

3.2.2 Pinyin and Pronunciation

According to Shi[33], Pinyin is phonetic prompt, rather than phonetictranscription, to be exact. It helps to understand or learn the pronunci-ation of Chinese characters, rather than the exact transcription of thepronunciation per se.

However, this account may be too exact and lacks of practical meaning.Pinyin has the exact function of transcription for Chinese characters inphonetics way. While Chinese characters has different pronunciation, eachPinyin symbol or combination has unique denotation for each phoneticphenomenon unit(s). Whether for “ba” or “pá”, each one has only one wayof speaking, although the pronunciation may mean different characters.There is no difference in English where [si:] can be paired with “see” or“sea”, but there is only one way to pronounce [si:], and the only differencesexist in accents. Therefore, we believe Pinyin is the Chinese phoneticsystem[31], instead of an assisting system for phonetics.

The Pinyin alphabet can be categorized to two groups[31], initials andvowels, like consonant and vowels in English and some are overlappingin both languages. In addition, there are five tones[31] for each Chinesecharacters, including the light tone which are used in interjections like[hng] (means “humph” in English). Actually, there are no such words inChinese represent Pinyin. In this case, it should be written by Pınyın,which indicates two characters, with each character having the first tone.

Thus, each Chinese character has 3 elements in phonetics, namely, aninitial, vowel and tone. The exceptions are that some characters seemslack initials, like [an]. However, in this case, the initial is silent, as itdenotes as an apostrophe in [xı’an], instead of [xian] which would meanonly one character.

3.2.3 Pinyin-based input method

According to Zang[34], because of the particularity of Chinese and dis-crepancy of 26-key keyboard and tens of thousands of Chinese characters,Chinese input method have to go through a process of transcoding. Thus,the induction of Wangma 5-strokes input method in 1983 was ground-breaking. It has many advantages, like efficiency and low overlappingcode rate (the frequency of one combination of keys representing multipleChinese characters instead of unique pair, because of the homophones).However, the learning cost of stroke-based methods are formidable.

In 1993, a Pinyin based method, named intelligent ABC Pinyin inputmethod, was invented by Shoutao Zhu, a professor of Peking University.When it hit the market, it gained popularity swiftly, simply because itis user-friendly and almost zero learning cost. Pinyin method is straight-forward, as the Pinyin uses 26 letters the same as in English, which isuniversally printed on keyboards.

3.2.4 Popularity and features of Pinyin-based input methods

There are various Pinyin based input software in the market. Somebrands may overcome the other, but the majority is using Quan-Pinyin

16

Page 24: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

scheme, meaning spelling all letters in Pinyin. Double-Pinyin is gettingattraction but it is still small in market share.

For all Pinyin input schemes available to date, the conventional featureis non-requirement of tone. That is to say, although there are 4 tones(the fifth tone is silent in tone) for the most common characters, when itcomes to input, it becomes tone-less, namely, using one characters’ Pinyinspelling to represent all 4 tones. Therefore, this input is inherently de-signed in a way with high overlapping code rate and low efficiency. Forexample, if you type “pin”, it may require turning several pages of charac-ter table to find the right character you are looking for. So after the firstPinyin based input method was available, Pinyin based input software pre-vailed only in the non-professional application. Initially, computer literacycourses persisted in only teaching stroke-based input software.

Nevertheless, the second generation of Pinyin based input tools turnedthe table around. Sogou and other Pinyin-based input scheme, like Mi-crosoft, Google, QQ etc., handle the efficiency problem so well. They haveever-expending offline and online corpus bank, helping to choose the op-timal words or terms, even sentences, depends on contexts. Because ofthat, even a great number of stroke-based input users converted to Sogouin non-intensive typing daily use.

Furthermore, now the Pinyin input tools are mostly equipped withambiguity input recognition, so for those users who are not good at spellingthe right Pinyin, they can still print the right characters when they typethe wrong keys in spelling. For example, typing “pingying” can still easilyget the Chinese characters of Pinyin. Other features in Sogou, for example,include auto-correction, separating uncommon characters to spell in two ormore characters’ spelling, etc. The ambiguity input recognition is furtherdiscussed in the next section.

3.2.5 Criticism of Pinyin-based input methods’ “auto-correction”

Despite the popularity, it also attracts criticism. As mentioned byZang[34], the popularity of Pinyin input method imposes negative impacton the typos of Chinese in handwriting, which is believed to be a concern-ing problem among higher education students, and primary and secondaryeducation are presumed to be responsible[35]. Other factors are believedto be dialects, online informal language, language environment, and homeeducation[34]. Some advocates stroke-based input to enhance the writtenChinese skills[36], but the voice is week compared to the decades longpopularity of Pinyin input methods.

Especially, dialects has the greatest influence on spelling Pinyin cor-rectly, and some say that input methods is indulging the wrong spellings[34].For example, in some cities, /z/ and /zh/ in Pinyin are hard to differenti-ate, while other regions have problems in pronouncing /f/ or /n/. Becauseof this, Pinyin input is hard for people with heavy accent in the beginning.At the early stage development of Pinyin input software, it relies heavilyon correct spelling of Pinyin. However, like the feature of ever-enlargingcorpus bank, Sogou, or other Pinyin based methods, has the solution torecognize ambiguous input. For instance, /z/, /c/ and /s/ can be read as/zh/, /ch/ and /sh/ respectively; /n/ as /ng/. This simplification even

17

Page 25: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

helps those with no accent or problems with differentiation those phonemepairs, because less keys in typing are required. A case in point would betyping only two keys – “py”, instead of “pinyin”, where two Chinese char-acters of “Pinyin” can be printed on the screen.

Zang[34] outlines some negative impacts on Pinyin literacy from Pinyininput methods. Firstly, short hand and ambiguity recognition furtherweakening or reinforce the incorrect pronunciation of initials and con-sonants, although sometimes Sogou can also provide the correct spellingprompt while recognize wrong spelling input. Secondly, because the designof Pinyin input method exclude the tone elements in the Pinyin input pro-cess, this simplifying feature would weaken users’ ability of recognizing thecorrect tone, if they do not know them beforehand. Thirdly, Pinyin inputmethods fortify the ability of acquisition of polyphonic words, because ofthe tone-less and ambiguity reading function. Fourthly, the acquisition oflow frequency words is weakened, as some words can be input by separat-ing into two words, without knowing the pronunciation of the uncommonwords.

This account maybe not true, because the “ignorance” of polyphonic oruncommon words is ubiquitous. Even the hosts of the national broadcastprograms or educators may misread those words from time to time. Itis just like the development of language, as the usage will re-defines theprescription. Rather, Sogou’s prompt of correction in spelling is actuallycontribute to reinstate the orthodoxy. Therefore, the input software maybe acquitted.

Since typing Chinese character is heavily relied on Pinyin-based inputmethod nowadays, having a close look at Pinyin and Pinyin-based inputmethod may benefit the analysis of Chinese sentiment word. Furtherdiscussion of treating Pinyin as an additional information in sentimentanalysis algorithm will be in next chapter: PinYinNet.

18

Page 26: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Chapter 4

PinYinNet for AnalyzingSentiments on Chinese SocialMedia

In this chapter, we proposed a method which designed for the Chineselanguage in sentiment analysis task called PinYin Net.

4.1 PinYin Net

As we mentioned in chapter 3, pinyin is the romanization system orthe phonetic symbols. Due to the popularity of pinyin input method inChina, typing Chinese character is heavily relied on Pinyin-based inputmethod

Pinyin has two style: standard pinyin(or pinyin) and lazy pinyin, stan-dard is the pin yin with tone, lazy pinyin is the pinyin without tone. Inpinyin input method, user type lazy pinyin (the pinyin without tone), thenpinyin input software will give all possible words associated with this lazypinyin, finally, the user selects the word which he/she want to expressamount those given words.

However, those given words’ meaning can be quite different. For in-stance, if a user is writing a review for a restaurant, he/she want to saythis restaurant’s steamed bun is so delicious, the lazy pinyin for steamedbun is baozi, user type baozi in pinyin input method, pinyin input softwarewill give following possible words let the user select.

Figure 4.1: Possible words associated with lazy pinyin baozi(by the software: Sogoupinyin)

From figure 4.1, the first word is the correct one with means steamedbun, but others is very different meaning, the second one means mainte-nance of value, the third one means newspaper etc. Sometimes, users willselect wrong word, making selection error (like typed error in English),especially in some informal circumstance like writing twitter or writingshopping reviews.

19

Page 27: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Unlike English, if writer makes typed error, like spell error, at mostcases it’s still not hard to roughly guessing writer’s intention, but in Chi-nese this kind of selection error will lead to totally different meaning, likesteam bun and newspaper. Hence, this kind of selection error will in-crease the difficulty for sentiment analysis algorithm to recognize writer’sintention.

Fortunately, pinyin behind those words is same, if we can take pinyinas additional information to our sentiment analysis algorithm, which mayhelp the system to recognize writer’s real intention.

Moreover, since the cyberculture in Chinese social media, sometimesthis kind of “selection error” is made on purpose. Cyberspeak is an exam-ple, people like to like replace a word by another word with different mean-ing but with similar pronounce (similar pinyin). For example, followingtwo sentences have same meaning which is “This restaurant is dreadful.”,the right one is standard Chinese and the left one is Cyberspeak:

Figure 4.2: A example of standard Chinese(right) and Cyberspeak(left)

Figure 4.3: Meaning in standard Chinese

The last word from this example (figure4.2) is the cyberspeak for dread-ful in Chinese cyberculture, and its meaning in standard Chinese (fig-ure4.3) is quite different which is spicy chicken, but pinyin is similar,especially the lazy pinyin. Hence, pinyin also can be a valuable informa-tion for sentiment analysis algorithm to recognize writer’s real intentionin this kind of yberspeak cases.

Hence we come up the idea about PinYin Net, we assume pinyin is thehelpful additional information to helping understand writer’s intention.

For adding information about pinyin, we can take pinyin as a partof word representation by concatenated word embedding with its pinyinembedding, embedding is the technique we introduced in the chapter 2which aims to map a word into a vector representation, words with similarmeaning can have close distance in vector space.

Figure 4.4: Word representation in PinYin Net

Hence, we will have the word representation vector which not only willclose to each if the meaning is similar but also will close to each otherif pinyin is similar. Then we feed those word representation into LSTM(which introduced in chapter 2,section: deep learning based approaches).

Given a text sentence S = word1, word2, . . . .wordn, where n is a vari-able associated with the number of words in sentence S. We first covert

20

Page 28: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Figure 4.5: Architecture for PinYin Net

word into its pinyin, then use a lookup layer to get the embedding of theword and the embedding of its pinyin. Concatenated word embeddingand its pinyin embedding as the representation of word, then feed thisrepresentation to LSTM cell, recursively transit to next LSTM cell untilthe last cell LSTMn, LSTMn has a fixed size vector can be regarded asthe representation of the whole sentence which has a fully connected layerwith a sigmoid function (or softmax for multi-classes case), finally sigmoidlayer will give the probability distribution over classes (classes are senti-ment label in our task, like positive, negative), we can choose the classwith the highest probability as our predicted result.

PinYin Net’s weights are trained by minimize the cross-entropy of thepredicted result and the ground-truth label, formula for cross-entropy canbe expressed as following:

H(y, l) = −N∑i=1

C∑j=1

lji log(yji ) (4.1)

where N is the number of training samples, C is the number of classes.lji is the ground-truth label, yji is predicted result.

4.2 Experimental Setup

In this section, we will talk about how to setup experiments.Our experiment data is from e-commerce shopping reviews which are

publicly available online [37]. This corpus (dataset) contains 6 domains:books, hotel, computers, milk, phone, heater. Have 10679 positive sam-ples and 10428 negative samples. More detail are available in the followingtable:

21

Page 29: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Domain Number of Positive Reviews Number of Negative ReviewsBook 4000 4000Hotel 2000 2000Computer 2000 2000Milk 1005 1170Phone 1159 1158Heater 515 100

Table 4.1: Statistics for number of positive data and number of negative data over 6domains.

Figure 4.6: Statistics of data over 6 domains.

Before the experiment, we need do some pro-process for our data. First,we need do word segmentation, unlike English, sentence can be simplytokenized into words by blank, tokenize Chinese sentence into words isstill an opening research question, In this project, we just used the packagejieba[38] cut to do the word segmentation task and it’s a Hidden MarkovModel [39] based approach, Then we used pypinyin[40] package to convertwords into its pinyin. Wording embedding and pinyin are trained byword2vec model based on package genism[41] (approached by CBOW) bygiven tokenized dataset as corpus.

In our experiments, we split the dataset into two disjoint subsets, train-ing and testing set with proportions 80% and 20%. We compared withthe performance which used information about words only (used word em-bedding only) as our baseline. After applied PinYin Net by using Adamoptimizer [42] and learning rate = 0.001, we are able to have the resultwhich demonstrated in section Results and Discussions.

4.3 Results and Discussions

After applied PinYin Net by using Adam optimizer with learning rate= 0.001, we are able to obtain the following result.

22

Page 30: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

(a)

(b)

Figure 4.7: (a) Accuracy for baseline(using word embedding only) vs PinYiNNet(concatenated with lazy pinyin) in testing set. (b) Loss for baseline(using wordembedding only) vs PinYiN Net(concatenated with lazy pinyin) during training.

Unfortunately, PinYin Net didn’t gives a significant improvement inaccuracy. However we can observed two phenomenons in Figure 4.7 thatis PinYin Net is able to give a good performance within very few epochand according to the plot of loss during training, we can observe thatloss of PinYin Net is smaller than the loss of baseline, Which means withinformation about pinyin may make the model (PinYin Net) becomeseasier to train.

If we using standard pinyin as additional information(pinyin with tone),we can observe similar phenomenons (Figure 4.8).

Moreover, if we using lazy pinyin only, PinYin Net still able to givea similar performance as the baseline (Figure 4.9). Since in lazy pinyin,it’s one to many map, the vocabulary of lazy pinyin is much smaller thanthe vocabulary of words. Hence, the training for PinYin Net (using lazypinyin only) are much computational cheaper (10 epochs trained within15min for PinYin Net(using lazy pinyin only), trained within 22min forbaseline method based on Intel i3, 2.4GHz).

23

Page 31: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

(a)

(b)

Figure 4.8: (a) Accuracy for baseline(using word embedding only) vs PinYiNNet(concatenated with standard pinyin) in testing set. (b) Loss for baseline(using wordembedding only) vs PinYiN Net(concatenated with standard pinyin) during training.

24

Page 32: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

(a)

(b)

Figure 4.9: (a) Accuracy for baseline(using word embedding only) vs PinYiN Net(usinglazy pinyin only) in testing set. (b) Loss for baseline(using word embedding only) vsPinYiN Net(using lazy pinyin only) during training.

25

Page 33: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Even we didn’t have a significant improvement in accuracy. However,we still have some meaningful achievement, since in real life, sometimewe may face the data which are really hard to train or huge amount ofdata like 1T. For some started-up company or some individual researcher,they may not have lots of computational resources, then the PinYin Netcan be one option they can consider about, which is easier to train byconcatenated with pinyin embedding or can give similar performance, butcomputational cheaper by using lazy pinyin only.

26

Page 34: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Chapter 5

Conclusion

5.1 Conclusion

In this thesis, we have proposed a method which is designed for theChinese language in sentiment analysis task, called PinYin Net. In PinYinNet, we take word’s pinyin as additional information into the representa-tion of the word (by concatenated word embedding with its pinyin em-bedding as the representation of word). After experimenting with datawhich mixed 6 domains of shopping reviews, we obverse that PinYin Netis easier to train than baseline method (Using information about wordsonly) and by using lazy pinyin (pinyin without tone) only, it’s still able togive a similar result as baseline method, but much computational cheaper.

5.2 Future Work

Our further work can be the research about the improvement forpinyin embedding. Since in this project we just used word2vec model totrain pinyin embedding, each pinyin was treated as a “word”, characteris-tics about pinyin haven’t take into consideration yet. Moreover, furtherwork also can be the research about dealing with the limitations aboutdeep learning based approach. Deep learning based approach is limited bylack of manually labeled data. Research about how to take part of textsas a noisy label to train the model also can be an interesting further work,like Bjarke’s work[43] which take emoji as a noisy label for sentiment.

27

Page 35: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

5.3 References

[1] K. N. R. C. P. M. Venkata Sasank Pagolu, "Sentiment Analysis ofTwitter Data for Predicting Stock Market Movements," International con-ference on Signal Processing, Communication, 2016.[2] B. Liu, Sentiment Analysis and Opinion Mining, Morgan ClaypoolPublishers, 2012.[3] Turney, P. D. (2002, July). Thumbs up or thumbs down?: semantic ori-entation applied to unsupervised classification of reviews. In Proceedingsof the 40th annual meeting on association for computational linguistics(pp. 417-424). Association for Computational Linguistics.[4] Bütow, F., Schultze, F., Strauch, L. Semantic Search: Sentiment Anal-ysis with Machine Learning Algorithms on German News.[5] Riloff, Ellen and Janyce Wiebe. Learning extraction patterns for sub-jective expressions. in Proceedings of Conference on Empirical Methodsin Natural Language Processing (EMNLP-2003). 2003.[6] Salas-Zárate, M. D. P., Medina-Moreira, J., Lagos-Ortiz, K., Luna-Aveiga, H., Rodríguez-García, M. Á., Valencia-García, R. (2017). Sen-timent Analysis on Tweets about Diabetes: An Aspect-Level Approach.Computational and mathematical methods in medicine, 2017.[7] Kundi FM, Khan A, Ahmad S, and Asghar MZ (2014), Lexicon-BasedSentiment Analysis in the Social Web, Journal of Basic and Applied Sci-entific, 4(6): 238–248[8] M. Hu and B. Liu, "Mining and summarizing customer reviews,"in Proceedings of the tenth ACM SIGKDD international conference onKnowledge discovery and data mining, 2004, pp. 168-177.[9] C. Strapparava and A. Valitutti, "WordNet Affect: an Affective Ex-tension of WordNet," in LREC, 2004, pp. 1083-1086.[10] S.-M. Kim and E. Hovy, "Determining the sentiment of opinions," inProceedings of the 20th international conference on Computational Lin-guistics, 2004, p. 1367.[11] S. Mohammad, C. Dunne, and B. Dorr, "Generating high-coverage se-mantic orientation lexicons from overtly marked words and a thesaurus,"in Proceedings of the 2009 Conference on Empirical Methods in NaturalLanguage Processing: Volume 2-Volume 2, 2009, pp. 599-608.[12] S. Park and Y. Kim, "Building thesaurus lexicon using dictionary-based approach for sentiment classification," in Software Engineering Re-search, Management and Applications (SERA), 2016 IEEE 14th Interna-tional Conference on, 2016, pp. 39-44.[13] W. Medhat, A. Hassan, and H. Korashy, "Sentiment analysis algo-rithms and applications: A survey," Ain Shams Engineering Journal, vol.5, pp. 1093-1113, 2014.[14] B. Liu, "Sentiment analysis and opinion mining," Synthesis lectureson human language technologies, vol. 5, pp. 1-167, 2012.[15] V. Hatzivassiloglou and K. R. McKeown, "Predicting the semanticorientation of adjectives," in Proceedings of the eighth conference on Eu-ropean chapter of the Association for Computational Linguistics, 1997,pp. 174-181.[16] X. Ding, B. Liu, and P. S. Yu, "A holistic lexicon-based approach toopinion mining," in Proceedings of the 2008 international conference on

28

Page 36: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

web search and data mining, 2008, pp. 231-240.[17] Tang, B., Kay, S., & He, H. (2016). Toward optimal feature selectionin naive Bayes for text categorization. IEEE Transactions on Knowledgeand Data Engineering, 28(9), 2508-2521.[18] Rish, “An empirical study of the naive bayes classifier,” in IJCAI 2001workshop on empirical methods in artificial intelligence, pp. 41–46, 2001.[19] "Naive Bayes text classification," Stanford NLP Group, [Online].Available: https://nlp.stanford.edu/IR-book/html/htmledition/text-classification-and-naive-bayes-1.html.[20] Gautam, G., & Yadav, D. (2014, August). Sentiment analysis oftwitter data using machine learning approaches and semantic analysis. InContemporary computing (IC3), 2014 seventh international conference on(pp. 437-442). IEEE.[21] Bütow, F., Schultze, F., & Strauch, L. Semantic Search: SentimentAnalysis with Machine Learning Algorithms on German News.[22] Z.-H. Zhou, in Machine Learning, Tsinghua University Press, 2016.[23] Chikersal, P., Poria, S., & Cambria, E. (2015, June). SeNTU: senti-ment analysis of tweets by combining a rule-based classifier with super-vised learning. In Proceedings of the International Workshop on SemanticEvaluation, SemEval (pp. 647-651).[24] Kurt Hornik. 1991. Approximation capabilities of multilayer feedfor-ward networks. Neural networks, 4(2):251–257.[25] L. B. G. B. O. K.-R. M. Yann LeCun, "Efficient BackProp," NeuralNetworks: tricks of th trade, 1998.[26] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-termmemory. Neural computation, 9(8):1735–1780.[27] "LSTM Networks for Sentiment Analysis," [Online].Available: http://deeplearning.net/tutorial/lstm.html.[28] Nianwen Xue, Fei Xia, Fu dong Chiou, and Martha Palmer. 2005.Building a large annotated Chinese corpus: the Penn Chinese treebank.Journal of Natural Language Engineering, 11(2).[29] Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. An improvedchinese word segmentation system with conditional random field. In Pro-ceedings of the Fifth SIGHANWorkshop on Chinese Language Processing,volume 1082117.[30] H. Z. Deng Cai, "Neural Word Segmentation Learning for Chinese,"2016.[31] W. Jun, "The Chinese Phonetic System (pinyin) is the Best Scheme,a Further Consideration [J]," Applied Linguistics, vol. 2, p. 000, 2003.[32] F. Zhiwei, "Review of Romannization of Chinese Scripts [J]," Termi-nology Standardization & Information Technology, vol. 4, p. 006, 2004.[33] F. Shi, "Actual Pronunciation of Madeiran Pinyin Symbols," Lan-guage and Characters Application, pp. 48-49, 2013.[34] Y. Zang, "An Investigation on the Negative Effects of IntelligentPinyin Input Software on the Application of Chinese Characters in CollegeStudents," Ph.D, Shenyang Normal University, 2012.[35] W. Peng, "A Survey on Common Chinese Characters Acquisition ofTertiary Students," Journal of Guiyang University: Social Science Edi-tion, pp. 54-59, 1999.[36] W. Wang, "Strategies to Strengthen Handwriting Skills of Chinese

29

Page 37: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

Characters in Computer Age," Journal of Zhankou Teachers College, vol.26, pp. 75-78, 2009.[37] J. Su, "Spaces.Ac.cn," 2016. [Online]. Available: http://kexue.fm/archives/3331/.[38] "jieba," [Online]. Available: https://github.com/fxsjy/jieba.[39] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Se-lected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), p. 257–286, February 1989.[40] "pypinyin," [Online]. Available:https://pypi.python.org/pypi/pypinyin.[41] "gensim," [Online]. Available: https://radimrehurek.com/gensim/index.html.[42] J. B. Diederik P. Kingma, "Adam: A Method for Stochastic Op-timization," 3rd International Conference for Learning Representations,San Diego, 2015.[43] A. M. A. S. I. R. S. L. Bjarke Felbo, "Using millions of emoji oc-currences to learn any-domain representations for detecting sentiment,emotion and sarcasm," EMNLP, 2017.

30

Page 38: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

5.4 Appendices

5.4.1 Appendix 1: Project Description

Due to the popularity of e-commerce in China, there are a huge num-ber of reviews, forums and weibo comment on quality of various productsand services. Understanding users’ feedback and options are crucial forbusiness success. In this project, we aim to build a deep learning mod-els for sentiment analysis. The resulted models should also focus on thelanguage-specific’s propertles of Chinese.

5.4.2 Appendix 2: Study Contract

31

Page 39: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym
Page 40: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym
Page 41: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym
Page 42: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

5.4.3 Appendix 3: Description of Software

The software was implemented in Python 2.7 under ubuntu 16.04with Intel i3, 2.4GHz. Contains following codes:

pinyin.py : Used for convert Chinese document/sentence into its pinyin,implemented by myself with open source package pypinyin.

trainPinYin2vec.py : Used for training embedding for pinyin, takespinyin corpus as input, and output a model file which maps each pinyin toits embedding. Implemented by myself with open source package genism.

trainWord2Vec.py : Used for training embedding for words, takes wordscorpus as input, and output a model file which maps each word to its em-bedding. Implemented by myself with open source package genism.

sentimentAnalysis_lstm.py : Implementation for architecture LSTMfor sentiment analysis task, used as the baseline, by taking dataset asinput and give a trained model and a training history file as output. Codeis from a publicly available GitHub, modified with add one more step(convert word to pinyin) before fed data into architecture, in order tohave some result about using pinyin embedding only:Referenced GitHub:Author: DaoYu LinDate:20/july/2016Title of program: Shopping Reviews sentiment analysisUrl:https://github.com/BUPTLdy/Sentiment-Analysis

sentimentAnalysis_lstm_WordAndPinYin.py : Implementation for ar-chitecture PinYin Net, by taking dataset as input and give a trained modeland a training history file as output. Code is modified based on sentimen-tAnalysis_lstm.py. by add a step for convert word to pinyin, add onemore look-up layer to get the pinyin embedding, and implemented a con-catenate step inside the architecture to concatenate word embedding andpinyin embedding as the word representation.

plotRseult.py : used for visualization about the result. Implemented bymyself.

5.4.4 Appendix 4: README

The software contains codes: pinyin.py, trainPinYin2vec.py, trainWord2Vec.py, sentimentAnalysis_lstm.py, sentimentAnalysis_lstm_WordAndPinYin.py,plotRseult.py (which are descripted in appendix 3)

First, we need install TensorFlow as deep learning backend.https://www.tensorflow.org/install/

Then we required package keras,sklearn,genism,jieba,h5py,numpy,pandas,pypinyinWe need run following command in root dir to install required package

sudo pip install r requirments.txtNow we can run trainPinYin2vec.py and trainWord2Vec.py to have the

embedding model for pinyin and words, moreover, in trainPinYin2vec we

35

Page 43: SENTIMENT ANALYSIS IN CHINESE SOCIAL MEDIA Nan Xie ... · AsHu’s approach[8], a seed set of words were preset, and the size of the words set was getting larger by retrieving synonym

can change the parameter method to “lazy” or “normal” to embeddingmodel for lazy pinyin or (standard) pinyin.

With embedding model, we are able to run sentimentAnalysis_lstm.pyor sentimentAnalysis_lstm_WordAndPinYin.py now.

When running sentimentAnalysis_lstm.py, we can change the type oftok (word, lazy_pinyin or normal_pinyin) with corresponding embeddingmodel (word embedding model, lazy pinyin embedding model or (normal)pinyin embedding model) to have result for using word only, lazy pinyinonly and pinyin only.

When running sentimentAnalysis_lstm_WordAndPinYin.py, we canalso can change the type of pinyin (lazy_pinyin or normal_pinyin) withcorresponding embedding model to have result for word concatenated withlazy pinyin or (normal) pinyin.

Finally, we will have some trained model and training history, we canusing plotRseult.py to visualize training history by given the name of thehistory file.

36