4

Click here to load reader

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Linguistic

  • Upload
    le-minh

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Linguistic

Linguistic Features for Subjectivity classification

Huong Nguyen Thi Xuan1,2, Anh Cuong Le2, Le Minh Nguyen3

1 Haiphong Private University, 36 Danlap, Duhangkenh, Lechan, Haiphong, Vietnam2 University of Engineering and Technology, Vietnam National University,

E3-144 Xuanthuy, Caugiay, Hanoi, Vietnam3 School of Information Science, Japan Advanced Institute of Science and Technology,

1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan

huong [email protected], [email protected], [email protected]

Abstract—Opinions are subjective expressions that describepeople’s viewpoints, perspectives or feelings about entities, eventsand theirs properties. Detecting subjective expressions is the taskof identifying whether a given text is subjective (i.e. an opinion)or objective (i.e. a reports fact). This task is considered as thefirst problem and it is very important for opinion mining andsentiment analysis which is now attracting many researcherscause its applicable capacity. Improvements in subjectivity classi-fication will positively impact on the performance of a sentimentanalysis system. Actually, features play the most important rolefor getting accurate subjective sentences. In this paper, we willenrich features by using syntactic information of the text. Fromour observation when investigating opinion evidences in thetexts, we will propose syntax-based patterns which are used forextracting rich linguistic features. Combining these new featureswith conventional features from previous studies, we obtain ahigh accuracy (about 92.1%) for detecting subjective sentenceson the Movie review data.

I. INTRODUCTION

Opinion mining and sentiment analysis aims to crawl,

extract, and analyze people ’s opinions which are shared in

Internet, such as product reviews, public reactions to political

events, scientific blogs, fashion and music trends and so on.

Results of this work are useful not only to other consumers

but also to archivists, companies looking for feedback on

both their own products and those of their rivals, governments

and social historians, amongst others. Therefore, recently this

problem has become a hot topic, specially in natural language

processing community.

Sentiment classification is the problem of classifying an

opinion text into its polarity (i.e. positive, negative, or neutral).

The result of polarity classification impacts directly on users,

so it is important and attract almost researchers in the problem.

However, in many real systems of opinion mining and senti-

ment analysis, it firstly has to crawl texts from websites, and

then determine whether a text contains opinion (i.e. subjective

information) or not (i.e. objective information), before going to

polarity classification. This task is considered as the subjective

classification which is focused in this paper.

To detect the viewpoints in the documents or sentences,

almost previous studies have paid attention on the lexical

characteristics to analysis. There are many methods introduced

to find the words to express the opinion features: use single

words [1]; a number of words by generating from text (n-

gram) [2], the phrases are extracted by using patterns [3], the

syntactic information [4], [5], and the dependency relations

[6]. Among them, the words are used in most studies have

focused on adjectives, adverbs and some extensions of the

verb to determine features. The word features has been given

by users to express opinion about an object as ”excellent”,”good”, ”bad”,̇.. Recently, Zhang and Liu [7] introduced a

method to determine single noun and noun phrases as features

which implies opinions. Sokolova and Lapalme [8] presented

the method which used non-affective adjectives and adverbs,

supplemented by degree pronouns, mental verbs, and modal

verbs, to determine the positive or negative opinion label.

Work in [9], Long et al., extracted aspects based on frequency

of dependent words(adjectives) on the Web, which are then

used to determine the selected reviews. In [10], Taboada et al.,

have presented a word-based method for extracting sentiment

from the texts; they calculated Semantic Orientation to parts

of speech such as noun, adverb, and verb, based on previous

research that made for adjectives.

After extracting the features from data, there are many

technical classifications in statistical machine learning have

been applied for sentiment classification task, such as Pang

and Lee used Support vector machines (SVM), naive Bayesian

(NB), Maximum Entropy (ME)in [2]; in [11] Yu and Hatzi-

vassiloglou used NB; Riloff et al., used SVM in [5], ... The

previous studies achieved fairly good results for different

domains.

In this paper, we propose a new approach of lexical extrac-

tion which is based on syntactic information. We investigate

four types of word including adjectives, adverbs, verbs and

some extensions of the verb, and nouns to extract linguistic

features that may appear on both subjective and objective

sentences. We not only find a fixed-number group of words

by the same some previous researches [1], [2], [3], [4], but

also extract a different number of words rely on syntactic

information which may express opinion.

Take the following example:

”color, musical bounce and warm seas lapping on islandshores; and just enough science to send you home thinking.”By apply syntax-based patterns, we collect four phrases as

following:

1) ( NP ( JJ musical ) ( NN bounce ) / ( CC and ) / ( JJwarm ) ( NNS seas ) )

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.47

17

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.47

17

Page 2: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Linguistic

2) ( VP ( VBG lapping )/ ( PP ( IN on )3) ( NP ( RB just ) ( JJ enough ) ( NN science ) )4) ( VP ( VB send )/ ( NP ( NN home ) ( NN thinking ) ) )More ever, in [5] Riloff et al. used a subsumption hierarchy

to identify complex features by using unigrams, bigrams and

extraction patterns. Their extraction patterns are generated

from unigrams and bigrams. By building the combination

from these features, they could extract complex features which

outperform simpler features. In our research, We rely on

analysis of the appearance of adjectives, adverbs, verbs and

nouns in syntactic trees to detect syntax-based patterns to

extract complex features.

Sokolova and Lapalme [8] considered modal verbs, mental

verbs and other kinds of word to identify sentiment polarity.

They created a seed of effective words and added their syn-

onyms from an electronic version of Rogets. The disadvantage

of their research depend on selected seed words and quality

of the thesaurus dictionary. In our research, we first find out

a various type of features which are based on rich linguistic

features, and then use a Maximum Entropy Model (MEM)

classifier to determine subjective and objective sentences. Our

experiment implements on the movie review data used in Pang

and Lee research [12]. Our result reach 92.1% for detecting

subjective sentences.

In section 2 we will present our proposal for identifying

features, and then the Experiment and Evaluation are presented

in section 3. Finally we have the conclusion in section 4.

II. FEATURE EXTRACTION

In our research, we investigate four types of words that are

adjectives, adverbs, verbs and some extension of verbs, and

nouns to create syntax-based patterns. We extract linguistic

features by using syntax-based patterns for distinguishing

subjective sentences from objective sentences. We choose 22

syntax-based patterns base on analyzing the appearance of

features on subjective sentence and objective sentence. The

difference between them will be classified by applying MEM

classification.

A. Syntactic Information

To extract linguistic features from sentences, we use Syn-

tactic Information which can be generated from using Stanford

Parser tool. Stanford Parser is a natural language parser devel-

oped by The Stanford Natural Language Processing Group. It

uses probabilistic methods to work out parse trees for texts.

Take the following example of a sentence:

”smart and alert, thirteen conversations about one thing is asmall gem.”

The Stanford parser output is generated:

(ROOT/(S/(NP/(NP/(ADJP (JJ smart)/(CC and)/(JJ alert) (, ,)(JJ thirteen))/ (NNS conversations))/(PP (IN about)/(NP (CDone) (NN thing))))/(VP (VBZ is)/(NP (DT a) (JJ small) (NNgem)))/(. .))

1) Syntax-based patterns include Adjectives: The adjectives

have been used in most studies related to sentiment analysis.

In our method, we extract four patterns contain adjective word

which results are a single adjective word, a phrase of adjective,

two phrase of adjective are connected by conjunction and a

group of word from adjective structure. We find various type

of adjective has occurred in the sentence, which may express

user’s opinion. We describe all syntax-based patterns include

Adjective in table I. In the patterns table, we use the symbol

”/” to indicate the tag follows it, which is an option.

Now, we take an example of using pattern [ADJP] [TO]

[VB] to extract features from a subjective sentence:

”writer/director walter hill is in his hyper masculine elementhere, once again able to inject some real vitality and evenart into a pulpy concept that, in many other hands would becompletely forgettable.”

The adjective phrase is extracted as following:

(ADJP (JJ able)/(S/(VP (TO to)/(VP (VB inject)

Pattern Describe[ADJP] [TO] [VB] The adjective on this pattern

expresses user’s ability or commentto do some thing.

[ADJP] [CC] [ADJP] Two adjective phrases whichor [ADJP contains CC JJ] are connected by conjunction.[VP contains VBZ/VBG] [ADJ] The adjective describes information

or review of entities or objects.[ADJP contains only JJ] The adjective as an exclamation

word in sentence that expresssubjective substance.

TABLE ISYNTAX-BASED PATTERNS INCLUDE ADJECTIVE

2) Syntax-based patterns include adverb: In English texts,

adverb modifies for verb, adjective, or another adverb, clause,

or sentence. Adverbs are put on the right before or behind

the object or verb, and also put at the beginning, at the end,

or at middle of a sentence. Turney [3] extracts two-word

phrases which is modifier of adverb for verb and adjective;

in [10], Taboada et al., considered adverb which nearby

noun verb, or adjective; Riloff et al., [5] used a subsumption

hierarchy by using unigrams, bigrams and extraction patterns

that containing some collocation of adverb. For finding the

linguistic features containing adverb, we use seven syntax-

based patterns that are described in table II.

Take the following example of subjective sentences:

”provides the kind of ’laugh therapy’ i need from moviecomedies - offbeat humor, amusing characters, and a happyending; after seeing ’analyze that, ’ i feel better already.”The adverb phrase that modifies for verb and follows it is

extracted by using pattern [VP][ADVP] [ADVP] .̇..:

(VP (VBP feel)/(ADVP (RBR better))/(ADVP (RB already)))3) Syntax-based patterns include verb: There are a lot of

previous researches dealing with verbs and some extension of

verbs which are used to find user’ recommendations, Turney

[3] calculated semantic orientation of two words containing

verbs or some extension of verbs appearing with adverb.

Kim and Hovy [13] classified word sentiment which includes

adjective and verb before classifying the sentence sentiment.

1818

Page 3: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Linguistic

Pattern Describe[VP contains [ADVP] The adverb that modifies for verb.[VB/VBN/VBG/VBZ/VBD][VP][with[PRT]or not] The adverb phrase that modifies[ADVP][with [PP] or not] for verb that has a prepositional phrase.[VP][ADVP][ADJP] The adverb modifies for adjective.[ADJP][ADVP][JJ] The adjective phrase contains adverb

phrases that modify to adjective.[ADVP][VP] The adverb phrases modify for

intransitive verb.[PP contain RB] [NP] The prepositional phrases include

adverb before noun phrase.[ADVP] The adverb modify to verb which

is put into the last of a sentence.

TABLE IISYNTAX-BASED PATTERNS INCLUDE ADVERB

Sokolova and Lapalme [8] used modal verbs, mental verbs

and other kind of word to classify the texts as positive or

negative. [10] extracted sentiment from texts basing on nouns,

adverbs, verbs, and adjectives. In our method, we extract a

phrase that includes verbs to express the purpose of an action

or describes events, or contains the preposition which reflects

sentiment. The list of syntactic patterns that are described in

table III. Take the following example of subjective sentence:

Pattern Describe[VP] [TO] [VP] The verb phrases that express a purpose.[MD] [VP] The model verb is used to express the[TO] [VP] ability or subjunctive mood about entities

or objects. It often uses in a conditionsentences so may express opinion.

[VP] [VBN]/ The description of an entity or an event,[VBG]/[NN] it may express an opinion or a fact.[VP] [PP/PRT] The verb phrases that is a type of word

that describes an action or a state of entitiesor objects.

TABLE IIISYNTAX-BASED PATTERNS INCLUDE VERB

”fans of behan’s work and of irish movies in general will berewarded by borstal boy.”We use the pattern [MD][VP][TO][VP] to extract phrase which

is following:

(VP (MD will)/ (VP (VB be)/(VP (VBN rewarded) /(PP (INby)

4) Syntax-based patterns include noun: In general, nouns

and noun phrases are usually used to describe the fact. Noun

feature has been investigated in some research to determine

subjective substance. Turney [3] extracted phrases which in-

clude two words with adjectives. Taboada et al., [10] calculated

Semantic Orientation to nouns which express a review. In

our model, we extract noun phrases that may occur in both

subjective and objective sentences. We collect the patterns that

are described in table IV.

Take the following example of subjective sentence:

”the story that emerges has elements of romance , tragedy andeven silent-movie comedy.”We apply the pattern [NP contains JJ/VBG/VBN] [CC] [NP]

or [NP] CC [NP contains JJ/VBG/VBN], the noun phrase is

Pattern Describe[NP contains JJ/JJR/JJS The noun phrase includes some parts of/RB/RBR/RBS speed which are comparative of adjective/VBG/VBN] and adverb, present participle and past

participle can be use as an adjective.[NP contains JJ/VBN/VBG] Two noun phrases which are connected[CC] [NP] by conjunction. It is a description aboutor [NP] CC about an object.[NP contains JJ/VBN/VBG][NP NN/NNS CC NN/NS] Two noun phrases which are connected

by conjunction. It is a description aboutabout an object.

[NP] [POS] [NP] Two noun phrases which are in thepossessive case. This collocation mayexpress either a opinion or not.

[NP] [IN] [NP] Two noun phrases which areconnected by preposition or subordinatingconjunction.

[NP contains [QP] [NN]] The noun phrase includes the quantityphrase which may express an opinion.

[NP contains NN/NNS The noun phrase includes two nounsNN/NNS] which may reflect user’s opinion.

TABLE IVSYNTAX-BASED PATTERNS INCLUDE NOUN

extracted as following:

(NP/( NP ( NN tragedy ) ) / ( CC and ) /( NP ( RB even ) (JJ silent-movie ) ( NN comedy ).The noun: ”tragedy” has the same oriented sentiment as the

phrase: ”even silent-movie comedy”.

III. EXPERIMENT AND EVALUATION

A. Experiment

Our experiment is conducted on the Movie Review data

which proposed from Pang and Lee [12]. This data contains

5,000 subjective sentences and 5,000 objective sentences.

Firstly, for pre-processing data, we cleaned a white space on

abbreviation (e.g. “m . r” . was changed to “m.r.”). We also

inserted the punctuation into the end of each sentence. This

work is for ensuring that the sentences are separated exactly

before applying Stanford parser tool. For getting syntactic

information we applied the Stanford Parser tool. Using the 22

syntax-based patterns as proposed, we extract the linguistic

features to distinguish subjective and objective sentences. We

calculate the number of group words are extracted from 5000

sentences quote and 5000 sentences plot with 22 syntax-based

patterns (which are introduced in Section 2) in table V. From

this table, we can see that the 4 patterns include adjective,

some patterns include adverb such as [VP] with[PRT] or

not][ADVP] and [ADJP][ADVP][JJ], the pattern with model

verb is [MD][VP][TO][VP] and the pattern include noun is

[NP contains JJ/VBN/VBG][CC][NP] has higher frequency

in quote sentences than plot sentences. Whereas, the pattern

include verb is [VP][TO][VP], and the pattern include noun is

[NP contains NN/NNS] has higher frequency in plot sentences

than quote sentences. The rest of patterns have the same

frequency in both type of sentence. Note that before applying

classification model, we clean all stop words in data extracted.

The stop words include: ”a”, ”an”, ”the”, ”their”, ”those”,̇...

1919

Page 4: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Linguistic

We apply Maximum Entropy Model in our model to classify

data into two classes: subjective and objective. The data is

divided into 10 folds, and we used 8 folds for training data,

and 2 fold for testing data.

No. Pattern Quote plot1 [ADJP][TO][VB] 228 1432 [ADJP][CC][ADJP] 514 1853 [VP contains VBZ/VBG][ADJ] 849 4224 [ADJP contains only JJ] 2260 11025 [VP contains [ADVP][VB/VBN/...] 25 256 [VP] with[PRT] or not][ADVP]... 1012 7697 [VP][ADVP][ADJP] 11 148 [ADJP][ADVP][JJ] 114 149 [ADVP][VP] 701 61110 [PP contain RB][NP] 42 3411 ADV 1233 107312 [VP][TO] [VP] 415 82813 [MD][VP][TO][VP] 671 47214 [VP][VBN]/[VBG]/[NN] 2297 338715 [VP][PP/PRT] 7469 847116 [NP contains JJ/JJR/JJS 4092 413817 [NP contains JJ/VBN/VBG][CC][NP]... 428 34318 [NP][POS][NP] 886 86519 [NP][IN][NP] 2646 298720 [NP contains [QP][NN]] 22 2721 [NP contains NN/NNS] 1282 226822 [NP NN/NNS CC NN/NS] 372 561

TABLE VTHE NUMBER OF 22 PATTERNS ARE EXTRACTED

B. Evaluation

We have a comparison between our results with baseline

methods.

In [12], Pang and Lee explored extraction methods based

on a minimum cut formulation. They retained only 60% of the

reviews’ words from original reviews and used NB and SVM

classification to classify subjective and objective sentences.

Also using the Movie Review Data, they implemented a 10-

fold cross validation. Their best results are 86.4% and 86.15%

of accuracies with NB and SVM classifications.

Riloff et al., [5] used a subsumption hierarchy for extracting

feature and then use SVM to classify subjective data. Their

result is 82.7% of accuracy.

On our method, we use Maximum entropy to classify the

linguistic features extracted basing on syntax-based pattern

from all sentences. We retain 66% of the reviews’ words from

original reviews and our results achieve 92.1% of accuracy

(using 10-fold cross validation) for determining subjective

sentences. The comparison of our method with some previous

researches is presented in table VI.

Method AccuracyOur method 92.1%NB+Prox (Pang and Lee, 2004) 86.4%SVM+Prox (Pang and Lee, 2004) 86.15%Riloff06 82.7 %

TABLE VITHE COMPARISON OF OUR METHOD WITH BASELINE SUBJECTIVITY

CLASSIFIER

IV. CONCLUSIONS

This paper focuses on subjectivity classification, in which

we have proposed a new approach for feature extraction based

on syntactic information. We have used the linguistic patterns

to extract word collocations using four types of parts of

speech including adjectives, adverbs, verbs and some extension

of verbs and nouns. The MEM classifier has been applied

to determine whether a sentence belongs to the subjective

or objective class. Our experiment obtains a high accuracy

(92.1%) for the Movie Review Data, it has shown that our

proposed features are much better than previous studies. We

believe that this approach of feature extraction can be used

not only for English but also for other language. In the future,

we will investigate linguistic features on other languages such

as Vietnamese.

ACKNOWLEDGEMENT This work is supported by the

project Studying Methods for Analyzing and Summarizing

Opinions from Internet and Building an Application which is

funded by Vietnam National University of Hanoi. This work

is also supported by the project ”KC.01.TN04/11-15”.

REFERENCES

[1] J. Wiebe, “Learning subjective adjectives from corpora,” in AAAI/IAAI,2000, pp. 735–740.

[2] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classifi-cation using machine learning techniques,” CoRR, vol. cs.CL/0205070,2002.

[3] P. D. Turney, “Thumbs up or thumbs down? semantic orientation appliedto unsupervised classification of reviews,” in ACL, 2002, pp. 417–424.

[4] J. Wiebe, T. Wilson, R. F. Bruce, M. Bell, and M. Martin, “Learningsubjective language,” Computational Linguistics, vol. 30, no. 3, pp. 277–308, 2004.

[5] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for opinionanalysis,” in EMNLP, 2006, pp. 440–448.

[6] G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word expansion and tar-get extraction through double propagation,” Computational Linguistics,vol. 37, no. 1, pp. 9–27, 2011.

[7] L. Zhang and B. Liu, “Identifying noun product features that implyopinions,” in ACL (Short Papers), 2011, pp. 575–580.

[8] M. Sokolova and G. Lapalme, “Opinion classification with non-affectiveadjectives and adverbs,” in Proceedings of the International Conferenceon Recent Advances in Natural Language Processing (RANLP’2009),Borovets, Bulgaria, sep 2009.

[9] C. Long, J. Zhang, and X. Zhut, “A review selection approach for accu-rate feature rating estimation,” in Proceedings of the 23rd InternationalConference on Computational Linguistics: Posters, ser. COLING ’10.Stroudsburg, PA, USA: Association for Computational Linguistics, 2010,pp. 766–774.

[10] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Comput. Linguist., vol. 37, no. 2,pp. 267–307.

[11] H. Yu and V. Hatzivassiloglou, “Towards answering opinion questions:separating facts from opinions and identifying the polarity of opinionsentences,” in Proceedings of the 2003 conference on Empirical methodsin natural language processing, ser. EMNLP ’03. Stroudsburg, PA,USA: Association for Computational Linguistics, 2003, pp. 129–136.

[12] B. Pang and L. Lee, “A sentimental education: sentiment analysis usingsubjectivity summarization based on minimum cuts,” in Proceedings ofthe 42nd Annual Meeting on Association for Computational Linguistics,ser. ACL ’04. Stroudsburg, PA, USA: Association for ComputationalLinguistics, 2004.

[13] S.-M. Kim and E. Hovy, “Determining the sentiment ofopinions,” in Proceedings of the 20th international conferenceon Computational Linguistics, ser. COLING ’04. Stroudsburg, PA,USA: Association for Computational Linguistics, 2004. [Online].Available: http://dx.doi.org/10.3115/1220355.1220555

2020