Click here to load reader
Upload
le-minh
View
217
Download
1
Embed Size (px)
Citation preview
Linguistic Features for Subjectivity classification
Huong Nguyen Thi Xuan1,2, Anh Cuong Le2, Le Minh Nguyen3
1 Haiphong Private University, 36 Danlap, Duhangkenh, Lechan, Haiphong, Vietnam2 University of Engineering and Technology, Vietnam National University,
E3-144 Xuanthuy, Caugiay, Hanoi, Vietnam3 School of Information Science, Japan Advanced Institute of Science and Technology,
1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan
huong [email protected], [email protected], [email protected]
Abstract—Opinions are subjective expressions that describepeople’s viewpoints, perspectives or feelings about entities, eventsand theirs properties. Detecting subjective expressions is the taskof identifying whether a given text is subjective (i.e. an opinion)or objective (i.e. a reports fact). This task is considered as thefirst problem and it is very important for opinion mining andsentiment analysis which is now attracting many researcherscause its applicable capacity. Improvements in subjectivity classi-fication will positively impact on the performance of a sentimentanalysis system. Actually, features play the most important rolefor getting accurate subjective sentences. In this paper, we willenrich features by using syntactic information of the text. Fromour observation when investigating opinion evidences in thetexts, we will propose syntax-based patterns which are used forextracting rich linguistic features. Combining these new featureswith conventional features from previous studies, we obtain ahigh accuracy (about 92.1%) for detecting subjective sentenceson the Movie review data.
I. INTRODUCTION
Opinion mining and sentiment analysis aims to crawl,
extract, and analyze people ’s opinions which are shared in
Internet, such as product reviews, public reactions to political
events, scientific blogs, fashion and music trends and so on.
Results of this work are useful not only to other consumers
but also to archivists, companies looking for feedback on
both their own products and those of their rivals, governments
and social historians, amongst others. Therefore, recently this
problem has become a hot topic, specially in natural language
processing community.
Sentiment classification is the problem of classifying an
opinion text into its polarity (i.e. positive, negative, or neutral).
The result of polarity classification impacts directly on users,
so it is important and attract almost researchers in the problem.
However, in many real systems of opinion mining and senti-
ment analysis, it firstly has to crawl texts from websites, and
then determine whether a text contains opinion (i.e. subjective
information) or not (i.e. objective information), before going to
polarity classification. This task is considered as the subjective
classification which is focused in this paper.
To detect the viewpoints in the documents or sentences,
almost previous studies have paid attention on the lexical
characteristics to analysis. There are many methods introduced
to find the words to express the opinion features: use single
words [1]; a number of words by generating from text (n-
gram) [2], the phrases are extracted by using patterns [3], the
syntactic information [4], [5], and the dependency relations
[6]. Among them, the words are used in most studies have
focused on adjectives, adverbs and some extensions of the
verb to determine features. The word features has been given
by users to express opinion about an object as ”excellent”,”good”, ”bad”,̇.. Recently, Zhang and Liu [7] introduced a
method to determine single noun and noun phrases as features
which implies opinions. Sokolova and Lapalme [8] presented
the method which used non-affective adjectives and adverbs,
supplemented by degree pronouns, mental verbs, and modal
verbs, to determine the positive or negative opinion label.
Work in [9], Long et al., extracted aspects based on frequency
of dependent words(adjectives) on the Web, which are then
used to determine the selected reviews. In [10], Taboada et al.,
have presented a word-based method for extracting sentiment
from the texts; they calculated Semantic Orientation to parts
of speech such as noun, adverb, and verb, based on previous
research that made for adjectives.
After extracting the features from data, there are many
technical classifications in statistical machine learning have
been applied for sentiment classification task, such as Pang
and Lee used Support vector machines (SVM), naive Bayesian
(NB), Maximum Entropy (ME)in [2]; in [11] Yu and Hatzi-
vassiloglou used NB; Riloff et al., used SVM in [5], ... The
previous studies achieved fairly good results for different
domains.
In this paper, we propose a new approach of lexical extrac-
tion which is based on syntactic information. We investigate
four types of word including adjectives, adverbs, verbs and
some extensions of the verb, and nouns to extract linguistic
features that may appear on both subjective and objective
sentences. We not only find a fixed-number group of words
by the same some previous researches [1], [2], [3], [4], but
also extract a different number of words rely on syntactic
information which may express opinion.
Take the following example:
”color, musical bounce and warm seas lapping on islandshores; and just enough science to send you home thinking.”By apply syntax-based patterns, we collect four phrases as
following:
1) ( NP ( JJ musical ) ( NN bounce ) / ( CC and ) / ( JJwarm ) ( NNS seas ) )
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.47
17
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.47
17
2) ( VP ( VBG lapping )/ ( PP ( IN on )3) ( NP ( RB just ) ( JJ enough ) ( NN science ) )4) ( VP ( VB send )/ ( NP ( NN home ) ( NN thinking ) ) )More ever, in [5] Riloff et al. used a subsumption hierarchy
to identify complex features by using unigrams, bigrams and
extraction patterns. Their extraction patterns are generated
from unigrams and bigrams. By building the combination
from these features, they could extract complex features which
outperform simpler features. In our research, We rely on
analysis of the appearance of adjectives, adverbs, verbs and
nouns in syntactic trees to detect syntax-based patterns to
extract complex features.
Sokolova and Lapalme [8] considered modal verbs, mental
verbs and other kinds of word to identify sentiment polarity.
They created a seed of effective words and added their syn-
onyms from an electronic version of Rogets. The disadvantage
of their research depend on selected seed words and quality
of the thesaurus dictionary. In our research, we first find out
a various type of features which are based on rich linguistic
features, and then use a Maximum Entropy Model (MEM)
classifier to determine subjective and objective sentences. Our
experiment implements on the movie review data used in Pang
and Lee research [12]. Our result reach 92.1% for detecting
subjective sentences.
In section 2 we will present our proposal for identifying
features, and then the Experiment and Evaluation are presented
in section 3. Finally we have the conclusion in section 4.
II. FEATURE EXTRACTION
In our research, we investigate four types of words that are
adjectives, adverbs, verbs and some extension of verbs, and
nouns to create syntax-based patterns. We extract linguistic
features by using syntax-based patterns for distinguishing
subjective sentences from objective sentences. We choose 22
syntax-based patterns base on analyzing the appearance of
features on subjective sentence and objective sentence. The
difference between them will be classified by applying MEM
classification.
A. Syntactic Information
To extract linguistic features from sentences, we use Syn-
tactic Information which can be generated from using Stanford
Parser tool. Stanford Parser is a natural language parser devel-
oped by The Stanford Natural Language Processing Group. It
uses probabilistic methods to work out parse trees for texts.
Take the following example of a sentence:
”smart and alert, thirteen conversations about one thing is asmall gem.”
The Stanford parser output is generated:
(ROOT/(S/(NP/(NP/(ADJP (JJ smart)/(CC and)/(JJ alert) (, ,)(JJ thirteen))/ (NNS conversations))/(PP (IN about)/(NP (CDone) (NN thing))))/(VP (VBZ is)/(NP (DT a) (JJ small) (NNgem)))/(. .))
1) Syntax-based patterns include Adjectives: The adjectives
have been used in most studies related to sentiment analysis.
In our method, we extract four patterns contain adjective word
which results are a single adjective word, a phrase of adjective,
two phrase of adjective are connected by conjunction and a
group of word from adjective structure. We find various type
of adjective has occurred in the sentence, which may express
user’s opinion. We describe all syntax-based patterns include
Adjective in table I. In the patterns table, we use the symbol
”/” to indicate the tag follows it, which is an option.
Now, we take an example of using pattern [ADJP] [TO]
[VB] to extract features from a subjective sentence:
”writer/director walter hill is in his hyper masculine elementhere, once again able to inject some real vitality and evenart into a pulpy concept that, in many other hands would becompletely forgettable.”
The adjective phrase is extracted as following:
(ADJP (JJ able)/(S/(VP (TO to)/(VP (VB inject)
Pattern Describe[ADJP] [TO] [VB] The adjective on this pattern
expresses user’s ability or commentto do some thing.
[ADJP] [CC] [ADJP] Two adjective phrases whichor [ADJP contains CC JJ] are connected by conjunction.[VP contains VBZ/VBG] [ADJ] The adjective describes information
or review of entities or objects.[ADJP contains only JJ] The adjective as an exclamation
word in sentence that expresssubjective substance.
TABLE ISYNTAX-BASED PATTERNS INCLUDE ADJECTIVE
2) Syntax-based patterns include adverb: In English texts,
adverb modifies for verb, adjective, or another adverb, clause,
or sentence. Adverbs are put on the right before or behind
the object or verb, and also put at the beginning, at the end,
or at middle of a sentence. Turney [3] extracts two-word
phrases which is modifier of adverb for verb and adjective;
in [10], Taboada et al., considered adverb which nearby
noun verb, or adjective; Riloff et al., [5] used a subsumption
hierarchy by using unigrams, bigrams and extraction patterns
that containing some collocation of adverb. For finding the
linguistic features containing adverb, we use seven syntax-
based patterns that are described in table II.
Take the following example of subjective sentences:
”provides the kind of ’laugh therapy’ i need from moviecomedies - offbeat humor, amusing characters, and a happyending; after seeing ’analyze that, ’ i feel better already.”The adverb phrase that modifies for verb and follows it is
extracted by using pattern [VP][ADVP] [ADVP] .̇..:
(VP (VBP feel)/(ADVP (RBR better))/(ADVP (RB already)))3) Syntax-based patterns include verb: There are a lot of
previous researches dealing with verbs and some extension of
verbs which are used to find user’ recommendations, Turney
[3] calculated semantic orientation of two words containing
verbs or some extension of verbs appearing with adverb.
Kim and Hovy [13] classified word sentiment which includes
adjective and verb before classifying the sentence sentiment.
1818
Pattern Describe[VP contains [ADVP] The adverb that modifies for verb.[VB/VBN/VBG/VBZ/VBD][VP][with[PRT]or not] The adverb phrase that modifies[ADVP][with [PP] or not] for verb that has a prepositional phrase.[VP][ADVP][ADJP] The adverb modifies for adjective.[ADJP][ADVP][JJ] The adjective phrase contains adverb
phrases that modify to adjective.[ADVP][VP] The adverb phrases modify for
intransitive verb.[PP contain RB] [NP] The prepositional phrases include
adverb before noun phrase.[ADVP] The adverb modify to verb which
is put into the last of a sentence.
TABLE IISYNTAX-BASED PATTERNS INCLUDE ADVERB
Sokolova and Lapalme [8] used modal verbs, mental verbs
and other kind of word to classify the texts as positive or
negative. [10] extracted sentiment from texts basing on nouns,
adverbs, verbs, and adjectives. In our method, we extract a
phrase that includes verbs to express the purpose of an action
or describes events, or contains the preposition which reflects
sentiment. The list of syntactic patterns that are described in
table III. Take the following example of subjective sentence:
Pattern Describe[VP] [TO] [VP] The verb phrases that express a purpose.[MD] [VP] The model verb is used to express the[TO] [VP] ability or subjunctive mood about entities
or objects. It often uses in a conditionsentences so may express opinion.
[VP] [VBN]/ The description of an entity or an event,[VBG]/[NN] it may express an opinion or a fact.[VP] [PP/PRT] The verb phrases that is a type of word
that describes an action or a state of entitiesor objects.
TABLE IIISYNTAX-BASED PATTERNS INCLUDE VERB
”fans of behan’s work and of irish movies in general will berewarded by borstal boy.”We use the pattern [MD][VP][TO][VP] to extract phrase which
is following:
(VP (MD will)/ (VP (VB be)/(VP (VBN rewarded) /(PP (INby)
4) Syntax-based patterns include noun: In general, nouns
and noun phrases are usually used to describe the fact. Noun
feature has been investigated in some research to determine
subjective substance. Turney [3] extracted phrases which in-
clude two words with adjectives. Taboada et al., [10] calculated
Semantic Orientation to nouns which express a review. In
our model, we extract noun phrases that may occur in both
subjective and objective sentences. We collect the patterns that
are described in table IV.
Take the following example of subjective sentence:
”the story that emerges has elements of romance , tragedy andeven silent-movie comedy.”We apply the pattern [NP contains JJ/VBG/VBN] [CC] [NP]
or [NP] CC [NP contains JJ/VBG/VBN], the noun phrase is
Pattern Describe[NP contains JJ/JJR/JJS The noun phrase includes some parts of/RB/RBR/RBS speed which are comparative of adjective/VBG/VBN] and adverb, present participle and past
participle can be use as an adjective.[NP contains JJ/VBN/VBG] Two noun phrases which are connected[CC] [NP] by conjunction. It is a description aboutor [NP] CC about an object.[NP contains JJ/VBN/VBG][NP NN/NNS CC NN/NS] Two noun phrases which are connected
by conjunction. It is a description aboutabout an object.
[NP] [POS] [NP] Two noun phrases which are in thepossessive case. This collocation mayexpress either a opinion or not.
[NP] [IN] [NP] Two noun phrases which areconnected by preposition or subordinatingconjunction.
[NP contains [QP] [NN]] The noun phrase includes the quantityphrase which may express an opinion.
[NP contains NN/NNS The noun phrase includes two nounsNN/NNS] which may reflect user’s opinion.
TABLE IVSYNTAX-BASED PATTERNS INCLUDE NOUN
extracted as following:
(NP/( NP ( NN tragedy ) ) / ( CC and ) /( NP ( RB even ) (JJ silent-movie ) ( NN comedy ).The noun: ”tragedy” has the same oriented sentiment as the
phrase: ”even silent-movie comedy”.
III. EXPERIMENT AND EVALUATION
A. Experiment
Our experiment is conducted on the Movie Review data
which proposed from Pang and Lee [12]. This data contains
5,000 subjective sentences and 5,000 objective sentences.
Firstly, for pre-processing data, we cleaned a white space on
abbreviation (e.g. “m . r” . was changed to “m.r.”). We also
inserted the punctuation into the end of each sentence. This
work is for ensuring that the sentences are separated exactly
before applying Stanford parser tool. For getting syntactic
information we applied the Stanford Parser tool. Using the 22
syntax-based patterns as proposed, we extract the linguistic
features to distinguish subjective and objective sentences. We
calculate the number of group words are extracted from 5000
sentences quote and 5000 sentences plot with 22 syntax-based
patterns (which are introduced in Section 2) in table V. From
this table, we can see that the 4 patterns include adjective,
some patterns include adverb such as [VP] with[PRT] or
not][ADVP] and [ADJP][ADVP][JJ], the pattern with model
verb is [MD][VP][TO][VP] and the pattern include noun is
[NP contains JJ/VBN/VBG][CC][NP] has higher frequency
in quote sentences than plot sentences. Whereas, the pattern
include verb is [VP][TO][VP], and the pattern include noun is
[NP contains NN/NNS] has higher frequency in plot sentences
than quote sentences. The rest of patterns have the same
frequency in both type of sentence. Note that before applying
classification model, we clean all stop words in data extracted.
The stop words include: ”a”, ”an”, ”the”, ”their”, ”those”,̇...
1919
We apply Maximum Entropy Model in our model to classify
data into two classes: subjective and objective. The data is
divided into 10 folds, and we used 8 folds for training data,
and 2 fold for testing data.
No. Pattern Quote plot1 [ADJP][TO][VB] 228 1432 [ADJP][CC][ADJP] 514 1853 [VP contains VBZ/VBG][ADJ] 849 4224 [ADJP contains only JJ] 2260 11025 [VP contains [ADVP][VB/VBN/...] 25 256 [VP] with[PRT] or not][ADVP]... 1012 7697 [VP][ADVP][ADJP] 11 148 [ADJP][ADVP][JJ] 114 149 [ADVP][VP] 701 61110 [PP contain RB][NP] 42 3411 ADV 1233 107312 [VP][TO] [VP] 415 82813 [MD][VP][TO][VP] 671 47214 [VP][VBN]/[VBG]/[NN] 2297 338715 [VP][PP/PRT] 7469 847116 [NP contains JJ/JJR/JJS 4092 413817 [NP contains JJ/VBN/VBG][CC][NP]... 428 34318 [NP][POS][NP] 886 86519 [NP][IN][NP] 2646 298720 [NP contains [QP][NN]] 22 2721 [NP contains NN/NNS] 1282 226822 [NP NN/NNS CC NN/NS] 372 561
TABLE VTHE NUMBER OF 22 PATTERNS ARE EXTRACTED
B. Evaluation
We have a comparison between our results with baseline
methods.
In [12], Pang and Lee explored extraction methods based
on a minimum cut formulation. They retained only 60% of the
reviews’ words from original reviews and used NB and SVM
classification to classify subjective and objective sentences.
Also using the Movie Review Data, they implemented a 10-
fold cross validation. Their best results are 86.4% and 86.15%
of accuracies with NB and SVM classifications.
Riloff et al., [5] used a subsumption hierarchy for extracting
feature and then use SVM to classify subjective data. Their
result is 82.7% of accuracy.
On our method, we use Maximum entropy to classify the
linguistic features extracted basing on syntax-based pattern
from all sentences. We retain 66% of the reviews’ words from
original reviews and our results achieve 92.1% of accuracy
(using 10-fold cross validation) for determining subjective
sentences. The comparison of our method with some previous
researches is presented in table VI.
Method AccuracyOur method 92.1%NB+Prox (Pang and Lee, 2004) 86.4%SVM+Prox (Pang and Lee, 2004) 86.15%Riloff06 82.7 %
TABLE VITHE COMPARISON OF OUR METHOD WITH BASELINE SUBJECTIVITY
CLASSIFIER
IV. CONCLUSIONS
This paper focuses on subjectivity classification, in which
we have proposed a new approach for feature extraction based
on syntactic information. We have used the linguistic patterns
to extract word collocations using four types of parts of
speech including adjectives, adverbs, verbs and some extension
of verbs and nouns. The MEM classifier has been applied
to determine whether a sentence belongs to the subjective
or objective class. Our experiment obtains a high accuracy
(92.1%) for the Movie Review Data, it has shown that our
proposed features are much better than previous studies. We
believe that this approach of feature extraction can be used
not only for English but also for other language. In the future,
we will investigate linguistic features on other languages such
as Vietnamese.
ACKNOWLEDGEMENT This work is supported by the
project Studying Methods for Analyzing and Summarizing
Opinions from Internet and Building an Application which is
funded by Vietnam National University of Hanoi. This work
is also supported by the project ”KC.01.TN04/11-15”.
REFERENCES
[1] J. Wiebe, “Learning subjective adjectives from corpora,” in AAAI/IAAI,2000, pp. 735–740.
[2] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classifi-cation using machine learning techniques,” CoRR, vol. cs.CL/0205070,2002.
[3] P. D. Turney, “Thumbs up or thumbs down? semantic orientation appliedto unsupervised classification of reviews,” in ACL, 2002, pp. 417–424.
[4] J. Wiebe, T. Wilson, R. F. Bruce, M. Bell, and M. Martin, “Learningsubjective language,” Computational Linguistics, vol. 30, no. 3, pp. 277–308, 2004.
[5] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for opinionanalysis,” in EMNLP, 2006, pp. 440–448.
[6] G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word expansion and tar-get extraction through double propagation,” Computational Linguistics,vol. 37, no. 1, pp. 9–27, 2011.
[7] L. Zhang and B. Liu, “Identifying noun product features that implyopinions,” in ACL (Short Papers), 2011, pp. 575–580.
[8] M. Sokolova and G. Lapalme, “Opinion classification with non-affectiveadjectives and adverbs,” in Proceedings of the International Conferenceon Recent Advances in Natural Language Processing (RANLP’2009),Borovets, Bulgaria, sep 2009.
[9] C. Long, J. Zhang, and X. Zhut, “A review selection approach for accu-rate feature rating estimation,” in Proceedings of the 23rd InternationalConference on Computational Linguistics: Posters, ser. COLING ’10.Stroudsburg, PA, USA: Association for Computational Linguistics, 2010,pp. 766–774.
[10] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Comput. Linguist., vol. 37, no. 2,pp. 267–307.
[11] H. Yu and V. Hatzivassiloglou, “Towards answering opinion questions:separating facts from opinions and identifying the polarity of opinionsentences,” in Proceedings of the 2003 conference on Empirical methodsin natural language processing, ser. EMNLP ’03. Stroudsburg, PA,USA: Association for Computational Linguistics, 2003, pp. 129–136.
[12] B. Pang and L. Lee, “A sentimental education: sentiment analysis usingsubjectivity summarization based on minimum cuts,” in Proceedings ofthe 42nd Annual Meeting on Association for Computational Linguistics,ser. ACL ’04. Stroudsburg, PA, USA: Association for ComputationalLinguistics, 2004.
[13] S.-M. Kim and E. Hovy, “Determining the sentiment ofopinions,” in Proceedings of the 20th international conferenceon Computational Linguistics, ser. COLING ’04. Stroudsburg, PA,USA: Association for Computational Linguistics, 2004. [Online].Available: http://dx.doi.org/10.3115/1220355.1220555
2020