4
Mining Space-time Elements of Opinion Tianfang Yao Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai, China [email protected] Jun Liu Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai, China [email protected] Wei Qiu Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai, China [email protected] Abstract—This paper introduces a new opinion model using space-time elements. It proposes the concept of an Opinion Importance Factor which is composed of a Time Importance Factor and a Source Importance Factor. Based on that, it analyzes the characteristics of the Time Importance Factor in different applications and divides the Source Importance Factor into two parts, the influence of source and the relatedness between source and domain. The experimental results show that space-time elements are important, and taking account of them can expand the scope of application of opinion mining and achieve more convincing results. Keywords-text mining; opinion mining; space-time elements; time importance factor; source importance factor I. INTRODUCTION Along with the Web 2.0 technology revolution, there is explosive growth in opinion texts [1] on the Internet. Opinion mining techniques have been developed for analyzing these texts. Kim & Hovy first proposed the concept of an opinion model [2]. Using the above model, researchers have achieved fruitful results on identifying topics [3], sentimental orientation [4], etc. Pulse [5], WebFoudation [6] and Opinion Observer [7] are typical opinion mining systems. However, this simple summarized information does not meet the needs of many applications. For example, a product opinion mining system for decision-makers should be able to distinguish time and authority level. In new application requirements, the space- time elements of opinions also play a crucial role. This paper first extends the model [2] by introducing space-time elements. Then it proposes the concept of Importance Factors for describing the degree of importance of opinions. It deduces formulae for computing an Importance Factor which is composed of space-time elements. Finally, the experimental results show that space-time elements are important, and taking account of them can expand the scope of application of opinion mining and achieve more convincing results. II. AN EXTENDED OPINION MODEL A. Topic Element Topic includes product, people, organization, event, etc. There are two important parameters for a topic: one is synonymous words or phrases for describing the topic; the other is the sub-topics of this topic. We define Topic as: [ , ] Topic SynWordSet SubTopicSet = (1) where SynWordSet is a set of synonymous words, SubTopicSet is a set of sub-topics. B. Sentiment Element There are two elements of sentiment, sentimental words / phrases and sentimental polarity. The values of sentimental polarity include positive, negative and neutral. We define Sentiment as: :[ ] { , , } Sentiment SentiWord positive negative neutral = (2) C. A New Model of Opinion We extend the original opinion model as follows: [ , , , , , ] Opinion Holder Topic Sentiment Claim Time Source = (3) The meaning of an opinion is that somebody (Holder) expresses a feeling (SentimentSet) for a particular theme (TopicSet) at a particular time (Time) and spot (Source) in a sentence (Claim). Namely, : Sentiment Opinion Holder TopicSet Time/Source ⎯⎯⎯⎯⎯⎯→ (4) For any Topic TopicSet, Holder must a) use Word SynWordSet to describe Topic; b) express SentimentSet for this Topic. III. OPINION IMPORTANCE FACTOR In traditional opinion mining, all the opinions have the same importance. But in lots of applications such as opinion trend mining, the importance of two opinions is different even if they have the same content but are posted at different times or on different BBSs (Bulletin Board Systems). In order to quantify the importance of opinion, we propose the Opinion Importance Factor (OIF). For any opinion o, OIF(o) is calculated by Formula (5): () () 1 () 0 () 0 () To So OIF o To S o =0 + + = = = α β α β (5) The meaning of OIF(o) is described as follows: When Time Importance Factor T(o) and Source Importance Factor S(o) are not equal to zero, we can adjust the weights of T(o) and S(o) for OIF(o). If one of T(o) and S(o) is zero, it means opinion o is not important. For example, if we need to neglect opinion o from a particular time or source, then we can set T(o) or S(o) to 0. A. Time Importance Factor Time is an attribute of objective existence in the world and each opinion also has its own posted time. Time can 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.49 89 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.49 89

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Mining

  • Upload
    wei

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Mining

Mining Space-time Elements of Opinion

Tianfang Yao Department of Computer Science

and Engineering Shanghai Jiao Tong University

Shanghai, China [email protected]

Jun Liu Department of Computer Science

and Engineering Shanghai Jiao Tong University

Shanghai, China [email protected]

Wei Qiu Department of Computer Science

and Engineering Shanghai Jiao Tong University

Shanghai, China [email protected]

Abstract—This paper introduces a new opinion model using space-time elements. It proposes the concept of an Opinion Importance Factor which is composed of a Time Importance Factor and a Source Importance Factor. Based on that, it analyzes the characteristics of the Time Importance Factor in different applications and divides the Source Importance Factor into two parts, the influence of source and the relatedness between source and domain. The experimental results show that space-time elements are important, and taking account of them can expand the scope of application of opinion mining and achieve more convincing results.

Keywords-text mining; opinion mining; space-time elements; time importance factor; source importance factor

I. INTRODUCTION Along with the Web 2.0 technology revolution, there is

explosive growth in opinion texts [1] on the Internet. Opinion mining techniques have been developed for analyzing these texts. Kim & Hovy first proposed the concept of an opinion model [2]. Using the above model, researchers have achieved fruitful results on identifying topics [3], sentimental orientation [4], etc. Pulse [5], WebFoudation [6] and Opinion Observer [7] are typical opinion mining systems. However, this simple summarized information does not meet the needs of many applications. For example, a product opinion mining system for decision-makers should be able to distinguish time and authority level. In new application requirements, the space-time elements of opinions also play a crucial role.

This paper first extends the model [2] by introducing space-time elements. Then it proposes the concept of Importance Factors for describing the degree of importance of opinions. It deduces formulae for computing an Importance Factor which is composed of space-time elements. Finally, the experimental results show that space-time elements are important, and taking account of them can expand the scope of application of opinion mining and achieve more convincing results.

II. AN EXTENDED OPINION MODEL

A. Topic Element Topic includes product, people, organization, event, etc.

There are two important parameters for a topic: one is synonymous words or phrases for describing the topic; the other is the sub-topics of this topic. We define Topic as:

[ , ]Topic SynWordSet SubTopicSet= (1)

where SynWordSet is a set of synonymous words, SubTopicSet is a set of sub-topics.

B. Sentiment Element There are two elements of sentiment, sentimental

words / phrases and sentimental polarity. The values of sentimental polarity include positive, negative and neutral. We define Sentiment as:

: [ ] { , , }Sentiment SentiWord positive negative neutral= (2)

C. A New Model of Opinion We extend the original opinion model as follows:

[ , , , , , ]Opinion Holder Topic Sentiment Claim Time Source= (3) The meaning of an opinion is that somebody (Holder)

expresses a feeling (SentimentSet) for a particular theme (TopicSet) at a particular time (Time) and spot (Source) in a sentence (Claim). Namely,

: SentimentOpinion Holder TopicSetTime/Source

⎯⎯⎯⎯⎯⎯→ (4)

For any Topic TopicSet, Holder must a) use WordSynWordSet to describe Topic; b) express SentimentSet for this Topic.

III. OPINION IMPORTANCE FACTOR In traditional opinion mining, all the opinions have the

same importance. But in lots of applications such as opinion trend mining, the importance of two opinions is different even if they have the same content but are posted at different times or on different BBSs (Bulletin Board Systems). In order to quantify the importance of opinion, we propose the Opinion Importance Factor (OIF). For any opinion o, OIF(o) is calculated by Formula (5):

( ) ( ) 1( )

0 ( ) 0 ( )T o S o

OIF oT o S o =0

+ + =⎧= ⎨ =⎩

α β α β

(5) The meaning of OIF(o) is described as follows: • When Time Importance Factor T(o) and Source

Importance Factor S(o) are not equal to zero, we can adjust the weights of T(o) and S(o) for OIF(o).

• If one of T(o) and S(o) is zero, it means opinion o is not important. For example, if we need to neglect opinion o from a particular time or source, then we can set T(o) or S(o) to 0.

A. Time Importance Factor Time is an attribute of objective existence in the world

and each opinion also has its own posted time. Time can

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.49

89

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.49

89

Page 2: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Mining

be used to distinguish the importance of opinions. For example, if you want to mine the most recent products from product reviews, then the latest reviews are more important than older reviews. However, this is only one possible relationship between the importance and timing of

an opinion. The formula used to compute Time Importance Factor is different for different applications. Assume for an opinion o, the time is time(o) and Time Importance Factor is T(o), Table 1 lists various Time Important Factors for some applications.

TABLE I. VARIOUS TIME IMPORTANT FACTORS FOR DIFFERENT APPLICATION

Application Relationship between Importance and Time Time Importance Factor

Mining most recent products The newer, the more important

1( )( , ( ))

T od now time o

= , where d(now,time(o)) is days /

months from time(o) to now Mining most recent products between time ξ and time ζ

Neglect all opinions whose times are not between ξ and ζ

1, ( ) [ , ]( )

0, ( ) [ , ]time o

T otime o

ξ ζξ ζ

∈⎧= ⎨ ∉⎩

Mining opinion trend for a particular product between time a and m

Classify the opinion by day / month

, ( ) [ , ), ( ) [ , )

( ), ( ) [ , ]

0, ( )

time o a btime o b c

T otime o l mtime o Other

αβ

σ

∈⎧⎪ ∈⎪⎪= ⎨⎪ ∈⎪⎪ ∈⎩

LL , where , , (0 1]α β σ ∈K

B. Source Importance Factor People often use “top ten game websites” etc. to

express the importance of some sites in a particular domain. Unlike the time element, the Source Importance Factor has strong relations with the domain where an opinion is expressed. For example, Amazon.com was first an online bookstore then gradually expanded into online sales of electronic products. Among websites in China, Amazon.cn has a higher proportion of book reviews than of electronic product reviews.

Assume the reviews of a corpus from n websites and the Importance Factor of website si for domain m is Ψ(si,m), then the Source Importance Factor S(o) of an opinion from website si is equal to Ψ(si,m). According to the research work [8], we divide the Importance Factor Ψ(si,m) of BBS si for domain m into two parts: the influence of the source and the relevance of the source to domain m.

Influence of Source: it is mainly used to depict a website’s importance among similar websites. In this paper, the Influence of Source is characterized by three parameters: Daily Reach, Number of Link-in Pages, Number of Pages indexed by search engine. It is clear that Daily Reach is a good parameter for a website's popularity, and indicates the number of users who visit the website. Similarly to Google's PageRank, Number of Link-in Pages indicates the importance of a website on the Internet. In addition, because search engines have become a necessary helper for netizens’ visiting voluminous information on the Internet, Number of Pages indexed by search engines indicates the potential access to a website.

Relatedness of Source and Domain: If people input a keyword K in a search engine and visit the website si by clicking on the links returned by the search engine, then we can determine that: a) the keyword K definitely has certain correlations with the website si. Otherwise search engines would not return the link of website si when search keyword is K; b) the more times keyword K leads to the website si, the higher the relatedness of keyword K and website si. We can get the high-frequency search keywords of websites by mining search logs. For website si, if the high-frequency search keywords of si are KS, then we can say KS is a synonymous proxy for website si because

people visit website si frequently after searching for keywords in KS in the search engine. Therefore, the Relatedness of website si and Domain m is equal to the Relatedness of high-frequency search keywords KS and Domain m.

C. Source Importance Factor Experiment Experiment Domain: We choose two domains:

“mobile phone” and “car”, as well as choosing four Chinese BBSs for each domain as our subjects. These eight BBSs are subdomains of their portal websites, which are shown in Table 2. For example, “sjbbs.zol.com.cn” is a subdomain of “zol.com.cn”.

Influence of Source: 1) Daily Reach alexa.com is an influential web information company.

It provides Daily Reach, click stream, and high-frequency search keywords for first-level domain websites. alexa.com uses Daily Reach percent, the percentage of users recorded by alexa.com who have visited this website.

Assume all Internet users are recorded by alexa.com is Δ, if the Daily Reach of website si is di%, then the visitors of website si is Δ×di%. We define Ψ11(si) as the proportion of website si to all website Daily Reach. As shown in Equation (6).

11

1 1

% %( )

% %i i

i n n

i ii i

d ds

d dψ

= =

Δ×= =

Δ×∑ ∑ (6)

2) Link-in Pages and Pages indexed by Search Engines

A search engine can give data on a website’s link-in pages and pages indexed by the search engine. In a search engine’s advanced search grammar, there are two search key words: link and site. If (si) is the URL of website si, then link: (si) returns the number of link-in pages for website si, site: (si) returns the number of pages indexed by the search engine.

Assume the number of link-in pages for website si is li, and the number of pages indexed by search engines is pi, then 12(si) is the ratio of website si to all websites' link-in pages, 13(si) is the ratio of website si to all pages indexed by search engines. As shown in Equation (7) and Equation

9090

Page 3: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Mining

TABLE II. RESULTS OF OPINION SOURCE IMPORTANCE FACTOR

Websites Daily Reach

Link-in Pages

Pages indexed by SE 1ψ 2ψ ψ

bbs.imobile.com.cn 0.0274 1037 6,280,000 0.28 0.35 0.315 bbs.shouji.com.cn 0.006 124 791,200 0.04 0.3 0.17 sjbbs.zol.com.cn 0.408 1055 1,002,800 0.27 0.23 0.25

itbbs.pconline.com.cn 0.465 1425 3,814,000 0.41 0.11 0.26 bbs.pcauto.com.cn 0.1112 1332 11,568,000 0.27 0.25 0.26

bbs.chetx.com 0.0308 3485 1,641,000 0.21 0.24 0.225 club.xcar.com.cn 0.0906 650 7,128 0.12 0.25 0.185

club.autohome.com.cn 0.1195 1442 23,870,000 0.4 0.26 0.33 (8) respectively.

12

1

( ) ii n

ii

ls

=

=

∑ (7)

13

1

( ) ii n

ii

ps

=

=

∑ (8)

1) Influence of Website Using Equations (6), (7) and (8), we can calculate the

Influence of a website. We get Equation (9):

1 11 11 12 12 13 13ψ α ψ α ψ α ψ= + + , (9)

where 11 12 13 1a α α+ + = , here 11 12 131

3a α α= = = . Relatedness of Website and Domain: We obtain the

top 30 high-frequency search keywords for eight websites from alexa.com, then the relatedness of website si and domain m is translated to the relatedness of high-frequency search KS and domain m. We use the available relatedness formula for any two words Relatedness(m,ki) [9] to get Equation (10) for calculating the relatedness of keywords KSi and the domain m:

( , ) 1 ( ) ( , )i i

i i ik KS

R m KS rank k Relatedness m k∈

= ×∑ (10)

where: rank(ki) is the frequency ranking of keyword ki; m is domain “mobile” or “car”. Finally we can calculate the relatedness of website si and the domain m. As shown in Equation (11):

2

1

( , )( , )( , )

ii n

ii

R m KSm sR m KS

ψ

=

=

∑ (11)

Result: Using Equations (9) and (11), we can calculate the Source Importance Factor of an opinion, as shown in Equation (12):

1 1 2 2ψ αψ α ψ= + (12)

where: 1 2 1a α+ = , here 1 21

2a α= = . Table 2 shows the experimental results for Source

Importance Factor. “Daily Reach” is a three-month daily reach, “Link-in pages” and “pages indexed by SE” are the total numbers from Baidu and Google. From these results we can draw the following conclusions:

1) alexa.com only provides information for first-level domains, so the data for Daily Reach and high-frequency search keywords are for the higher-level domain that hosts the BBS. The experimental results verify the

rationality of this approach. For example, “itbbs.pconline.com.cn” is hosted at “pconline.com.cn”, which is a multi-field site. That BBS got a top score for Influence in the mobile phone domain, but the lowest score for relatedness to that domain. This reflects the fact that the host has heavy traffic but much of it is unrelated to this topic.

2) The Source Importance Factor of an opinion is a multi-factor parameter. Viewing from a different angle, the result will be slightly different. In this paper, we analyze the influence of a source, the relatedness of the source, and domain. Additionally, it utilizes credible statistics to calculate Source Importance Factor. From the results we can see this method reflects the real situation of websites.

IV. OPINION TREND MINING

A. Mining Approach The opinions of people will change with time; this is a

key point in opinion trend mining. An opinion trend depicts the changes of opinions over time and records the development of events/things. It is helpful for people to grasp the real condition of things.

Say the opinion trend for product p is the line connecting its opinion scores over a time axis. We divide the time axis into t pieces of period T and the opinion score of product p at period Ti is θ. Then the opinion trend for product p is the line of θi (i=1..t). The core issue of mining opinion trends is to calculate the opinion score θi of product p at period Ti. We use Equation (13) to calculate the “popularity” of product p in period Ti. For an opinion, the opinion score is strength(o)×OIF(o). strength(o) is the sentimental strength [10], OIF(o) is the Opinion Importance Factor. Supposing product p has n positive opinions and m negative opinions in period Ti:

1

1

( ) ( )( )

1 ( ) ( )

n

i ii

m

j jj

strength o OIF opopularity p

strength o OIF o

=

=

×=

+ ×

∑ (13)

where the numerator stands for positive opinions and the denominator stands for negative opinions.

B. Experiment Experiment Subject and Corpus: We selected five

kinds of mini-car as subject: “Chery QQ3”, “BYD F0”, “Changhe Big Dipper”, “Geely Panda”, “Changan

9191

Page 4: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Mining

Benben”, all made in China. For experimental time, we define a quarter as a period and limit the scope to 2008 and the first half of 2009. The goal of this experiment is to mine the quarterly opinion trends for those cars over that period. The calculation of the Time Importance Factor is described by Equation (14), and the calculation of the Importance Factor by Equation (15). The Source Importance Factor has been calculated in Table 2 in Section 3.

1, ( ) [2008 01 01,2009 06 31]( )

0, ( ) [2008 01 01,2009 06 31]time o

T otime o

∈ − − − −⎧= ⎨ ∉ − − − −⎩ (14)

( ) ( ) 1( )

0 ( ) 0S o T o

OIF oT o

=⎧= ⎨ =⎩ (15)

The corpus of this experiment comes from four BBSs of cars, including a total of 15819 opinions.

Results and Evaluation: We first use Equation (13) to calculate quarterly “popularity” of five kinds of mini-cars, and then draw their quarterly opinion trend. Figure 1 shows the quarterly trends for 2008 and the first half of 2009. Figure 2 is a mining system in accordance with the practice of traditional views, plotting the gross “popularity” of mini-cars.

Comparing Figure 1 and Figure 2, we find that opinion trend mining has the following advantages:

1) Record the development of things. In traditional opinion mining system, as shown in

Figure 2, “Chery QQ3” (the first car) is more critically acclaimed than "BYD F0" (the second car) and “Geely Panda” (the fourth car) is more recent than “Changhe Big Dipper” (the third car). However, from the opinion trend of Figure 1 we can clearly see, the gap between “BYD F0” and “Chery QQ3” is getting smaller and smaller, and even “BYD F0” has outdone “Chery QQ3” at two quarters. Additionally, “Changhe Big Dipper” has ill opinion in early 2008 for its high prices, but after improvement it gets more praise than “Geely Panda.”

Figure 1. Quarterly Opinion Trend on Mini-Cars.

Figure 2. Gross Popularity of Mini-Cars.

2) It is a guide for people. As shown in Figure 1, “Chery QQ3” in 2008Q4 and

“Geely Panda” in 2008Q3 have a significant drop in points, and this information can remind policy makers to be concerned about improvement.

V. CONCLUSION This paper introduces space-time elements into an

original opinion model and mainly studies its relationship with Opinion Importance Factor. Additionally, it explores the function of space-time elements in opinion trend mining applications. The experimental results show that space-time elements are important, and taking account of them can expand the scope of application of opinion mining and achieve more convincing results.

There are two future directions of this research work. First, this work only deals with space-time elements’ role in describing opinion importance. Actually the roles of space-time elements are far more than that. Second, we only consider the impact of space-time elements when we propose the conception of the Opinion Importance Factor. In fact, other elements of opinion can also change an opinion’s importance. For example, if we have mined the user habits of a particular forum for holder element before mining opinion trends on this forum, we can reduce the importance of opinions of some people who prefer to post useless comments, or even remove them. Additionally, we can add some mining constraints such as age, gender, area, etc. which can affect the importance of opinions too.

REFERENCES [1] Q. Liu, T. Yao, G. Huang, J. Liu, and H. Song, “Study on the

Category Architecture of Chinese Opinioned-subjective Text,” Journal of Chinese Information Processing, vol. 22, no. 6, 2008, pp. 63–68. (in Chinese)

[2] S. M. Kim and E. Hovy, “Determining the Sentiment of Opinions,” Proceedings of COLING ’04, 2004, pp. 1367–es.

[3] M. Hu and B. Liu, “Mining Opinion Features in Customer Reviews,” Proceedings of Nineteenth National Conference on Artificial Intelligence (AAAI 2004), ACM Press. San Jose, 2004, pp. 755–760.

[4] P. D. Turney, “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews,” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 417–424.

[5] M. Gamon, A. S. Corston-Oliver and E. Ringger, “Pulse: Mining Customer Opinions from Free Text,” Lecture Notes in Computer Science, no. 3646, 2005, pp. 121–132.

[6] J. Yi and W. Niblack, “Sentiment Mining in WebFountain,” Proceedings of 21st International Conference on Data Engineering, 2005, pp. 1073–1083.

[7] M. Hu and B. Liu, “Mining and Summarizing Customer Reviews,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Seattle, ACM Press, 2004, pp. 168–177.

[8] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” Proceedings of the Seventh International World Wide Web Conference, 1998, pp. 107–117.

[9] E. Gabrilovich and S. Markovitch, “Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis,” The 20th International Joint Conference on Artificial Intelligence, ACM Press, Hyderabad, 2007, pp. 1606–1611.

[10] T. Yao and D. Lou, “Research on Semantic Orientation Analysis for Topics in Chinese Sentences,” 7th International Conference on Chinese Computing, Wuhan, China, CIPS, 2007, pp. 221–225. (in Chinese)

9292