Additional2

+

MINING SIGNIFICANT WORDS FROM

CUSTOMER OPINIONS WRITTEN IN

DIFFERENT NATURAL LANGUAGES

Jan Žižka

František Dařena

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

+ Introduction

Many companies collect opinions expressed

by their customers.

These opinions can hide valuable knowledge.

Discovering the knowledge by people can be

sometimes a very demanding task because

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

+ Objective

For answering the question “What is

significant for including a certain

opinion into one of categories like

satisfied or dissatisfied customers?”

automatically extract words significant

for positive and negative customers'

opinions and to form not too large

dictionaries of these words.

+ Data description

Processed data included reviews of hotel clients collected from publicly available sources.

The reviews were labeled as positive and negative.

Reviews characteristics:

more than 5,000,000 reviews,

written in more than 25 natural languages,

written only by real customers, based on a real experience,

written relatively carefully but still containing errors that are typical for natural languages.

+ Review examples

Positive The breakfast and the very clean rooms stood out as the best

features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Good location - very quiet and good breakfast.

Negative High price charged for internet access which actual cost now

is extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The air conditioning wasn't working

+ Data preparation

Data collection, cleaning (removing tags, non-

letter characters), converting to upper-case.

Transforming into the bag-of-words

representation, term frequencies (TF) used as

attribute values.

Removing the words with global frequency < 2.

Stemming, stopwords removing, spell

checking, diacritics removal etc. were not

carried out.

+ Data characteristics

0

200000

400000

600000

800000

1000000

1200000

English French Spanish German Italian Russian Japan Czech

nu

mb

er

of

rev

iew

s

positive

negative

+ Data characteristics

0

50000

100000

150000

200000

250000

English German Japan French Spanish Italian Russian Czech

nu

mb

er

of

un

iqu

e w

ord

s

MinTF=1

MinTF=2

+ Finding the significant words

Thanks to having a large collection of labeled examples a classifier that separates positive and negative reviews could be created.

To reveal significant attributes (words) a decision tree was built using the tree-generating algorithm c5 (by R. Quinlan) based on entropy minimization.

The goal was not to achieve the best classification accuracy but to find relevant attributes that contribute to assigning a text to a given class.

The significant words appeared in the nodes of the decision tree.

+ An example of a decision tree

LOCATION > 0: :...POOR > 0: : :...GOOD > 0: _P (13) : : GOOD <= 0: : : :...EXCELLENT > 0: _P (3) : : EXCELLENT <= 0: : : :...GREAT > 0: _P (3) : : GREAT <= 0: : : :...CLEAN <= 0: _N (48/4) : : CLEAN > 0: _P (4/1) : POOR <= 0: : :...DIFFICULT > 0: : :...GOOD > 0: _P (6) : : GOOD <= 0: : : :...HELPFUL <= 0: _N (34/7) : : HELPFUL > 0: _P (5) ... ...

+ Finding the significant words

The classification accuracy which is proportional to the relevancy of words was between 83 – 93%.

The decision tree mostly asked if the frequency was > 0 or = 0 (binary representation).

The decision tree provides a list of about 200-300 words significant for classification from the sentiment perspective together with the significance (i.e. the frequency of using the words during classification) of the words.

Only 15 words for each language is presented together with their significance (column %).

+ Handling large collections

For languages with large amount of reviews the

datasets were randomly split into subsets

consisting of 50,000 reviews because of memory

requirements and a decision tree was created for

each such subset.

Each of the 50,000-sample subsets gave almost the

same list of words.

The relevancies of extracted words were averaged.

+ Results

+ Results

+ Results

+ Results

+ Conclusions

A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant words was presented.

From the total number of words (80,000–200,000) only about 200–300 were identified as significant.

The simple, unified procedure worked well for many languages.

Following research focuses on determining the strength of sentiment and on generating typical short phrases instead of only creating individual words.

The procedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.

Thank you for your attention

Vielen Dank für Ihre Aufmerksamkeit

Gracias por vuestra atención

Merci de votre attention

Grazie per la vostra attenzione

Спасибо за ваше внимание

ご静聴ありがとうございました

Děkuji za vaši pozornost

Education

Additional2