Upload
natalia-ostapuk
View
266
Download
0
Tags:
Embed Size (px)
Citation preview
+
Clustering Very Large Textual
Unstructured Customers' Reviews in
a Natural Language
Jan Žižka
Karel Burda
František Dařena
Department
of
Informatics
Faculty of
Business
and
Economics
Mendel
University
in Brno
Czech
Republic
+ Introduction
Many companies collect opinions expressed by
their customers.
These opinions can hide valuable knowledge.
Discovering such the knowledge by people can
be a very demanding task because
the opinion database can be very large,
the customers can use different languages,
the people can handle the opinions subjectively,
sometimes additional resources (like lists of positive
and negative words) might be needed.
+ Introduction
Our previous research focused on analysis what was
significant for including a certain opinion into one of
categories like satisfied or dissatisfied customers
However, this requires to have the reviews separated
into classes sharing a common opinion/sentiment
Clustering as the most common form of unsupervised
learning enables automatic grouping of unlabeled
documents into subsets called clusters
+ Objective
The objective is to find out how well a computer can separate the classes expressing a certain opinion, without prior knowledge of the nature of such the classes, and to find a clustering algorithm with a set of its best parameters, similarity and clustering-criterion functions, word representation, and the role of stemming for the given specific data.
+ Data description
Processed data included reviews of hotel clients collected from publicly available sources
The reviews were labeled as positive and negative
Reviews characteristics:
more than 5,000,000 reviews
written in more than 25 natural languages
written only by real customers, based on a real experience
written relatively carefully but still containing errors that are typical for natural languages
+ Properties of data used for
experiments
The subset used in our experiments contained
almost two million opinions marked as written in
English.
Review category Positive Negative
Number of reviews 1,190,949 741,092
Maximal review length 391 words 396 words
Average review length 21.67 words 25.73 words
Variance 403.34 words 618.47 words
+ Review examples
Positive
The breakfast and the very clean rooms stood out as the best features of this hotel.
Clean and moden, the great loation near station. Friendly reception!
The rooms are new. The breakfast is also great. We had a really nice stay.
Nothing, the hotel is very noisy, no sound insulation whatsoever. Room very small. Shower not nice with a curtain. This is a 2/3 star max.
Negative
High price charged for internet access which actual cost now is extreamly low.
water in the shower did not flow away
The room was noisy and the room temperature was higher than normal.
The train almost running through your room every 10 minutes, the old man at the restaurant was ironic beyond friendly, the food was ok but very German.
+ Data preparation
Data collection, cleaning (removing tags, non-letter characters), converting to upper-case
Removing words shorter than 3 characters
Porter’s Stemming
Stopwords removing, spell checking, diacritics removal etc. were not carried out
Creating 14 smaller subsets containing positive and negative reviews with following proportions: 131:144, 229:211, 987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757, 7432:7399, 10023:8946, 10251:9352, 15469:14784, 24153:23956, 52146:49986, and 365921:313752
+ Experimental steps
Random selection of desired amount of reviews
Transformation of the data into the vector representation
Loading the data in Cluto* and performing clustering
Evaluating the results
* Free software providing different clustering methods working with
several clustering criterion functions and similarity measures, suitable
for operating on very large datasets.
+ Clustering algorithm parameters
Clustering algorithm – describes the way how objects to be
clustered are assigned into individual groups
Available algorithms
Cluto's k-means variation – algorithm iteratively adapts the initial
randomly generated k cluster centroids' positions
Repeated bisection – a sequence of cluster bisections
Graph-based – partitioning a graph representing objects to be
clustered
+ Clustering algorithm parameters
Similarity – an important measure affecting the results of
clustering because the objects within one cluster need to be
similar while objects from different clusters should be dissimilar
Available similarity/distance measures
Cosine similarity – measures the cosine of the angle between
couples of vectors representing the documents
Pearson's correlation coefficient – measures linear correlation
between values of two vectors
Euclidean distance – computes the distance between points
representing documents in the abstract space
+ Clustering algorithm parameters
Criterion functions – particular clustering criterion function defined over the entire clustering solution is optimized
Internal functions are defined over the documents that are part of each cluster and do not take into account the documents assigned to different clusters
External criterion functions derive the clustering solution the difference among individual clusters
Internal and external functions can be combined to define a set of hybrid criterion functions that simultaneously optimize individual criterion functions
Available criterion functions
Internal – I1, I2
External – E1, E2
Hybrid – H1, H2
Graph based – G1
+ Clustering algorithm parameters
Document representation – documents are represented using the vector-space model
Vector dimensions – document properties (terms, in our experiments words)
Vector values
Term Presence (TP)
Term Frequency (TF)
Term Frequency × Inverse Document Frequency (TF-IDF)
Term Presence × Inverse Document Frequency (TP-IDF)
𝑖𝑑𝑓 𝑡𝑖 = log𝑁
𝑛(𝑡𝑖)
+ Evaluation of cluster quality
Purity based measures – measure the extend to which each
cluster contained documents from primarily one class
Purity of cluster Sr of size nr:
P(𝑆𝑟)=1
nrmaxinri
Purity of the entire solution with k clusters:
𝑃𝑢𝑟𝑖𝑡𝑦 = 𝑛𝑟𝑛P(𝑆𝑟)
𝑘
𝑟=1
A perfect clustering solution – clusters contain documents from
only a single class Purity = 1
+ Evaluation of cluster quality
Entropy based measures – how the various classes of documents are distributed within each cluster
Entropy of cluster Sr of size nr:
E 𝑆𝑟 = −1
log 𝑞 𝑛𝑟𝑖
𝑛𝑟log𝑛𝑟𝑖
𝑛𝑟
𝑞
𝑖=1
,
where q is the number of classes and 𝑛𝑟𝑖 number of documents
in ith class that were assigned to the rth cluster
Entropy of the entire solution with k clusters:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝑛𝑟𝑛E(𝑆𝑟)
𝑘
𝑟=1
A perfect clustering solution – clusters contain documents from only a single class Entropy = 0
+ Results
Best results were achieved by k-means, repeated bisection,
and cosine similarity as demonstrated in following tables
A certain boundary from which the entropy value oscillates and
does not change much with increasing number of documents
was found – around 10,000 documents
IDF weighting had a considerable positive impact on clustering
results in comparison with simple TP/TF
TF-IDF document representation provided almost the same
results as TP-IDF
+ Results
Using cosine similarity provided the best results unlike the Euclidean distance and Pearson's correlation coefficient.
For example, for the set of documents containing 4,932 positive and 4,745 negative reviews, the entropy was 0.594 for cosine similarity, while Euclidean distance provided entropy 0.740, and Pearson's coefficient 0.838
The H2 and I2 criterion functions provided the best results.
For the I1 criterion function, the entropy of one cluster was very low (less than 0.2). On the other hand, the second cluster's entropy was extremely high.
Stemming applied during the preprocessing phase had no impact on the entropy at all.
+ Weighted entropy
Ratio P:N
K-means Repeated bisection
TF-IDF TP-IDF TF-IDF TP-IDF
I2 H2 I2 H2 I2 H2 I2 H2
131:144 0.792 0.785 0.793 0.741 0.726 0.767 0.774 0.774
229:211 0.694 0.632 0.695 0.627 0.648 0.643 0.650 0.647
987:1029 0.624 0.610 0.618 0.605 0.624 0.609 0.618 0.611
4832:4757 0.601 0.581 0.599 0.579 0.600 0.584 0.598 0.580
7432:7399 0.605 0.596 0.599 0.587 0.605 0.595 0.594 0.586
15469:14784 0.604 0.583 0.598 0.579 0.604 0.582 0.598 0.579
24153:23956 0.597 0.580 0.589 0.572 0.597 0.580 0.589 0.572
52164:49986 0.596 0.582 0.600 0.573 0.604 0.582 0.598 0.574
201346:204716 0.599 0.583 0.592 0.575 0.597 0.583 0.593 0.576
365921:313752 0.602 0.586 0.598 0.584 0.599 0.581 0.598 0.580
+ Percentage ratios of documents in
the clusters
Ratio P:N
K-means Repeated bisection
I2 H2 I2 H2
cluster 0
(P:N) cluster 1
(P:N) cluster 0
(P:N) cluster 1
(P:N) cluster 0
(P:N) cluster 1
(P:N) cluster 0
(P:N) cluster 1
(P:N)
131:144 76:24 24:74 78:24 22:74 75:22 25:76 78:19 22:78
229:211 84:21 16:79 86:18 14:82 84:20 16:80 84:18 16:82
987:1029 80:12 19:87 85:16 14:83 79:11 20:88 85:15 15:84
4832:4757 83:13 17:87 87:15 13:85 83:12 17:87 86:14 14:86
7432:7399 82:12 17:87 85:14 14:85 82:12 17:86 86:14 14:85
15469:14784 80:11 19:89 85:13 15:86 81:10 19:89 85:13 15:87
24153:23956 81:11 19:89 85:13 14:86 81:10 18:89 86:13 14:87
52164:49986 18:89 81:11 15:87 85:13 19:89 80:10 15:87 85:12
201346:204716 82:11 18:88 85:13 15:86 82:11 18:89 15:87 85:12
365921:313752 19:89 80:10 16:88 83:12 80:10 20:90 16:87 84:12
+ Weighted entropy for different data
set sizes
+ Conclusions
The goal was to automatically build clusters
representing positive and negative opinions and
finding a clustering algorithm with a set of its best
parameters, similarity measure, clustering-criterion
function, word representation, and the role of
stemming.
The main focus was on clustering large real-world
data during a reasonable time, without applying any
sophisticated methods that can increase the
computational complexity.
+ Conclusions
The best results were obtained with
k-means
performed better when compared with other
algorithms
proved itself as a faster algorithm
binary vector representation
idf weighting
cosine similarity
H2 criterion function
stemming did not improve the results
+ Future work
Clustering of reviews in other languages
Analysis of “incorrectly” categorized reviews
Clustering smaller units of reviews (e.g., sentences)