Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
JDCTA
Volume 10, Num
ber 4, Sep. 30, 2016
International Journal of Digital Content Technology and its Applications (JDCTA)
Volume 10, Number 4, Sep. 30, 2016 ISSN : 1975-9339(Print), 2233-9310(Online)A Publication of the Advanced Institute of Convergence Information Technology(AICIT)
90, Centum jungang-ro, Haeundae-gu, Busan, 612-050, Rep.of Korea | Website : http://www.aicit.org/jdcta | E-mail : [email protected]
International Journalof Digital Content Technology and
its Applications
International Journal of Digital Content Technology and its Applications
Volume 10, Number 4, Sep. 30, 2016 ISSN : 1975-9339(Print), 2233-9310(Online)
JDCTAJDCTA
International Journal of Digital Content Technology and its Applications
Contents:
Relation between knowledge, attitude, and performance ability of infant
CPR of pediatric ward nurses .............................................................................................................. 1
In-Sook Park, Jae-Woo Oh
Technical feasibility analysis to integrate pilot test of desk as a Service on
Computational Laboratory in Academic Environments ............................................. 9
Carlos Enrique Montenegro Marin, Paulo Alonso Gaona-Garcia, Joaquin Fernando Sánchez
Cifuentes
Customer Segmentation Using Particle Swarm Optimization and K-Means
Algorithm .............................................................................................................................................................. 22
Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
Customer Segmentation Using Particle Swarm Optimization
and K-Means Algorithm
1I Dewa Ayu Agung Yunita Primandari, 2I Ketut Gede Darma Putra, 3I Made Sukarsa 1,2,3
Department of Information Technology Engineering Faculty Udayana University,
Jimbaran, Badung, Bali, Indonesia, [email protected], [email protected],
Abstract Customer segmentation is an implementation of clustering in the data mining process. Customer
segmentation divides customers into certain classes to help a company to understand each customer.
This paper analyzes 33,441 rows of a transaction data and transforms it into 914 rows of Recency,
Frequency, and Monetary data (RFM) to identify potential customer. Clustering method uses are the
combination of Particle Swarm Optimization (PSO) and K-Means algorithm. The combination of these
algorithms aims to take advantages of both algorithms and remove their weakness. K-means is very
sensitive to initialize the cluster center because it does randomly. PSO is uses to optimize the cluster
center and help K-Means to cluster better. The clustering experiment uses several numbers of a cluster.
The best numbering of the cluster for this experiments are two clusters according to Davies-bouldin
Index (DBI) method.
Keywords: Data Mining, Customer Segmentation, RFM Model, K-Means, Particle Swarm
Optimization, Davies-Bouldin Index
1. Introduction
Nowadays, the main goal of a company is to strengthen customer relationships, so that the company
can get a competitive advantage on the market and make a bigger profit. This means that companies
should develop their skills in terms of identification customer expectations and then provide customers
with more benefits and satisfaction resulting therefrom [1]. Customer Relationship Management (CRM)
can help a company to manage their relationship with customers. CRM is a strategy to build, manage,
and strengthen loyal and long-lasting customer [2]. Many companies often use data mining techniques
and CRM’s concepts, which helps to identify customers by using customer segmentation [3].
Customer segmentation is the process of dividing customers into different groups, with members of
each group having similar needs, characteristics, or behaviors [4]. Customer segmentation analyzes a
company’s data transaction to find a profitable customer. First, the company’s data need to change into
Recency, Frequency, and Monetary (RFM) value. RFM is a technique used for analyzing customer
behavior such as how recently a customer has purchased (Recency), how often the customer purchase
(Frequency), and how much the customer spends (Monetary) [5]. The attribute of RFM model can
describe by linguistic variables. As an example, the linguistic variable of Recency is described by the
terms ‘long ago’ and ‘very recent’, the attribute of Frequency by terms ‘rare’ and ‘frequent’ and the third
attribute Monetary is divided into the terms ‘low value’ and ‘high value’ [6].
K-Means with RFM model is the clustering method used in this research. In similar studies revealed
that K-Means with RFM model has the ability to process huge customers’ data [7], but the selection of
the central point of K-Means that done randomly, made the clustering process become slowly. The study
was then refined by using a combination of K-Means and PSO Algorithm [8]. In that study concluded
that the lack of K-Means in determining the center point of the cluster can be overcome by combining
the K-Means and PSO resulting in a classification that is more accurate and efficient. This research
analyzes the transaction data of companies engaged in the distribution of pharmaceutical ingredients.
Data transaction of the company segmented using a combination of K-Means and PSO methods with
RFM model, to get the most potential customers for the company.
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
22
2. Methodology
The input of this customer segmentation’s system is monthly transaction data from May 2014 until
July 2015. It consists of 33,741 rows of data. It stored into transaction table which has consist of
no_invoice, transaction_date, sales_id, customer_id and total_transaction field.
Figure 1. Data Transformation Scheme
Figure 1 explaining the data transformation scheme. Transaction table simplified into a new table that
called standard table which contains transaction_date, customer_id, and total_transcation field.
transaction_date can represent recency attribute in the declared value of the last transaction done.
customer_id is the field that can be used to calculate frequency value based on its appearance.
total_transcation can be used to calculate the monetary value based customer_id field. The data contained
in standard table calculated into RFM value then stored in the table called rfm table. Standard table
transformed into 914 rows of RFM.
2.1 Data normalization
The process of data normalization is performed after the data transformation. Data normalization is
the process of scaling value data attributes that can fall into certain range [9]. Clustering process may
consist of a various interval of values that lead to differences in distance between one values with others,
the need for data to process the normalization has an average value of 0 or zero mean. A data or features
can be normalized by the following min-max normalization method.
𝑥′ =𝑥−𝑚𝑖𝑛𝐴
𝑚𝑎𝑥𝐴−𝑚𝑖𝑛𝐴(𝑛𝑒𝑤_𝑚𝑎𝑥𝐴 − 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴) + 𝑛𝑒𝑤_𝑚𝑖𝑛𝐴 (1)
minA and maxA are the minimum and maximum values of attributes.
2.2 Clustering
This paper used a combination algorithm based on PSO and K-Means to clustered data. The purpose
of this algorithm is to take advantages of both algorithms and remove their weakness. When comes to
clustering K-means is very sensitive to initializing the cluster center because it is done randomly. K-
Means would work well in optimal cluster center. PSO is one of the swarm intelligence methods that can
use to optimize the cluster center [10]. PSO was presented according to animal social interactions. Every
particle tries to move toward a final solution by adjusting its path and moving toward the best personal
experience and also the best swarm experience [11]. Each particle updates its location with the following
two equations.
𝑉𝑖′ = 𝜔𝑉𝑖 + 𝑐1𝑟𝑎𝑛𝑑1(𝑃𝑏𝑒𝑠𝑡 − 𝑋𝑖) + 𝑐2𝑟𝑎𝑛𝑑2(𝑃𝑔𝑏𝑒𝑠𝑡 − 𝑋𝑖)` (2)
𝑋𝑖′ = 𝑋𝑖 + 𝑉,
′ (3)
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
23
Rand1and rand2 is random numbers between 0 and 1, c1 and c2 are the cognitive and social learning
rates, Pbest are the best location of particle, Pgbest is the global best location value, ω is the inertia, i is the
dimension of the data and t is the number of iteration.
In this research K-Means inheriting cluster center from PSO result and assigned each data to the
closest cluster center using Euclidean Distance Equation [7].
𝑑𝑒𝑢𝑐 = ∑ √(𝑧𝑖 − 𝑋𝑖)2𝑛
𝑖=0 (4)
Recalculate the values of the centroids. The value of the centroid’s field is updated, taken as the
average of the values of the entities attributes that are part of the cluster. Repeat iteratively until the
convergence is achieved. The cluster that has been formed then being validated using DBI Method. DBI
is based on similarity measure of cluster (Rij) whose bases are the dispersion measure of the cluster (si)
and the cluster dissimilarity measure (dij) [12].
2.3 Data modeling
Cluster formed going through the data modeling process. This process compared the average value
of each cluster and compared with the range of values in the domain value of RFM to find the grade for
each cluster. Each variable of R, F and M has three linguistic variables and domain value. Linguistic
variable and domain value for each average are shown in Table 1
Table 1. Linguistic Variable and Domain Value for Each of Average R, F, M
Attributes Linguistic Variable Domain Value
Recency Long Time Ago
A Little Longer
Recently
0 ≤ r ≤ 200 days
201 days<r <501 days
501 days < r
Frequency Seldom
Rather Frequently
Often
0 ≤ f < 50 transactions
51 transactions < f < 150 transactions
150 transactions < f
Monetary Low
Medium
High
0 ≤ m < 1,000,000 rupiahs
1,000,001 rupiahs < m < 5,000,000 rupiahs
5,000,001 rupiahs < m
Data modeling process is done by calculating the average RFM of each cluster to all classes of RFM
models include in Table 2 using equation Zumstein [6].
𝜇𝐴(𝑥) = (∏ 𝜇𝑖𝑚𝑖=1 (𝑥))1−𝑦(1 − ∏ (1 − (𝜇𝑖
𝑚𝑖=1 (𝑥)))𝑦 (5)
Each class in the Model RFM has a customer label that states the characteristics of each class of
customers. Class description for each cluster can be seen in Table 2
Table 2. Description Linguistic Variables and Labels Costumers
Description of Linguistic Variables Class
Recency Frequency Monetary
Recently Seldom Low Dormant D
Recently Seldom Medium Dormant A
Recently Seldom High Occasional A
Recently Quite Often Low Everyday D
Recently Quite Often Medium Golden D
Recently Quite Often High Superstar D
Recently Often Low Everyday A
Recently Often Medium Golden A
Recently Often High Superstar A
Rather Long-Time Seldom Low Dormant E
Rather Long-Time Seldom Medium Dormant B
Rather Long-Time Seldom High Occasional B
Rather Long-Time Quite Often Low Everyday E
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
24
Description of Linguistic Variables Class
Recency Frequency Monetary Rather Long-Time Quite Often Medium Golden E
Rather Long-Time Quite Often High Superstar E
Rather Long-Time Often Low Everyday B
Rather Long-Time Often Medium Golden B
Rather Long-Time Often High Superstar B
Long-Time Seldom Low Dormant F
Long-Time Seldom Medium Dormant C
Long-Time Seldom High Occasional C
Long-Time Quite Often Low Everyday F
Long-Time Quite Often Medium Golden F
Long-Time Quite Often High Superstar F
Long-Time Often Low Everyday C
Long-Time Often Medium Golden C
Long-Time Often High Superstar C
There are five classes of customers. The most loyal customer with the highest value called Superstar
Customers. The second highest value called Golden Customers. Occasional Customers are customers
whose have the second lowest value. Everyday are customers with increasing number of visits but low
transaction value and Dormant Customers are customers whose have the lowest value. Each class
consists of three until five subclasses. For example, Golden customers consist of golden A, golden B,
golden C, golden D and Golden E.
3. Experiment and result
The experiment began with a selection of data from enterprise transaction data. Transaction data was
converted into a form RFM. Transaction data was converted into a form of RFM that had been
normalized using the min and max normalization method. The value of RFM was simplified into a range
of values from 1 to 0. The data that had been normalized will further be segmented. The segmentation
used two methods: the method of K-Means and K-Means methods combined with PSO. The experiments
used two until four clusters and 10 iterations produced optimal center points through the method of PSO.
(a) 2 cluster
(b) 4 cluster
Figure 2. Center Points Optimization Using PSO
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
25
PSO obtained an optimal center point which was used to perform clustering on the K-Means method.
Table 3 is an optimal center point obtained.
Table 3. The Value of Optimized Cluster Center Using 2 and 4 Clusters
Number of Cluster Recency Frequency Monetary
2 cluster -0.0919 -9.5080e-05 -2.3345e-05
-0.0352 -4.14892-04 -1.7413e-04
4 cluster 0.2442 -0.0015 -0.0014
0.0783 -1.0015e-04 -2.5278e-05
-0.0139 -3.4009e-04 -1.3324e-04
-0.0105 -1.1684e-04 -2.9462e-05
Figure 3 shows the segmentation result using K-Means and PSO + K-Means. The segmentation with
K-Means method used a random cluster center and the segmentation with PSO + K-Means Method used
the optimal cluster center from PSO Method process.
(a) 2 cluster
(b) 4 cluster
Figure 3. Comparison clustering process between K-Means and PSO+K-Means
The segmentation result using K-Means and PSO + K-Means was shown in Table 4.
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
26
Table 4. Segmentation Result Using 4 Cluster
Method Cluster Recency Frequency Monetary Label
K-Means
using 2
clusters
2 444 (Rather Long-
Time)
27 (Seldom) 8171852 (Medium) Dormant B
1 688 (Recently) 174 (Often) 30573045 (High) Superstar A
PSO+K-
Means
using 2
clusters
1 688 (Recently) 174 (Often) 30573045 (High) Superstar A
2 444 (Rather Long-
Time)
27 (Seldom) 8171852 (Medium) Dormant B
K-Means
using 4
clusters
3 697 (Quite Often) 106 (Quite Often) 10371744 (High) Superstar D
1 381 (Rather Long-
Time)
25 (Seldom) 10900387 (High) Occasional B
4 718 (Recently 468 (Often) 105686107 (High) Superstar A
2 590 (Recently) 42 (Seldom) 5995413 (Medium) Dormant A
PSO+K-
Means
using 4
clusters
3 580 (Recently) 40 (Seldom) 6749484 (Medium) Dormant A
1 698 (Recently) 114 (Quite Often) 10900040 (High) Superstar D
4 719 (Recently) 506 (Often) 121826389 (High) Superstar A
2 364 (Rather Long-
Time)
27 (Seldom) 10266023 (High) Occasional B
The experiments above show that the results of the segmentation between K-Means Method and PSO
+ K-Means Method didn't have significant differences judged by the labels of customer that was
produced. It suggests that the optimization of the center point of the cluster using PSO didn't have an
impact on the outcome of the K-Means clustering. Figure 4 show cluster validation process using DBI
shows that the optimum clusters needed for both algorithms are two. The second cluster has 0.5 DBI’s
index either with K-Means Method or with PSO+K-Means Method. The optimum cluster according to
DBI Index is cluster with the lowest DBI Index.
Figure 4. Cluster Validation of K-Means and PSO+K-Means Result
PSO capability in optimizing the K-Means was described through a comparison table of the speed of
execution time K-Means clustering using PSO and without using PSO. This experiment used the same
amount of data and the same iteration.
Table 5. Comparison of K-Means and PSO + K Means
Number of Cluster Execution time of K-Means Execution time of K-Means PSO+K-
Means
2 1.317107 s 0.098473 s
4 1.472539 s 0.128821 s
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
27
Table 5 explains that the execution time of PSO + K-Means is faster than execution using K-Means
method. The effectiveness of the combination method of PSO and K-Means can be proven from the
execution time which was faster than K-Means method. Optimization of the center point by PSO was
proven to make the process faster than the clustering process using K-Means clustering because the
central point was no longer being selected randomly.
4. Conclusion
Segmentation results with both K-Means and PSO + K-Means can perform segmentation well, seen
from various customer classes’ result. The customer classes that has been produced were Dormant B,
Superstar A, Superstar D, Occasional B, Dormant A, and Occasional B. The result of the segmentation
between K-Means and PSO + K-Means did not produce a significant difference, but judged from the
execution time indicated that the PSO + K-Means combined method can perform faster than the
segmentation using K-Means. With 2 clusters, PSO + K-Means method could perform clustering
1.218634 seconds faster than used K-Means Method only and 1.343718 faster with 4 clusters.
5. Acknowledgment
Our thanks go to Department of Information Technology Udayana University, Bali, Indonesia who
has helped organize this research.
6. References
[1] Dorota Jelonek, “The Evolution of Customer Relationship Management System”, Recent
Advances in Computer Science, p. 29-33. 2015.
[2] Konstantinos Tsiptsis, Antonios Charianopoulos, “Data Mining Techniques in CRM: Inside
Customer Segmentation”, John Wiley and Sons, ltd., England, 2009.
[3] Hasan Ziafat, Majid Shakeri, “Using Data Mining Techniques in Customer Segmentation”,
Journal of Engineering Research and Applications, Vol. 4, pp. 70-79, 2014.
[4] Vahid Golmah, “A Case Study of Applying SOM in Market Segmentation of Automobile
Insurance Customers”, International Journal of Database Theory and Application, Vol. 7, No. 1,
pp. 25-36, 2014.
[5] Derya Birant, “Data Mining Using RFM Analysis”, InTech, Croatia, 2011.
[6] Darius Zumstein, “Customer Performance Measurement: Analysis of the Benefit of a Fuzzy
Classification Approach in Customer Relationship Management”, (Thesis), University of
Fribourg, 2007.
[7] Hasitha Indika Arumawadu. R. M Kapila Tharanga Rathnayaka, S.K. Illangarathne, “Mining
Profability of Telecommunication Customers Using K-Means Clustering”, Journal of Data
Analysis and Information Processing, Scientific Research Publishing, Vol. 3, pp. 63-71, 2015.
[8] C.Y. Chiu, Y.F. Chen, I.T. Kuo, He Chun Ku, “An Intelligent Market Segmentation System Using
K-Means and Particle Swarm Optimization”, Journal Expert System with Applications: An
International Journal, Pergamon Press Inc, Vol. 36, pp. 4558-4565, 2009.
[9] Jiawei Han, Jian Pei, Micheline Kamber. “Data Mining, Southeast Asia Edition”. Morgan
Kaufmann Publishers, USA, 2006.
[10] Junyan Chen, “Hybrid Clustering Algorithm Based on PSO with the Multidimensional
Asynchronism and Stochastic Disturbance Method”. Journal of Theoretical and Applied
Information Technology, Vol. 46 No. 1, pp. 434-440.
[11] Satyobroto Talukder, “Mathematical Modelling and Application of Particle Swarm
Optimization”, (Master Thesis), Blekinge Institute of Technology, Sweden, 2011.
[12] D. L. Davies, D. W. Bouldin, “Cluster Separation Measure”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 1, No. 2, pp. 95-104, 1979.
Customer Segmentation Using Particle Swarm Optimization and K-Means Algorithm Dewa Ayu Agung Yunita Primandari, Ketut Gede Darma Putra, Made Sukarsa
International Journal of Digital Content Technology and its Applications(JDCTA) Volume10, Number4, Sep. 2016
28