6

Click here to load reader

[IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

  • Upload
    reda

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

Effectiveness of Machine Learning Techniques for Automated

Identification of Calling Communities

Keivan Kianmehr Reda AlhajjDepartment of Computer Science

University of CalgaryCalgary, Alberta, Canada

Abstract

In this paper, we demonstrate how cluster analy-sis can be used to effectively identify communities us-ing information derived from the Call Detail Record(CDR) data. We use the information extracted fromthe cluster analysis to identify customer calling pat-terns. Customers calling patterns are then given to aclassification algorithm to generate a classifier modelfor predicting the calling communities of a customer.We apply two different classification methods: Supportvector machine and fuzzy-genetic classifier. The lat-ter method is used for possibly assigning a customer todifferent classes with different degrees of membership.The reported test results demonstrate the applicabilityand effectiveness of the proposed approach.

1 Introduction

Identifying social communities is an emerging re-search area that has already attracted the attention ofseveral research groups, e.g., [2, 3, 4, 8, 9, 13]. Other re-searchers concentrated on identifying terrorist groups,e.g., [12]. In this paper, we investigate customer rela-tionships by analyzing Call Detail Record (CDR) ob-tained from a telecommunication company. It is bene-ficial for [19]:

• Churn Prediction: The goal is to understand whenand why company’s customers are likely to leaveso that appropriate action can be planned. Datamining is applied to perform two major tasks:

1. predict whether a particular customer willchurn and when this will happen;

2. understand why particular customers churn.

• Identifying Calling Communities: this helps theeffective targeted marketing design which is sig-

nificantly important for increasing profitability inthe telecommunication industry.

The main focus of this study is to use a unsupervisedmachine learning technique, namely clustering to clas-sify customers of a mobile service provider into appro-priate calling communities according to the statisticsextracted from the CDR data. Once an acceptableclustering has been found using the similarities anddissimilarities in the training data set, the clusteringis transformed into a classifier by using a classificationtechnique. In this study, the agglomerative hierarchicalclustering approach has been applied for the clusteringtask; it can produce an ordering of the objects.

In this study, two different techniques have beenselected for the classification task. The support vec-tor machine (SVM) as a statistical-based learning ap-proach has been used to build the classifier model.A fuzzy-genetic algorithm has been also applied forthe classification task. The SVM algorithm has beenused for identifying crisp user communities and fuzzy-genetic algorithm has been adapted for assigning a par-ticular customer to possibly more than one communitywith different degrees of membership. This way, wedifferentiate between strict membership and gradualmembership. Fuzziness is attractive because it facili-tates the possibility of having partial membership in agiven group. Each fuzzy set has a corresponding mem-bership function which is used to decide for each cus-tomer its degree of membership in the group.

The rest of this paper is organized as follows. Sec-tion 2 covers CDR data and calling neighbors. Sec-tion 3 presents the proposed model. Section 4 reportsexperimental results. Section 5 is conclusions.

2 CDR Data and Calling Neighbors

From the CDR data, it is possible to extract cus-tomer’s calling destination numbers, duration and fre-

12th International Conference Information Visualisation

1550-6037/08 $25.00 © 2008 IEEE

DOI 10.1109/IV.2008.68

308

Page 2: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

quency for each destination number for a particular pe-riod of time. Customer’s calling neighbors can be alsoidentified by using the links of who calls whom [19].There are two types of calling neighbors: 1) Direct call-ing neighbor: A person who calls the customer or whomthe customer calls. The majority of these neighbors ofa customer may be outside of the service provider’s ownnetwork, and no information about them is available.An example of direct calling neighbor can be describedas follows: members of a research community call eachother heavily. 2) Indirect calling neighbor: A per-son who calls the same number(s) as another customerdoes. For example, employees of a large organizationsuch as a bank who work in different local branches maycall their headquarter frequently, and each employee ispossibly a calling neighbor of other employees. Eventhough employees of different branches may not knoweach other, they can be classified as colleagues.

There are two major challenges in using the CDRdata for finding the calling communities: 1) The major-ity of the destination phone numbers are outside serviceprovider’s network. So, information about these cus-tomers such as their calling records and links betweenthem is not available. 2) The other type of phone num-bers are those in the service provider’s network, andeach of them corresponds to a customer of the serviceprovider. The connectivity between these customers isvery sparse because customers more likely call numbersoutside their home network. After identifying the cus-tomers communities, the information derived from thecalling communities can be used for building a classi-fier model. This classifier model is able to assign a newcustomer to one or possibly more of the existing com-munities according to his/her general calling patternextracted from the CDR data as well as closeness ofhis/her direct and indirect calling neighbors.

3 Model Details

The classifier model consists of three major phases:data preprocessing, clustering, and classification.Data Preprocessing: During the first step ofthe data preprocessing, all phone numbers (customers)within the service provider’s network are identified.Since the CDR data includes information about cus-tomers, all the source phone numbers have been con-sidered as subscribers of the service provider. Thegiven data set consists of 55,000 calling records of 2,000subscribers (distinct phone numbers within the serviceprovider’s network). Calls with very low duration (lessthan 5 seconds) are assumed to have no effect on iden-tifying the subscriber’s neighbors and are ignored.

In the second step, for each subscriber, his/her own

phone number, called destination numbers (within andoutside of the home network), and each destinationnumber’s calling duration and frequency in that partic-ular period of time are extracted from the CDR data.During pre-processing of the CDR data, inactive cus-tomers have been excluded from the data since thesenumbers greatly skewed the distance distribution. In-active customer refers to a customer who barely makesa phone call within a particular time period.Distance Measures: In order to identify the close-ness of a particular customer to his/her direct callingneighbors, a similarity measure weighted by call dura-tion or frequency between two phone numbers has beenused. The weighted similarity measure is a first orderdistance [19] defined as follows:

D1(i, j) =1

2

x∈N(i)∩N(j)

wi(ki(x)) + wj(kj(x)) (3)

where x is a common phone number which both cus-tomers i and j called during the specified time period,and ki(x) and kj(x) are the corresponding weights ofcommon called phone number x in Nw(i) and Nw(j),respectively. According to Eq. 3, two customers areconsidered very close when they call some commonnumbers frequently (or heavily), regardless of how

309

Page 3: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

many other numbers they regularly call. The distancebetween a pair of customers is 1 (maximum distance)if they do not call any common phone number.Clustering Technique: Using the first and secondorder distance measures, a hierarchical clustering ap-proach which works based on links between customers,can be applied to identify communities. The followingdistance measure is defined to incorporate both the firstand second order distance measures into the similaritymeasure of the clustering algorithm:

D(i, j) = (1− α)D1(i, j) + αD2(i, j) (4)

The term α controls the degree of relevance of the firstand second order distance measures and typically de-pends on the network characteristics of the mobile ser-vice provider. As α −→ 0, the similarity measure ap-proaches the first order distance measure. Intuitively,α captures the customers indirect calling neighbors.

Eq. 4 makes it possible for the clustering algorithmto merge clusters with the most number of links, whichare defined as the common neighbors of the customersbased on both direct and indirect calling patterns. Fordiscovering calling communities in this study, the ag-glomerative hierarchical clustering algorithm has beenused [21]. The MATLAB Statistics Toolbox has beenused for conducting the hierarchical clustering.Building the Classifier: After identifying usercommunities, it is time to transform the clusters into aclassifier model which is able to predict a communitywhere a new customer belongs to according to his/hercalling patterns. We use the community of a customeras his/her class label; then we derive several featuresusing the information extracted from calling links andclusters. Finally, we create a training set in which ev-ery column represents a distance feature and every rowrepresents a certain customer’s values for the features.Distance Features: The following input featureshave been constructed for each customer from his/hercalling neighbors based on the first order and secondorder distances. 1) Total number of customer’s di-rect calling neighbors. 2) Percentage of a customer’scalls made to her/his closest direct calling neighbor.3) Percentage of a customer’s direct calling neighborswhich are within the service provider’s network. 4) Per-centage of a customer’s direct calling neighbors whichare outside of the service provider’s network. 5) Theshortest distances of a customer to all existing classes.6) Percentage of direct calls to neighbors (within thenetwork) belonging to all existing classes. 7) Percent-age of indirect calls to neighbors (within the network)belonging to all existing classes.

For every customer within the service provider’s net-work, a feature vector based on the above feature def-

initions is built to represent relationships between thisparticular customer and all other customers within theexisting communities. Then, a set of feature vectors,each of which corresponds to a specific customer, isused as the training set for the classification algorithm.

Classification Technique: For building a classifiermodel that satisfies the purpose of this paper, two dif-ferent approaches have been used. The first approach isSupport Vector Machine (SVM) [16] from the family ofstatistical-based learning algorithms. Basically, SVMcan be used to solve binary classification problems.However, in order to use SVM for real world classifica-tion tasks, the idea has been extended for multi-classproblems as well. The extension can be done eitherduring the learning process or during the decision pro-cess. One-vs-rest and adaptive code algorithm are twoof the most well-known extensions of SVM to multi-class problems [1].

The second approach that has been applied for clas-sification task is a fuzzy genetic rule-based classifica-tion technique [7]. In contrast to SVM, this rule-basedapproach is more understandable by humans, but suf-fers from efficiency issues. In this work, the character-istics of the fuzzy-rule based approach have been usedfor possibly doing fuzzy classification.

For conducting the classification based on SVM, aMATLAB interface of LIBSVM [6] has been used. LIB-SVM is a free library for SVM classification and regres-sion. The fuzzy-genetic rule-based algorithm has beenimplemented in MATLAB. The MATLAB genetic al-gorithm built-in functions have been integrated intothe implementation. In the rest of this section, we firstpresent some basics of the genetic algorithm requiredto understand the rest of this paper, then we describethe fuzzy-genetic rule-based.

Figure 1. Membership functions of five lin-guistic values (S: small, MS: medium small,M: medium, ML: medium large, and L: large).

Fuzzy-Genetic Rule-Based Systems: In order tobuild a fuzzy rule-based system, the major task is tofind an appropriate fuzzy rule set which represents theproblem. Genetic algorithms have shown to be a pow-erful tool for performing: 1) generation and optimiza-tion of fuzzy rule-base, and 2) generation and tuningof membership functions. In a fuzzy rule-based sys-

310

Page 4: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

Figure 2. Accuracy using hierarchical cluster-ing with SVM.

tem, fuzzy if-then rules for an n−dimensional patternclassification problem are defined as follows:

Rj : If, x1 isAj1 and . . . and xn is Ajn then Class Cj with CFj(5)

where Rj is the label of the j−th fuzzy if-then rule,j indexes the number of rules, x = (x1, x2, . . . , xn) isan n−dimensional pattern vector, Aij is an antecedentfuzzy set with linguistic label (i.e., a linguistic valuesuch as small or large) on the i−th axis, Cj is a con-sequent class, and CFj is a certainty grade. As theantecedent fuzzy sets Aij ’s, five linguistic values shownin Figure 1 and “don’t care” have been used. There-fore, the number of combinations of the antecedentfuzzy sets is 6n, which is very large in the case of high-dimensional problems.

As shown in figure 1, the meaning of each linguisticvalue is specified by a triangular membership functionon the unit interval [0, 1]. “don’t care” has been han-dled by a special linguistic value with the followingmembership function:

µdon′t care(x) =

{1 0 ≤ x ≤ 1,0 otherwise.

(6)

In this study, a small number of fuzzy if-then ruleshave been randomly generated. When antecedent fuzzysets of a fuzzy if-then rule are specified, its consequentclass and certainty grade are determined by applying aheuristic [21]. After generating a small number of ini-tial fuzzy if-then rules, the genetic algorithm has beenapplied to optimize the initial rule set so that it willbe able to classify the test set with a reasonable clas-sification accuracy. This work does not involve the ad-justment of membership functions or certainty grade.

Assume that the fuzzy if-then rule Rj in Eq. 5is denoted by its n antecedent fuzzy sets as Rj =Aj1 . . . Ajn. That is, Rj is coded as a string (chro-mosome) of length n. Let S be a set of N fuzzy if-thenrules (i.e., S = {R1, . . . , RN}). S is denoted by a con-catenated string of the length n×N , where each sub-string of length n corresponds to a single fuzzy if-then

rule. In other words, the rule set S is formulated as:

Rj = A11 . . . A1nA21 . . . A2n . . . AN1 . . . ANn. (7)

The fitness of the rule set S is measured as:fitness(S) = NCP (S), where NCP (S) is the num-ber of correctly classified training patterns by S.

The genetic algorithm has been set to use the uni-form crossover, where each substring is handled as ablock. That is, some rules are exchanged between thetwo parents by the crossover. A mutation operationhas been set to randomly replace an antecedent fuzzyset of a rule with another one.

4 Experimental Analysis

This section is dedicated to describe the evaluationcriteria and the conducted experiments. We summarizethe experimental results and highlight the performanceand applicability of the system.

4.1 Evaluation Criteria

After linking the customers of the service providernetwork into a cluster tree, it has to be decided to di-vide the tree at a particular level to generate the clus-ters. For the community identification problem, thevalidity of the clusters produced by a clustering algo-rithm is an important consideration. The reason maybe articulated as follows: once the clustering is com-plete, each of the clusters must be labeled and thenused in the classification task. Since the number ofgenerated clusters is subjective to the accuracy of theclassifier, the approach that has been used to determinethe validity of the cluster divisions is to compare the ac-curacy of the classifiers built based on different clusterdivisions. That is, we divide the cluster tree at a levelthat generates two clusters at the beginning. Then welabeled the clusters and build a classifier using SVMalgorithm. At the next iteration, we increase the num-ber of clusters by 2 and we divide the cluster tree suchthat it generates that particular number of clusters.The stopping criteria is when an acceptable classifica-tion accuracy is obtained using the generated clusters.To evaluate the accuracy of the classifier model, thecross validation method has been used. Basically, thedata is randomly divided into 5 disjoint groups. Thefirst group is set aside for testing and the other fourare put together for model building. The model builton the 80% group is then used to predict the groupthat was set aside. This process is repeated a total offive times as each group in turn is set aside. Finally, amodel is built using all the data. The mean of the five

311

Page 5: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

independent error rate predictions is used as the errorrate for the final model.

4.2 Experimental Results

Using the CDR data, a cluster tree has been built.In all the experiments, α has been set to 0.75 in thedistance measure formula. By choosing such a largevalue, we give more weight to the second order dis-tance than the first order distance. We believe thatcustomer’s indirect calling patterns will provide moreuseful information compared with direct calling pat-terns since a customer more likely calls numbers whichare not inside his/her home service provider.

The overall effectiveness of the clustering algorithmis calculated using overall accuracy of the classifiermodel. This overall accuracy measurement determineshow well the clustering algorithm is able to create com-munities that contain customers with similar behav-iors. The number of correctly classified customers in acluster is referred to as the True Positives (TP). Cus-tomers that are not correctly classified are consideredFalse Positives (FP). The overall accuracy is thus cal-culated as follows:

overall accuracy =

∑TP for all classes

Algorithm Average Minimum Maximum

SVM 99.75% 97.95% 100%

Fuzzy-Genetic 86.4% 75.25% 94%

The examination of the overall accuracy between theSVM classifier and the fuzzy-genetic approach using5-fold cross validation can be seen in Table 2. TheLIBSVM default parameter settings have been usedfor running the SVM algorithm. Based on some ini-tial test runs, the following settings have been appliedfor running the fuzzy-genetic classifier: 1) The numberof fuzzy rules: 40, 60 or 80; 2) The number of rule sets:20; 3) Crossover probabilities: 0.9; 4) Mutation proba-bilities: 0.1; 5) Stopping condition: 500 generations.

For the given CDR data, SVM has an average overallaccuracy of 98.5%, whereas in comparison, the fuzzy-genetic classifier has an overall accuracy of 82.5%.Thus, we find that SVM outperforms the fuzzy-geneticclassifier by almost 13%. This shows that using ge-netic algorithm for tuning the fuzzy rules of the fuzzy-genetic classifier results in a reasonable accuracy butnot as high as the accuracy obtained by SVM. How-ever, the rules in the fuzzy-genetic classifier are easilyunderstandable and interpretable.

Table 3. Running time of fuzzy-genetic classi-fication algorithmData set Number of rules Accuracy CPU time

CDR Data

40 86.4% 448 min60 80.80% 365 min80 73.15% 378 min

The runtime of both approaches is an importantconsideration because the model building phase is com-putationally time consuming. For the analysis, all op-erations are performed on a Dell Optiplex 745 with anIntel Core2 Duo 6600 @ 2.4 GHz processor and 3 GB ofRAM. The number of data objects in the training setis 2000. In general, the runtime for the SVM classifierwas significantly less than the fuzzy-genetic algorithmwhen building the classification models. For example,with 2000 objects running 5-fold classification took lessthan a second, whereas fuzzy-genetic took much longerto build the classification model. The running timeof the fuzzy-genetic approach while varying the num-ber of fuzzy rules is shown in Table 3. Although theSVM classifier was faster, the size of the training set isultimately limited by the amount of memory becauseboth approaches must load the entire training set intomemory before building the model.

5 Summary and Conclusions

In this paper, we demonstrated how cluster analysiscan be used to effectively identify calling communitiesby using information derived from the CDR data. We

312

Page 6: [IEEE 2008 12th International Conference Information Visualisation (IV) - London, UK (2008.07.9-2008.07.11)] 2008 12th International Conference Information Visualisation - Effectiveness

used the information extracted from the cluster anal-ysis to identify customer calling patterns. Customerscalling patterns are then given to a classification algo-rithm to generate a classifier model for predicting thecalling communities of a customer. This work is es-pecially important for targeted marketing campaignsin the telecommunication industry since the CDR datais often the only primary data source available for thecustomers. Based on the assumption that customersin the same calling community might behave similarly,targeted efforts can be focused on certain communi-ties. Further, the fuzzy classification proposed in thisproject provides more convenience for selecting cus-tomer communities and for measuring the efficiencyand validity of the communities regarding the mar-keting campaign design. The flexibility of the fuzzyapproach featured by the application of membershipfunctions provides the ability to increase or decreasethe homogeneity between the targeted customers de-pending on whether the proposed products are veryspecific or intended for a large community.

References

[1] E. Allwein, R. Schapire and Y. Singer: “ReducingMulticlass to Binary: A Unifying Approach forMargin Classifiers,” AT&T Corp., 2000.

[2] L. Backstrom, D. Huttenlocher, J. Kleinberg, andX. Lan. Group formation in large social net-works: Membership, growth, and evolution. Proc.of ACM KDD, 2006.

[3] J. Baumes, M. Goldberg, M. Magdon-Ismail, , andW. Wallace. Discovering hidden groups in commu-nication networks. Proc. of NSF/NIJ Symp. onIntelligence and Security Informatics, 2004.

[4] T. Y. Berger-Wolf and J. Saia. A framework foranalysis of dynamic social networks. Proc. of ACMKDD, 523528, 2006.

[5] L. B. Booker, D. E. Goldberg, and J. H. Hol-land: “Classifier systems and genetic algorithms,”Artificial Intelligence, Vol.40, No.1-3, pp.235-282,September 1989.

[6] C.C. Chang and C.J. Lin: “LIBSVM: A Li-brary for Support Vector Machines”, URL:[http://www.csie.ntu.edu.tw/∼cjlin/libsvm],2001, Last Accessed on 16/7/2006 .

[7] H. Ishibuchi, K. Nozaki and H. Tanaka: “Dis-tributed representation of fuzzy rules and its ap-plication to pattern classification,” Fuzzy Sets andSystems, Vol.52, No.1, pp.2132, Nov. 1992.

[8] M. Kretzschmar and M. Morris. “Measures of con-currency in networks and the spread of infectiousdisease,” Math. Biosci., 133:165195, 1996.

[9] M. Magdon-Ismail, M. Goldberg, W. Wallace, andD. Siebecker. “Locating hidden groups in commu-nication networks using hidden markov models,”Proc. of ISI, 2003.

[10] B. Malin. “Data and collocation surveillancethrough location access patterns,” Proc. NAAC-SOS Conf., 2004.

[11] L. A. Meyers, M. Newman, and B. Pourbohloul.Predicting epidemics on directed contact net-works. Journal of Theoretical Biology, 240:400418,2006.

[12] M. Nasrullah, H. L. Larsen: “Structural Analysisand Mathematical Methods for Destabilizing Ter-rorist Networks,” Proc. of ADMA, pp.1037-1048,2006.

[13] M. Newman and M. Girvan. Finding and evaluat-ing community structure in networks. Phys. Rev.,69, 2004.

[14] M. Newman, A.-L. Barabasi, and D. J. Watts, ed-itors. The Structure and Dynamics of Networks.Princeton University Press, 2006.

[15] S. F. Smith: “A learning system based on ge-netic algorithms,” Ph.D. Dissertation, Universityof Pittsburgh, Pittsburgh, PA, 1980.

[16] V. N. Vapnik: Statistical Learning Theory, JohnWiley, NY, p.732, 1998.

[17] N. Werro, H. Stormer, and Andreas Meier: “AHierarchical Fuzzy Classification of Online Cus-tomers”, Proc. of IEEE International Conferenceon e-Business Engineering, Shanghai, 2006.

[18] D. Whitley: “A genetic algorithm tutorial,”Statistics and Computing, (4):65-85, 1994.

[19] L. Yan, M. Fassino, and P. Baldasare: “Predict-ing customer behavior via calling links”, Proc. ofIEEE International Joint Conference on NeuralNetworks, Vol.4, pp.2555-2560, 2005.

[20] URL: [http://en.wikipedia.org/wiki/Call

record],Last Accessed on 9/2/2008.

[21] URL: [http://http://www.mathworks.com/], LastAccessed on 9/2/2008.

313