A new feature set with new window techniques for customer churn prediction in land-line telecommunications

Expert Systems with Applications 37 (2010) 3657–3665

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A new feature set with new window techniques for customer churn predictionin land-line telecommunications

B.Q. Huang a,*, T.-M. Kechadi a, B. Buckley b, G. Kiernan b, E. Keogh b, T. Rashid a

a School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Irelandb Eircom Limited, 1 Heuston South Quarter, St. Johns Road, Dublin 8, Ireland

a r t i c l e i n f o

Keywords:Churn predictionWindow techniquesNeural networksSupport vector machinesDecision trees

0957-4174/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.eswa.2009.10.025

* Corresponding author.E-mail address: [email protected] (B.Q. Hua

a b s t r a c t

In order to improve the prediction rates of churn prediction in land-line telecommunication service field,this paper proposes a new set of features with three new input window techniques. The new features aredemographic profiles, account information, grant information, Henley segmentation, aggregated call-details, line information, service orders, bill and payment history. The basic idea of the three inputwindow techniques is to make the position order of some monthly aggregated call-detail features fromprevious months in the combined feature set for testing be as the same one as for training phase. Forevaluating these new features and window techniques, the two most common modelling techniques(decision trees and multilayer perceptron neural networks) and one of the most promising approaches(support vector machines) are selected as predictors. The experimental results show that the new fea-tures with the new window techniques are efficient for churn prediction in land-line telecommunicationservice fields.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Services companies in telecommunication industry in particularsuffer from a loss of valuable customers to competitors; this isknown as customer churn. In the last few years, there has beenmany changes in the telecommunications industry, such as, newservices, new technologies, and the liberalisation of the market.For fixed-line telecommunication industry, the churn of customerscauses a huge loss of fixed-line service in such a way that it is avery serious problem. Recently, data mining techniques haveemerged to tackle this challenging problem (Au, Chan, & Yao,2003; Bin, Peiji, & Juan, 2007; Coussement & den Poe, 2008; Hung,Yen, & Wang, 2006; John, Ashutosh, Rajkumar, & Dymitr, 2007;Lazarov & Capota, 2007; Wei & Chiu, 2002). As one of the importantmeasures to retain customers, churn prediction has been a concernin telecommunication industry and research (Bin et al., 2007). Overthe last decade, many data mining techniques have emerged forchurn prediction in the mobile and fixed-line service fields. Asmentioned in Bin et al. (2007) and Zhang et al. (2006), in contrastto the mobiles services, there are less researchers to investigate thechurn prediction for the land-line telecommunication industry.

Generally, the information used to build models for churn pre-diction in mobile telecoms industry includes customer demo-graphics, contractual data, customer service logs, call-details,

ll rights reserved.

ng).

complaint data, bill and payment (Bin et al., 2007; Hadden, Tiwari,Roy, & Ruta, 2006; Hung et al., 2006; John et al., 2007; Wei & Chiu,2002). In contrast to the mobiles services, there are less amounts ofqualified information of land-line services providers (Bin et al.,2007; Zhang, Qi, Shu, & Li, 2006). The data of land-line communi-cation services is different to the one of mobile services. Some ofthis data is missing or less reliable or incomplete in land-line com-munication service providers. For instances, customer ages andcomplaint data, fault reports are unavailable and only the call-de-tails of a few months are available. Due to business confidentialityand privacy, there are no public datasets for churn prediction. Inaddition, different telecoms have different datasets. Therefore, itis difficult to compare these input features for churn prediction.Recently, for churn prediction in land-line telecommunication ser-vice field, Qi et al. (Bin et al., 2007; Zhang et al., 2006) presented aset of features/variables: duration of service use, payment type andamount and structure of monthly service fees (such as, Proportionsvariables, consumption level rates variables, the growth rate of thesecond three months, etc.).

Most of the literature (Hadden et al., 2006; Hung et al., 2006;John et al., 2007; Wei & Chiu, 2002) clearly shows how to selectthe aggregated call-details for predictors. In literature (Euler,2005; Hadden et al., 2006; Hung et al., 2006; John et al., 2007;Wei & Chiu, 2002) churn prediction has been performed by utilis-ing a combination of the following inputs: (i) a time series featurewhich contains a slice of aggregated call-details from the previousand current months and (ii) a constant feature (non-time series)

http://dx.doi.org/10.1016/j.eswa.2009.10.025

mailto:[email protected]

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

Table 1

3658 B.Q. Huang et al. / Expert Systems with Applications 37 (2010) 3657–3665

which encodes details specific to the customer, e.g. previous bill to-tals, customer address, and payment method.

Many predictive modelling techniques have been proposed forchurn prediction in research, such as Naive Bayes, K-nearest-neigh-bour, logistic regression, decision trees (DT), artificial neural net-works (ANNs). These techniques are reviewed in the literature(Hadden et al., 2006; John et al., 2007; Wei & Chiu, 2002). DTand ANN are the most successful techniques for churn prediction(Hadden et al., 2006; John et al., 2007; Wei & Chiu, 2002). Accord-ing to the literature (Hadden et al., 2006; John et al., 2007; Wei &Chiu, 2002), ANN can obtain higher accuracy than DT. However, Qiand Zhang (Zhang et al., 2006) reported that DT can get higher pre-diction rates for the land-line communication. In addition, one ele-gant approach – support vector machines (SVMs) (Boser, Guyon, &Vapnik, 1992; Vapnik, 1998), has been successfully applied to pat-tern recognition (Burges, 1998) (e.g. handwritten recognition (Bo-ser et al., 1992), image processing (Tong & Chang, 2001), etc.).Recently, SVMs have emerged to predict customer churn of tele-coms (Coussement & den Poe, 2008; Zhao, Li, Li, Liu, & Ren,2005). It shows that SVM is very promising for churn prediction.

However, there are still some limitations with the existing inputfeatures/variables and traditional window techniques which pres-ent features/variables to a predictor. The existing input features/variables are not robust to obtain higher prediction rates. Fromthe traditional input window technique, the input order of theaggregated call-details of each month to the predictive model inthe training phase is different to the ones in the same model forthe testing/predicting phase. If the input order of some monthlycall-details to the models/predictors in the testing phase can bealigned as in the training phase, the prediction rates might be im-proved. In addition, most authors (Bin et al., 2007; Zhang et al.,2006) did not provide the details for presenting features/variablesto a predictor, and did not compare the common modelling tech-niques and SVMs for the churn prediction in land-linecommunication.

In order to improve the prediction rates of churn prediction, thispaper presents a new set of features/variables with three new in-put window techniques. The new features/variable are demo-graphic profiles, account information, grant information, Henleysegmentation, aggregated call-details, line information, service or-ders, bill and payment history that are extracted from existing lim-ited information. In contrast to the traditional window technique,the new input window techniques attempt to make the positionorder of some aggregated monthly call-details in the combined fea-ture set in the testing phase be the same as in the training phase.Decision trees, multilayer perceptron neural networks and supportvector machines are selected as predictors to evaluate the new fea-tures and window techniques. In addition, the comparison of thesethree modelling techniques is studied in this paper. The experi-mental results show that the new features with the new windowtechniques are efficient for churn prediction in land-line telecom-munication service fields.

The rest of this paper is organised as follows: Section 2 outlinesthe churn definition and evaluation criteria of predictions. Section3 describes our proposed prediction system and gives the details ofthe new features and the new window techniques. The modellingtechniques – Decision Tree C4.5, multilayer perceptron neural net-works and support vector machines, are introduced in Section 3.3.Section 4 provides the experimental results with discussion. The con-clusion of this paper and future works are made in the last section.

Confusion matrix.

Predicted

NON-CHUN CHU

Actual NON-CHU a11 a12

CHU a21 a22

2. Churn definition and evaluation criteria

Churn in telecoms is the term used to collectively describe theceasing of customer subscriptions to a service. For a telecomminca-

tion service company, a customer is someone who has joined thecompany for at least a peroid (e.g. six-months), a churner is a cus-tomer that left the company. Churns can be categorised into thefollowing types: (1) package churn – customers moving betweenpackages offered by the company (e.g. customers move from baserates to a talktime package), (2) service churn – customers churnfrom a service (e.g. churning from broadband), and (3) operator(company) churn – customers churn to a competitor. at this stage,this paper only investigate the operator churns. for the remainderof this paper, churns refer to operator churns.

After a classification model is available, the model will be usedto predict the further behaviour of customers. As one of the impor-tant step to ensure the model generalise well, the performance ofthe predictive churn model has to be evaluated. Table 1 shows aconfusion matrix, where a11 is the number of the correctly pre-dicted nonchurners, a12 is the number of the incorrect predictednonchurners, a13 is the number of the incorrect predicted churners,and a14 is the number of the correct predicted churners. From theconfusion matrix, the most common evaluation techniques for apredictive model are introduced as follows:

� The overall accuracy (OA) is the proportion of the total numberof predictions that were correct, calculated by a11þa22

a11þa12þa21þa22.

� The accuracy of true churn (TC) is the proportion of churncases that were correctly identified, written as a22

a21þa22.

� The accuracy of true nonchurn (TN) is defined as the propor-tion of nonchurn cases that were classified correctly, calcu-lated by a11

a11þa12.

� The false nonchurn rate (FN) is the proportion of nonchurncases that were incorrectly classified as churn, written as

a12a11þa12

.

� The false churn rate (FC) is the proportion of churn cases thatwere incorrectly classified as nonchurn, written as a21

a21þa22.

� Precision rate (P) is the proportion of the predicted churn casesthat were correct, written as a22

a12þa22.

� Geometric means are defined by g-mean1 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiTC � Pp

andg-mean2 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiTC � TNp

� The average rate (AVG) is the average rates of OA, TC and TN.

In this paper, we are more interested in overall accuracy, theaccuracy of true churn and accuracy of true nonchurn.

3. Methodology

Our system for churn prediction consists of sampling data, pre-processing, and classification/prediction phases. Data samplingrandomly selects a set of customers with their relative information,according to churn definition. Preprocessing (also called data prep-aration) clears noise, extracts, combines and normalises variables/features. For some modelling techniques (e.g. decision trees), thefeatures/variables are not necessarily normalised. Therefore, thenormalisation in preprocessing phase can be ignored for such typeof modelling techniques. The main task in the classification phaseis to employ modelling techniques to identify the further behav-iour of customers. Fig. 1 shows these phases of the churn predictive

B.Q. Huang et al. / Expert Systems with Applications 37 (2010) 3657–3665 3659

modelling system. In this paper, the extraction and combination ofvariables/features are mainly focused on.

3.1. Data sampling

In this paper, 47,391 customers are randomly selected from thetelecom company of Ireland (Eircom, 2008), according to the churndefinition that described in Section 2. There are 9999 churners,18,196 nonchurners, a total of 28,195 customers in training data-set. In the testing data, there are 19,196 customers which includes1000 churners and 18,196 nonchurners.

3.2. Preprocessing

3.2.1. Noise reductionNoise is the irrelevant information which would cause prob-

lems for the subsequent processing steps. Therefore, noisy datashould be removed. This irrelevant information includes wrongspelling words caused by officers (e.g. incorrectly spelling thenames of customer counties), special symbols that data miningsoftware cannot be dealt with (e.g. some special mathematicalsymbols), missing values often found in database tables (missingvalues are usually identified by the string ‘‘NULL”, ‘‘null”, blankspaces or some special symbols), duplicated information (e.g. thesame attributes with the same values are in different tables of adatabase). This noise can be removed by finding their locationsand using the correct values to replace them. For instances, the cor-rect county names is used to replace the incorrect ones; the miss-ing values that are identified by ‘‘NULL” or blank spaces can beremoved by neutral values; one of the duplicated data is keptand the others are removed.

3.2.2. Feature/variable extractionThe feature/variable extraction plays the most important role

which can directly influence the performance of predictive modelsin terms of the prediction rates. If a robust set of features/variablesis extracted in this phase, a significant improvement will beyielded. However, it is not easy to obtain such a set of features/variables. As mentioned above, most of the existing features/vari-ables (Bin et al., 2007; Zhang et al., 2006) are inefficient, producingunacceptable prediction rates for land-line telecommunication ser-vice field. In order to improve prediction accuracy, based on theexisting information provided by the telecoms of Ireland, this pa-per presents a new set of features/variables, which are describedas follows:

� Demographic profiles: Demographic information is often usedto describe a demographic grouping or a market segment. Usu-ally, this information includes age bands, social class bands,and gender. However, the available demographic informationfor this research is gender and country. Therefore, the socialclass bands and genders are selected as new features/variablesfor churn prediction. The demographic information might indi-

Fig. 1. System

cate the potential behaviour of customers. For instances, thewealthy customers may continue to use the service productsfor more years than middle and lower classes and may be will-ing to pay more. The customers from some special counties aremore selective and more often cease the land-line services;males may continue to use the services for a longer period thanfemales. Therefore, the information is important for predictingthe further behaviour of a customer.

� Information of grants: Some customers have obtained somespecial grants and their bills are paid fully or partly by a thirdparty. For example, if the customers have a disability or areover 80 years old, their bills will be paid by the Departmentof Social and Family Affairs. Usually, such customers will notchurn and continue using their services. This information isextracted by finding these attributes.

� Account information: Account information includes the statusof accounts, the date of creating the account, the bill frequency,the service packages, the balance of accounts, payment types,and the general call information which includes the summa-rised call duration, the number of calls and standard prices,the fees that the customers should pay and the fees that thecustomers already paid. Due to recording the current eventson customer accounts, the account information is very usefulfor predicting behaviour of customers for the next observationperiod. For example, the customers who obtained theiraccounts earlier might be more happy with the telecommuni-cation services and will be willing to use the services than theones that opened their accounts later; the customers who havethe lower bill frequency might cease the services less than theones who have higher bill frequency. Therefore, this informa-tion is selected as new features/variables for modelling tech-niques. Based on these new features, the average of callduration, the number of calls, the standard prices and theactual fees paid for 30 days (note the duration is a number ofminutes) are also considered as new features. Let‘‘AVE30_DUR”, ‘‘AVE30_CALL”, ‘‘AVE30_AMNT” and‘‘AVE30_TXN_AMNT” be the average of call duration, the num-ber of calls, the standard prices and the fees paid in 30 days ofthe most recent bill, respectively. They are obtained by Eq. (1):

overv

AVE30 CALL ¼ nCalls MnDays

� 30

AVE30 DUR ¼ Duration MnDays

� 30

AVE30 AMNT ¼ Fees MnDays

� 30 ð1Þ

AVE30 TXN AMNT ¼ Fee CnDays

� 30

where ‘‘nCalls_M” is the number of calls in the most recent bill,‘‘Duration_M” is the duration for the most recent bill, ‘‘Fees_M”is the fees of the most recent bill, ‘‘Fee_C” is the fees from

iew.


customers, and ‘‘nDays” is the number of days in the bill, whichcan be obtained by Eq. (2)

nDays ¼ endDate� startDate ð2Þ

where ‘‘endDate” and ‘‘startDate” are the dates of the bill startingand ending. In addition, the ratio between the actual fees thatshould be pay and the call duration of the current bill is extractedas a new feature, which is written as Eq. (3)

R AMNT DUR ¼ Fees MDuration M

ð3Þ

� Call-details: If the customers did not use the services, he mightcease the services in the future. If the fees of services are sud-denly increased or decreased, the customer might cease theservices sooner. Call-details can reflect this information –how often customer have used the services with relative pay-ment, and so on. The use of call details in churn prediction isreported in Wei and Chiu (2002) and Zhang et al. (2006). Thecall-details contain call duration, price and types of call (e.g.International or local call) of every call. It is difficult to storeall call-details of every call every month for every customer.Most of the telecommunication companies keep the call-details of a few months. The company only provide the call-details for the most recent six-months on this research. Basedon only these six-month call-details, the aggregated the num-ber of calls, duration and fees are extracted as new features.The basic idea for extracting features is to segment the call-details into a number of defined periods, then to aggregatethe duration, fees and the number of calls for each period forevery customer. In the literature (Wei & Chiu, 2002; Zhanget al., 2006), it is reported that the call-details of every 15 or20 days are efficient. In this paper, the six-month call detailsare segmented into every 15 days, then number of calls, dura-tion and fees of each 15-day call-details are aggregated foreach customer. For a segment i of a customer’s call-details,let the aggregated number of calls, duration and fees be‘‘CALL_Ni”, ‘‘DURi” and ‘‘COSTi”, respectively. The changednumber of calls, changed-duration and changed-cost betweentwo consecutive segment of call-details can be obtained by Eq.(4)

CH DURi;i�1 ¼jDURi � DURi�1jPM0

j¼2jDURj � DURj�1j

CH Ni;i�1 ¼jCALL Ni � CALL Ni�1jPM0

j¼2jCALL Nj � CALL Nj�1jð4Þ

CH Ci;i�1 ¼jCOSTi � COSTi�1jPM0

j¼2jCOSTj � COSTj�1j

where M0

is the number of call-detail segments; i and j arethe indexes of call-detail segment, and 2 =< i, j <= M

0. In addi-

tion, the increment rates of the number of calls, duration andfees are calculated by Eq. (5)

R DURi;i�1 ¼CH DURi;i�1

CH DURi;i�1 þ DURi�1

R Ni;i�1 ¼CH Ni;i�1

CH Ni;i�1 þ CALL Ni�1ð5Þ

R Ci;i�1 ¼CH Ci;i�1

CH Ci;i�1 þ COSTi�1

Thus, the new features includes the number of calls, dura-tion, fees, the changed number of calls, changed-duration,changed-fees, the rates of the increased number of calls,the rates of the increased duration and the rates of the in-creased fees.

� Service orders: Describes the services ordered by customers.The quantity of the ordered services and the amount of rentalare selected as new features.

� Henley segmentation: The algorithm of Henley segmentation(Group, 2006) splits customers and potential customers intodifferent groups or levels according to characteristics, needs,and commercial value. There are two types of Henley segment:the individual and discriminant segments. The individual seg-ment includes ambitious techno enthusiast (ATE) and commsintense families (CIF) Henley segments. The discriminant seg-ments are the mutually exclusive segments (DS) and can rep-resent the loyalty of customers. The Henley segments (‘‘DS”,‘‘ATE” and ‘‘CIF”) of the most recent 2 six-months are selectedas new input features. Similarly, the missing information of theHenley segments are replaced by neutral data.

� Payment and bill information: Storing the history of customerbills for the most recent 3 years. Each bill includes cost, prices,rental fees, call duration, paid off fees and so on. From thechurn definition, some customers have only a few monthlybills. However, the bills for the most recent 6 months are usedfor model developments. The attributes monthly cost, rentalfees, call duration and paid off fees are extracted as new inputfeatures/variables. They are represented by ‘‘mnCost”,‘‘mnRent_fees”, ‘‘mnDur” and ‘‘payedfee”, respectively. Fromthese new features/variables, the changed-cost, changed callduration and rental changed-fees are created as a new set ofinput features/variables. They can be obtained by Eq. (6)

Changed costi;i�1 ¼jmnCosti�mnCosti�1jPTj¼2jmnCostj�mnCostj�1j

Changed Durationi;i�1 ¼jmnDuri�mnDuri�1jPTj¼2jmnDurj�mnDurj�1j

ð6Þ

Changed rental Feesi;i�1 ¼jmnRent feesi�mnRent feesijPTj¼2jmnRent feesj�mnRent feesjj

where mnCosti, mnDurationi and mnRent_feesi are the cost,call duration and rental fees of the bill for the month ith.

� Telephone line information: Including the indicator which pre-sents whether a customer has voice mail services with hisphone, the number of lines, and so on. The customers whohave more lines might prefer the services more and they mightbe more willing to continue using the services. The customerswhose telephone with voice mail services might more oftenuse the services, they might cease the communication servicesless. This information might be useful for a modelling tech-nique. Therefore, the number of lines and the indicator of avoice mail are selected as new input features/variables.

3.2.3. Feature/variable combinationOne of the most important steps for churn prediction is how to

present the input features to a modelling technique. Usually, inputwindow techniques are used to solve this problem. As mentionedin (Euler, 2005), once a trained predictor is available, traditionally,churn prediction has been attempted to utilise a combination ofthe following inputs: (i) time series features which contains a sliceof aggregated call-details from the previous and current monthsand (ii) constant features (non-time series) which encodes detailsspecific to the customer (e.g. previous bill information, demo-graphic profiles, payment method, etc.). If the whole feature setconsists of part P1 and P2. Part P2 represents the time series fea-tures/variables – the aggregated call-details that consists of five-month call-details {M � 4, . . . ,M � 1,M,} (where M is call-detail ofthe Mth month). Part P1 represents the constant features/variablesthat exclude the monthly call-details. The traditional window tech-nique can be showed in Fig. 2a. For testing or prediction task,


feature set P1 and the call-detail features {M � 3,M � 2,M � 1,M},are input to a predictor together (see the figure on the left handside of Fig. 2a). For training phase, feature set P1 and the call-detailfeatures {M� 4,M� 3,M� 2,M� 1}, are inputted to a model together(see the figure on the right hand side of Fig. 2a). In contrast, theposition of the call-detailed in combined features/variables fortesting and training phases are different for the same month (seethe two figures on right and left hand sides of Fig. 2a).

If the variables/features (call-details) that are from the samemonths (different customer call-details) are added at the samepositions in their combined feature/variable set for the testingand training phase, or the positions of some call-detail features/variables that are from the same months for the testing phaseare adjusted to be as the same as the ones for the training phase,the prediction rates might be improved. Fig. 2b shows the newwindow technique that adjusts the positions of some monthlycall-detail features/variables from the same months to be the samefor the testing and training phases. From Fig. 2b, the order of thecall-details {M-3,M-2,M-1} in the training and testing phases arethe same. Fig. 2c and d shows the two new window techniques.In contrast to the traditional window technique which is showedin Fig. 2a, the new window techniques add two 2-month call-detailfeatures/variables {M � 1,M � 2}. In testing and training phases,the position order of the two-month call-detail features/variables{M � 1,M � 2} that are added are the same. The added monthlycall-details are identified by gray boxes in Fig. 2c. The figure on leftand right hand sides of the Fig. 2c shows the ways to combine fea-ture sets for testing/prediction and training phases, respectively.Fig. 2d shows another new window technique that invokes the

Fig. 2. Input window techniques.

two predictors/models. The prediction procedure can be describedint three phases: (1) first predictor which accepts features P1 andcall-details {M � 3,M � 2,M � 1,M} provides the result, then (2)combine the output with the call-details {M � 1,M � 2} as new fea-tures to the second predictor, finally, (3) the second predictor pre-dicts final results. Similarly, for this new window technique, themodels can be trained by: (1) training the first predictor basedon features P1 and {M � 3,M � 2,M � 1,M}, then (2) combiningthe output from the first model with the call-details {M � 1,M � 2}into a new set features for second predictor, finally (3) training thesecond predictor. In the combined features that are input into thesecond predictor, the position order of the call-details{M � 1,M � 2} for testing phase are as the same as the one fortraining phase. Fig. 2d shows the call details {M � 1,M � 2} thatare the gray boxes and the output from the first model or predictorare combined together for the second model/predictor, and the po-sition order of {M � 1,M � 2} are the same in the combine featuresfor the testing and training phases.

3.2.4. NormalisationSome predictors or classifiers (artificial neural networks, sup-

port vector machines) have difficulties in accepting the string val-ues of features/variables, such as, genders, county names. Thevalues of a feature/variable can be rewritten into binary strings.

In addition, the values of each of these features/variables (e.gthe number of lines, account balances, ‘‘AVE30_CALL”, ‘‘CALL_N1”,‘‘DUR1”, ‘‘COST1”, ‘‘CALL_NM”, ‘‘mnDur1” and ‘‘payedfee1”, etc.), liein different dynamical ranges. The large values of features have lar-ger influence over the cost functions than the smaller ones. How-ever, it cannot reflect that the large values are more important inclassifier design. To solve this problem, the values of these featurescan be normalised into a similar range by Eq. (7)

�xj ¼1N

XN

i¼1

xij; j ¼ 1;2; . . . ; i

r2j ¼

1N � 1

XN

i¼1

ðxij � �xjÞ

y ¼ xij � �xj

rrj

~xij ¼1

1þ e�yð7Þ

where xj is the feature jth, i is the number of features, N is the num-ber of instances or patterns and r is a constant parameter which isdefined by the user. In this study, r is set to one.

3.3. Classification or prediction

Many techniques have been proposed for churn prediction intelecommunication. However, in order to evaluate the new set offeatures/variables with the window techniques, two popular mod-elling techniques and one promising and innovative modellingtechnique, are selected as predictors in prediction system. Thetwo common modelling techniques are multilayer perceptron neu-ral networks (MLP) and Decision Tree C 4.5 (Quinlan, 1993, 1996).The promising and innovative modelling technique is support vec-tor machines (SVM) (Cortes & Vapnik, 1995; Schlkopf, Smola,Williamson, & Bartlett, 2000, 2001). These three modelling tech-niques are outlined as follows:

Decision trees: A method known as ‘‘divide and conquer” is ap-plied to construct a binary tree. Initially, the method starts tosearch an attribute with best information gain at root node and di-vide the tree into sub-trees. Similarly, each sub-tree is further sep-arated recursively following the same rule. The partitioning stops ifthe leaf node is reached or there is no information gain. Once the


tree is created, rules can be obtained by traversing each branch ofthe tree. The details of decision trees based on C4.5 algorithm arein the literature (Quinlan, 1993, 1996).

Artificial neural networks: A MLP is a supervised feed-forwardneural network and usually consists of input, hidden and outputlayers. Normally, the activation function of MLP is a sigmoid func-tion. If an example of MLPs with one hidden layer, the networkoutputs can be obtained by transforming the activation functionsof the hidden unit using a second layer of processing elements,written as follows:

OutputnetðjÞ ¼ fXL

l¼1

wjlfXD

i

wlixi

! !f j ¼ 1; . . . ; J ð8Þ

where D,L and J are the total number of units in input, hidden andoutput layer respectively, and f is an activation function. Theback-propagation (BP) or quick back-propagation learning algo-rithms would be used to train MLP. More details on learning algo-rithms can be found in Rumelhart, Hinton, and Williams (1986).

Support vector machines: Standard SVMs (M-SVM) and one-Class SVMs (1-SVM) are the two most common types of supportvector machines. A standard SVM classifier can be trained by find-ing a maximal margin hyper-plane in terms of a linear combinationof subsets (support vectors) of the training set. If the input featurevectors are nonlinearly separable, SVM firstly maps the data into ahigh (possibly infinite) dimensional feature space by using the ker-nel trick (Boser et al., 1992), and then classifies the data by themaximal margin hyper-plane as following:

f ð~xÞ ¼ sgnXM

i

yiai/ð~xi;~xÞ þ d

!ð9Þ

where M is the number of samples in the training set, ~xi is a supportvector with ai > 0, / is a kernel function, ~x is an unknown samplefeature vector, and d is a threshold.

The parameters {ai} can be obtained by solving a convex qua-dratic programming problem subject to linear constraints (Burges,1998). Polynomial kernels and Gaussian radial basis functions(RBF) are usually applied in practice for kernel functions. d canbe obtained by taking into account the Karush–Kuhn–Tucker con-dition (Burges, 1998), and choosing any i for which ai > 0 (i.e. sup-port vectors). However, it is safer in practice to take the averagevalue of d over all support vectors.

Based on the standard SVMs, the authors (Schölkopf, Platt,Shawe-Taylor, Smola, & Williamson, 2001) proposed one-ClassSVMs to estimate the support of a high-dimensional distributionfor one classification problems. Their approach is to construct a hy-per-plane that is maximally distant from the origin with all data ly-ing on the opposite side from the origin. The details of one-ClassSVMs can find in Schölkopf et al. (2001).

4. Experimental result and discussion

Two sets of experiments were carried out. One set of experi-ments was performed based on the existing features/variables(Bin et al., 2007; Zhang et al., 2006) and the other was carriedout based on our proposed features/variables. The four windowtechniques and four modelling techniques (MLPs, Decision TreeC4.5 and standard SVMs and one-Class SVMs) were used in eachset of experiments. Therefore, each set of experiments consists of16 experiments. Each experiment was performed by using thesame training and testing data that are described in Section 3.1.The features/variables for Decision Tree C4.5 were not normalised.But the input features/variables to MLPs, Decision Tree C4.5 andstandard SVMs and one-Class SVMs were normalised into between0 and 1 by using the methods that are described in Section 3.2.3.

Based on the training data without normalisation, the decisiontrees (C4.5) were trained by using the training dataset with 10folds of cross-validations for each experiment.

In each experiment, each MLP with one hidden layer wastrained. The number of input neurons of a MPL network is as thesame as the number of the dimensions of a feature vector. The num-ber of output neurons of the network is the number of classifica-tions. Thus, the number of output neurons is 2 in this application:one represents a nonchurner, the other represents a churner. Ifthe numbers of input and output neurons are n and m, respectively,the number of hidden neurons of the MLP is mþn

2 . The sigmoidfunction is selected as the activation function for all MLPs in theexperiments. For each input window techniques, 10 folds ofcross-validation and BP learning algorithm with learning rate 0.2,maximum cycle 2500 and tolerant error 0.01 were used to trainthe MLPs, based on the training dataset. The number of trainingcycles to yield the highest accuracy is 1800 for MLPs when inputwindow technique 2(a), 2(b) and 2(b) were used. However, wheninput window technique 2(d), the best numbers of training cyclesare 1800 and 500 for training the first and second MPLs, respectively.

Based on the extracted features and different window tech-niques, Multi-Class SVMs (M-SVM) were trained to find the sepa-rating decision hyper-plane that maximises the margin of theclassified training data. Two sets of values: the regularisation termC 2 {20, 21, . . . ,2�14} and r2 2 {2�3, 2�2, . . . ,214} of RBF were at-tempted to find the best parameters for the churn prediction. Alltogether, 255 combinations of C and r2 with 3 folds of cross-vali-dation were used for training M-SVMs in each experiment. Theoptimal parameter sets (C, r2) yielding a maximum classificationaccuracy of Multi-Class SVMs were (2�11, 28) and (2�11, 210) forthe input window technique 2(a), 2(b) and 2(c), respectively. Wheninput window technique 2(d) was used, the (2�11,28) and (2�7,23)were the best optimal parameters for the first and second M-SVMmoldels, respectively.

Based on the training dataset and the four input window tech-niques, one-Class SVMs (1-SVMs) were trained. Three sets of values:the regularisation term C 2 {20,21, . . . ,2�14},r2 2 {2�3,2�2, . . . ,214}and n 2 {0.001,0.001*2, . . . ,0.512} of RBF were attempted to findthe best parameters for churn prediction. All together, 255 � 10combinations of C, r2 and n with 3 folds of cross-validation wereused for training 1-SVMs in each experiment. The optimal parame-ter sets (C,r2,n) of 1-Class SVMs were (2�8, 25, 0.004) and(2�8, 27,0.004) when input window technique 2(a), 2(b) and 2(c)were used, respectively. When input window technique 2(d) wasused, the (2�7, 28, 0.002) and (22, 24, 0.1) were the best optimalparameters for the first and second 1-SVMs, respectively.

The computational cost on training MLPs is the most expensive,the one on training decision trees is the cheapest. After the modelswere trained, these models were used to perform prediction tasks,based on the testing dataset. The results generated by decisiontrees, MLPs and M-SVMs and 1-SVMs, are showed in Tables 2–5.Tables 2, 3, shows the confusion matrixes for the different predic-tors and different input window techniques. From Tables 2 and 3,the prediction rates – OA, TC, TN, FN, P, g-mean1, g-mean2 and AVG(see Section 2), were calculated in Tables 4 and 5. The results in Ta-bles 2 and 4 were obtained, based on the existing features/vari-ables reported in (Bin et al., 2007; Zhang et al., 2006). The resultsin Tables 3 and 5 were obtained, based on our new features/variables.

From Tables 4 and 5, for the same modelling techniques, theprediction rates obtained by using our new features/variables arehigher than the ones obtained by using the existing features/vari-ables. Therefore, the proposed feature /variable set is more efficientthan the existing one. The prediction rates were increased a lot. Forexamples, AO, TN, g-mean2 and AVG are increased at least 5%, andTC was increased about 3%.

Table 2Prediction confusion matrixes from old features.

Wind. 2(a) Wind. 2(b) Wind. 2(c) Wind. 2(d)Predicted Predicted Predicted Predicted

Decision Tree C4.5 Actual N.CHU 15140 3056 15242 2954 15175 3021 15175 3021CHU 360 640 372 628 358 642 358 642

MLP Actual N.CHU 16199 1997 16349 1847 16446 1750 15921 2275CHU 421 579 430 570 425 575 392 608

M-SVM Actual N.CHU 16472 1724 16466 1730 16552 1644 16487 1709CHU 435 565 439 561 435 565 440 560

1-SVM Actual N.CHU 6581 11615 6598 11598 8030 10166 6547 11649CHU 418 582 416 584 478 522 457 543

Table 3Prediction confusion matrixes from new features.

Predictors Class Wind. 2(a) Wind. 2(b) Wind. 2(c) Wind. 2(d)

Predicted Predicted Predicted Predicted

N.CHU CHU N.CHU CHU N.HU CHU. N.CHU CHU

Decision Tree C4.5 Actual N.CHU 16232 1964 16254 1942 16224 1972 16187 2009CHU 334 666 337 663 318 682 317 683

MLP Actual N.CHU 17143 1053 17798 398 17171 1025 16387 1809CHU 389 611 467 553 374 626 358 642

M-SVM Actual N.CHU 17648 548 17338 858 17722 474 17557 639CHU 449 551 445 555 448 552 420 580

1-SVM Actual N.CHU 9289 8907 8031 10165 9303 8893 10319 7877CHU 502 498 584 416 514 486 347 653

Table 4Prediction rates.

Predictors Input Wind. OA TC TN FN FC P g-mean1 g-mean2 AVG

DT 2(a) 82.20 64.00 83.21 16.79 36.00 17.32 33.29 72.97 76.472(b) 82.67 62.80 83.77 16.23 37.20 17.53 33.18 72.53 76.412(c) 82.40 64.20 83.40 16.60 35.80 17.53 33.54 73.17 76.662(d) 82.40 64.20 83.40 16.60 35.80 17.53 33.54 73.17 76.66

MLP 2(a) 87.40 57.90 89.03 10.97 42.10 22.48 36.07 71.80 78.112(b) 88.14 57.00 89.85 10.15 43.00 23.58 36.66 71.56 78.332(c) 88.67 57.50 90.38 9.62 42.50 24.73 37.71 72.09 78.852(d) 86.11 60.80 87.50 12.50 39.20 21.09 35.81 72.94 78.13

M-SVM 2(a) 88.75 56.50 90.53 9.47 43.50 24.68 37.34 71.52 78.592(b) 88.70 56.10 90.49 9.51 43.90 24.49 37.06 71.25 78.432(c) 89.17 56.50 90.97 9.03 43.50 25.58 38.01 71.69 78.882(d) 88.80 56.00 90.61 9.39 44.00 24.68 37.18 71.23 78.47

1-SVM 2(a) 37.32 58.20 36.17 63.83 41.80 4.77 16.66 45.88 43.892(b) 37.41 58.40 36.26 63.74 41.60 4.79 16.73 46.02 44.022(c) 44.55 52.20 44.13 55.87 47.80 4.88 15.97 48.00 46.962(d) 36.93 54.30 35.98 64.02 45.70 4.45 15.55 44.20 42.41


Tables 4 and 5 also show that the higher true churn rates or g-mean2 rates are from Decision Tree C4.5, then MLP and SVM, whenthe same input window techniques was used. If a Decision TreeC4.5, MPL or SVM was used to make churn prediction with differ-ent window techniques, the overall prediction rates or precisionrates and g-mean1 were the highest, when the input window tech-nique 2(d) was used. The input window technique 2(c) is more effi-cient than the window technique 2(a), when the same modellingtechniques was applied for churn prediction. Therefore, when thesame feature set was used in the prediction, which window tech-niques is more efficient depends on what evaluation criteria areused. This paper is more interested in OA, TC and TN, these predic-tion rates are expected to be as high as possible. From Tables 4 and5, the new window technique Fig. 2c was used to obtain higherprediction rates and is more efficient.

Which modelling technique is more efficient for churn predic-tion depends on what measurement of prediction rates are ex-pected. For examples, if OA is selected as the evaluation criteria,the highest overall prediction rates or g-mean1 are provided byM-SVM, then the ones from MLP and Decision Tree C4.5, whenthe same input window technique was used; the worst overallrates were from 1-SVM. If the TC is selected as the evaluation cri-teria, Decision Tree C4.5 is more suitable than MLP and M-SVM forchurn predictions, because the highest TC was obtained from Deci-sion Tree C4.5. As mentioned above, here the rates OA, TC and TNare expected to be higher. If the weights (degrees) of the expectedOA, TC and TN are the same, prediction rate AVG is used to mea-sure the churn prediction. MLp obtained highest rate AVG, M-SVMsobtained higher one, and one-SVM obtained the worst one. In addi-tion, in this case, from Tables 4 and 5, the highest rate AVG is

Table 5Prediction rates.

Predictors Input Wind. OA TC TN FN FC P g-mean1 g-mean2 AVG

DT 2(a) 88.03 66.60 89.21 10.79 33.40 25.32 41.07 77.08 81.282(b) 88.12 66.30 89.33 10.67 33.70 25.45 41.08 76.96 81.252(c) 88.07 68.20 89.16 10.84 31.80 25.70 41.86 77.98 81.812(d) 87.88 68.30 88.96 11.04 31.70 25.37 41.63 77.95 81.71

MLP 2(a) 92.49 61.10 94.21 5.79 38.90 36.72 47.37 75.87 82.62(b) 95.50 54.22 97.81 2.19 45.78 58.15 56.15 72.82 82.512(c) 92.71 62.60 94.37 5.63 37.40 37.92 48.72 76.86 83.232(d) 88.71 64.20 90.06 9.94 35.80 26.19 41.01 76.04 80.99

M-SVM 2(a) 94.81 55.10 96.99 3.01 44.90 50.14 52.56 73.10 82.32(b) 93.21 55.50 95.28 4.72 44.50 39.28 46.69 72.72 81.332(c) 95.20 55.20 97.40 2.60 44.80 53.80 54.50 73.32 82.62(d) 94.48 58.00 96.49 3.51 42.00 47.58 52.53 74.81 82.99

1-SVM 2(a) 50.98 49.80 51.05 48.95 50.20 5.30 16.24 50.42 50.612(b) 44.00 41.60 44.14 55.86 58.40 3.93 12.79 42.85 43.252(c) 50.99 48.60 51.13 48.87 51.40 5.18 15.87 49.85 50.242(d) 57.16 65.30 56.71 43.29 34.70 7.66 22.36 60.85 59.72

Fig. 3. Visualisation of features.


83.23% that were obtained by the MLPs and the input windowtechnique 2(b) and our proposed new features/variables are usedfor churn prediction. Therefore, MLPs with the input window tech-nique 2(b) are more efficient than other modelling techniques withother window input techniques. In addition, in these modellingtechniques, MLPs only can provide the likelihood of a prediction.

From Tables 4 and 5, the prediction rates from 1-SVMs are verylow. One of reasons is the churners that are a small percentage ofcustomers may not be in a zone, but might be located in differentplaces. In order to verify this assumption, it is necessary to visual-ise the dataset. In contrast to principal components analysis (linearPCA, Kernel PCA, etc.), the information of patterns will not be lostwhen self-organising map (SOM) (Kohonen, 1995, 2001) is usedto visualise to dataset. Therefore, the SOM was used to visualiseour dataset. The details of SOM can be found in Kohonen (2001)and Kohonen (1995). We built a map of 39 by 70 nodes. The max-imum 9,000,000 training cycles and 3 folds of the cross-validationwere used to train the SOM. It took about 24 h to train the SOM onone PC of Dell OptiPlex GX620. Fig. 3 shows that the dataset wasmapped in the SOM map. The cells with gray colour representthe churners, and the black cells represent nonchurners in themap. Fig. 3 shows that the churners are very small portion andare located in different places in the map. The kernels of 1-SVMs

might be difficult to transform the input feature into the same zoneor the same side in the high-dimensions. Thus, the decision hyper-plane which can distinguish well between churners and nonchur-ners might not be obtained. Therefore, it is difficult to use 1-SVMsto obtain an acceptable result for our application.

5. Conclusions

For the land-line telecommunication service field, this paperpresented a churn prediction system, which consists of samplingdata, preprocessing, and classification/prediction phases. However,this paper mainly presented a new set of input features/variablesand three input window techniques to improve churn predictionrates. This new set of input features includes the information ofHenley segmentation, account information, customer personalinformation, monthly bills, payment details, month call-details,and so on. The new window techniques were obtained by addingor adjusting the position orders of the some monthly call-detailsin the feature window for training and testing phases, based onthe traditional window technique. In order to evaluate the efficiencyof this set of features and the input window techniques, DecisionTree C4.5 and MLP, Multi-Class SVMs and 1-Class SVMs wereselected as predictors. The comparison of the features/variables,


the comparison of window techniques and the comparison of thethree modelling techniques were made by the experiments. Theexperiments showed that (1) the new feature/variable set is moreefficient than the existing ones, (2) which window technique ismore efficient depends on what evaluation criteria is selected(for our evaluation criteria, the window technique 2(c) is moreefficient), (3) which modelling technique is more efficient dependson what evaluation criteria is selected (for our evaluation criteria,MLP and SVM is more efficient), but (4) the modelling techniqueone-Class SVM is not suitable for our application.

However, there are some limitations with our proposed tech-niques. In the future, the other information (e.g. complain informa-tion, contract information, more fault reports, etc.) should beadded into the new feature set in such a way to improve features.The dimensions of input features also should be reduced by usingPCA, SOM, genetic algorithm. For the window techniques, theyshould be improved as well, because they use a large set of featuresthat might cause dimension crisis problem for some modellingtechniques. For the modelling techniques, the sampling data fortraining models is very important. If the size of training datasetis increased, the computation overhead for training a model willincrease a lot, especially for training ANNs. If the size of trainingdataset is decreased, the prediction rates might be decreased. Inaddition, in the future, we should develop a modelling techniquethat can improve prediction rates with low computational cost intraining phase.

Acknowledgement

This research was partly supported by Eircom of Ireland.

References

Au, W., Chan, C., & Yao, X. (2003). A novel evolutionary data mining algorithm withapplications to churn prediction. IEEE Transactions on Evolutionary Computation,7, 532545.

Bin, L., Peiji, S., & Juan, L. (2007). Customer churn prediction based on the decisiontree in personal handyphone system service. In International conference onservice systems and service management (pp. 1–5).

Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal marginclassifiers. In Proceedings of the fifth annual ACM workshop on computationallearning theory (pp. 144–152). Pittsburgh, PA: ACM Press.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2(2), 121–167.

Cortes, C., & Vapnik, V. (1995). Support vector networks. In Machine learning (pp.273–297).

Coussement, K., & den Poe, D. V. (2008). Churn prediction in subscription services:An application of support vector machines while comparing two parameter-selection techniques. Expert Systems with Applications, 34, 313–327. January.

Eircom (2008). <http://www.eircom.ie/cgi-bin/bvsm/bveircom/mainPage.jsp>.Euler, T. (2005). Churn prediction in telecommunications using miningmart. In

Proceedings of the workshop on data mining and business (DMBiz). <citeseer.ist.psu.edu/euler05churn.html>.

Group, H. R. (2006). <http://www.henleymc.ac.uk/henleyres03.nsf/pages/hccm_models>.

Hadden, J., Tiwari, A., Roy, R., & Ruta, D. (2006). Churn prediction: Does technologymatter? International Journal of Intelligent Technology, 1(2).

Hung, S.-Y., Yen, D. C., & Wang, H.-Y. (2006). Applying data mining to telecom churnmanagement. Expert Systems with Applications, 31, 515–524. October.

John, H., Ashutosh, T., Rajkumar, R., & Dymitr, R. (2007). Computer assisted customerchurn management: State-of-the-art and future trends.

Kohonen, T. (1995). Self-organizing maps.Kohonen, T. (2001). <http://www.cis.hut.fi/research/som-research/>.Lazarov, V., & Capota, M. (2007). Churn prediction. TUM computer science. <http://

home.in.tum.de/lazarov/files/research/papers/churn-prediction.pdf>.Quinlan, J. R. (1993). C4.5: Programs for machine learning.Quinlan, J. R. (1996). Improved use of continuous attributes in c4.5. Journal of

Artificial Intelligence Research, 4, 77–90.Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by

error propagation. MA: MIT Press. Vol. 1.Schlkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).

Estimating the support of a high-dimensional distribution. Neural Computation,13, 1443–1471.

Schlkopf, B., Smola, A., Williamson, R., & Bartlett, P. L. (2000). New support vectoralgorithms. Neural Computation, 12, 1207–1245.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., & Williamson, R. C. (2001).Estimating the support of a high-dimensional distribution. Neural Computing,13(7), 1443–1471.

Tong, S., & Chang, E. (2001). Support vector machine active learning for imageretrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM internationalconference on multimedia (pp. 107–118). New York, NY, USA: ACM.

Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. pp. 339–371.Wei, C., & Chiu, I. (2002). Turning telecommunications call details to churn prediction:

A data mining approach, 23, 103–112.Zhang, Y., Qi, J., Shu, H., & Li, Y. (2006). Case study on crm: Detecting likely churners

with limited information of fixed-line subscriber. In International conference onservice systems and service management. (Vol. 2, pp. 1495–1500).

Zhao, Y., Li, B., Li, X., Liu, W., & Ren, S. (2005). Customer churn prediction usingimproved one-class support vector machine (Vol. 3584, pp. 300–306).

http://www.eircom.ie/cgi-bin/bvsm/bveircom/mainPage.jsp

http://citeseer.ist.psu.edu/euler05churn.html

http://citeseer.ist.psu.edu/euler05churn.html

http://www.henleymc.ac.uk/henleyres03.nsf/pages/hccm_models

http://www.henleymc.ac.uk/henleyres03.nsf/pages/hccm_models

http://www.cis.hut.fi/research/som-research/

http://home.in.tum.de/lazarov/files/research/papers/churn-prediction.pdf

http://home.in.tum.de/lazarov/files/research/papers/churn-prediction.pdf

Documents

A new feature set with new window techniques for customer churn prediction in land-line telecommunications