Churn Prediction in the Mobile Telecommunications Industry · PDF fileChurn Prediction in the Mobile Telecommunications Industry ... and my girlfriend for their support, interest

Churn Prediction in the Mobile

Telecommunications Industry

An application of Survival Analysis in Data Mining

Master Thesis

Author: L.J.S.M. Alberts, Bsc.

Supervisor: Dr. R.L. Westra

Graduation comity:

Dr. Ir. R.L.M. Peeters (Dep. of Mathematics, Maastricht University)Prof. R. Braekers (Center for Statistics, Hasselt University)C. Meijer (Vodafone Netherlands)

Maastricht UniversityDepartment of General Sciences

Maastricht, September 2006

Acknowledgements

For my master degree in Knowledge Engineering & Computer Science (Oper-ations Research) at Maastricht University I conducted research at the mobiletelecommunication company Vodafone Netherlands at Maastricht. The mainaim of this research was to gain more insight into survival analysis and itsapplication as a predictive model. I considered translating theoretical knowl-edge to a real life context both challenging as well as very appealing. First,I would like to thank my supervisor dr. Ronald Westra and Carel Meijer ofVodafone for their guidance during my research and for giving me the op-portunity to carry out this assignment. I would also like to thank professorRoel Braekers of Hasselt University for answering my questions about sur-vival analysis. Further, I would like to thank Driek Maas and everyone atInformation Management. Last but not least I would like thank my familyand my girlfriend for their support, interest and understanding during thisperiod.

1

Abstract

Recently, the mobile telecommunication market in the Netherlands has changedfrom a rapidly growing market, into a state of saturation and fierce compe-tition. The focus of telecommunication companies has therefore shifted frombuilding a large customer base into keeping customers ‘in house’. Customerswho switch to a competitor are so called churned customers. Churn pre-vention, through churn prediction, is one way to keep customers ‘in house’.In this study we focus solely on prepaid customers. In contrast to post-paid customers, prepaid customers are not bound by a contract. The centralproblem concerning prepaid customers is that the actual churn date in mostcases is difficult to assess. This is a direct consequence of the difficulty inproviding a unequivocal definition of churning and a lack of understanding inchurn behavior. To overcome this problem, here a custom and flexible churndefinition is proposed.

The predictive churn model presented in this study is based on the the-ory of survival analysis. Survival analysis is predominantly used in medicalsciences to examine the influence of variables on the length of survival ofpatients. In survival analysis, the time until the occurance of a well-definedevent is modelled. In the present case, the event of interest is churn. Inthis research the focus is on the extended Cox model. This is a variant ofthe original proportional hazards model, that is used for churn modelling.Since survival models are not designed to act as predictive models, someadjustments had to be made.

To be able to compare the performance of the extended Cox model withthe established predictive models, a decision tree is also considered.

Both models performed approximately similar with a sensitivity rangingfrom 93% to 99% and a specificity ranging from 92% to 97%, dependingon the model and the churn definition. The extended Cox model can beconsidered as a perfect alternative to the established predictive models andoffers some unique qualities.

2

List of definitions

Call credit: prepaid customers pay their calls from this credit.

Churn: a term used by companies to denote the loss of customers.

Commitment date: the date a new customer is registrated.

Postpaid customer: a customer who is bound by a contract and paysa monthly sum in exchange for free call minutes.

Prepaid customer: a customer who is not bound by a contract and whoonly pays for the calls he makes.

Recharge: the term used for raising the call credit.

Sim-card: a small electronic chip on which the mobile phone number isstored.

3

Contents

1 Introduction 6

1.1 Churn prediction modelling . . . . . . . . . . . . . . . . . . . 61.2 Prepaid versus postpaid . . . . . . . . . . . . . . . . . . . . . 71.3 Churning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Data acquisition and preparation 11

2.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Derived variables . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Operational churn definition 15

4 Theory of survival analysis 17

4.1 Survival and hazard functions . . . . . . . . . . . . . . . . . . 174.2 Recording survival data . . . . . . . . . . . . . . . . . . . . . 184.3 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 A survival model as predictive churn model 20

5.1 Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Extended Cox model . . . . . . . . . . . . . . . . . . . . . . . 225.3 Survival data . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 Discrete event times . . . . . . . . . . . . . . . . . . . . . . . 245.5 Lagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.6 Principal component regression . . . . . . . . . . . . . . . . . 255.7 Handling nonlinearity . . . . . . . . . . . . . . . . . . . . . . . 275.8 Predictive score method . . . . . . . . . . . . . . . . . . . . . 27

6 A decision tree as predictive churn model 30

6.1 Finding the optimal tree size . . . . . . . . . . . . . . . . . . . 32

4

6.2 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Tests and results 34

7.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 Conclusions and recommendations 37

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378.2 Recommendations for future research . . . . . . . . . . . . . . 39

Appendix A: Derived variables 43

Appendix B: Decision tree for churn definition 1 45

Appendix C: Decision tree for churn definition 2 46

5

Chapter 1

Introduction

Recently, the mobile telecommunication market in the Netherlands has changedfrom a rapidly growing market, into a state of saturation and fierce compe-tition. The focus of telecommunication companies has therefore shifted frombuilding a large customer base into keeping customers in house. For thatreason, it is valuable to know which customers are likely to switch to a com-petitor in the near future. Those customers are so called churned customers.Since acquiring new customers is more expensive than retaining existing cus-tomers, churn prevention can be regarded as a popular way of reducing thecompany’s costs. With this objective, this research is carried out for Voda-fone Netherlands. Vodafone is particularly interested in churn prediction ofprepaid customers. Prepaid customers are, as opposed to postpaid customers,not bound by a contract. In this study, two different predictive churn modelsare considered. The first model, the so-called extended Cox model, is basedon the theory of survival analysis, whereas the second model, a decision tree,is commonly used in data mining.

In the present study, the focus will be on the extended cox model. How-ever, in order to establish a direct comparison, we decided to consider thedecision tree as well. Both models are ultimately tested on a selection ofprepaid customers from the database provided by Vodafone.

1.1 Churn prediction modelling

Churn prediction is currently a relevant subject in data mining and has beenapplied in the field of banking [5, 14], mobile telecommunication [10, 7], lifeinsurances [13], and others. In fact, all companies who are dealing with long

term customers can take advantage of churn prediction methods.Models such as neural networks, logistic regression and decision trees

6

are common choices of data miners to tackle this churn prediction problem.These models are trained by offering snapshots of churned customers and non-churned customers. The goal is to distinguish churners from non-churnersas much as possible. When new customers are offered, the model attemptsto predict to which class each customer belongs. Although in general thisapproach gives satisfying results, the time aspect often involved in theseproblems is neglected. For instance, in the case of life insurances, customersare more likely to churn after a year than after a period of 5 years.

We propose an approach which makes it possible to incorporate this timeaspect. In order to do so, a collection of statistical methods called survivalanalysis is used. Survival analysis is predominantly used in medical sciencesto examine the influence of explanatory variables on the length of survival ofpatients, hence the name survival analysis.

Survival analysis has different synonyms depending on its application.For example, survival analysis is labeled ‘reliability analysis’ in engineeringand is known under the term ‘duration analysis’ in economics. In addition,survival analysis applied in the field of data mining is also known as ‘survivaldata mining’ [15].

1.2 Prepaid versus postpaid

Although churn prediction is often associated with the mobile telecommu-nication market, the problem described in this thesis is not widely coveredin scientific literature. This is because almost all articles concentrate onso-called postpaid customers, customers who are bound by a contract. Incontrast to postpaid customers, prepaid customers are far more difficult todeal with when it comes to churn prediction.

Firstly, as opposed to postpaid customers, prepaid customers do not paya monthly sum and do not receive any free calling minutes. In order tomake outgoing calls, they have to buy call credit. When this call credit isused up, they have to recharge their credit to make outgoing calls again.Incoming calls, on the other hand, can always be received, regardless of thecurrent call credit. These properties result in calling behaviour that is veryhard to predict. Most prepaid customers are irregular users where months ofzero usage frequently occur. This behaviour can easily be explained by thefact that they are not charged for not making any calls. Obviously, this incontrast with postpaid customers who pay a monthly sum.

Secondly, postpaid customers are obligated to provide personal informa-tion like their name, gender and address. As a consequence, a postpaid phonenumber is linked to a single customer. Prepaid customers, on the other hand,

7

do not necessarily need to provide this personal information and can there-fore stay anonymous. Because of this, sim-cards, which contain the phonenumber of the cusomer, can easily be switched between individuals withouthaving to inform the operator. Different calling behaviors are then registeredon a single sim-card and thus on a single phone number. Such numbers aremisleading if we assume that a phone number is linked to a single and uniquecustomer.

Finally, in most cases postpaid contracts have a fixed length of one ortwo years. Since people are likely to churn when their contract expires,the expiration date can be held as a very reliable predictor for churning.Unfortunately, this predictor is unavailable for prepaid churn.

In the following section, the most important difference between postpaidand prepaid customers will be addressed.

1.3 Churning

Churn is usually distinguished in voluntary and involuntary churn. Voluntarychurn refers to churn due to the customer’s choice. For example, switching toa competitor or switching to a postpaid contract. Involuntary churn concernscustomers who are disconnected by the operator, typically due to nonpay-ment or fraud reasons. Since fraudulent customers are rare, they can beneglected in this study. Involuntary churn due to nonpayment is not applica-ble here, as we are dealing with prepaid customers only. Though, involuntarychurn is possible when customers do not recharge for a long period of time.

Probably, one of the most important issues with prepaid churn is the lackof a good definition for it. When considering postpaid churn, the deactivationdate, i.e. the date that a customer is disconnected from the network, is equalto the churn date. After all, this is the actual date a customer stops using theoperator’s services. In the case of prepaid churn however, the deactivationdate does not necessarily have to match the churn date. This can be mademore clear by the different states a prepaid customer can be in. We candistinguish four states, namely

• Status 1: Normal use

• Status 2: No credit (call credit has expired or call credit is zero)

• Status 3: Recharge only

• Status 4: Deactivation

8

Figure 1.1: The different states prepaid customers can pass.

The different states are shown in Figure 1.1. After recharging, a customer instatus 2 or 3 returns to status 1 again.

In general, it takes a long period before a prepaid customer is actuallydisconnected from the network. In many cases customers are churned longbefore they are disconnected from the network. This is exactly the reasonwhy the deactivation date is not a suitable indicator for churn. Thus, weare interested in a churn definition which indicates when a customer haspermanently stopped using his prepaid sim-card. Furthermore, this stophas to be indicated as early as possible and before the deactivation date isreached. In Chapter 3 such churn definition is proposed. This definition isused to train the predictive models.

1.4 Research questions

The research question is stated as follows:

Is it possible to make a prepaid churn model based on the theory of survival

analysis?

In order to address this question, the following three sub questions are for-mulated:

• What is a proper, practical and measurable prepaid churn definition?

9

• How well do survival models perform in comparison to the establishedpredictive models?

• Do survival models have an added value compared to the establishedpredictive models?

We are thus interested in a survival model that can be used to predict pre-paid churn. This model is proposed is Chapter 5. The problem addressedin the first sub question is already mentioned in the previous section. Toovercome this problem a churn definition is proposed in Chapter 3. In orderto investigate the sub question 2 and 3, a second predictive model is required.This model is discussed in Chapter 6. Tests are performed in order to answerthe second sub question. These are discussed in Chapter 7. In Chapter 8 wereturn to the research question and the sub questions, where we attempt toanswer them.

1.5 Outline

This work is structured as follows. In Chapter 2 the data acquisition andpreparation procedure is described. The data is acquired from a databaseprovided by Vodafone Netherlands. A visualization tool is used to obtaininsight in the data. Further, a number of derived variables are proposedwhich describe customer behaviour in a better way.

As a consequence of the inavailability of exact churn dates, a churn def-inition is required. This definition is proposed in Chapter 3 and is used toprovide the dataset with labels indicating churn or non-churn. These labelsare essential to train the predictive models.

In Chapter 4 a brief theoretical background of survival analysis is given.This is done in order to gain a better understanding of the terminology usedin Chapter 5, where the survival model is introduced. Chapter 5 discussesthe theoretical background of the model and its application as a predictivemodel. In Chapter 6 the second predictive model is described. This model iscalled a decision tree and is commonly used in churn prediction modelling.Therefore, the decision tree is here used to compare its performance with theperformance of the surival model. Chapter 7 provides the tests and results ofthe comparison between the two models. The tests are based on a selectionof prepaid customers from the database provided by Vodafone. Finally, inChapter 8 the conclusions and recommendations are given.

10

Chapter 2

Data acquisition and

preparation

The first step in predictive modeling is the acquisition and preparation ofdata. Having the correct data, is as important as having the correct method[2].

2.1 Data acquisition

Vodafone has provided a database containing all prepaid and postpaid cus-tomers. The data stored in this database is used for predictive modelling.The data is already monthly aggregated. For modelling purposes this is thedesired level of aggregation. Daily or weekly aggregated data will not offerany advantages over monthly aggregated data, since prepaid customers arecharacterized by inconstant and irregular behaviour as previously discussedin Section 1.2.

Not all fields of the database are suitable for modelling purposes. Fieldswith unique values, like addresses or personal unlock codes are left out. Thesedo not have predictive value as they uniquely identify each row [2]. Also fieldswith only one value are left out, as these represent a negligible part of thedata [2]. Finally, fields with too many ‘null’ values are also excluded. Thelatter applies to fields like ‘sex’ and ‘age’, since prepaid customers are notobliged to provide personal information. In addition, personal data thatis available is not necessarily reliable. Geographical and personal data aretherefore excluded. In sum, the selected data is thus only related to usageand billing information. For instance, ‘number of outgoing calls per month’or ‘number of recharges per month’.

The database is updated with the latest available data every month. How-

11

ever, if a customer does not show any activity in a month, nothing is added tothe database. Although this may increase storage efficiency, it also makes itvery difficult to detect a possible update failure. After all, an update failurecan undeservedly implicate total inactivity of a customer. In spite of this,this problem has further been disregarded in this study.

A selection of 20.000 customers with a commitment date between Apriland July 2005 are taken from the database. This is comparable to 12 to 15months of history per customer. Longer histories were not possible due tothe limited history of the recharge data.

2.2 Data preparation

The large flat data file obtained from the database is represented in an ob-ject oriented manner. That is, each customer is stored as an object wherethe attributes are the selected fields from the database. These attributeswill be also referred to as variables from now on. Furthermore, customerattributes can be time dependent or time independent. The attribute ‘cus-tomer id’ for instance is a time independent variable. On the other hand, thevariable ‘number of outgoing calls per month’ differs from month to monthand is therefore time dependent. This object oriented representation gives usgreater flexibility and a better overview of the data. In addition, the data isbetter organised and it creates the possibility to examine a single customervery easily. Figure 2.1 depicts a diagram of the object oriented representationof the data. A visualization tool has been built to further benefit from thisobject oriented representation. It plots two variables of a single customerover time. The different variables and customers can be chosen from a list.Figure 2.2 is a screenshot of the user interface which is built with Matlab 7.This tool makes it possible to visually examine the behaviour of hundreds ofcustomers. It provides insight in potentially interesting and worthwhile vari-ables. Furthermore, the visualization tool is also used to examine differenttypes of churn definitions.

The set of customer objects needs to be turned into the right formatbefore it can be used by the predictive models. For both models, the datahas to be represented differently. In the case of the decision tree model alldata must be in a single table where each row corresponds to a single monthof a customer. Each row can hold different customers and different months.In the case of the survival model, the dataset must contain the total historyof a customer.

12

Figure 2.1: Object oriented representation of the data.

2.3 Derived variables

Derived variables are new variables based on original variables. The mosteffective derived variables are those that represent something in the realworld, such as a description of some underlying customer behaviour [2]. Infact, all original variables are already derived variables, since these are allmonthly aggretated.

There are some general classes of derived variables, like total values, av-erage values, and ratios. We have chosen to employ the average value overthe last three months as a derived variable type. Also, the ratio between theaverage over the last three months and the average over all months before isused as a derived variable. In addition a number of specific derived variablesare used. Some examples are:

• The number of months since the last recharge.

• The number of months since the last voicemail call.

• The ratio of incoming and outgoing calls.

A list of all derived variables can be found in appendix A. The derived vari-ables explain customer behaviour in a better way than the original variables.For instance, knowing how many months it was since a customer called hisvoicemail is much more informative than knowing if a customer called hisvoicemail this month.

13

Figure 2.2: A screenshot of the data visualization tool.

14

Chapter 3

Operational churn definition

In Chapter 1 we stated that we prefer a churn definition that indicates whena customer has permanently stopped using his sim-card as early as possible.In this chapter such a churn definition is proposed. This is called the oper-ational churn definition. An operational churn definition is necessary sincethe proposed models in this study are supervised models. Supervised modelsrequire a labeled dataset for training purposes.

The most obvious operational definition of prepaid churn is based on anumber of successive months with zero usage. After all, this definition baresgreat similarity with our intended churn definition. For instance, a churnedcustomer could be defined as not having have any incoming and outgoingcalls in the past three months. Although this is an intuitive operationaldefinition, we do have to take the following considerations into account.

First, the definition of zero usage has to be refined. Incoming calls arealso registered when the user does not answer his phone. Besides, incomingduration is registered when a message is recorded on voicemail. This meansthat a prepaid customer who has already churned, still can have incomingcalls and incoming duration from uninformed outsiders. Therefore, definingzero usage as zero outgoing duration plus zero incoming duration can notbe considered optimal. With regard to incoming duration, we manage athreshold. This threshold is set to the value of 30 seconds per month whichturned out to be a practical value. In sum, zero usage can be defined as atotal incoming duration of less than 30 seconds per month in combinationwith zero outgoing duration.

Secondly, the operational churn definition has to be refined. Most prepaidcustomers are irregular users that often display zero usage. In addition, sev-eral successive months of zero usage often occurs as well. As a consequence,the above mentioned churn definition will not hold for every customer. Lowusage customers could be quickly labeled as churned without actually being

15

churned. To overcome this problem, a flexible churn definition is proposed.This definition allows for a larger range of customer behaviours. The def-inition consists of two parts, α and β, where α is a fixed value and β isequal to the maximum number of successive months with zero usage. α + βis then used as a threshold to distinguish churn behaviour from non-churnbehaviour. An example is provided in Figure 3.1.In this example α is set to a value of 3 and is marked in grey. β is equal to

Figure 3.1: An example of the operational churn definition.

2 and is marked in black. At month 13 the customer is labeled as churnedsince there are (2 + 3 =) 5 successive months with zero usage.

The churn definition can be interpreted as follows. Although the customerof Figure 3.1 did not show any usage during the fifth and sixth month, atmonth 7 he ‘proved’ that he was not churned. This means that another twosuccessive months of zero usage would not guarantee any churn behaviour.This is indicated with the variable β. The constant α is added to indicate apoint where the customer would churn. Two different values for α are usedto label the training set, that is α = 2 and α = 3. From now on these will bereferred to as churn definition 1 and 2 respectively. Customers with β ≥ 5are excluded, as the churn definition becomes less accurate and intuitive afterlonger durations.

16

Chapter 4

Theory of survival analysis

Survival analysis is a collection of statistical methods which model time-to-event data. A thorough introduction can be found in [12]. Central is theoccurrence of a well-defined ‘event’. The variable of interest is the time untilthis event occurs. This in contrast with approaches like regression methodsand neural networks which model the probability of an event. Depending onits application, the event of interest can be the failure of a physical componentor the time to death. In the context of data mining the event of interest istypically the time until churn or the time until the next purchase [4].

4.1 Survival and hazard functions

Let T ≥ 0 be the random variable that denotes the time at which the eventoccurs and let it have density f(t) and distribution function F (t). The sur-vival function is then defined as

S(t) = Pr(T > t) = 1 − F (t) =∫

∞

tf(x)dx. (4.1)

We note that S(t) is a monotone decreasing function with S(0) = 1. Thesurvival at time t is the probability that a subject will survive to that pointin time. The hazard rate function λ(t) is defined as

λ(t) =lim△t→0Pr(t < T < t + △t|T > t)

△t=

f(t)

1 − F (t). (4.2)

The hazard function is also known as the instantaneous failure rate, forceof mortality, and age-specific failure rate. λ(t)△t can be interpreted as theinstantaneous probability of having an event at time t given that one hassurvived (i.e. not had an event) up to time t. The functions f(t), F (t), S(t)

17

and λ(t) give mathematically equivalent specifications of the distribution ofT .

4.2 Recording survival data

Survival data is recorded in the following way. Subjects are observed for acertain period of time. During this period, the time of the event of interestis registered. However, it is possible that the event cannot be registeredbecause it does not occur during the period of observation. This is calledcensoring and is discussed in the following section.

Survival data requires both a well defined origin of time as well as atime scale. The origin of time is the moment from which the observationstarts. This can be any particular event, for example birth, or in our casethe commitment date of a customer. The time scale is the frequency bywhich a subject is checked on the occurrence of the event. Common scalesare based on years or months, depending on the nature of its application. Inthis study we apply a time scale based on months, since the data we use ismonthly aggregated.

Let Ti denote the time at which the event occurs for the ith subject andlet T ∗

i denote the observed time when dealing with censoring. Let δi be a0/1 indicator which is 1 if Ti is observed and 0 if the observation is censored.The pair (T ∗

i , δi) is then used to describe a single observation.

4.3 Censoring

The concept of censoring entails an essential element of survival analysis.Censoring occurs when there are incomplete observations. Three types ofcensoring can be distinguished:

• Right censoring

• Left censoring

• Interval censoring

These are illustrated in Figure 4.1. The most common type of random cen-soring is right censoring. Right censoring occurs when the event time is largerthan the period of observation. In that case, it is only known that the eventtime is larger then the last observed time. In the case of left censoring, theevent occurs before the period of observation. Then, the event time is onlyknown to be before a certain point in time. Finally, interval censoring occurs

18

Figure 4.1: Different types of censoring.

when the exact event times are not available, and are only considered to bein a certain interval. In the present study we are dealing with right censoreddata since not every customer churns during observation.

19

Chapter 5

A survival model as predictive

churn model

In this chapter, a survival model for churn prediction is proposed. There aremany different types of survival models. In this approach, we are particularlyinterested in survival models that incorporate a regression component, sincethese regression models can be used to examine the influence of explanatoryvariables on the event time. In this context, such explanatory variables areoften called covariates. There are two commonly used classes of regressionmodels, namely so-called accelerated failure time models and proportionalhazard models. Accelerated failure time models are based on a survivaldistribution. Common employed distributions are Weibull, exponential andlog-logistic. In accelerated failure time models the regression componentaffects survival time by rescaling the time axis. This is illustrated in Figure5.1. The proportional hazard model is also known as the Cox model. It is themost popular survival regression model available since it does not makes anyassumptions on the survival function as opposed to accelerated failure timemodels. It has been introduced by David Cox in 1972. In the Cox model, theregression component affects the hazard curve through multiplication. Thisis illustrated in Figure 5.1. Many improvements and adjustments have beenmade to the Cox model since the introduction of the model. Because of theseimprovements, which will be discussed later on, the Cox model is a strongcandidate for churn prediction.

20

Figure 5.1: Left: The accelerated failure time model. Rigth: The propor-tional hazard model.

5.1 Cox model

Let Xij denote the jth covariate of the ith subject, where i = 1, 2, ..., n andj = 1, 2, ..., p. One can think of X as an n × p matrix where Xi denotes thecovariate vector of subject i. The Cox model specifies the hazard rate as

λi(t) = λ0(t)eXiβ, (5.1)

where λ0 is called the baseline hazard and β is a p × 1 vector of regressioncoefficients. The latter is familiar from multivariate regression. The base-line hazard λ0 is an unspecified nonnegative function over time which canbe interpreted as the average hazard at time t. In contrast to acceleratedfailure time model, the baseline hazard does not have to follow a particulardistribution. This is a significant advantage over parametric survival modelswhich are restricted by the shape of a particular distribution. The hazardrate at time t is thus the product of a scalar, eXiβ and the baseline hazard attime t. To put it differently, the covariates increase or decrease the hazardfunction by a constant relative to the baseline hazard function.

The Cox model is also referred to as the proportional hazard model sincethe hazard ratio for two subjects with covariate vector Xi and Xj, given by

λi(t)

λj(t)=

λ0(t)eXiβ

λ0(t)eXjβ=

eXiβ

eXjβ, (5.2)

is constant over time. This implies that the covariates must have the sameeffect on the hazard at any point in time.

21

Estimation of parameter β is based on the partial likelihood function andis performed without specifying or even estimating the baseline hazard. Theterm partial likelihood is used because the likelihood formula considers prob-abilities only for those subjects who have an event, and does not explicitlyconsider probabilities for those subjects who are censored [9]. The likelihoodcan be written as a product of several likelihoods, one for each event time.The likelihood at time j denotes the likelihood of having an event at time j,given survival up to time j. The set of individuals at risk is called the riskset and is denoted by R(j). The partial likelihood is then given by

L(β) =∏

i

eXiβ

∑

j∈R(i) eXjβ. (5.3)

This equation holds as long as there are no tied event times, or ties in short.Ties are events that occur exactly at the same time. Adjustments to Equation5.3 are necessary to deal with this. This will not be discussed here, but anextensive explanation can be found here [12].

5.2 Extended Cox model

The Cox model, as discussed in the previous section, is not suitable as apredictive model for prepaid churn. The main reason is that the explanatoryvariables, or covariates, are fixed over time. This implies that we can onlyuse explanatory variables that are time-independent, such as address or sex.Since we do not have these kind of variables to our disposal (due to reasonsexplained in Chapter 2) and, more importantly, because we want to usevariables that capture the (current) behaviour of customers, the use of time-independent covariates is far from optimal.

Fortunately, several improvements and adjustments have been made tothe Cox model in the last several years. With regard to the present work, weare particularly interested to the improvements made by Gill and Anderson[17]. This adapted model is often referred to as the extended Cox model.The extended Cox model includes the ability to accommodate left-censoreddata, time-varying covariates, and multiple events [17]. The ability to includetime-varying covariates makes it possible to use covariates that capture thebehaviour of customers much better. Another advantage is that the propor-tionality assumption does not have to hold in this case, since the covariatesare dependent on time. Equation 5.1 is slightly adjusted to let Xi depend ontime. The extended Cox model can be defined as

λi(t) = λ0(t)exp

p1∑

i=1

βXi +p2∑

j=1

βXj(t))

. (5.4)

22

Notice that covariates are now split in p1 time-independent covariates andp2 time-dependent covariates. In order to include time-varying covariatesin the Cox model, a counting process formulation is required. A countingprocess is a stochastic process starting at 0 and whose sample paths are rightcontinuous step functions with height 1 [17]. Recall that the pair (T ∗

i , δi) isnormally used to represent an observation. The counting process formulationreplaces this with (Ni(t), Yi(t)) where Ni(t) the number of observed eventsin [0, t] for subject i, whereas Yi(t) is 1 if subject i is under observation andat risk at time t and 0 otherwise. Right censored data is a special case ofthis formulation. Two examples are shown in Figure 5.2. These correspondto (T ∗, δ) pairs (3,0) and (4,1). The counting process formulation makesit immediately possible to include multiple event times and multiple at-riskintervals. The application of this new formulation is not restricted to survivalanalysis anymore and can be used to model non-stationary Poisson processes,Markov processes and other processes [17].

Figure 5.2: The counting process formulation. Left: Right censoring. Right:No censoring.

The extended Cox model is already implemented in the statistical modellinglanguage R. R features many additional packages and a large on-line com-munity. The ‘event history analysis’ package is used here for modelling.

23

5.3 Survival data

Since we are dealing with survival analysis, the data must be representedas survival data. As discussed in Section 4.2, there are several recordingproperties that have to be defined:

• The origin of time is chosen to be equal to the commitment date of thecustomer.

• The customers are followed for a maximum period of 15 months.

• The time scale is set to months.

The proportion of churned and censored customers in the training set is madeequal, because this appears to provide the best results [14].

5.4 Discrete event times

Ties are events that occur exactly at the same time. In order to computethe partial likelihood, the event times themselves are not important, onlythe ordering is of importance. Due to the process of ordering, tied eventtimes are only registrated once. This results in a misrepresentation of thesurvival data. There are several ways to deal with this problem. The methodswill be only mentioned here, a detailed explanation can be found here [17].According to the so-called exact method, time is considered to be continuous.Ties are then a result of imprecise measurement and will rarely occur. Thediscrete method, on the other hand, assumes that time is discrete, and thatties really occur at the same time. Since events are only measured at theend of each month, we are dealing with discrete event times. The discretemethod provided in R is used to handle tied event times.

5.5 Lagging

An important assumption of the extended Cox model is that the effect of atime-dependent covariate Xi(t) on the survival probability at time t dependson the value of this covariate at that same time t, and not on a value at anearlier or later time [9]. Since the covariate values Xi(t) are not availableuntil time t, the model can only predict the hazard for time t. In our churnapplication this can be easily interpreted. The data of a customer is onlycompletely available when a month has ended. The hazard is then predictedfor that month. In fact, the prediction is never up to date. If the customer

24

would be churned during that month, it could only be predicted afterwardswhen it is already too late. Our goal is thus to forecast a future time t + L,where L is typically 1 or 2. Notice that this would not be an issue if wewould not incorporate time-dependent covariates. After all, Equation 5.1shows that the time aspect is only represented by the baseline hazard, andnot by the covariates. To be able to forecast a future time using time-dependent covariates, the time-dependent covariates must be forecasted orlagged by L units. We have chosen for a lag-time of 1. This means that thedata of the previous month is used to predict the hazard of this month. Theextended Cox model can be written to allow for a lag-time modification ofany time-dependent covariate as follows,

λi(t) = λ0(t)exp

p1∑

i=1

βXi +p2∑

j=1

βXj(t − Lj)

, (5.5)

where Lj denotes the lag-time for time-dependent covariate j.

5.6 Principal component regression

The main aim of principal component analysis (PCA) is to reduce the dimen-sionality of the dataset while retaining as much as possible of the variationpresent in the dataset. This is achieved by transforming the original variablesinto a new set of variables, called the principal components. Each principalcomponent is a linear combination of the original variables. The principalcomponents are uncorrelated and ordered, which means that the first princi-pal components retain most of the variation present in the dataset [8]. Theprincipal components are the eigenvectors of the correlation matrix of themean extracted dataset. The corresponding eigenvalues indicate the impor-tance of the principal components. A more indepth explanation of principalcomponent analysis can be found here [8]. Figure 5.3 shows a scree plot ofthe PCA performed on the variables of Appendix A. A screeplot is a plot ofthe eigenvalues of the ordered principal components.

In principal component regression the principal components are used asregressors instead of the original explanatory variables. There are two reasonsto use the principal components as regressors rather than the explanatoryvariables. Firstly, the explanatory variables are often highly correlated whichmay cause inaccurate estimations of the regression coefficients [3]. This canbe avoided by using the principal components in place of the original vari-ables since the principal components are uncorrelated. Hierarchical variableclustering performed on the original variables and the first 20 principal com-ponents is used to visualize this. Variable clustering is a technique used for

25

Figure 5.3: The principal components versus the explained variance.

assessing collinearity and redundancy. The Spearman correlation measureis used as a distance measure. The results of this are shown in Figure 5.4Notice that the most correlated cluster of the principal components has avalue of 0.3, whereas the most correlated cluster of the original variables isnear a value of 0.9. Collinearity is thus greatly reduced.

Secondly, the dimensionality of the regressors is reduced by taking onlya subset of principal components for prediction [3]. We used the first 20components as regressors, since the contribution of the last 12 componentsis negligible (Figure 5.3). Although this can be considered a safe margin,it is known that principal components with the largest variances are notnecessarily the best predictors [3]. In the extended Cox model based on churndefinition 1, all principal components are significant. In the model based onchurn definition 2 there are two non-significant principal components. Theseare removed from the model. There is no risk of overfitting since we utilize alarge dataset. A rule of thumb for how many variables can be used withoutthe risk of overfitting is given by P < m

10, where P denotes the number of

variables and m the number of events in the dataset [6]. Since we have adataset with thousands of events, this is not an issue.

26

Figure 5.4: Hierarchical variable clustering performed on the original vari-ables and the principal components.

5.7 Handling nonlinearity

In this section the Cox model is tested for nonlinearity. Nonlinearity occurswhen the regression part of the model has an incorrectly specified functionalform. The linearity assumption is then violated. This disadvantage of theCox model is normally associated with linear regression models. Nonlinearitycan be detected by plotting the martingale residuals against the covariates.Martingale residuals measure ‘excess events’ in a subject. For example, acustomer who churned earlier than expected by the model will have a posi-tive residual. A customer who churned later will have a negative martingaleresidual. Figure 5.5 provides an example of a variable that meets the linearityassumption and one that does not. Smoothing of the plots is done by locallinear regression. The nonlinear variables are not removed but are trans-formed using restricted cubic splines. An explanation about the applicationof splines can be found here [6].

5.8 Predictive score method

Survival models are not designed to be used for classification or prediction[1]. Therefore, a specific procedure is required. A predictive score method isused to classify churn events.

27

Figure 5.5: Left: Nonlinearity. Right: Linearity.

The lag-time extended Cox model as discussed above is used for predictingchurn in the next month. The predictive score method is based on the hazardfunction λ. Recall that the λ(t)△t is the instantaneous probability of havingan event at time t given that one has survived (i.e. not had an event) upto time t. The hazard is thus a good candidate for measuring churn. Athreshold is set to distinguish churn behaviour from non-churn behaviour.This treshold is empirically determined. If the hazard exceeds the threshold,the customer is classified as churned.

Figure 5.6 shows the estimated survival curve and the estimated baselinehazard for churn definition 2. Notice that since we used α = 3 for churndefinition 2, every customer survives the first three months. Figure 5.7 showsan example of a customer that churns at month 11 and its correspondinghazard values including the threshold.

28

Figure 5.6: Left: The estimated survival curve. Right: The estimated base-line hazard.

Figure 5.7: Left: An illustration of customer behaviour. Right: The corre-sponding hazard values including the threshold.

29

Chapter 6

A decision tree as predictive

churn model

So far we have discussed the extended Cox model. Since we are also interestedin its performance compared to commonly used predictive models, a secondpredictive model is discussed, the so-called decision tree. Decision tree modelsare widely used in the field of data mining. In Chapter 7, the performance ofthe decision tree is compared to the performance of the survival model. Thisis done by applying both models on a new unseen test set. If for instance,the survival model turns out to perform much worse than the decision tree,we can conlude that the survival model is a poor predictive model for thisspecific problem. The decision tree is thus used to put the performance ofthe survival model in perspective.

Decision trees can be split into classification and regression trees. Clas-sification trees are used to predict a categorical outcome, whereas regressiontrees are used in case of a continuous outcome. Since we are dealing with abinary outcome, i.e. churn, a classification tree is used. In a decision treeeach interior node corresponds to a variable. An arc to a child representsa possible value of that variable. A leaf represents the outcome given thevalues of the variables represented by the path from the root.

One of the advantages of decision trees is that they can be very easily in-terpreted, since they produce a set of understandable rules. Neural networks,on the other hand, are so called black boxes. A trained neural network con-tains several optimized parameters and weights which cannot be interpretedeasily. It is therefore not possible to understand why a neural network givesa particular outcome.

A decision tree is a supervised model and thus requires a labeled trainingset. In this study, the training set contains all the variables listed in AppendixA. Each record, or observation, contains a snapshot of a single month of a

30

customer. The outcome of an observation, ‘churn’ or ‘non-churn’, is indicatedby a 1 or 0 respectively. As with the extended Cox model, the variables hereare also lagged to be able to predict churn in the next month.

Decision trees are built through a process known as recursive partitioning.Recursive partitioning is an iterative process of splitting the data up into two(or more) partitions [11]. An high-level view of this procedure is providedby Figure 6.1. A splitting criterion is required to evaluate a split made by

Figure 6.1: An overview of the recursive partitioning procedure.

an explanatory variable. The splitting criterion used in this study is theGini-index. The Gini-index is a measure of impurity of a split at a particularnode. The Gini-index is defined as

1 −∑

k

p2ik, (6.1)

where k indicate the different classes and pik denotes the relative frequencyof class k in node i. The lowest value for the gini-index is used for splittingthe node’s observations.

31

6.1 Finding the optimal tree size

When decision trees do not incorporate a stopping criterion they are likely tooverfit the training set. Overfitting occurs when the model tries to captureartefacts and noise present in the training set, by adding more and moresplits. Splits can be added until the whole trainingsset is explained. Theproblem of overfitting is therefore more dependent on the number of splitsthan on the number of explanatory variables [11]. Since an overfitted treehas lost its ability to generalize, it will only have limited predictive power onnew unseen observations.

Overfitting can be avoided by means of two approaches, namely preprun-ing and postpruning. Prepruning is done by setting a maximum tree-depthor by setting a minimum increase in fit per split. This will hold the tree togrow too large. Postpruning is used in this study and is performed by re-moving branches from a fully grown, and thus probably overfitted, tree. Todecide which pruned tree performs best, 10-fold cross-validation is applied.10-fold cross-validation works as follows. The training set is split into 10 sub-sets. Each of the 10 subsets is left out in turn. The tree is learned with theremaining data and the result is used to predict the outcome for the subsetthat has been left out. Each prediction is independent of the data to whichit is applied. As a consequence, cross-validation gives an unbiased estimateof the predictive power [11]. Figure 6.2 provides plots of the relative errorversus the tree size.

Figure 6.2: Relative error versus tree size. Left: Churn definition 1. Right:Churn definition 2.

32

These plots show that the optimal tree size, in the case of churn definition1, is equal to 18 splits and, in the case of churn definition 2, equal to 15 splits.The final trees are shown in Appendix B and C.

6.2 Oversampling

Oversampling is a technique that alters the proportion of the outcomes in thetraining set. More specifically, it increases the proportion of the less frequentoutcome. This will make a model more sensible to the outcome that is leastrepresented [2]. Imagine that there are 100.000 observations of which 99.000are labeled 0, and only 1000 are labeled 1. Almost any model will classify anew observation as 0, as it will be right in 99% of the time. This is exactlythe case in our churn application. Churn outcomes are under-representedwhen compared to non-churning outcomes. This can be explained by thefact that churn only occurs in a single month, whereas non-churn occurs inmany months. Oversampling is therefore used to increase the frequency ofthe churn outcomes. The proportion of churn and non-churn represented inthe trainingset is 1/3 and 2/3 respectively. This is a typical split that workswell [2]. The actual training set is built with all its churn observations andis filled up with twice as much, randomly selected, non-churn observations.

33

Chapter 7

Tests and results

The goal of the tests discussed in this chapter, is to gain insight into theperformance of the extended Cox model discussed in Chapter 5. This isachieved by offering the same test set to the extended Cox model and thedecision tree. By doing so, a direct comparison is made possible. Othertests, like determining the optimal tree size, have been discussed earlier inthe appropriate sections.

7.1 Tests

The tests are performed on an AMD Athlon 3000+ processor, running Win-dows XP and R 2.3.1. Both models are tested with churn definition 1 and 2as explained in Chapter 3. This is done in order to see if the churn definitionhas a significant influence on the performance of the models.

The dataset of 20.000 customers has been split into a training set of15.000 customers and a test set of 5000 customers. The test set consists ofall the months of history of the 5000 customers and contains 1313 churnedcustomers, 3403 non-churned customers and 284 outliers. The outliers areremoved from the test set.

The optimization procedure of the extended Cox model with 10.000 cus-tomers approximately takes 18 seconds, whereas the training procedure ofthe decision tree with 15.000 observations approximately takes 17 seconds.Prediction on the test set takes approximately 12 seconds for both the deci-sion tree and the extended Cox model. Thus, regarding the time complexity,the models are roughly equal to each other.

Confusion matrices are commonly used to summarize the performanceof predictive models Table 7.1 to 7.4 show the confusion matrices for bothmodels and churn definitions. The upper left and the lower right values are

34

also known as the sensitivity and specificity measure respectively.

7.2 Results

Predicted class

True class churn non-churn

churn 0.99 0.01

non-churn 0.06 0.93

Table 7.1: Confusion matrix of the decision tree (churn definition 1).

Predicted class


churn 0.93 0.07

non-churn 0.08 0.92

Table 7.2: Confusion matrix of the extended Cox model (churn definition 1).

Predicted class


churn 0.99 0.01

non-churn 0.2 0.98

Table 7.3: Confusion matrix of the decision tree (churn definition 2).

35

Predicted class


churn 0.96 0.04

non-churn 0.03 0.97

Table 7.4: Confusion matrix of the extended Cox model (churn definition 2).

We can state that the extended Cox model gives satisfying results with botha high sensitivity and specificity. However, the decision tree performs evenslightly better. Evidently, the time aspect incorporated by the extended Coxmodel does not provide an advantage over the decision tree in this particularproblem. Figure 5.6 gives a possible explanation for this. The survival curveis almost a linear decreasing line after the fourth month, which indicatesthat the number of churned customers is approximately constant over time.Probably, the Cox model could have had an advantage over the decision treewhen the problem contained a less smooth survival curve.

At this point, it is necessary to put the results in perspective, since theydepend on the churn definition used. This is already revealed in the differencebetween churn definition 1 and 2. Both models find it more difficult to predictchurn which is defined by churn definition 1, as Table 7.1 and 7.2 show. Afterall, the longer a customer is allowed to have zero usage before he is consideredas churned, the easier it is to predict it. Moreover, a new and different churndefinition is likely to yield a different pattern of results. Notice however thatif the obtained results could be ascribed to the simplicity of the problem,the optimal decision trees shown in Appendix B and C would be far lesscomplicated.

36

Chapter 8

Conclusions and

recommendations

The conclusions of the present study are based on the research question andsub questions stated in Chapter 1. By answering these questions we not onlyprovide an overview of the work presented in this study, but we consider theextent to which we succeeded answering these questions as well. Moreover,a number of recommendations for further research in this field are discussed.

8.1 Conclusions

The research question formulated in Chapter 1 is:


analysis?

The sub questions are formulated as:

• What is a proper, practical and measurable prepaid churn definition?

• How well do survival models perform in comparison to the establishedpredictive models?

• Do survival models have an added value compared to the establishedpredictive models?

First, these three questions will be discussed before the central research ques-tion is addressed.

37

What is a proper, practical and measurable prepaid churn definition?

In Chapter 1 is described why the deactivation date stored in the database isnot a proper definition for prepaid churn. To overcome this problem a churndefinition is proposed. This churn definition should indicate when a customerhas permanently stopped using his prepaid sim-card as early as possible. Af-ter extensively examining the behaviour of prepaid customer, an intuitiveoperational churn definition is proposed in Chapter 3. This definition allowsfor a large range of customer behaviours. The definition provides the datawith churn labels in a consistent and understandable manner. However, forlarger periods of zero usage the definition becomes less reliable.

How well do survival models perform in comparison to the established pre-

dictive models?

The answer on this question is based on the following situation. The survivalmodel proposed in this study is an extended Cox model. This model is usedto predict prepaid churn. In order to make a direct comparison, a decisiontree is also considered. This model represents the ‘established predictivemodels’.

In terms of predictive power, the extended Cox model scores very well,with sensitivities ranging from 93% to 96% and specificities ranging from92% to 97%. However, this is slightly less than the performance of the deci-sion tree which has in both cases a sensitivity of 99% and specificities rangingfrom 93% to 98%. The advantage of the extended Cox model over models likeregression models, neural networks and decision trees is that it incorporatesthe time aspect by means of the baseline hazard. It is thus able to captureaspects of customer behaviour at specific points in time. Unfortenately, theextended Cox model cannot fully benefit from this feature in this problem.Probably for this reason, the extended Cox model does not outperform thedecision tree.

Do survival models have an added value compared to the established predic-

tive models?

Since survival analysis is originally designed to analyse data, it can be per-fectly used for that purpose. It gives insight in the behaviour of customersover time. This is a useful extension to exploratory data analysis. This typeof data analysis is normally not considered by predictive modelling. Besides,after analysis a survival model can be applied to model the influence of cus-tomer behaviour on the event time. This is an elegant approach which works

38

intuitively. Another advantage of survival models is the ability to captureaspects of customer behaviour at specific points in time. Common predictivemodels on the other hand, use snapshots of single months to classify churn.They take no notice of the time aspect involved. Furthermore, survival mod-els can handle censored data and stratification. In stratification a categoricalvariable is used to create seperate baseline hazards. An advantage is that thistime-independent covariate does not have to meet the proportional hazardsassumption. Stratification can be for instance used to subdivide differentcustomer groups. Finally, survival models can predict the probability of anevent at a future point in time, given that only time-independent covariatesare used.


analysis?

We have shown that it is possible to make a prepaid churn model using theextended Cox model. This survival model is able to predict churn in thenext month. It yields very good results, also in comparison to the decisiontree. It does however not outperform the decision tree on this prepaid churnproblem.

8.2 Recommendations for future research

The proposed churn definition provides a plausible indication of churn. How-ever, there are cases where this definition is not suitable. It is in particularfor customers with a very low level of consumption hard to indicate whenthey are going to churn. Therefore, some improvements could be made to thechurn definition. In addition, mobile telecommunication companies shouldgain more insight into the (churn) behaviour of prepaid customers. Thisinformation could be, for instance, obtained by qualitative studies. Churndefinitions will then become more reliable, and so will predictive churn mod-els.

Another prepaid phenomenon which could be examined in the future isthe switching of sim-cards between individuals. Sim-cards are often handedover to relatives or friends when they are not used anymore. Although thesecustomers are actually churned, this cannot be observed. This causes adistorted view of churn behaviour and should be acknowledged.

In this study we have shown that survival analysis, and in particular theextended Cox model, provides very satisfying results. Though, since theapplication of survival analysis in data mining is relatively new, there is still

39

considerable space for improvement. Current research [16] shows that neuralnetworks are also appropriate to model survival data. This could furtherimprove the accuracy of survival models, as neural networks can handle non-linear relations.

Furthermore, stratification could be applied to distinguish different cus-tomer profiles and so gain a better performance. Since no such profiles wereavailable, this has not been carried out in this study.

Finally, more research could be performed into the predictive scoringmethods. In this work we used the hazard in combination with a thresholdto identify churn. This is an efficient method which gave satisfying results.However, new score methods or combinations of existing methods could beexamined in the future.

40

Bibliography

[1] S. Balcaen and H. Ooghe. Alternative methodologies in studies on busi-ness failure: do they produce better results than the classic statisticalmethods? Vlerick Leuven Gent Management School Working PaperSeries, 2004.

[2] M. Berry and G. Linoff. Mastering Data Mining. John Wiley and Sons,New York, USA, 2000.

[3] P. Filzmoser. Robust principal component regression. In Proceedingsof the Sixth International Conference on Computer Data Analysis andModeling, pages 132–137, Minsk, Belarus, 2001.

[4] P. F. G. Bijwaard and R. Paap. Modeling purchases as repeated events.Econometric Institute Reports, 2003.

[5] M. Halling and E. Hayden. Bank failure prediction: A two-step survivaltime approach. http://ssrn.com/abstract=904255, 2006.

[6] F. Harrell. Regression Modeling Strategies: With Applications to LinearModels, Logistic Regression, and Survival Analysis. Springer, New York,USA, 2001.

[7] M. P. J. Ferreira, M. Vellasco and C. Barbosa. Data mining techniqueson the evaluation of wireless churn. In ESANN, pages 483–488, 2004.

[8] I. Jolliffe. Principal Component Analysis. Springer, New York, USA,2002.

[9] D. Kleinbaum and M. Klein. Survival Analysis: A Self-Learning Text.Springer, New York, USA, 2005.

[10] D. G. M. Mozer, R. Wolniewicz and H. Kaushansky. Predicting sub-scriber dissatisfaction and improving retention in the wireless telecom-munications industry. IEEE Transactions on Neural Networks, 11:690–696, 2000.

41

[11] J. Maindonald and J. Braun. Data Analysis and Graphics Using R.Cambridge University Press, Cambridge, UK, 2003.

[12] R. Miller. Survival Analysis. John Wiley and Sons, New York, USA,1981.

[13] K. Morik and H. Kopcke. Analysing customer churn in insurance data acase study. In Proceedings of the 8th European Conference on Principlesand Practice of Knowledge Discovery in Databases, pages 325–336, NewYork, USA, 2004.

[14] D. V. D. Poel and B. Larivire. Customer attrition analysis for financialservices using proportional hazard models. Working Papers of Facultyof Economics and Business Administration, Ghent University, 2003.

[15] W. Potts. Survival data mining. http://www.data-miners.com, 2001.

[16] B. Ripley and R. Ripley. Neural networks as statistical methods insurvival analysis. In Artificial Neural Networks: Prospects for Medicine.Landes Biosciences Publishers, 1998.

[17] T. Therneau and P. Grambsch. Modeling Survival Data: Extending theCox Model. Springer, New York, USA, 2000.

42

Appendix A: Derived variables

average out dur per call = average duration of a single outgoing callaverage in dur per call = average duration of a single outgoing callratio dur per call = ratio between outgoing and incoming durationtotal rev = sum of incoming revenue and outgoing revenuenon usage interval = the current number of successive non usage monthsmax non usage interval = the maximum number of successive non usagemonthscompare = a comparison between the non usage interval and max non usageintervaltotal rev sum = cumulative total revnon usage sum = cumulative number of non usage monthstotal recharge num sum = cumulative number of rechargestotal in dur avg = average duration of incoming calls over all past monthstotal out dur avg = average duration of outcoming calls over all past monthslength last voicemail = the number of months since the last voicemail calllength last recharge = the number of months since the last rechargelast recharge val = the last recharge amountaverage recharge val = the average recharge amount of all past months

3 months average = the average over the last three months

diff = the ratio between the average over the last three months and the

average over all months before

sms point to point call = number of sms messages

dur = duration in seconds

call = number of calls

total out dur moving difftotal out dur 3 months averagetotal out call moving difftotal out call 3 months averagetotal in dur moving diff

43

total in dur 3 months averagetotal in call moving difftotal in call 3 months averageinternational dur moving diffinternational dur 3 months averageinternational call moving diffinternational call 3 months averagetotal rev moving difftotal rev 3 months averagesms point to point call moving diffsms point to point call 3 months average

44

Appendix B: Decision tree for

churn definition 1

45

Appendix C: Decision tree for

churn definition 2

46

Documents

Churn Prediction in the Mobile Telecommunications Industry · PDF fileChurn Prediction in the Mobile Telecommunications Industry ... and my girlfriend for their support, interest