prev

next

out of 41

Published on

07-Sep-2015View

5Download

4

Embed Size (px)

DESCRIPTION

Insightful decision making of your data.

Transcript

ii

K14271 2013/2/15 9:12 ii

ii

ii

3 Data Mining in a Nutshell

The purpose of this chapter is to introduce the reader to themain concepts and data mining tools used in business analyt-ics. Methods are described at a non-technical level, focusingon the idea behind the method, how it is used, advantagesand limitations, and when the method is likely to be of valueto business objectives.

The goal is to transform data into information,and information into insight.

- Carly Fiorina (1954 )President of Hewlett Packard, 19992005

3.1 What Is Data Mining?

Data mining is a field that combines methods from artificialintelligence, machine learning1, statistics, and database sys- 1 Machine learning is a sci-

entific discipline concernedwith the design and develop-ment of algorithms that allowcomputers to evolve behav-iors based on empirical data,such as from sensor data ordatabases.

tems. Machine learning and statistical tools are used for thepurpose of learning through experience, in order to improvefuture performance. In the context of business analytics, datamining is sometimes referred to as advanced analytics2.

2 We avoid referring to analyt-ics as simple or advancedas these terms more appropri-ately describe the usage levelof analytics.

We want machines that are able to learn for several rea-sons. From large amounts of data, hidden relationships andcorrelations can be extracted. Scenarios such as changing en-vironments highlight the need for machines that can learnhow to cope with modifying surroundings. Computer learn-ing algorithms that are not produced by detailed human de-sign but by automatic evolution can accommodate a constantstream of new data and information related to a task.

Data mining focuses on automatically recognizing com-plex patterns from data, to project likely outcomes. Learningis defined as the acquisition of knowledge or skill throughexperience. In data mining, we train computational methods

ii

K14271 2013/2/15 9:12 ii

ii

ii

42 getting started with business analytics

to learn from data for the purpose of applying the knowledgeto new cases.

The main challenge of data mining techniques is the abil-ity to learn from a finite set of samples (data) and be able togeneralize and produce useful output on new cases (scoring).

Within data mining, algorithms are divided into two majortypes of learning approaches:

Supervised learning: we know the labels or outcomes for asample of records, and we wish to predict the outcomes ofnew or future records. In this case, the algorithm is trainedto detect patterns that relate inputs and the outcome. Thisrelationship is then used to predict future or new records.An example is predicting the next move in a chess game.

The fundamental approach in supervised learning isbased on training the model and then evaluating its per-formance. Data are therefore typically segmented intothree portions:

Training data: the data used for training the data min-ing algorithm or model

Validation data: used to tweak models and to compareperformance across models. The rule of thumb is 8020for training and validation data.

Test data (or hold-out data): used to evaluate the finalmodels performance, based on its ability to perform onnew previously unseen data.

Unsupervised learning: we have a sample of records, each con-taining a set of measurements but without any particularoutcome of interest. Here the goal is to detect patterns orassociations that help find groups of records or relation-ships between measurements, to create insights about rela-tionships between records or between measurements. Anexample is Amazons recommendation system that recom-mends a set of products based on browsing and purchaseinformation.

3.2 Predictive Analytics

This set of tools includes a wide variety of methods and al-gorithms from statistics and machine learning. We cover afew of the most popular predictive analytics tools. Interested

ii

K14271 2013/2/15 9:12 ii

ii

ii

data mining in a nutshell 43

Figure 3.1: The difference be-tween supervised and unsu-pervised problems. In the su-pervised learning task we tryto classify the object as chair(+1) or an apple (1). In theunsupervised learning case,we try to measure similarity ofan item to other items.

ii

K14271 2013/2/15 9:12 ii

ii

ii

44 getting started with business analytics

reader can obtain information about further methods or fur-ther technical details from more specialized books.

Computers are useless. They can only give you an-swers.

Pablo Picasso (18811973)

Supervised Learning

In supervised learning, for each record we have a set of inputmeasurements as well as a known target or outcome mea-surement. For example, in a customer database of mobilephone users, where we are interested in modeling customerchurn, a record is a customer. For each customer, input mea-surements can include demographic information as well ascall and billing history. A possible outcome measurement iswhether the customer stays with the company for at least ayear.

The purpose of supervised learning methods is to find arelationship between the input measurements and the out-come measurement. In the mobile customer churn example,we are looking for a relationship between customer attributesand behavior and their attrition.

Another classic example of a supervised learning task isthe prediction of spam (unsolicited email) for the purpose ofspam filtering. Each record is an email message, for whichwe have multiple input measurements such as the senderaddress, the title, and text length. The outcome of interest isa label of spam or non-spam.

In the above examples of customer churn and spam, theoutcome measurement is categorical: whether a customerstays or not, or whether an email is spam or not. This type ofoutcome is called a class. Predicting the outcome is thereforecalled classification.

Supervised learning includes scenarios where the outcomemeasurement is either categorical or numerical. Some exam-ples of a numerical outcome are predicting the duration ofservice calls at a call center based on input measurementsthat are available before the call is taken, or predicting theamount of cash withdrawn in each ATM transaction beforethe actual amount is keyed in by the customer. When theoutcome is numerical, the supervised learning task is calledprediction3.

3 In machine learning, theterm used for predicting a nu-merical outcome is regression.

ii

K14271 2013/2/15 9:12 ii

ii

ii

data mining in a nutshell 45

The following supervised learning techniques are used forclassification and/or prediction. The various methods, eachwith strengths and weaknesses, approach the task of detect-ing potential relationships between the input and outcomemeasurements differently.k-Nearest Neighbors (k-NN)

k-nearest neighbors (k-NN) algorithms are useful for bothclassification and prediction. They can be used to predict cat-egorical and numerical outcomes. The algorithm identifies krecords in the training set that are most similar to the recordto be predicted, in terms of input measurements. These kneighbors are then used to generate a prediction of the out-come for the record of interest. If the outcome is categorical,we let the neighbors vote to determine the predicted class ofthe record of interest. If the outcome is numerical, we simplytake an average of the neighbors outcome measurement toobtain the prediction.

The nearest neighbors approach is what real estate agentstend to instinctively use when pricing a new property. Theyseek similar properties in terms of size, location and otherfeatures and then use these reference properties to price thenew property.

Consider the mobile customer churn example for predict-ing how likely a new customer is to stay with the com-pany for at least one year. The k-nearest-neighbors algorithmsearches the customer database for a set of k customers simi-lar to the to-be-predicted customer in terms of demographic,calling and billing profiles. The algorithm then considersthe churn behavior of the k neighbors and uses the mostpopular class (churn/no churn) to predict the class of thenew customer. If we are interested in a probability of churn,the algorithm can compute the percentage of neighbors whochurned.

In the call-center call duration example, we want to pre-dict the duration of an incoming call before it begins. The k-NN algorithm searches the historic database for k calls withsimilar features (information available on the caller, call time,etc.). The average call duration of these k similar calls is thenthe predicted duration for the new call.

To illustrate the k-NN algorithm graphically, consider theexample of predicting whether an online auction will be com-petitive or not. A competitive auction is one that receivesmore than a single bid. Using a set of over 1,000 eBay auc-tions, we examine two input measurements in each auction:

ii

K14271 2013/2/15 9:12 ii

ii

ii

46 getting started with business analytics

the seller rating (where higher ratings indicate more experi-ence) and the opening price set by the seller.

The relationship between the auction competitiveness out-come and these two inputs is shown in Figure 3.2. Supposethat we want to predict the outcome for a new auction, giventhe seller rating and opening price. This new record is de-noted by a question mark in the chart. The k-NN algorithmsearches for the k nearest auctions. In this case k was chosento be 7. Among the seven neighbors, five were competitiveauctions; the predicted probability of this auction to be com-petitive is therefore 5/7. If we use a majority rule to generatea classification, then the five competitive auctions are the ma-jority of the seven neighboring auctions, and k-NN classifiesthe new auction as being competitive.

Figure 3.2: Competitiveauctions (black circles) andnon-competitive auctions(gray squares) as a functionof seller rating and openingprice in eBay auctions. k-nearest neighbors classifies anew auctions competitivenessbased on k auctions withsimilar seller ratings andopening prices.

A k-nearest neighbors algorithm requires determining twofactors: the number of neighbors to use (k) and the defini-tion of similarity between records. The number of neighborsshould depend on the nature of the relationship between theinput and outcome measurements in terms of its global ver-sus local nature. In a global pattern, the same relationshipholds across all input values, whereas in local patterns differ-ent relationships exist for different values of the input values.

In the mobile churn example, if churn decreases in ageregardless of other demographics or billing features, then wecan say that there is a global relationship between churn andage. However, if churn decreases in age only for heavy callers

ii

K14271 2013/2/15 9:12 ii

ii

ii

data mining in a nutshell 47

but increases for low-volume callers, then the relationshipbetween churn and age is local. A small number of neighborsis better for capturing local relationships only a small setof very close neighbors would be similar to the record ofinterest whereas in global relationships a large number ofneighbors leads to more precise predictions.

The choice of k is typically done automatically. The algo-rithm is run multiple times, each time varying the value of k(starting with k = 2) and evaluating the predictive accuracyon a validation set. The number of neighbors that producesthe most accurate predictions on the validation set is chosen.

Similarity between records can be determined in manyways. Records are compared in terms of their input measure-ments. The similarity metric most commonly used in k-NNalgorithms is Euclidean distance. To measure the distance be-tween two records, we look at each input measurement sep-arately and measure the squared difference between the tworecords. We then take a sum of all the squared differencesacross the various input measurements. This is the Euclideandistance between two records.

For example, the Euclidean distance between two auctionsis computed by summing up the squared difference betweenthe pair of seller ratings and the squared difference betweenthe pair of opening prices. You may have noticed that com-puting a Euclidean distance in this way will produce a sim-ilarity measure that gives much more weight to input mea-surements with large scales (such as seller ratings, comparedto opening prices). For this reason, it is essential to first nor-malize the input measurements before computing Euclideandistances. Normalizing can be done in different ways. Twocommon normalizing approaches are converting all scalesto a [0,1] scale or subtracting the mean and dividing by thestandard deviation. While similarity between records can bemeasured in different ways, Euclidean distance is appealingbecause of its computational efficiency.

In k-NN, computational efficiency is especially importantbecause the algorithm computes the similarity between theto-be-predicted record with each and every record in thetraining data. Moreover, if we want to predict many newrecords (such as for a large set of potential customers), thecomputational task can be heavy.

The Verdict: Among supervised learning methods, k-NNis simple to explain and easy to automate. It can be used forboth prediction and classification and is highly data-driven,

ii

K14271 2013/2/15 9:12 ii

ii

ii

48 getting started with business analytics

i.e., there are no assumptions about the nature of the relation-ship between the outcome and inputs. While k-NN is simpleto explain, it produces black-box predictions because it isnot clear which inputs contribute to the prediction and towhat degree. When transparency is needed, k-NN is not anappealing candidate.

One key requirement of k-NN algorithms is sufficienttraining data. k-NN must be able to find a sufficient numberof close neighbors to produce accurate predictions. Unfor-tunately, the number of required records increases exponen-tially in the number of input measurements, a problem calledthe curse of dimensionality. Another challenge that KNNfaces is computational: the time to find the nearest neighborsin a large training dataset can be prohibitive. While thereare various tricks to try to address the curse of dimensional-ity and the computational burden, these two issues must beconsidered as inherent challenges within k-NN.Classification and Regression Trees (CART)

Classification and regression trees are supervised learningalgorithms that can be used for both classification (classifi-cation trees) and prediction (regression trees). Like k-NN,the idea is to define neighborhoods of similar records, andto use those neighborhoods to produce predictions or classi-fications for new records. However, the way that trees deter-mine neighborhoods is very different from k-NN. In particu-lar, trees create rules that split data into different zones basedon input measurements, so that each zone is dominated byrecords with a similar outcome. In the eBay auctions exam-ple, we might have a rule that says IF the opening pric...