7 Clustering

Embed Size (px)

Citation preview

  • 8/9/2019 7 Clustering

    1/12

    7.0 CLUSTERING

    7.1 Introduction to Clustering

    Clustering is one of the most important data mining methodologies used in marketing and

    customer-relationship management. It uses customer data to track customer behavior andcreate strategic business initiatives. Organizations can use this data to divide customers

    into segments based on variants such as demography, customer behavior, customerprofitability, measure of risk, and lifetime value of a customer or retention probability.

    Creating customer segments based on such variables highlights marketing opportunities.

    Clustering is used to group similar data together to form a set of cohesive clusters. Let us

    take a very simple example for clustering. The example below demonstrates theclustering of balls. There are a total of 10 balls, which are of three different colors. The

    objective is to cluster these balls of into different groups.

    The balls of same color are clustered into a group as shown below:

    Each cluster's profile highlights the prevalent characteristics of its members. The resultsfrom clustering can be used to summarize the contents of the database by considering the

    characteristic of each cluster rather than the characteristics of each record. You can use

    clustering for either forecasting or description. You may have a set of valuable customerdata with a large number of records that you need to utilize to develop marketingcampaigns or segments. It will not be very effective to launch a single marketingcampaign for such a large group of customers. You can develop clusters of your

    customers based on all or some of the information you know about them (demographics,types of products purchased, personal preferences etc). With clustering, it is possible to

    determine which combinations of attributes frequently occur together and use thisinformation to build clusters, that is to say, customer segments.

    For example, the customer data for a fruit juice outlet contains attributes such as gender,age, income, region, occupation, and product bought most. A customer segment Cluster 1

    could consist of male customers aged between 30 and 40, with high incomes, and whosemost frequent purchase is orange juice. Another customer segment Cluster 2 couldrepresent female customers aged between 20 and 40, without occupation, and whose most

    frequent purchase is apple juice. A third customer segment, Cluster 3 could representsenior citizens, without occupation, average incomes, and whose most frequent purchase

    is Pineapple juice. If you have a campaign proposed for a product based on Orange flavor,then you can target this at Cluster 1.

  • 8/9/2019 7 Clustering

    2/12

    With clustering, you can identify both customers that are exhibiting similarcharacteristics within a cluster and characteristics that are different across clusters. Based

    on this information you can develop personalized marketing campaigns for each cluster.

    7.2 Sample Uses and Applications of Clustering

    Marketing

    In marketing, you can conduct demographic clustering and segmentation within the

    behavioral segments to define tactical marketing campaigns and select the appropriatemarketing channel and advertising for the tactical campaign. It is then possible to targetthose customers most likely to exhibit the desired behavior by creating predictive models.

    We can distinguish between three types of segmentation:

    Demographic Segmentation:

    This involves understanding the demographic profile of your customer (such as age,income, geographic location).

    Behavioral Segmentation:

    This segmentation is based on understanding the behavior of your customer. The datawould be drawn, for example, from customer surveys or from purchasing history.

    Targeted Segmentation:

    Understanding the profile of your customer based on a specific dimension. For example,you could segment customers by their usage of a specific product or service.

    Clustering analysis is also widely used in information, policy and decision sciences. Thevarious applications of clustering analysis to documents include votes on political issues,

    survey of markets, survey of products, survey of sales programs, and R & D. In the lifesciences (biology, botany, zoology, entomology, cytology, microbiology), the objects of

    analysis are life forms such as plants, animals, and insects. The clustering analysis mayrange from developing complete taxonomies to classification of the species intosubspecies. The subspecies can be further classified into subspecies.

    7.3 Typical Inputs to Clustering

    Clustering can be applied to any data, if the data is of the format mentioned in theexamples for typical inputs. Data could be of content type Discrete, Continuous and

    Ordered. Following are some examples of typical inputs that can be used to getmeaningful results from Clustering.

  • 8/9/2019 7 Clustering

    3/12

    Figure xx: Typical input to clustering

    In the above example, the customer ID is an Entity. Each of the columns Age, Sex, Statusand Telephone Number, Income are the Attributes.

    The actual data may have many other attributes but we must consider only the ones that

    are relevant for clustering purposes. In the example above, the attributes Age, Status andIncome could be considered for clustering analysis.

    Typically, an attribute like Telephone Number, Social Security Number, Customer IDsshould not be considered for clustering purposes since it contains too many unique values

    and there is a very rare chance of having common values across the customer base. Also,attributes with too many missing entries that cannot be substituted by any value, shouldnot be considered since this may not produce meaningful clusters.

    It is recommended that clustering be done on a representative sample of the customer

    base, typically 20% to 30% of the data rather than on the complete set of records. Thesample size must also take into account t he number of attributes and the number ofpossible values for each attribute.

    7.4 Typical Outputs from Clustering

    Following is an example of how the typical output for the Clustering Analysis would looklike.

  • 8/9/2019 7 Clustering

    4/12

    The clustering output contains:

    Number of clusters formed. Share of each cluster in relation to the initial volume or dataset. Entities assigned to each cluster Distribution of attribute values within each clusterFor implementing clustering in BW refer to the tutorial section.

    7.5 K-Means Clustering Algorithm

    Clustering works to group records together according to an algorithm or mathematical

    formula that attempts to find centroids, or centers, around which similar records gravitate.

  • 8/9/2019 7 Clustering

    5/12

    This method initially takes the number of components of the population equal to the finalrequired number of clusters. In this step itself the final required number of clusters is

    chosen such that the points are mutually farthest apart.

    Next, it examines each component in the population and assigns it to one of the clusters

    depending on the minimum distance. The distance measure used in BW is the Euclideandistance. It simply is the geometric distance in the multidimensional space. It is computed

    as:

    Distance (x, y) = { ? (xi - yi)2 }

    After every input record is assigned to some cluster or the other, the centroid's position is

    recalculated based on the records assigned to it. With the new centroids means, theassignments are checked again and this continues until all the stopping conditions are

    reached (i.e., maximum number of iterations reached or cluster assignments do notchange much between iterations).

    Weighting Concept in Clustering

    In K-Means clustering algorithm, each field forms one or more dimensions. Each datarecord in the input dataset is referred to a point in this N-dimensional space. Thecontinuous and ordered type fields each has one dimension. For discrete fields, every

    value will have one dimension.

    Let us assume a model contains two continuous attributes A and B, one ordered attributeC, two discrete attributes D (with 5 values) and E (with 2 values). Thus the total numberof dimensions is 2 + 1 + 5 + 2 dimensions i.e., 10 dimensions.

    All Continuous model field values are converted to weights that have values between 0

    and 100. The formula used for this is:

    Weighted Value = Model Field Weight * (Actual Value Minimum Value) * 100 /(Maximum Value Minimum Value)

  • 8/9/2019 7 Clustering

    6/12

    For example, for the continuous model field A, t he max value is 2000 and min value is10. Now if the value of this field is 100, the weight assigned is:

    Weighted Value = 2 * (100-10) * 100 / (2000 10) = 9.04522.

    All the weights assigned to ordered model field values are also normalized to haveweights between 0 and 100. The same formula as defined for continuous model field is

    used. For example, suppose the weights of ordered model field Cs values are given asfollows:

    EXCELLENT: 90GOOD: 75

    AVERAGEE: 50

    The weighted value of EXCELLENT = 1 * (90 50) * 100 / (90 50) = 100.

    For discrete model field, there is one dimension for every value of this model field. Forexample, field E has 2 possible values male and female. The default highest value foreach of the dimensions is 100. The formula used for weighted value calculation is (100 *

    Model Field Weight * Value Weight). The maximum weight a value can take is 9999. Ifby any calculations, the weight exceeds 9999, then the weight is reset to 9999. A warningmessage to this effect is logged in the training log.

    To understand the process of K -Means Clustering Algorithm, let us apply the k -Means

    clustering algorithm to the following example and obtain clusters from it.

    In the specified example, the f ields Age and Income are continuous attributes. The field

    Sex is a discrete attribute and Status is ordered attribute. For the ordered attribute Status,we assign the following weights for clustering purposes.

  • 8/9/2019 7 Clustering

    7/12

    Let us assume that the set of records needs to be divided in to two clusters. Other detailsregarding the weights assigned to each model field, and the values for the Continuous and

    ordered attributes are specified below.

    Using the above information, we get the weighted values for each of the attributes and

    input data is transformed into a weighted values table.

    The original numerical attribute and ordered attribute values are normalized using thefollowing formula. If a field value is specified missing, then their values are put as 0.

    Continuous attribute weight = 100 * Model field weight * (actual value minimumvalue) / (maximum value - minimum value)

  • 8/9/2019 7 Clustering

    8/12

    Ordered attribute weight = 100 * Model field weight * (actual specified weight -minimum specified weight) / (maximum specified weight minimum specifiedweight)

    Discrete attribute weight = 100 * Model Field Weight * W for attributeThe discrete field values are substituted by enumerated Value IDs. In the example above,for attribute Sex, Value IDs 1 and 2 replace the values M and F respectively. If a fieldvalue is specified missing, then 0 replaces their value-IDs.

    In the above example, we have 5 dimensions one each for Age, Income, Status and twodimensions for Sex (one for Male and other for Female). The input data can be

    represented in this 5-dimension space as follows:

    (Xage, Yincome, Zstatus, Amale, Afemale)

    For example, the first record corresponding to the above is represented as:

    (33.33, 83.33, 50, 100, 0)

    Since the attribute Sex has Male as the value, we flatten it in 2 dimensions as (100, 0). IfSex had the value Female, then the point would be represented as (0,100).

    The Cluster Centroid will also be represented by a similar co-ordinate structure.

    Following is a brief explanation of the algorithm used for Clustering.

    I. Get Initial means

    In this step, we need to randomly find Centroid points equal to the number of clusters (in

    the example, number of clusters is 2). Random sampling is done to get these two pointsfrom the given data. The complete input data is divided into K-parts and some random

    points are picked for each part. The mean from Kth part is assigned as the clustercentroids of Kth cluster.

    For example, if 2 random cluster points are required, then the data is logically dividedinto 2 parts, that is, Record IDs 1 to 4 forming the first part and remaining forming the

    second part.

    From the first part, 2 records are picked randoml y (approximately 30% to 40%). Supposerecords 1 and 3 are selected, the mean point is calculated as follows:

    (X1age, Y1income, Z1status, A1male, A1female), where:

    X1age = (33.33 + 81.48)/2 = 57.41

    Y1income= (83.33 + 175)/2 = 129.17Z1status= (50+ 100)/2 = 75

  • 8/9/2019 7 Clustering

    9/12

    A1female= (0 + 0)/2 = 0A1male= (100 + 100)/2 = 100

    Similarly, from the second part, we get the mean for the second cluster. Suppose the

    records 7 and 8 are picked randomly, we get cluster means as follows:

    (X2age, Y2income, Z2status, A2male, A2female), where:

    X2age = (100 + 25.93)/2 = 62.97

    Y2income= (200 + 73.33)/2 = 136.67Z2status= (0+ 100)/2 = 50A2female= (100 + 0)/2 = 50

    A2male= (0 + 100)/2 = 50

    II. Find the cluster each record is assigned to

    In this step, the system loops through the input data to find out which cluster is closest toit and assigns that cluster to the record. To find out the closest cluster for the record, wecalculate the distance of this record from each of the cluster centroids and determine

    smallest distance.

    Distance = sum (weights in cluster for each attribute - weights of each attribute) **2)

    For example, let us take input record 1 which is represented as [3.33, 83.33, 50,100,0]

    Now distance with respect to cluster 1 (represented as [57.41, 129.17, 75, 0, 100]) iscalculated as follows:

    (X1age - Xage)

    2

    + (Y1income - Yincome)

    2

    + (Z1status- Zstatus)

    2

    +(A1male- Amale)

    2

    + (A1female-Afemale)2

    If we apply the formula to record 1 then we get Distance to Cluster 1 (D1):D1 = (57.41 3.33)2 + (129.17 83.33)2+ (75 50)2+ (0 0)2+ (100 0)2

    which is 15650.84.

    Similarly, the distance with respect to cluster 2, that is, D2 (represented as [62.97, 136.67,50, 50, 50]) is 11402.08.

    This implies that the input Record 1 is assigned to cluster2, since the D 2 < D1. Similarly,

    the assignments of other records are also determined. At the end of this step, we know towhich cluster each of the records has been assigned.

    III. Recalculate Cluster Means

    Based on which records are assigned to which cluster, the centroid positions arerecalculated as shown in Step I.

  • 8/9/2019 7 Clustering

    10/12

    IV. Repeat from Step II until stopping conditions are reached.

    Step II and Step III are repeated until the stopping conditions are satisfied.

    V. Determine the distribution details of attributes in each cluster

    You can do this by analyzing the records assigned to each cluster. For discrete attributes,

    the distribution information is the frequency distribution of various values occurringwithin the cluster. For continuous attributes, the frequency distribution across different

    binning intervals is determined.

    The final cluster centroids are quite dependent on the initial cluster means. If we use a

    different set of initial means, then we get a completely different set of cluster centroids.This is the inherent drawback of K -Means clustering algorithm. To overcome this

    drawback, SAP uses initial cluster points formed from the input data by using intelligentsampling techniques.

    7.6 Interpreting Clustering Output

    The last step in the clustering process is to interpret the meaning of each cluster andderive strategies from this knowledge. The purpose of profiling clusters is to assess thepotential business value of each cluster quantitatively by profiling the aggregate values of

    the variables by cluster.

    From the various value distribution graphs provided in the output, one can easily findwhich attribute value dominates in which cluster by analyzing the share of this value vis --vis other values. As shown below in a tabular format, we can summarize, for every

    cluster, which attribute value occurs more frequently. For continuous attributes, we canfind which value range has the highest share for a given cluster.

    Another example of how the clustering output can be used is shown below. Analyzing thecustomer information like revenue, number of products purchased, and customer tenure

    across these clusters, we can come to interesting propositions. The table shows thatcluster 5 is the most profitable cluster, representing about 35 percent of the revenue yet

    only 9 percent of the customers. The average revenue per customer is the highest for thiscluster. Also the average number of products bought by this group and their averagelifetime is also high.

  • 8/9/2019 7 Clustering

    11/12

    The profile of the clusters shows that there is a business opportunity in increasing the

    number of products purchased by customers. From this simple result, it is possible to

    derive some high -level business strategies. It is obvious that the best customers(considering only the data contained in the table) are contained in clusters 2, 5, and 7.

    These customers have higher revenue per person than the customers of other clusters, asindicated by the third column. Some possible strategies include:

    Retention strategy for best customers (those in clusters 2, 5, and 7) Cross-sell strategy for clusters 2, 6, and 9 by contrasting with clusters 5 and 7.

    Clusters 2, 6, and 9 have average number of products close to those of clusters 5 and7, which have the highest number of products purchased. Because the clusters are

    close, in the number of products purchased, it shouldn't be a big stretch to convertcustomers from clusters 2, 6, and 9 to clusters 5 and 7. By comparing which product s

    are bought by the best customers to those purchased by those in clusters 2, 6, and 9,we can find products that are candidates for cross selling.

    Similarly, you can cross -sell between clusters 3 and 4 and clusters 2, 6, and 9because they are close in value.

    Strategy for cluster 1 would be to wait and see. It appears to be a group of newcustomers for which we have not yet collected sufficient data to determine whatbehaviors they may exhibit.

    Strategy for cluster 8 may be to refrain from spending any significant marketingdollars on them. Cluster 8 appears to be the worst cluster, with a very low revenue

    percentage. These customers purchase very few products even though they have beenwith the company for quite some time.

    Prediction

    Since we typically use a representative sample for clustering, we need to determine thecluster groups for the entire customer database. You can use Prediction to do this.

    Prediction can be applied on a single customer or a set of customers. This processinvolves calculating the distance of each customer with already determined cluster

  • 8/9/2019 7 Clustering

    12/12

    centroids and assigning the cluster where the distance is least. Prediction for a singlecustomer record is useful in a call center based or web based application to provide

    personalized service on the basis of the customers predicted cluster group.