Unsupervised Learning

Unsupervised LearningClusteringUnsupervised classification, that is, without the class attributeWant to discover the classes

Association Rule DiscoveryDiscover correlation

Data Mining and Knowledge Discovery

The Clustering ProcessPattern representationDefinition of pattern proximity measureClusteringData abstractionCluster validation


Pattern RepresentationNumber of classesNumber of available patternsCircles, ellipses, squares, etc.Feature selectionCan we use wrappers and filters?Feature extractionProduce new featuresE.g., principle component analysis (PCA)


Pattern ProximityWant clusters of instances that are similar to each other but dissimilar to othersNeed a similarity measureContinuous caseEuclidean measure (compact isolated clusters)The squared Mahalanobis distance

alleviates problems with correlationMany more measures


Pattern ProximityNominal attributes


Clustering TechniquesClusteringHierarchicalPartitionalSingle LinkComplete LinkSquare ErrorMixture MaximizationK-meansExpectation MaximizationCobWeb


Technique CharacteristicsAgglomerative vs DivisiveAgglomerative: each instance is its own cluster and the algorithm merges clustersDivisive: begins with all instances in one cluster and divides it upHard vs FuzzyHard clustering assigns each instance to one cluster whereas in fuzzy clustering assigns degree of membership


More CharacteristicsMonothetic vs PolytheticPolythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms)Monothetic: attributes are considered one at a timeIncremental vs Non-IncrementalWith large data sets it may be necessary to consider only part of the data at a time (data mining)Incremental works instance by instance


Hierarchical ClusteringACBD EF GA B C D E F GDendrogramS imilarity


Hierarchical AlgorithmsSingle-linkDistance between two clusters set equal to the minimum of distances between all instancesMore versatileProduces (sometimes too) elongated clustersComplete-linkDistance between two clusters set equal to maximum of all distances between instances in the clustersTightly bound, compact clustersOften more useful in practice


Example: Clusters FoundSingle-Link

Complete-Link


Partitional ClusteringOutput a single partition of the data into clustersGood for large data setsDetermining the number of clusters is a major challenge


K-MeansSeedsPredeterminednumber of clusters

Start with seedclusters of one element


Assign Instances to Clusters


Find New Centroids


New Clusters


Discussion: k-meansApplicable to fairly large data setsSensitive to initial centersUse other heuristics to find good initial centersConverges to a local optimumSpecifying the number of centers very subjective


Clustering in WekaClustering algorithms in WekaK-MeansExpectation Maximization (EM)Cobweb hierarchical, incremental, and agglomerative


CobWebAlgorithm (main) characteristics:Hierarchical and incrementalUses category utility

Why divide by k? The k clustersAll possible values for attribute ai


Category UtilityIf each instance in its own cluster

Category utility function becomes

Without k it would always be best for each instance to have its own cluster, overfitting!


The Weather Problem


Sheet1

OutlookTemp.HumidityWindyPlay

SunnyHotHighNo

SunnyHotHighNo

OvercastHotHighYes

RainyMildHighYes

RainyCoolNormalYes

RainyCoolNormalNo

OvercastCoolNormalYes

SunnyMildHighNo

SunnyCoolNormalYes

RainyMildNormalYes

SunnyMildNormalYes

OvercastMildHighYes

OvercastHotNormalYes

RainyMildHighNo

Sheet2

Sheet3

Weather Data (without Play)Label instances: a,b,.,naStart by puttingthe first instancein its own clusterabAdd another instancein its own cluster


Adding the Third InstanceacbacbacbEvaluate the category utility of adding the instance to oneof the two clusters versus adding it as its own clusterHighest utility


Adding Instance fabcdefFirst instance not to getits own cluster:Look at the instances:Rainy Cool Normal FALSERainy Cool Normal TRUEQuite similar!


Add Instance gabcdefLook at the instances:E) Rainy Cool Normal FALSEF) Rainy Cool Normal TRUEG) Overcast Cool Normal TRUEg


Add Instance hbcdefLook at the instances:A) Sunny Hot High FALSED) Rainy Mild High FALSEH) Sunny Mild High FALSEgahRearrange:Merged into a single cluster before h is addedBest matching nodeRunner up(Splitting is also possible)


Final HierarchyWhat next?


Dendrogram ClustersWhat do a, b, c, d, h, k, and lhave in common?


Numerical AttributesAssume normal distribution

Problems with zero variance!The acuity parameter imposes a minimum variance


Hierarchy Size (Scalability)May create very large hierarchyThe cutoff parameter is uses to suppress growthIf

cut node off.


DiscussionAdvantagesIncremental scales to large number of instancesCutoff limits size of hierarchyHandles mixed attributesDisadvantagesIncremental sensitive to order of instances?Arbitrary choice of parameters: divide by k, artificial minimum value for variance of numeric attributes, ad hoc cutoff value


Probabilistic PerspectiveMost likely set of clusters given dataProbability of each instance belonging to a clusterAssumption: instances are drawn from one of several distributionsGoal: estimate the parameters of these distributionsUsually: assume distributions are normal


Mixture ResolutionMixture: set of k probability distributionsRepresent the k clustersProbabilities that an instance takes certain attribute values given it is in the cluster

What is the probability an instance belongs to a cluster (or a distribution)


One Numeric AttributeGiven some data, how can you determine the parameters:Two cluster mixture model:


ProblemsIf we knew which instance came from each cluster we could estimate these valuesIf we knew the parameters we could calculate the probability that an instance belongs to each cluster


EM AlgorithmExpectation Maximization (EM)Start with initial values for the parametersCalculate the cluster probabilities for each instanceRe-estimate the values for the parametersRepeatGeneral purpose maximum likelihood estimate algorithm for missing dataCan also be used to train Bayesian networks (later)


Beyond Normal ModelsMore than one class:StraightforwardMore than one numeric attributeEasy if assume attributes independentIf dependent attributes, treat them jointly using the bivariate normalNominal attributesNo more normal distribution!


EM using WekaOptionsnumClusters: set number of clusters. Default = -1 selects it automaticallymaxIterations: maximum number of iterationsseed -- random number seedminStdDev -- set minimum allowable standard deviation


Other ClusteringArtificial Neural Networks (ANN)Random search Genetic Algorithms (GA)GA used to find initial centroids for k-meansSimulated Annealing (SA)Tabu Search (TS)Support Vector Machines (SVM)Will discuss GA and SVM later


ApplicationsImage segmentationObject and Character RecognitionData Mining:Stand-alone to gain insight into the dataPreprocess before classification that operates on the detected clusters


DM Clustering ChallengesData mining deals with large databasesScalability with respect to number of instanceUse a random sample (possible bias)Dealing with mixed dataMany algorithms only make sense for numeric dataHigh dimensional problemsCan the algorithm handle many attributes?How do we interpret a cluster in high dimensions?


Other (General) ChallengesShape of clustersMinimum domain knowledge (e.g., knowing the number of clusters)Noisy dataInsensitivity to instance orderInterpretability and usability


Clustering for DMMain issue is scalability to large databasesMany algorithms have been developed for scalable clustering:Partitional methods: CLARA, CLARANSHierarchical methods: AGNES, DIANA, BIRCH, CURE, Chameleon


Practical Partitional Clustering AlgorithmsClassic k-Means (1967)Work from 1990 and later:k-MedoidsUses the mediod instead of the centroidLess sensitive to outliers and noiseComputations more costlyPAM (Partitioning Around Mediods) algorithm


Large-Scale ProblemsCLARA: Clustering LARge ApplicationsSelect several random samples of instancesApply PAM to eachReturn the best clustersCLARANS: Similar to CLARADraws samples randomly while searchingMore effective than PAM and CLARA


Hierarchical MethodsBIRCH: Balanced Iterative Reducing and Clustering using HierarchiesClustering feature: triplet summarizing information about subclusters

Clustering feature tree: height-balanced tree that stores the clustering features


BIRCH MechanismPhase I:Scan database to build an initial CF treeMultilevel compression of the dataPhase II:Apply a selected clustering algorithm to the leaf nodes of the CF treeHas been found to be very scalable


ConclusionThe use of clustering in data mining practice seems to be somewhat limited due to scalability problemsMore commonly used unsupervised learning:

Association Rule Discovery


Association Rule DiscoveryAims to discovery interesting correlation or other relationships in large databasesFinds a rule of the formif A and B then C and DWhich attributes will be included in the relation is unknown


Mining Association RulesSimilar to classification rulesUse same procedure?Every attribute is the sameApply to every possible expression on right hand sideHuge number of rules Infeasible

Only want rules with high coverage/support


Market Basket AnalysisBasket data: items purchased on per-transaction basis (not cumulative, etc)How do you boost the sales of a given product?What other products does discontinuing a product impact?Which products should be shelved together?Terminology (market basket analysis):Item - an attribute/value pair Item set - combination of items with min. coverage


How Many k-Item Sets Have Minimum Coverage?


Sheet1

OutlookTemp.HumidityWindyPlay

SunnyHotHighNo

SunnyHotHighNo

OvercastHotHighYes

RainyMildHighYes

RainyCoolNormalYes

RainyCoolNormalNo

OvercastCoolNormalYes

SunnyMildHighNo

SunnyCoolNormalYes

RainyMildNormalYes

SunnyMildNormalYes

OvercastMildHighYes

OvercastHotNormalYes

RainyMildHighNo

Sheet2

Sheet3

Item Sets


1-Item

2-Item

3-Item

4-Item

Outlook=sunny (5)

Outlook=sunny temp=mild (2)

Outlook=sunny temp=hot humidity=high (2)

Outlook=sunny temp=hot humidity=high play=no (2)

Outlook= overcast (4)

Outlook=sunny temp=hot (2)

Outlook=sunny temp=hot play=no (2)

Outlook=sunny humidity=high windy=false play=no (2)

Outlook=rainy (5)

Outlook=sunny humidity=norm (2)

Outlook=sunny humidity=norm play=yes (2)

Outlook=over temp=hot windy=false play=no (2)

Temp=cool (4)

Outlook=sunny windy=true (2)

Outlook=sunny humidity=high windy=false (2)

Outlook=rainy temp=mild windy=false play=yes (2)

Temp=mild (6)

Outlook=sunny windy=true (2)

Outlook=sunny humidity=high play=no (3)

Outlook=rainy humidity=norm windy=false play=yes (2)

From Sets to Rules3-Item Set w/coverage 4:Humidity = normal, windy = false, play = yes

Association Rules: Accuracy

If humidity = normal and windy = false then play = yes4/4If humidity = normal and play = yes then windy = false4/6If windy = false and play = yes then humidity = normal4/6If humidity = normal then windy = false and play = yes4/7If windy = false then humidity = normal and play = yes4/8If play = yes then humidity = normal and windy = false4/9If - then humidity = normal and windy = false and play=yes4/12


From Sets to Rules (continued)4-Item Set w/coverage 2:Temperature = cool, humidity = normal, windy = false, play = yes

Association Rules: Accuracy

If temperature = cool, windy = false humidity = normal, play = yes2/2If temperature = cool, humidity = normal, windy = false play = yes2/2If temperature = cool, windy = false, play = yes humidity = normal2/2


OverallMinimum coverage (2):12 1-item sets, 47 2-item sets, 39 3-item sets, 6 4-item setsMinimum accuracy (100%):58 association rulesBest Rules (Coverage = 4, Accuracy = 100%)

If humidity = normal and windy = false play = yesIf temperature = cool humidity = normalIf outlook = overcast play = yes


Association Rule MiningSTEP 1: Find all item sets that meet minimum coverage STEP 2: Find all rules that meet minimum accuracySTEP 3: Prune


Generating Item SetsHow do we generate minimum coverage item sets in a scalable manner?Total number of item set is hugeGrows exponentially in the number of attributesNeed an efficient algorithm:Start by generating minimum coverage 1-item setsUse those to generate 2-item sets, etcWhy do we only need to consider minimum coverage 1-item sets?


JustificationItem Set 1: {Humidity = high}Coverage(1) = Number of times humidity is high

Item Set 2: {Windy = false}Coverage (2) = Number of times windy is false

Item Set 3: {Humidity = high, Windy = false}Coverage (3) = Number of times humidity is high and windy is false

Coverage (3) Coverage(1) If Item Set 1 and 2 do notCoverage (3) Coverage(2) both meet min. coverageItem Set 3 cannot either


Generating Item Sets(A B C)(A B D)(A C D)(A C E)Start with all3-item setsthat meet min.coverageCandidate 4-item sets with minimumcoverage (must be checked)(A B C D)(A C D E)(Consider onlysets that startwith the sametwo attributes)Merge togenerate4-item setsThere are only two 4-item sets that could possibly work


Algorithm for Generating Item SetsBuild up from 1-item sets so that we only consider item sets that is found by merging two minimum coverage setsOnly consider sets that have all but one item in commonComputational efficiency further improved using hash tables


Generating RulesIf windy = false and play = no then outlook = sunny and humidity = highIf windy = false and play = no then outlook = sunnyIf windy = false and play = no then humidity = highMeets min.coverageand accuracyMeets min.coverageand accuracy


How Many Rules?Want to consider every possible subset of attributes as consequentHave 4 attributes:Four single consequent rulesSix double consequent rulesTwo triple consequent rulesTwelve possible rules for single 4-item set!Exponential explosion of possible rules


Must We Check All?If A and B then C and DIf A,B and C then D


Efficiency ImprovementA double consequent rule can only be OK if both single consequent rules are OKProcedure:Start with single consequent rulesBuild up double consequent rules, etc.candidate rulescheck for accuracyIn practice: need to check far fewer rules


Apriori AlgorithmThis is a simplified description of the Apriori algorithmDeveloped in early 90s and is the most commonly used approachNew developments focus onGenerating item sets more efficientlyGenerating rules from item sets more efficiently


Association Rule Discovery using WekaParameters to be specified in Apriori:upperBoundMinSupport: start with this value of minimum supportdelta: in each step decrease the minimum support required by this valuelowerBoundMinSupport: final minimum supportnumRules: how many rules are generatedmetricType: confidence, lift, leverage, convictionminMetric: smallest acceptable value for a ruleHandles only nominal attributes


DifficultiesApriori algorithm improves performance by using candidate item sets Still some problems Costly to generate large number of item setsTo generate a frequent pattern of size 100 need >21001030 candidates!Requires repeated scans of database to check candidatesAgain, most problematic for long patterns


Solution?Can candidate generation be avoided?New approach:Create a frequent pattern tree (FP-tree)stores information on frequent patternsUse the FP-tree for mining frequent patternspartitioning-baseddivide-and-conquer(as opposed to bottom-up generation)


Database FP-Tree(Min. support = 3)


TID

Items

Frequent Items

100

F,a,c,d,g,i,m,p

F,c,a,m,p

200

A,b,c,f,l,m,o

F,c,a,b,m

300

B,f,h,j,o

F,b

400

B,c,k,s,p

C,b,p

500

A,f,c,e,l,p,m,n

F,c,a,m,p

Computational EffortEach node has three fieldsitem namecountnode linkAlso a header table withitem namehead of node linkNeed two scans of the databaseCollect set of frequent itemsConstruct the FP-tree


CommentsThe FP-tree is a compact data structureThe FP-tree contains all the information related to mining frequent patterns (given the support)The size of the tree is bounded by the occurrences of frequent itemsThe height of the tree is bounded by the maximum number of items in a transaction


Mining PatternsMine complete set of frequent patterns

For any frequent item A, all possible patterns containing A can be obtained by following As node links starting from As head of node links


ExampleItemFCABMPP:2B:1M:1B:1B:1C:1P:1RootHead of node linksFrequent Pattern(P:3)

Paths

Occurs twiceOccurs ones


Rule GenerationMining complete set of association rules has some problemsMay be a large number of frequent item setsMay be a huge number of association rules

One potential solution is to look at closed item sets only


Frequent Closed Item SetsAn item set X is a closed item set if there is no item set X such that X X and every transaction containing X also contains X

A rule X Y is an association rule on a frequent closed item set if both X and XY are frequent closed item sets, andthere does not exist a frequent closed item set Z such that X Z XY


ExampleFrequent Item Sets (min support = 2):A (3),E (4),AE (2),ACDF (2),CF (3),CEF (3),D (2),AC (2), + 12 moreNot closed! Why?All the closed sets


ID

Items

10

A,C,D,E,F

20

A,B,E

30

C,E,F

40

A,C,D,F

50

C,E,F

Mining Frequent Closed Item Sets (CLOSET)TDBCEFADEACEFCFADCEFD-cond DB (D:2)CEFACFAA-cond DB (A:3)CEFECFF-cond DB (F:4)CE:3CE-cond DB (E:4)C:4Output: CFAD:2Output: A:3Output: CF:2,CEF:3Output: E:4EA-cond DB (EA:2)COutput: EA:2NOTEC:4E:4F:4A:3 Order forD:2 conditional DB


Mining with TaxonomiesClothesOuterwearShirtsJacketsSki PantsFootwearShoesHiking BootsTaxonomy:Generalized association rule X Y where no item in Y is an ancestor of an item in X


Why Taxonomy?The classic association rule mining restricts the rules to the leave nodes in the taxonomyHowever:Rules at lower levels may not have minimum support and thus interesting association may go undiscoveredTaxonomies can be used to prune uninteresting and redundant rules


Example


ID

Items

10

Shirt

20

Jacket, Hiking Boots

30

Ski pants, Hiking Boots

40

Shoes

50

Shoes

60

Jacket

Item Set

Support

{Jacket}

2

{Outerwear}

3

{Cloths}

4

{Shoes}

2

{Hiking Boots}

2

{Footwear}

2

{Outerwear, Hiking Boots}

2

{Cloths, Hiking Boots}

2

{Outerwear, Footwear}

2

{Cloths, Footwear}

2

Rule

Support

Confidence

Outerwear ( Hiking Boots

2

2/3

Outerwear ( Footwear

2

2/3

Hiking Boots ( Outerwear

2

2/2

Hiking Boots ( Clothes

2

2/2

Interesting RulesMany way in which the interestingness of a rule can be evaluated based on ancestorsFor example: A rule with no ancestors is interestingA rule with ancestor(s) is interesting only if it has enough relative support

Which rules are interesting?


Rule ID

Rule

Support

1

Clothes ( Footwear

10

2

Outerwear ( Footwear

8

3

Jackets ( Footwear

4

Item

Support

Clothes

5

Outerwear

2

Jackets

1

DiscussionAssociation rule mining finds expression of the form X Y from large data setsOne of the most popular data mining tasksOriginates in market basket analysisKey measures of performanceSupportConfidence (or accuracy)Is support and confidence enough?


Type of Rules DiscoveredClassic association rule problemAll rules satisfying minimum threshold of support and confidence

Focus on subset of rules, e.g.,Optimized rulesMaximal frequent item setsClosed item setsWhat makes for aninteresting rule?


Algorithm ConstructionDetermine frequent item sets (all or part)By far the most computational timeVariations focus on this part

Generate rules from frequent item sets


Generating Item SetsBottom-upTop-downCountingIntersectingCountingIntersectingSearch spacetraversed

Supportdetermined Apriori*Partition FP-Growth* Eclat AprioriTID DICApriori-likealgorithmsNo algorithmdominates others! * Have discussed


ApplicationsMarket basket analysisClassic marketing application

Applications to recommender systems


RecommenderCustomized goods and servicesRecommend productsCollaborative filteringsimilarities among users tastesrecommend based on other usersmany on-line systemssimple algorithms


Classification ApproachView as classification problemProduct either of interest or notInduce a model, e.g., a decision treeClassify a new product as either interesting or not interestingDifficulty in this approach?


Association Rule ApproachProduct associations90% of users who like product A and product B also like product CA and B C (90%)User associations90% of products liked by user A and user B are also liked by user CUse combination of product and user associations


AdvantagesClassic collaborative filtering must identify users with similar tastesThis approach uses overlap of other users tastes to match given users tasteCan be applied to users whose tastes dont correlate strongly with those of other usersCan take advantage of information from, say user A, for a recommendation to user B, even if they do not correlate


Whats Different Here?Is this really a classic association rule problem?Want to learn what products are liked by what usersSemi-supervisedTarget itemUser (for user associations)Product (for product associations)


Single-Consequent RulesOnly a single (target) item in the consequentGo through all such itemsClassificationOne single itemconsequentAssociation RulesAll possible itemcombination consequentAssociations for Recommender


Documents

Unsupervised Learning