Upload
suyash-dhoot
View
17
Download
1
Tags:
Embed Size (px)
DESCRIPTION
PPT on unsupervised learning methods in clustering and data mining
Citation preview
Unsupervised LearningClusteringUnsupervised classification, that is, without the class attributeWant to discover the classes
Association Rule DiscoveryDiscover correlation
Data Mining and Knowledge Discovery
The Clustering ProcessPattern representationDefinition of pattern proximity measureClusteringData abstractionCluster validation
Data Mining and Knowledge Discovery
Pattern RepresentationNumber of classesNumber of available patternsCircles, ellipses, squares, etc.Feature selectionCan we use wrappers and filters?Feature extractionProduce new featuresE.g., principle component analysis (PCA)
Data Mining and Knowledge Discovery
Pattern ProximityWant clusters of instances that are similar to each other but dissimilar to othersNeed a similarity measureContinuous caseEuclidean measure (compact isolated clusters)The squared Mahalanobis distance
alleviates problems with correlationMany more measures
Data Mining and Knowledge Discovery
Pattern ProximityNominal attributes
Data Mining and Knowledge Discovery
Clustering TechniquesClusteringHierarchicalPartitionalSingle LinkComplete LinkSquare ErrorMixture MaximizationK-meansExpectation MaximizationCobWeb
Data Mining and Knowledge Discovery
Technique CharacteristicsAgglomerative vs DivisiveAgglomerative: each instance is its own cluster and the algorithm merges clustersDivisive: begins with all instances in one cluster and divides it upHard vs FuzzyHard clustering assigns each instance to one cluster whereas in fuzzy clustering assigns degree of membership
Data Mining and Knowledge Discovery
More CharacteristicsMonothetic vs PolytheticPolythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms)Monothetic: attributes are considered one at a timeIncremental vs Non-IncrementalWith large data sets it may be necessary to consider only part of the data at a time (data mining)Incremental works instance by instance
Data Mining and Knowledge Discovery
Hierarchical ClusteringACBD EF GA B C D E F GDendrogramS imilarity
Data Mining and Knowledge Discovery
Hierarchical AlgorithmsSingle-linkDistance between two clusters set equal to the minimum of distances between all instancesMore versatileProduces (sometimes too) elongated clustersComplete-linkDistance between two clusters set equal to maximum of all distances between instances in the clustersTightly bound, compact clustersOften more useful in practice
Data Mining and Knowledge Discovery
Example: Clusters FoundSingle-Link
Complete-Link
Data Mining and Knowledge Discovery
Partitional ClusteringOutput a single partition of the data into clustersGood for large data setsDetermining the number of clusters is a major challenge
Data Mining and Knowledge Discovery
K-MeansSeedsPredeterminednumber of clusters
Start with seedclusters of one element
Data Mining and Knowledge Discovery
Assign Instances to Clusters
Data Mining and Knowledge Discovery
Find New Centroids
Data Mining and Knowledge Discovery
New Clusters
Data Mining and Knowledge Discovery
Discussion: k-meansApplicable to fairly large data setsSensitive to initial centersUse other heuristics to find good initial centersConverges to a local optimumSpecifying the number of centers very subjective
Data Mining and Knowledge Discovery
Clustering in WekaClustering algorithms in WekaK-MeansExpectation Maximization (EM)Cobweb hierarchical, incremental, and agglomerative
Data Mining and Knowledge Discovery
CobWebAlgorithm (main) characteristics:Hierarchical and incrementalUses category utility
Why divide by k? The k clustersAll possible values for attribute ai
Data Mining and Knowledge Discovery
Category UtilityIf each instance in its own cluster
Category utility function becomes
Without k it would always be best for each instance to have its own cluster, overfitting!
Data Mining and Knowledge Discovery
The Weather Problem
Data Mining and Knowledge Discovery
Sheet1
OutlookTemp.HumidityWindyPlay
SunnyHotHighNo
SunnyHotHighNo
OvercastHotHighYes
RainyMildHighYes
RainyCoolNormalYes
RainyCoolNormalNo
OvercastCoolNormalYes
SunnyMildHighNo
SunnyCoolNormalYes
RainyMildNormalYes
SunnyMildNormalYes
OvercastMildHighYes
OvercastHotNormalYes
RainyMildHighNo
Sheet2
Sheet3
Weather Data (without Play)Label instances: a,b,.,naStart by puttingthe first instancein its own clusterabAdd another instancein its own cluster
Data Mining and Knowledge Discovery
Adding the Third InstanceacbacbacbEvaluate the category utility of adding the instance to oneof the two clusters versus adding it as its own clusterHighest utility
Data Mining and Knowledge Discovery
Adding Instance fabcdefFirst instance not to getits own cluster:Look at the instances:Rainy Cool Normal FALSERainy Cool Normal TRUEQuite similar!
Data Mining and Knowledge Discovery
Add Instance gabcdefLook at the instances:E) Rainy Cool Normal FALSEF) Rainy Cool Normal TRUEG) Overcast Cool Normal TRUEg
Data Mining and Knowledge Discovery
Add Instance hbcdefLook at the instances:A) Sunny Hot High FALSED) Rainy Mild High FALSEH) Sunny Mild High FALSEgahRearrange:Merged into a single cluster before h is addedBest matching nodeRunner up(Splitting is also possible)
Data Mining and Knowledge Discovery
Final HierarchyWhat next?
Data Mining and Knowledge Discovery
Dendrogram ClustersWhat do a, b, c, d, h, k, and lhave in common?
Data Mining and Knowledge Discovery
Numerical AttributesAssume normal distribution
Problems with zero variance!The acuity parameter imposes a minimum variance
Data Mining and Knowledge Discovery
Hierarchy Size (Scalability)May create very large hierarchyThe cutoff parameter is uses to suppress growthIf
cut node off.
Data Mining and Knowledge Discovery
DiscussionAdvantagesIncremental scales to large number of instancesCutoff limits size of hierarchyHandles mixed attributesDisadvantagesIncremental sensitive to order of instances?Arbitrary choice of parameters: divide by k, artificial minimum value for variance of numeric attributes, ad hoc cutoff value
Data Mining and Knowledge Discovery
Probabilistic PerspectiveMost likely set of clusters given dataProbability of each instance belonging to a clusterAssumption: instances are drawn from one of several distributionsGoal: estimate the parameters of these distributionsUsually: assume distributions are normal
Data Mining and Knowledge Discovery
Mixture ResolutionMixture: set of k probability distributionsRepresent the k clustersProbabilities that an instance takes certain attribute values given it is in the cluster
What is the probability an instance belongs to a cluster (or a distribution)
Data Mining and Knowledge Discovery
One Numeric AttributeGiven some data, how can you determine the parameters:Two cluster mixture model:
Data Mining and Knowledge Discovery
ProblemsIf we knew which instance came from each cluster we could estimate these valuesIf we knew the parameters we could calculate the probability that an instance belongs to each cluster
Data Mining and Knowledge Discovery
EM AlgorithmExpectation Maximization (EM)Start with initial values for the parametersCalculate the cluster probabilities for each instanceRe-estimate the values for the parametersRepeatGeneral purpose maximum likelihood estimate algorithm for missing dataCan also be used to train Bayesian networks (later)
Data Mining and Knowledge Discovery
Beyond Normal ModelsMore than one class:StraightforwardMore than one numeric attributeEasy if assume attributes independentIf dependent attributes, treat them jointly using the bivariate normalNominal attributesNo more normal distribution!
Data Mining and Knowledge Discovery
EM using WekaOptionsnumClusters: set number of clusters. Default = -1 selects it automaticallymaxIterations: maximum number of iterationsseed -- random number seedminStdDev -- set minimum allowable standard deviation
Data Mining and Knowledge Discovery
Other ClusteringArtificial Neural Networks (ANN)Random search Genetic Algorithms (GA)GA used to find initial centroids for k-meansSimulated Annealing (SA)Tabu Search (TS)Support Vector Machines (SVM)Will discuss GA and SVM later
Data Mining and Knowledge Discovery
ApplicationsImage segmentationObject and Character RecognitionData Mining:Stand-alone to gain insight into the dataPreprocess before classification that operates on the detected clusters
Data Mining and Knowledge Discovery
DM Clustering ChallengesData mining deals with large databasesScalability with respect to number of instanceUse a random sample (possible bias)Dealing with mixed dataMany algorithms only make sense for numeric dataHigh dimensional problemsCan the algorithm handle many attributes?How do we interpret a cluster in high dimensions?
Data Mining and Knowledge Discovery
Other (General) ChallengesShape of clustersMinimum domain knowledge (e.g., knowing the number of clusters)Noisy dataInsensitivity to instance orderInterpretability and usability
Data Mining and Knowledge Discovery
Clustering for DMMain issue is scalability to large databasesMany algorithms have been developed for scalable clustering:Partitional methods: CLARA, CLARANSHierarchical methods: AGNES, DIANA, BIRCH, CURE, Chameleon
Data Mining and Knowledge Discovery
Practical Partitional Clustering AlgorithmsClassic k-Means (1967)Work from 1990 and later:k-MedoidsUses the mediod instead of the centroidLess sensitive to outliers and noiseComputations more costlyPAM (Partitioning Around Mediods) algorithm
Data Mining and Knowledge Discovery
Large-Scale ProblemsCLARA: Clustering LARge ApplicationsSelect several random samples of instancesApply PAM to eachReturn the best clustersCLARANS: Similar to CLARADraws samples randomly while searchingMore effective than PAM and CLARA
Data Mining and Knowledge Discovery
Hierarchical MethodsBIRCH: Balanced Iterative Reducing and Clustering using HierarchiesClustering feature: triplet summarizing information about subclusters
Clustering feature tree: height-balanced tree that stores the clustering features
Data Mining and Knowledge Discovery
BIRCH MechanismPhase I:Scan database to build an initial CF treeMultilevel compression of the dataPhase II:Apply a selected clustering algorithm to the leaf nodes of the CF treeHas been found to be very scalable
Data Mining and Knowledge Discovery
ConclusionThe use of clustering in data mining practice seems to be somewhat limited due to scalability problemsMore commonly used unsupervised learning:
Association Rule Discovery
Data Mining and Knowledge Discovery
Association Rule DiscoveryAims to discovery interesting correlation or other relationships in large databasesFinds a rule of the formif A and B then C and DWhich attributes will be included in the relation is unknown
Data Mining and Knowledge Discovery
Mining Association RulesSimilar to classification rulesUse same procedure?Every attribute is the sameApply to every possible expression on right hand sideHuge number of rules Infeasible
Only want rules with high coverage/support
Data Mining and Knowledge Discovery
Market Basket AnalysisBasket data: items purchased on per-transaction basis (not cumulative, etc)How do you boost the sales of a given product?What other products does discontinuing a product impact?Which products should be shelved together?Terminology (market basket analysis):Item - an attribute/value pair Item set - combination of items with min. coverage
Data Mining and Knowledge Discovery
How Many k-Item Sets Have Minimum Coverage?
Data Mining and Knowledge Discovery
Sheet1
OutlookTemp.HumidityWindyPlay
SunnyHotHighNo
SunnyHotHighNo
OvercastHotHighYes
RainyMildHighYes
RainyCoolNormalYes
RainyCoolNormalNo
OvercastCoolNormalYes
SunnyMildHighNo
SunnyCoolNormalYes
RainyMildNormalYes
SunnyMildNormalYes
OvercastMildHighYes
OvercastHotNormalYes
RainyMildHighNo
Sheet2
Sheet3
Item Sets
Data Mining and Knowledge Discovery
1-Item
2-Item
3-Item
4-Item
Outlook=sunny (5)
Outlook=sunny temp=mild (2)
Outlook=sunny temp=hot humidity=high (2)
Outlook=sunny temp=hot humidity=high play=no (2)
Outlook= overcast (4)
Outlook=sunny temp=hot (2)
Outlook=sunny temp=hot play=no (2)
Outlook=sunny humidity=high windy=false play=no (2)
Outlook=rainy (5)
Outlook=sunny humidity=norm (2)
Outlook=sunny humidity=norm play=yes (2)
Outlook=over temp=hot windy=false play=no (2)
Temp=cool (4)
Outlook=sunny windy=true (2)
Outlook=sunny humidity=high windy=false (2)
Outlook=rainy temp=mild windy=false play=yes (2)
Temp=mild (6)
Outlook=sunny windy=true (2)
Outlook=sunny humidity=high play=no (3)
Outlook=rainy humidity=norm windy=false play=yes (2)
From Sets to Rules3-Item Set w/coverage 4:Humidity = normal, windy = false, play = yes
Association Rules: Accuracy
If humidity = normal and windy = false then play = yes4/4If humidity = normal and play = yes then windy = false4/6If windy = false and play = yes then humidity = normal4/6If humidity = normal then windy = false and play = yes4/7If windy = false then humidity = normal and play = yes4/8If play = yes then humidity = normal and windy = false4/9If - then humidity = normal and windy = false and play=yes4/12
Data Mining and Knowledge Discovery
From Sets to Rules (continued)4-Item Set w/coverage 2:Temperature = cool, humidity = normal, windy = false, play = yes
Association Rules: Accuracy
If temperature = cool, windy = false humidity = normal, play = yes2/2If temperature = cool, humidity = normal, windy = false play = yes2/2If temperature = cool, windy = false, play = yes humidity = normal2/2
Data Mining and Knowledge Discovery
OverallMinimum coverage (2):12 1-item sets, 47 2-item sets, 39 3-item sets, 6 4-item setsMinimum accuracy (100%):58 association rulesBest Rules (Coverage = 4, Accuracy = 100%)
If humidity = normal and windy = false play = yesIf temperature = cool humidity = normalIf outlook = overcast play = yes
Data Mining and Knowledge Discovery
Association Rule MiningSTEP 1: Find all item sets that meet minimum coverage STEP 2: Find all rules that meet minimum accuracySTEP 3: Prune
Data Mining and Knowledge Discovery
Generating Item SetsHow do we generate minimum coverage item sets in a scalable manner?Total number of item set is hugeGrows exponentially in the number of attributesNeed an efficient algorithm:Start by generating minimum coverage 1-item setsUse those to generate 2-item sets, etcWhy do we only need to consider minimum coverage 1-item sets?
Data Mining and Knowledge Discovery
JustificationItem Set 1: {Humidity = high}Coverage(1) = Number of times humidity is high
Item Set 2: {Windy = false}Coverage (2) = Number of times windy is false
Item Set 3: {Humidity = high, Windy = false}Coverage (3) = Number of times humidity is high and windy is false
Coverage (3) Coverage(1) If Item Set 1 and 2 do notCoverage (3) Coverage(2) both meet min. coverageItem Set 3 cannot either
Data Mining and Knowledge Discovery
Generating Item Sets(A B C)(A B D)(A C D)(A C E)Start with all3-item setsthat meet min.coverageCandidate 4-item sets with minimumcoverage (must be checked)(A B C D)(A C D E)(Consider onlysets that startwith the sametwo attributes)Merge togenerate4-item setsThere are only two 4-item sets that could possibly work
Data Mining and Knowledge Discovery
Algorithm for Generating Item SetsBuild up from 1-item sets so that we only consider item sets that is found by merging two minimum coverage setsOnly consider sets that have all but one item in commonComputational efficiency further improved using hash tables
Data Mining and Knowledge Discovery
Generating RulesIf windy = false and play = no then outlook = sunny and humidity = highIf windy = false and play = no then outlook = sunnyIf windy = false and play = no then humidity = highMeets min.coverageand accuracyMeets min.coverageand accuracy
Data Mining and Knowledge Discovery
How Many Rules?Want to consider every possible subset of attributes as consequentHave 4 attributes:Four single consequent rulesSix double consequent rulesTwo triple consequent rulesTwelve possible rules for single 4-item set!Exponential explosion of possible rules
Data Mining and Knowledge Discovery
Must We Check All?If A and B then C and DIf A,B and C then D
Data Mining and Knowledge Discovery
Efficiency ImprovementA double consequent rule can only be OK if both single consequent rules are OKProcedure:Start with single consequent rulesBuild up double consequent rules, etc.candidate rulescheck for accuracyIn practice: need to check far fewer rules
Data Mining and Knowledge Discovery
Apriori AlgorithmThis is a simplified description of the Apriori algorithmDeveloped in early 90s and is the most commonly used approachNew developments focus onGenerating item sets more efficientlyGenerating rules from item sets more efficiently
Data Mining and Knowledge Discovery
Association Rule Discovery using WekaParameters to be specified in Apriori:upperBoundMinSupport: start with this value of minimum supportdelta: in each step decrease the minimum support required by this valuelowerBoundMinSupport: final minimum supportnumRules: how many rules are generatedmetricType: confidence, lift, leverage, convictionminMetric: smallest acceptable value for a ruleHandles only nominal attributes
Data Mining and Knowledge Discovery
DifficultiesApriori algorithm improves performance by using candidate item sets Still some problems Costly to generate large number of item setsTo generate a frequent pattern of size 100 need >21001030 candidates!Requires repeated scans of database to check candidatesAgain, most problematic for long patterns
Data Mining and Knowledge Discovery
Solution?Can candidate generation be avoided?New approach:Create a frequent pattern tree (FP-tree)stores information on frequent patternsUse the FP-tree for mining frequent patternspartitioning-baseddivide-and-conquer(as opposed to bottom-up generation)
Data Mining and Knowledge Discovery
Database FP-Tree(Min. support = 3)
Data Mining and Knowledge Discovery
TID
Items
Frequent Items
100
F,a,c,d,g,i,m,p
F,c,a,m,p
200
A,b,c,f,l,m,o
F,c,a,b,m
300
B,f,h,j,o
F,b
400
B,c,k,s,p
C,b,p
500
A,f,c,e,l,p,m,n
F,c,a,m,p
Computational EffortEach node has three fieldsitem namecountnode linkAlso a header table withitem namehead of node linkNeed two scans of the databaseCollect set of frequent itemsConstruct the FP-tree
Data Mining and Knowledge Discovery
CommentsThe FP-tree is a compact data structureThe FP-tree contains all the information related to mining frequent patterns (given the support)The size of the tree is bounded by the occurrences of frequent itemsThe height of the tree is bounded by the maximum number of items in a transaction
Data Mining and Knowledge Discovery
Mining PatternsMine complete set of frequent patterns
For any frequent item A, all possible patterns containing A can be obtained by following As node links starting from As head of node links
Data Mining and Knowledge Discovery
ExampleItemFCABMPP:2B:1M:1B:1B:1C:1P:1RootHead of node linksFrequent Pattern(P:3)
Paths
Occurs twiceOccurs ones
Data Mining and Knowledge Discovery
Rule GenerationMining complete set of association rules has some problemsMay be a large number of frequent item setsMay be a huge number of association rules
One potential solution is to look at closed item sets only
Data Mining and Knowledge Discovery
Frequent Closed Item SetsAn item set X is a closed item set if there is no item set X such that X X and every transaction containing X also contains X
A rule X Y is an association rule on a frequent closed item set if both X and XY are frequent closed item sets, andthere does not exist a frequent closed item set Z such that X Z XY
Data Mining and Knowledge Discovery
ExampleFrequent Item Sets (min support = 2):A (3),E (4),AE (2),ACDF (2),CF (3),CEF (3),D (2),AC (2), + 12 moreNot closed! Why?All the closed sets
Data Mining and Knowledge Discovery
ID
Items
10
A,C,D,E,F
20
A,B,E
30
C,E,F
40
A,C,D,F
50
C,E,F
Mining Frequent Closed Item Sets (CLOSET)TDBCEFADEACEFCFADCEFD-cond DB (D:2)CEFACFAA-cond DB (A:3)CEFECFF-cond DB (F:4)CE:3CE-cond DB (E:4)C:4Output: CFAD:2Output: A:3Output: CF:2,CEF:3Output: E:4EA-cond DB (EA:2)COutput: EA:2NOTEC:4E:4F:4A:3 Order forD:2 conditional DB
Data Mining and Knowledge Discovery
Mining with TaxonomiesClothesOuterwearShirtsJacketsSki PantsFootwearShoesHiking BootsTaxonomy:Generalized association rule X Y where no item in Y is an ancestor of an item in X
Data Mining and Knowledge Discovery
Why Taxonomy?The classic association rule mining restricts the rules to the leave nodes in the taxonomyHowever:Rules at lower levels may not have minimum support and thus interesting association may go undiscoveredTaxonomies can be used to prune uninteresting and redundant rules
Data Mining and Knowledge Discovery
Example
Data Mining and Knowledge Discovery
ID
Items
10
Shirt
20
Jacket, Hiking Boots
30
Ski pants, Hiking Boots
40
Shoes
50
Shoes
60
Jacket
Item Set
Support
{Jacket}
2
{Outerwear}
3
{Cloths}
4
{Shoes}
2
{Hiking Boots}
2
{Footwear}
2
{Outerwear, Hiking Boots}
2
{Cloths, Hiking Boots}
2
{Outerwear, Footwear}
2
{Cloths, Footwear}
2
Rule
Support
Confidence
Outerwear ( Hiking Boots
2
2/3
Outerwear ( Footwear
2
2/3
Hiking Boots ( Outerwear
2
2/2
Hiking Boots ( Clothes
2
2/2
Interesting RulesMany way in which the interestingness of a rule can be evaluated based on ancestorsFor example: A rule with no ancestors is interestingA rule with ancestor(s) is interesting only if it has enough relative support
Which rules are interesting?
Data Mining and Knowledge Discovery
Rule ID
Rule
Support
1
Clothes ( Footwear
10
2
Outerwear ( Footwear
8
3
Jackets ( Footwear
4
Item
Support
Clothes
5
Outerwear
2
Jackets
1
DiscussionAssociation rule mining finds expression of the form X Y from large data setsOne of the most popular data mining tasksOriginates in market basket analysisKey measures of performanceSupportConfidence (or accuracy)Is support and confidence enough?
Data Mining and Knowledge Discovery
Type of Rules DiscoveredClassic association rule problemAll rules satisfying minimum threshold of support and confidence
Focus on subset of rules, e.g.,Optimized rulesMaximal frequent item setsClosed item setsWhat makes for aninteresting rule?
Data Mining and Knowledge Discovery
Algorithm ConstructionDetermine frequent item sets (all or part)By far the most computational timeVariations focus on this part
Generate rules from frequent item sets
Data Mining and Knowledge Discovery
Generating Item SetsBottom-upTop-downCountingIntersectingCountingIntersectingSearch spacetraversed
Supportdetermined Apriori*Partition FP-Growth* Eclat AprioriTID DICApriori-likealgorithmsNo algorithmdominates others! * Have discussed
Data Mining and Knowledge Discovery
ApplicationsMarket basket analysisClassic marketing application
Applications to recommender systems
Data Mining and Knowledge Discovery
RecommenderCustomized goods and servicesRecommend productsCollaborative filteringsimilarities among users tastesrecommend based on other usersmany on-line systemssimple algorithms
Data Mining and Knowledge Discovery
Classification ApproachView as classification problemProduct either of interest or notInduce a model, e.g., a decision treeClassify a new product as either interesting or not interestingDifficulty in this approach?
Data Mining and Knowledge Discovery
Association Rule ApproachProduct associations90% of users who like product A and product B also like product CA and B C (90%)User associations90% of products liked by user A and user B are also liked by user CUse combination of product and user associations
Data Mining and Knowledge Discovery
AdvantagesClassic collaborative filtering must identify users with similar tastesThis approach uses overlap of other users tastes to match given users tasteCan be applied to users whose tastes dont correlate strongly with those of other usersCan take advantage of information from, say user A, for a recommendation to user B, even if they do not correlate
Data Mining and Knowledge Discovery
Whats Different Here?Is this really a classic association rule problem?Want to learn what products are liked by what usersSemi-supervisedTarget itemUser (for user associations)Product (for product associations)
Data Mining and Knowledge Discovery
Single-Consequent RulesOnly a single (target) item in the consequentGo through all such itemsClassificationOne single itemconsequentAssociation RulesAll possible itemcombination consequentAssociations for Recommender
Data Mining and Knowledge Discovery