CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 11
DATA MININGDATA MININGIntroductoryIntroductory
Dr. Mohammed AlhaddadDr. Mohammed Alhaddad
Collage of Information TechnologyCollage of Information Technology
King AbdulAziz UniversityKing AbdulAziz University
CS483CS483
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 22
Data Mining OutlineData Mining Outline
PART IPART I– IntroductionIntroduction– Related ConceptsRelated Concepts– Data Mining TechniquesData Mining Techniques
PART IIPART II– ClassificationClassification– ClusteringClustering– Association RulesAssociation Rules
PART IIIPART III– Web MiningWeb Mining– Spatial MiningSpatial Mining– Temporal MiningTemporal Mining
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 33
Goal:Goal: Provide an overview of data mining Provide an overview of data mining
Define data miningDefine data mining
Data mining vs. databasesData mining vs. databases
Basic data mining tasksBasic data mining tasks
Data mining developmentData mining development
Data mining issuesData mining issues
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 44
IntroductionIntroduction
Data is growing at a phenomenal rateData is growing at a phenomenal rate
Users expect more sophisticated Users expect more sophisticated informationinformation
How?How?
UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION
DATA MININGDATA MINING
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 55
Data Mining DefinitionData Mining Definition
Finding hidden information in a database. Finding hidden information in a database.
Fit data to a modelFit data to a model
Similar termsSimilar terms– Exploratory data analysisExploratory data analysis– Data driven discoveryData driven discovery– Deductive learningDeductive learning
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 66
What is (not) Data Mining?What is (not) Data Mining? What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 77
Data Mining AlgorithmData Mining Algorithm
Objective: Fit Data to a ModelObjective: Fit Data to a Model– DescriptiveDescriptive– PredictivePredictive
Preference – Technique to choose the Preference – Technique to choose the best modelbest model
Search – Technique to search the dataSearch – Technique to search the data– ““Query”Query”
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 88
DB Processing vs. Data Mining DB Processing vs. Data Mining ProcessingProcessing
QueryQuery– Well definedWell defined– SQLSQL
QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language
DataData– Operational dataOperational data
OutputOutput– PrecisePrecise– Subset of databaseSubset of database
DataData– Not operational dataNot operational data
OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 99
Query ExamplesQuery Examples
DatabaseDatabase
Data MiningData Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)
– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.
– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1010
Data Mining Models and TasksData Mining Models and Tasks
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1111
Data Mining TasksData Mining Tasks
Prediction MethodsPrediction Methods– Use some variables to predict unknown or Use some variables to predict unknown or
future values of other variables.future values of other variables.
Description MethodsDescription Methods– Find human-interpretable patterns that Find human-interpretable patterns that
describe the data.describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1212
Data Mining Tasks...Data Mining Tasks...
1.1. Classification Classification [Predictive][Predictive]
2.2. Clustering Clustering [Descriptive][Descriptive]
3.3. Association Rule Discovery Association Rule Discovery [Descriptive][Descriptive]
4.4. Sequential Pattern Discovery Sequential Pattern Discovery [Descriptive][Descriptive]
5.5. Regression Regression [Predictive][Predictive]
6.6. Deviation Detection Deviation Detection [Predictive][Predictive]
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1313
Classification: DefinitionClassification: DefinitionGiven a collection of records (Given a collection of records (training set training set ))– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the
attributes is the attributes is the classclass..
Find a Find a modelmodel for class attribute as a for class attribute as a function of the values of other attributes.function of the values of other attributes.Goal: Goal: previously unseenpreviously unseen records should be records should be assigned a class as accurately as possible.assigned a class as accurately as possible.– A A test settest set is used to determine the accuracy of the is used to determine the accuracy of the
model. Usually, the given data set is divided into model. Usually, the given data set is divided into training and test sets, with training set used to build the training and test sets, with training set used to build the model and test set used to validate it.model and test set used to validate it.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1414
Classification ExampleClassification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1515
Classification: Application 1Classification: Application 1
Direct MarketingDirect Marketing– Goal: Reduce cost of mailing by Goal: Reduce cost of mailing by targetingtargeting a set of a set of
consumers likely to buy a new cell-phone product.consumers likely to buy a new cell-phone product.– Approach:Approach:
Use the data for a similar product introduced before. Use the data for a similar product introduced before.
We know which customers decided to buy and which decided We know which customers decided to buy and which decided otherwise. This otherwise. This {buy, don’t buy}{buy, don’t buy} decision forms the decision forms the class class attributeattribute..
Collect various demographic, lifestyle, and company-interaction Collect various demographic, lifestyle, and company-interaction related information about all such customers.related information about all such customers.
– Type of business, where they stay, how much they earn, etc.Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1616
Classification: Application 2Classification: Application 2
Fraud DetectionFraud Detection– Goal: Predict fraudulent cases in credit card transactions.Goal: Predict fraudulent cases in credit card transactions.– Approach:Approach:
Use credit card transactions and the information on its account-Use credit card transactions and the information on its account-holder as attributes.holder as attributes.
– When does a customer buy, what does he buy, how often he pays on When does a customer buy, what does he buy, how often he pays on time, etctime, etc
Label past transactions as fraud or fair transactions. This forms Label past transactions as fraud or fair transactions. This forms the class attribute.the class attribute.Learn a model for the class of the transactions.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card Use this model to detect fraud by observing credit card transactions on an account.transactions on an account.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1717
Classification: Application 3Classification: Application 3
Customer Attrition/Churn:Customer Attrition/Churn:– Goal: To predict whether a customer is likely Goal: To predict whether a customer is likely
to be lost to a competitor.to be lost to a competitor.– Approach:Approach:
Use detailed record of transactions with each of Use detailed record of transactions with each of the past and present customers, to find attributes.the past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital of-the day he calls most, his financial status, marital status, etc. status, etc.
Label the customers as loyal or disloyal.Label the customers as loyal or disloyal.
Find a model for loyalty.Find a model for loyalty.From [Berry & Linoff] Data Mining Techniques, 1997
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1818
Classification: Application 4Classification: Application 4
Sky Survey CatalogingSky Survey Cataloging– Goal: To predict class (star or galaxy) of sky objects, Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.3000 images with 23,040 x 23,040 pixels per image.
– Approach:Approach:Segment the image. Segment the image.
Measure image attributes (features) - 40 of them per object.Measure image attributes (features) - 40 of them per object.
Model the class based on these features.Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 1919
Clustering DefinitionClustering Definition
Given a set of data points, each having a Given a set of data points, each having a set of attributes, and a similarity measure set of attributes, and a similarity measure among them, find clusters such thatamong them, find clusters such that– Data points in one cluster are more similar to Data points in one cluster are more similar to
one another.one another.– Data points in separate clusters are less similar Data points in separate clusters are less similar
to one another.to one another.
Similarity Measures:Similarity Measures:– Euclidean Distance if attributes are continuous.Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.Other Problem-specific Measures.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2020
Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximized
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2121
Clustering: Application 1Clustering: Application 1
Market Segmentation:Market Segmentation:– Goal: subdivide a market into distinct subsets of Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be customers where any subset may conceivably be selected as a market target to be reached with a distinct selected as a market target to be reached with a distinct marketing mix.marketing mix.
– Approach: Approach: Collect different attributes of customers based on their Collect different attributes of customers based on their geographical and lifestyle related information.geographical and lifestyle related information.Find clusters of similar customers.Find clusters of similar customers.Measure the clustering quality by observing buying patterns of Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. customers in same cluster vs. those from different clusters.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2222
Clustering: Application 2Clustering: Application 2
Document Clustering:Document Clustering:– Goal: To find groups of documents that are similar Goal: To find groups of documents that are similar
to each other based on the important terms to each other based on the important terms appearing in them.appearing in them.
– Approach: To identify frequently occurring terms in Approach: To identify frequently occurring terms in each document. Form a similarity measure based each document. Form a similarity measure based on the frequencies of different terms. Use it to on the frequencies of different terms. Use it to cluster.cluster.
– Gain: Information Retrieval can utilize the clusters Gain: Information Retrieval can utilize the clusters to relate a new document or search term to to relate a new document or search term to clustered documents.clustered documents.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2323
Illustrating Document ClusteringIllustrating Document ClusteringClustering Points: 3204 Articles of Los Angeles Times.Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in Similarity Measure: How many words are common in these documents (after some word filtering).these documents (after some word filtering).
Category TotalArticles
CorrectlyPlaced
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2424
Clustering of S&P 500 Stock Clustering of S&P 500 Stock DataData
Discovered Clusters Industry Group
1Applied-Matl-DOW N,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Technology1-DOWN
2Apple-Comp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3Fannie-Mae-DOWN,Fed-Home-Loan-DOW N,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP
Oil-UP
Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the
events described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2525
Association Rule Discovery: Association Rule Discovery: DefinitionDefinition
Given a set of records each of which contain Given a set of records each of which contain some number of items from a given collection;some number of items from a given collection;– Produce dependency rules which will predict Produce dependency rules which will predict
occurrence of an item based on occurrences of other occurrence of an item based on occurrences of other items.items.TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2626
Association Rule Discovery: Association Rule Discovery: Application 1Application 1
Marketing and Sales Promotion:Marketing and Sales Promotion:– Let the rule discovered beLet the rule discovered be {Bagels, … } --> {Potato Chips}{Bagels, … } --> {Potato Chips}– Potato ChipsPotato Chips as consequentas consequent => => Can be used to Can be used to
determine what should be done to boost its sales.determine what should be done to boost its sales.– Bagels in the antecedentBagels in the antecedent => C => Can be used to see an be used to see
which products would be affected if the store which products would be affected if the store discontinues selling bagels.discontinues selling bagels.
– Bagels in antecedentBagels in antecedent andand Potato chips in consequentPotato chips in consequent => => Can be used to see what products should be sold Can be used to see what products should be sold with Bagels to promote sale of Potato chips!with Bagels to promote sale of Potato chips!
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2727
Association Rule Discovery: Association Rule Discovery: Application 2Application 2
Supermarket shelf management.Supermarket shelf management.– Goal: To identify items that are bought together by Goal: To identify items that are bought together by
sufficiently many customers.sufficiently many customers.– Approach: Process the point-of-sale data collected Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies with barcode scanners to find dependencies among items.among items.
– A classic rule --A classic rule --If a customer buys diaper and milk, then he is very likely If a customer buys diaper and milk, then he is very likely to buy beer.to buy beer.
So, don’t be surprised if you find six-packs stacked next So, don’t be surprised if you find six-packs stacked next to diapers!to diapers!
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2828
Association Rule Discovery: Association Rule Discovery: Application 3Application 3
Inventory Management:Inventory Management:– Goal: A consumer appliance repair company wants to Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer right parts to reduce on number of visits to consumer households.households.
– Approach: Process the data on tools and parts Approach: Process the data on tools and parts required in previous repairs at different consumer required in previous repairs at different consumer locations and discover the co-occurrence patterns.locations and discover the co-occurrence patterns.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 2929
RegressionRegressionPredict a value of a given continuous valued variable Predict a value of a given continuous valued variable based on the values of other variables, assuming a based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.Greatly studied in statistics, neural network fields.
Examples:Examples:– Predicting sales amounts of new product based on Predicting sales amounts of new product based on
advetising expenditure.advetising expenditure.– Predicting wind velocities as a function of temperature, Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.humidity, air pressure, etc.– Time series prediction of stock market indices.Time series prediction of stock market indices.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3030
Basic Data Mining TasksBasic Data Mining TasksClassification Classification maps data into predefined groups maps data into predefined groups or classesor classes– Supervised learningSupervised learning– Pattern recognitionPattern recognition– PredictionPrediction
RegressionRegression is used to map a data item to a real is used to map a data item to a real valued prediction variable.valued prediction variable.Clustering Clustering groups similar data together into groups similar data together into clusters.clusters.– Unsupervised learningUnsupervised learning– SegmentationSegmentation– PartitioningPartitioning
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3131
Basic Data Mining Tasks Basic Data Mining Tasks (cont’d)(cont’d)
Summarization Summarization maps data into subsets with maps data into subsets with associated simple descriptions.associated simple descriptions.– CharacterizationCharacterization– GeneralizationGeneralization
Link AnalysisLink Analysis uncovers relationships among uncovers relationships among data.data.– Affinity AnalysisAffinity Analysis– Association RulesAssociation Rules– Sequential Analysis determines sequential patterns.Sequential Analysis determines sequential patterns.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3232
Ex: Time Series AnalysisEx: Time Series AnalysisExample: Stock MarketExample: Stock MarketPredict future valuesPredict future valuesDetermine similar patterns over timeDetermine similar patterns over timeClassify behaviorClassify behavior
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3333
Data Mining vs. KDDData Mining vs. KDD
Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.
Data Mining:Data Mining: Use of algorithms to extract Use of algorithms to extract the information and patterns derived by the information and patterns derived by the KDD process. the KDD process.
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3434
KDD ProcessKDD Process
Selection:Selection: Obtain data from various sources. Obtain data from various sources.Preprocessing:Preprocessing: Cleanse data. Cleanse data.Transformation:Transformation: Convert to common format. Convert to common format. Transform to new format.Transform to new format.Data Mining:Data Mining: Obtain desired results. Obtain desired results.Interpretation/Evaluation:Interpretation/Evaluation: Present results Present results to user in meaningful manner.to user in meaningful manner.
Modified from [FPSS96C]
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3535
KDD Process Ex: Web LogKDD Process Ex: Web LogSelection:Selection: – Select log data (dates and locations) to useSelect log data (dates and locations) to use
Preprocessing:Preprocessing: – Remove identifying URLsRemove identifying URLs– Remove error logsRemove error logs
Transformation:Transformation: – Sessionize (sort and group)Sessionize (sort and group)
Data Mining:Data Mining: – Identify and count patternsIdentify and count patterns– Construct data structureConstruct data structure
Interpretation/Evaluation:Interpretation/Evaluation:– Identify and display frequently accessed sequences.Identify and display frequently accessed sequences.
Potential User Applications:Potential User Applications:– Cache predictionCache prediction– PersonalizationPersonalization
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3636
Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines
•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis
•Neural Networks•Decision Tree Algorithms
•Algorithm Design Techniques•Algorithm Analysis•Data Structures
•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3737
KDD IssuesKDD Issues
Human InteractionHuman Interaction
OverfittingOverfitting
OutliersOutliers
InterpretationInterpretation
Visualization Visualization
Large DatasetsLarge Datasets
High DimensionalityHigh Dimensionality
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3838
KDD Issues (cont’d)KDD Issues (cont’d)
Multimedia DataMultimedia Data
Missing DataMissing Data
Irrelevant DataIrrelevant Data
Noisy DataNoisy Data
Changing DataChanging Data
IntegrationIntegration
ApplicationApplication
CS 483 Introduction to Data MiningCS 483 Introduction to Data Mining 3939
Challenges of Data MiningChallenges of Data MiningScalabilityScalability
DimensionalityDimensionality
Complex and Heterogeneous DataComplex and Heterogeneous Data
Data QualityData Quality
Data Ownership and DistributionData Ownership and Distribution
Privacy PreservationPrivacy Preservation
Streaming DataStreaming Data