Upload
tommy96
View
737
Download
2
Tags:
Embed Size (px)
Citation preview
Data mining exercisewith SPSS Clementine
Lab 1
Winnie LamEmail: [email protected]
Website: http://www.comp.polyu.edu.hk/~cswinnie/The Hong Kong Polytechnic University
Department of ComputingLast update:09/03/2006
2
OVERVIEWOVERVIEW
Clementine is a data mining tool that combines advanced modeling technology with ease of use, it helps you discover and predict interesting and valuable relationships within your data.
You can use Clementine for decision-support activities such as:
• Creating customer profiles and determining customer lifetime value.• Detecting and predicting fraud in your organization.• Determining and predicting valuable sequences in Web-site data.• Predicting future trends in sales and growth.• Profiling for direct mailing response and credit risk.• Performing churn prediction, classification and segmentation.
3
KDD Process
SelectionPreprocessing
Transformation
Data Mining
Evaluation
Preprocessed DataTarget
Data
TransformedData
Patterns
Knowledge
OVERVIEWOVERVIEW
4
Simplified KDD process
Data Understanding
Data Preparation
Modeling (Data Mining)
Define target &
discover useful data
Obtain Clean &
Useful data
Discover patterns
5
STREAM CANVAS
NODE PALETTES
OBJECT MANAGER
PROJECT
Learning the Nodes
Sources Record Ops Field Ops Graphs Modeling Output
7
NODES
Source nodes
3. Database - import data using ODBC4. Variable File - free-field ASCII data5. Fixed File - fixed-field ASCII data6. SPSS File - import SPSS files7. SAS File - import files in SAS format8. User Input - replace existing source nodes
Sources Record Ops Field Ops Graphs Modeling Output
8Sources Record Ops Field Ops Graphs Modeling Output
NODES
Record Operations Nodes- make changes to the data set at the record level
4. Select5. Sample6. Balance7. Aggregate8. Sort9. Merge10. Append11. Distinct
9Sources Record Ops Field Ops Graphs Modeling Output
NODES
Field Operation Nodes- for data transformation and preparation
4. Type5. Filter6. Derive7. Filler8. Reclassify9. Binning10.Partition11.Set to Flag12.History13.Field Reorder
10Sources Record Ops Field Ops Graphs Modeling Output
NODES
Graph Nodes
- explore & check the distribution and relationships
4. Plot5. Multiplot6. Distribution7. Histogram8. Collection9. Web10. Evaluation
11Sources Record Ops Field Ops Graphs Modeling Output
Modeling Nodes- Heart of DM process (machine learning)- Each method has certain strengths and is best
suited for particular types of problems.
5. Neural Net6. C5.07. Classification and Regression (C&R) Trees8. QUEST9. CHAID10. Kohonen11. K-Means12. TwoStep Cluster13. Apriori14. Generalized Rule Induction (GRI)15. CARMA16. Sequence Detection17. PCA/ Factor Analysis18. Linear Regression19. Logistic Regression
12Sources Record Ops Field Ops Graphs Modeling Output
NODESOutput Nodes- obtain information about your data and models- exporting data in various formats
5. Table6. Matrix7. Analysis8. Data Audit9. Statistics10. Quality11. Report12. Set Globals13. Publisher14. Database Output15. Flat File16. SPSS Export17. SAS Export18. Excel19. SPSS Procedure
13
Association Tools• Apriori discovers association rules in the data.
For large problems, Apriori is generally faster to train than GRI.It has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions.
• GRI, Generalized Rule Induction, extracts a set of rules from the data (similar to Apriori). GRI can handle numeric as well as symbolic input fields.
• CARMA uses an association rules discovery algorithm to discoverassociation rules in the data. CARMA node does not require In fields or Out fields. It is equivalent to build an Apriori model with all fields set to Both.
• Sequence discovers patterns in sequential or time-oriented data. A sequence is a list of item sets that tend to occur in a predictable order.
14
Classification Tools – Decision tree• C5.0. This method splits the sample based on the field that provides the
maximum information gain at each level to produce either a decision tree or a ruleset. The target field must be categorical. Multiple splits into more than two subgroups are allowed.
• C&RT. The Classification and Regression Trees method is based on minimization of impurity measures. A node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and predictor fields can be range or categorical; all splits are binary (only two subgroups).
• CHAID. Chi-squared Automatic Interaction Detector uses chi-squared statistics to identify optimal splits. Target and predictor fields can be range or categorical; nodes can be split into two or more subgroups at each level.
• QUEST. The Quick, Unbiased, Efficient Statistical Tree method is quick to compute and avoids other methods’ biases in favor of predictors with many categories. Predictor fields can be numeric ranges, but the target field must be categorical. All splits are binary.
15
Clustering Tools
• K-means. An approach to clustering that defines k clusters and iteratively assigns records to clusters based on distances from the mean of each cluster until a stable solution is found.
• TwoSteps. A clustering method that involves preclustering the records into a large number of subclusters and then applying a hierarchical clustering technique to those subclusters to define the final clusters.
• Kohonen Networks. A type of neural network used for clustering. Also known as a self organizing map (SOM).
16
ClassificationClass 1
Class 2
Class 3
New Sample
With predefined class!
17
ClusteringClass 1
Class 2
Class 3
No class is defined previously!
CROSS
TRIANGLE
STAR
Practical Session
19
Data Understanding
Data Description:
• Total no. of records : ? (find out by yourself)
Data file:http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv
TIDdt
DiscountGroupref_no
prod_cd
Attributes
Transaction IDDate
Discount offered? Y/NProduct GroupInternal Ref no.Product Code
20
Step 1: Import Data to Clementine
Data Understanding
Add Node: Var. File (in Sources Palette)
double clickBrowse
21
Data Understanding
Step 2: Analyze the dataAdd Nodes: Table (in Output Palette)
right click and choose execute
22
Data Understanding
Step 2: Analyze the dataAdd Nodes: Data Audit (in Output Palette)
Execute
23
Data Understanding
Step 2: Analyze the dataAdd Nodes: Quality (in Output Palette)
Execute
Data Preparation
25
Data Preparation
Edit Node: Var. File (in Stream)
Goal: Define data type and value
1
2
2 Re-define the Type of Group and ref_no to “Set”
Press “Read Values” againdouble click
26
Data Preparation
Edit Node: Var. File (in Stream)
Goal: Define blanks
1
2
27
Add Node: Filler (in Field Ops Palette)
Goal: Replace all blanks with a specified value
Data Preparation
Result
28
Add Node: Type (in Field Ops Palette)
Goal: Remove records with blanks
Data Preparation
12
3
4
4 Choose “-1” and delete it
5
Q: How many records are left?
29
Add Node: Reclassify (in Field Ops Palette)
Goal: Replace invalid values
Data Preparation
12
3
4 5
66 Modify to a common set of new value (Y/N)
Data Transformation
31
Derive New Fields
Useful Node: Derive (in Field Ops Palette)
Weekday : datetime_weekday(dt)Hour : datetime_hour (dt)
Goal: Add new attributes “Weekday” and “Hour”
For weekday, 0 represents Sunday,1 represents Monday,etc.
Q: How many fields in your data?
32
DiscretizationGoal: Divide the Hour field into 4 intervals
Useful Node: Binning (in Field Ops Palette)
33
Preprocessed Data
Data Mining
35
Data Mining
Add Nodes: Type (in Field Ops Palette)
12
Goal: Update the type and value of data
Association
37
Association
Add Nodes: SetToFlag (in Field Ops Palette)
1
2 3
4
Goal: Convert the transactional format to tabular format
Select all values
38
Association
Add Nodes: Apriori (in Modeling Palette)
1
2
3
Goal: Perform association with Apriori
39
AssociationGoal: View the mining result
Association Rules:
For 1st Rule: IF P17 and P39 THEN P27
Right Click and choose Browse
Classification
41
Classification
Choose the Inputs and Target
Add Nodes: C5.0 (in Modeling Palette)
42
ClassificationGoal: View the mining result
Classification Rules:
Right Click and choose Browse
43
ClassificationGoal: Find out the classification accuracy
Drag the classification result to the stream
Add Nodes: Classification result (in Model) and Analysis (in Output Palette)
Right Click and Execute
Clustering
45
Clustering
Choose the Inputs
Add Nodes: K-means (in Modeling Palette)
Set k= 3
46
ClusteringGoal: View the mining result
Clustering result:
Right Click and choose Browse
47
Q&Asession
48
SUMMARY
Today, you’ve learnt :• KDD process• the differences between nodes• how to build streams in Clementine• how to do data preparation with
Clementine• Association modeling• Classification modeling• Clustering modeling