52
SUBMITTED BY: ARUN KUMAR DASH PARAS SHAH DIVYA RAJASRI TADI NIREESHA MANDALA DATA MINING FINAL PROJECT

DM PROJECT

Embed Size (px)

Citation preview

Page 1: DM PROJECT

SUBMITTED BY:

ARUN KUMAR DASH

PARAS SHAH

DIVYA RAJASRI TADI

NIREESHA MANDALA

DATA MINING FINAL PROJECT

Page 2: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 1 ~

INTRODUCTION

Data mining aims at extracting patterns and knowledge from a particular data

set and converting it into an easy to understand framework that can be used in

future as well. Data is not only analyzed, it is pre-processed, complexities in

the data are taken into consideration and the structures discovered are post-

processed. Data mining is done on large quantities of data in order to extract

unknown patterns like groups of data records known as cluster analysis or to

identify exceptional records known as anomaly detection. These patterns can

then be used in machine learning and predictive analysis. All the steps like

data collection, data preparation, result interpretation and reporting though a

part of data mining belong to the knowledge discovery in databases process.

Data mining is being used in almost every field from business, science and

engineering, medical, visual, music, sensor, temporal and spatial to name a

few. The ways in which data mining can be used can in some cases and

contexts raise questions regarding privacy, legality, and ethics. As data mining

needs data preparation which can uncover patterns that can compromise

confidentiality. And this mainly happens during data aggregation where data is

combined together for analysis purpose which might affect the privacy of

individual data.

We used a census data set to predict whether income exceeds $50K/yr based

on census data. Prediction task is to determine whether a person makes over

50K a year.

Page 3: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 2 ~

Number of instances : 32561

Number of attributes : 15

Age, fnlwgt, workclass, education, education-num, marital-status,occupation,

relationship, race, sex, capital-gain, capital-loss, hours-per-week,native-country,

salary

Attribute Information:

Listing of attributes:

Salary: >50K, <=50K.

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,

State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,

Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,

Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated,

Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,

Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-

fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

Page 4: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 3 ~

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,

Unmarried.

DATA PRE-PROCESSING:

The first step in data mining is to choose a data set with large number of

records so that unknown patterns can be identified. The data set should not be

too large that it consumes a lot of time to come up with a pattern. So, basically

selecting an appropriate data set is very important. Data sets are usually

chosen from a data warehouse. The main of data pre-processing is to evaluate

multivariate data sets prior to data mining. The data set is cleaned to remove

the records containing noise and the ones with missing data.

DATA MINING:

It is broadly segregated into six categories of tasks-

Anomaly detection

This phase involves identifying records that are unusual like outliers, any

deviations or errors. These data records have to be further investigated.

Association rule learning

This phase is the one where the model searches for any relationships among

the different variables. The very common example of it can be the beer diaper,

where on Fridays customers buy beer along with diapers. Based on this

Page 5: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 4 ~

association, the store can put both these items together to increase the sales

which is also known as market based analysis or dependency modelling.

Clustering:

Determining groups in the data which are similar in some aspect without the

use of previously known structures in data.

Classification:

This phase generalizes the known structures to apply the new data. As an

example an e-mail program that attempts to classify e-mails as spam or

genuine.

Regression:

This phase tries to phase a function which models the data with the least

amount of errors.

Summarization:

This phase provides a more compact view of the data set including the

visualization and report generation.

Validation of results:

Data mining can also give results which seem significant but cannot be used to

predict future behaviors and cannot be reproduced on a new sample of data.

These results are derived from improper statistical hypothesis testing. A simple

Page 6: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 5 ~

version of this problem is known as over fitting, where data mining algorithms

find patterns in the training set which are not present in the general data set.

WEKA (WAIKATO ENVIRONMENT FOR

KNOWLEDGE ANALYSIS)

Weka consists of a collection of visualization tools and algorithms for data

analysis and predictive modeling, along with graphical user interfaces for easy

access to this functionality. It is basically used for educational purposes and

research. Weka is:

Freely available General Public License

Portable as it completely implemented in Java and can be run on any

computing platform

An Extensive collection of data preprocessing and modeling techniques

Has a GUI that makes it easy to use

Weka supports various data mining tasks like data preprocessing, clustering,

classification, regression, visualization, and feature selection. Weka's

techniques are predicated on the assumption that data is available as a single

flat file or relation, where each data point is expressed by a fixed number of

attributes. Weka provides access to SQL databases using Java Database

Connectivity and can process the result returned by a database query.

Page 7: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 6 ~

NAÏVE BAYES SIMPLE

All attributes: We have executed Naïve Bayes Simple on all the 15

attributes in our data set and have obtained the following results.

Page 8: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 7 ~

The accuracy of the predictive model is 83.69%.

Page 9: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 8 ~

Few attributes: We have considered 6 attributes namely age, work class,

education, education number, hours per week and salary and obtained

the following results.

Page 10: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 9 ~

The accuracy of the predictive model is 79.20%.

Page 11: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 10 ~

The following results are obtained on removing the education attribute

from the data set.

Page 12: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 11 ~

The accuracy of the predictive model is now 80% as compared to 79.20%

when the attribute education was present. This indicates that education is an

unimportant attribute and its removal from the data set yields a more accurate

result.

Page 13: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 12 ~

=== Run information ===

Scheme:weka.classifiers.bayes.NaiveBayesSimple

Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-

weka.filters.unsupervised.attribute.Remove-R3-

weka.filters.unsupervised.attribute.Remove-R5-

weka.filters.unsupervised.attribute.Remove-R3

Instances: 32561

Attributes: 5

age

workclass

education-num

hours-per-week

salary

Test mode:split 66.0% train, remainder test

=== Classifier model (full training set) ===

Naive Bayes (simple)

Class <=50K: P(C) = 0.75917452

Attribute age

Mean: 36.78373786 Standard Deviation: 14.02008849

Attribute workclass

State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-

inc Without-pay Never-worked

0.03825468 0.07351692 0.71713373 0.02385863 0.05972745 0.06656153

0.02001698 0.00060658 0.00032351

Page 14: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 13 ~

Attribute education-num

Mean: 9.59506472 Standard Deviation: 2.43614679

Attribute hours-per-week

Mean: 38.84021036 Standard Deviation: 12.31899464

Class >50K: P(C) = 0.24082548

Attribute age

Mean: 44.24984058 Standard Deviation: 10.51902772

Attribute workclass

State-gov Self-emp-not-inc Private Federal-gov Local-gov ? Self-emp-inc Without-

pay Never-worked

0.04509554 0.09235669 0.63235669 0.04738854 0.07872611 0.0244586 0.07936306

0.00012739 0.00012739

Attribute education-num

Mean: 11.61165668 Standard Deviation: 2.38512863

Attribute hours-per-week

Mean: 45.4730264 Standard Deviation: 11.01297093

Time taken to build model: 0.06 seconds

Page 15: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 14 ~

=== Evaluation on test split ===

=== Summary ===

Correctly Classified Instances 8857 80.0018 %

Incorrectly Classified Instances 2214 19.9982 %

Kappa statistic 0.3666

Mean absolute error 0.2737

Root mean squared error 0.3704

Relative absolute error 75.1356 %

Root relative squared error 87.3416 %

Total Number of Instances 11071

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.923 0.601 0.833 0.923 0.876 0.82 <=50K

0.399 0.077 0.615 0.399 0.484 0.82 >50K

Weighted Avg. 0.8 0.478 0.782 0.8 0.784 0.82

=== Confusion Matrix ===

a b <-- classified as

7820 650 | a = <=50K

1564 1037 | b = >50K

Page 16: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 15 ~

APRIORI:

Apriori is an algorithm for frequent item set mining and association rule

learning over transactional databases. Apriori proceeds by identifying the

frequent individual items in the database and extending them to larger and

larger item sets as long as those item sets appear sufficiently often in the

database.

TOTAL NUMBER OF RULES

Page 17: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 16 ~

OUTPUT:

=== Run information ===

Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.1 -D 0.05 -U 1.0 -M 0.1 -S

1.0 -V -c -1

Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-

weka.filters.unsupervised.attribute.Remove-R3-

weka.filters.unsupervised.attribute.Remove-R5-

weka.filters.unsupervised.attribute.Remove-R3-

weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last

Instances: 32561

Attributes: 5

age

workclass

education-num

hours-per-week

salary

=== Associator model (full training set) ===

Apriori

=======

Minimum support: 0.1 (3256 instances)

Minimum metric <confidence>: 0.1

Significance level: 1

Number of cycles performed: 14

Page 18: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 17 ~

Generated sets of large itemsets:

Size of set of large itemsetsL(1): 12

Large ItemsetsL(1):

age='(-inf-24.3]' 5570

age='(24.3-31.6]' 5890

age='(31.6-38.9]' 6048

age='(38.9-46.2]' 6163

age='(46.2-53.5]' 3967

workclass= Private 22696

education-num='(8.5-10]' 17792

education-num='(11.5-13]' 6422

hours-per-week='(30.4-40.2]' 17735

hours-per-week='(40.2-50]' 5938

salary= <=50K 24720

salary= >50K 7841

Page 19: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 18 ~

Size of set of large itemsetsL(2): 25

Large ItemsetsL(2):

age='(-inf-24.3]' workclass= Private 4404

age='(-inf-24.3]' education-num='(8.5-10]' 3637

age='(-inf-24.3]' salary= <=50K 5509

age='(24.3-31.6]' workclass= Private 4617

age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517

age='(24.3-31.6]' salary= <=50K 5086

age='(31.6-38.9]' workclass= Private 4426

age='(31.6-38.9]' education-num='(8.5-10]' 3283

age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371

age='(31.6-38.9]' salary= <=50K 4371

age='(38.9-46.2]' workclass= Private 4127

age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481

age='(38.9-46.2]' salary= <=50K 3934

workclass= Private education-num='(8.5-10]' 12874

workclass= Private education-num='(11.5-13]' 4280

workclass= Private hours-per-week='(30.4-40.2]' 12849

workclass= Private hours-per-week='(40.2-50]' 4270

workclass= Private salary= <=50K 17733

workclass= Private salary= >50K 4963

education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177

education-num='(8.5-10]' salary= <=50K 14730

education-num='(11.5-13]' salary= <=50K 3936

hours-per-week='(30.4-40.2]' salary= <=50K 14103

hours-per-week='(30.4-40.2]' salary= >50K 3632

hours-per-week='(40.2-50]' salary= <=50K 3586

Page 20: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 19 ~

Size of set of large itemsetsL(3): 7

Large ItemsetsL(3):

age='(-inf-24.3]' workclass= Private salary= <=50K 4360

age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605

age='(24.3-31.6]' workclass= Private salary= <=50K 4023

workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597

workclass= Private education-num='(8.5-10]' salary= <=50K 10832

workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524

education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574

Size of set of large itemsetsL(4): 1

Large ItemsetsL(4):

workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary=

<=50K 6513

Page 21: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 20 ~

BEST RULES FOUND:

1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)

lift:(1.31) lev:(0.03) [843] conv:(26.54)

2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)

lev:(0.03) [1016] conv:(23.57)

3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]

conv:(21.63)

4. age='(24.3-31.6]' workclass= Private 4617 ==> salary= <=50K 4023 conf:(0.87) lift:(1.15)

lev:(0.02) [517] conv:(1.87)

5. age='(24.3-31.6]' 5890 ==> salary= <=50K 5086 conf:(0.86) lift:(1.14) lev:(0.02) [614]

conv:(1.76)

6. workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597 ==>

salary= <=50K 6513 conf:(0.86) lift:(1.13) lev:(0.02) [745] conv:(1.69)

7. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==> salary= <=50K 8574

conf:(0.84) lift:(1.11) lev:(0.03) [847] conv:(1.53)

8. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)

lift:(1.11) lev:(0.03) [1058] conv:(1.52)

9. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)

[1222] conv:(1.4)

10. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524

conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33)

11. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)

lev:(0.02) [638] conv:(1.18)

12. age='(-inf-24.3]' salary= <=50K 5509 ==>workclass= Private 4360 conf:(0.79) lift:(1.14)

lev:(0.02) [520] conv:(1.45) 13. age='(24.3-31.6]' salary= <=50K 5086 ==>workclass= Private

4023 conf:(0.79) lift:(1.13) lev:(0.01) [477] conv:(1.45)

14. age='(-inf-24.3]' 5570 ==>workclass= Private 4404 conf:(0.79) lift:(1.13) lev:(0.02) [521]

conv:(1.45)

Page 22: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 21 ~

15. age='(24.3-31.6]' 5890 ==>workclass= Private 4617 conf:(0.78) lift:(1.12) lev:(0.02) [511]

conv:(1.4)

16. age='(-inf-24.3]' 5570 ==>workclass= Private salary= <=50K 4360 conf:(0.78) lift:(1.44)

lev:(0.04) [1326] conv:(2.09)

17. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02)

[502] conv:(1.1)

18. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574

==>workclass= Private 6513 conf:(0.76) lift:(1.09) lev:(0.02) [536] conv:(1.26)

19. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private 7597

conf:(0.75) lift:(1.07) lev:(0.02) [503] conv:(1.19)

20. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524

conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19)

21. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)

lift:(1.06) lev:(0.02) [564] conv:(1.14)

22. age='(31.6-38.9]' 6048 ==>workclass= Private 4426 conf:(0.73) lift:(1.05) lev:(0.01) [210]

conv:(1.13)

23. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)

lev:(0.01) [487] conv:(1.1)

24. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)

lev:(0.01) [472] conv:(1.1)

25. age='(31.6-38.9]' 6048 ==> salary= <=50K 4371 conf:(0.72) lift:(0.95) lev:(-0.01) [-220]

conv:(0.87)

26. hours-per-week='(40.2-50]' 5938 ==>workclass= Private 4270 conf:(0.72) lift:(1.03)

lev:(0) [131] conv:(1.08)

27. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]

conv:(1.07)

28. age='(24.3-31.6]' 5890 ==>workclass= Private salary= <=50K 4023 conf:(0.68) lift:(1.25)

lev:(0.03) [815] conv:(1.44)

Page 23: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 22 ~

29. age='(38.9-46.2]' 6163 ==>workclass= Private 4127 conf:(0.67) lift:(0.96) lev:(-0.01) [-

168] conv:(0.92)

30. education-num='(11.5-13]' 6422 ==>workclass= Private 4280 conf:(0.67) lift:(0.96) lev:(-

0.01) [-196] conv:(0.91)

31. age='(-inf-24.3]' salary= <=50K 5509 ==> education-num='(8.5-10]' 3605 conf:(0.65)

lift:(1.2) lev:(0.02) [594] conv:(1.31)

32. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' 3637 conf:(0.65) lift:(1.19) lev:(0.02)

[593] conv:(1.31)

33. age='(-inf-24.3]' 5570 ==> education-num='(8.5-10]' salary= <=50K 3605 conf:(0.65)

lift:(1.43) lev:(0.03) [1085] conv:(1.55)

34. education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177 ==>workclass= Private

salary= <=50K 6513 conf:(0.64) lift:(1.18) lev:(0.03) [970] conv:(1.26)

35. age='(38.9-46.2]' 6163 ==> salary= <=50K 3934 conf:(0.64) lift:(0.84) lev:(-0.02) [-744]

conv:(0.67)

36. salary= >50K 7841 ==>workclass= Private 4963 conf:(0.63) lift:(0.91) lev:(-0.02) [-502]

conv:(0.83)

37. workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524 ==> education-

num='(8.5-10]' 6513 conf:(0.62) lift:(1.13) lev:(0.02) [762] conv:(1.19)

38. education-num='(11.5-13]' 6422 ==> salary= <=50K 3936 conf:(0.61) lift:(0.81) lev:(-0.03)

[-939] conv:(0.62)

39. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' 10832

conf:(0.61) lift:(1.12) lev:(0.04) [1142] conv:(1.17)

40. education-num='(8.5-10]' 17792 ==>workclass= Private salary= <=50K 10832 conf:(0.61)

lift:(1.12) lev:(0.04) [1142] conv:(1.16)

41. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==> education-num='(8.5-10]' 8574

conf:(0.61) lift:(1.11) lev:(0.03) [867] conv:(1.16)

42. hours-per-week='(40.2-50]' 5938 ==> salary= <=50K 3586 conf:(0.6) lift:(0.8) lev:(-0.03)

[-922] conv:(0.61)

Page 24: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 23 ~

43. workclass= Private education-num='(8.5-10]' salary= <=50K 10832 ==> hours-per-

week='(30.4-40.2]' 6513 conf:(0.6) lift:(1.1) lev:(0.02) [613] conv:(1.14)

44. age='(24.3-31.6]' 5890 ==> hours-per-week='(30.4-40.2]' 3517 conf:(0.6) lift:(1.1)

lev:(0.01) [308] conv:(1.13)

45. salary= <=50K 24720 ==> education-num='(8.5-10]' 14730 conf:(0.6) lift:(1.09) lev:(0.04)

[1222] conv:(1.12)

46. workclass= Private salary= <=50K 17733 ==> hours-per-week='(30.4-40.2]' 10524

conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12)

47. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private salary= <=50K 10524

conf:(0.59) lift:(1.09) lev:(0.03) [865] conv:(1.12)

48. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'

7597 conf:(0.59) lift:(1.08) lev:(0.02) [576] conv:(1.11)

49. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'

7597 conf:(0.59) lift:(1.08) lev:(0.02) [584] conv:(1.11)

50. education-num='(8.5-10]' salary= <=50K 14730 ==> hours-per-week='(30.4-40.2]' 8574

conf:(0.58) lift:(1.07) lev:(0.02) [551] conv:(1.09)

51. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' 10177 conf:(0.57)

lift:(1.05) lev:(0.01) [486] conv:(1.06)

52. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' 10177 conf:(0.57)

lift:(1.05) lev:(0.01) [486] conv:(1.06)

53. salary= <=50K 24720 ==> hours-per-week='(30.4-40.2]' 14103 conf:(0.57) lift:(1.05)

lev:(0.02) [638] conv:(1.06)

54. workclass= Private 22696 ==> education-num='(8.5-10]' 12874 conf:(0.57) lift:(1.04)

lev:(0.01) [472] conv:(1.05)

55. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' 12849 conf:(0.57) lift:(1.04)

lev:(0.01) [487] conv:(1.05)

56. age='(38.9-46.2]' 6163 ==> hours-per-week='(30.4-40.2]' 3481 conf:(0.56) lift:(1.04)

lev:(0) [124] conv:(1.05)

Page 25: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 24 ~

57. age='(31.6-38.9]' 6048 ==> hours-per-week='(30.4-40.2]' 3371 conf:(0.56) lift:(1.02)

lev:(0) [76] conv:(1.03)

58. age='(31.6-38.9]' 6048 ==> education-num='(8.5-10]' 3283 conf:(0.54) lift:(0.99) lev:(0) [-

21] conv:(0.99)

59. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> education-num='(8.5-10]'

salary= <=50K 6513 conf:(0.51) lift:(1.12) lev:(0.02) [700] conv:(1.11)

60. workclass= Private education-num='(8.5-10]' 12874 ==> hours-per-week='(30.4-40.2]'

salary= <=50K 6513 conf:(0.51) lift:(1.17) lev:(0.03) [936] conv:(1.15)

61. hours-per-week='(30.4-40.2]' 17735 ==> education-num='(8.5-10]' salary= <=50K 8574

conf:(0.48) lift:(1.07) lev:(0.02) [551] conv:(1.06)

62. education-num='(8.5-10]' 17792 ==> hours-per-week='(30.4-40.2]' salary= <=50K 8574

conf:(0.48) lift:(1.11) lev:(0.03) [867] conv:(1.09)

63. workclass= Private 22696 ==> education-num='(8.5-10]' salary= <=50K 10832

conf:(0.48) lift:(1.06) lev:(0.02) [564] conv:(1.05)

64. workclass= Private 22696 ==> hours-per-week='(30.4-40.2]' salary= <=50K 10524

conf:(0.46) lift:(1.07) lev:(0.02) [693] conv:(1.06)

65. salary= >50K 7841 ==> hours-per-week='(30.4-40.2]' 3632 conf:(0.46) lift:(0.85) lev:(-

0.02) [-638] conv:(0.85)

66. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private education-

num='(8.5-10]' 6513 conf:(0.46) lift:(1.17) lev:(0.03) [936] conv:(1.12)

67. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private hours-per-

week='(30.4-40.2]' 6513 conf:(0.44) lift:(1.12) lev:(0.02) [700] conv:(1.09)

68. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' 10832 conf:(0.44)

lift:(1.11) lev:(0.03) [1058] conv:(1.08)

69. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]' 7597

conf:(0.43) lift:(1.08) lev:(0.02) [584] conv:(1.06)

70. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]' 7597

conf:(0.43) lift:(1.08) lev:(0.02) [576] conv:(1.06)

Page 26: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 25 ~

71. salary= <=50K 24720 ==>workclass= Private hours-per-week='(30.4-40.2]' 10524

conf:(0.43) lift:(1.08) lev:(0.02) [769] conv:(1.05)

72. workclass= Private salary= <=50K 17733 ==> education-num='(8.5-10]' hours-per-

week='(30.4-40.2]' 6513 conf:(0.37) lift:(1.18) lev:(0.03) [970] conv:(1.09)

73. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private education-num='(8.5-10]'

salary= <=50K 6513 conf:(0.37) lift:(1.1) lev:(0.02) [613] conv:(1.05)

74. education-num='(8.5-10]' 17792 ==>workclass= Private hours-per-week='(30.4-40.2]'

salary= <=50K 6513 conf:(0.37) lift:(1.13) lev:(0.02) [762] conv:(1.07)

75. salary= <=50K 24720 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 8574

conf:(0.35) lift:(1.11) lev:(0.03) [847] conv:(1.05)

76. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'

7597 conf:(0.33) lift:(1.07) lev:(0.02) [503] conv:(1.03)

77. workclass= Private 22696 ==> education-num='(8.5-10]' hours-per-week='(30.4-40.2]'

salary= <=50K 6513 conf:(0.29) lift:(1.09) lev:(0.02) [536] conv:(1.03)

78. salary= <=50K 24720 ==>workclass= Private education-num='(8.5-10]' hours-per-

week='(30.4-40.2]' 6513 conf:(0.26) lift:(1.13) lev:(0.02) [745] conv:(1.04)

79. workclass= Private salary= <=50K 17733 ==> age='(-inf-24.3]' 4360 conf:(0.25) lift:(1.44)

lev:(0.04) [1326] conv:(1.1)

80. education-num='(8.5-10]' salary= <=50K 14730 ==> age='(-inf-24.3]' 3605 conf:(0.24)

lift:(1.43) lev:(0.03) [1085] conv:(1.1)

81. workclass= Private salary= <=50K 17733 ==> age='(24.3-31.6]' 4023 conf:(0.23)

lift:(1.25) lev:(0.03) [815] conv:(1.06)

82. salary= <=50K 24720 ==> age='(-inf-24.3]' 5509 conf:(0.22) lift:(1.3) lev:(0.04) [1280]

conv:(1.07)

83. workclass= Private 22696 ==> salary= >50K 4963 conf:(0.22) lift:(0.91) lev:(-0.02) [-502]

conv:(0.97)

Page 27: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 26 ~

84. salary= <=50K 24720 ==> age='(24.3-31.6]' 5086 conf:(0.21) lift:(1.14) lev:(0.02) [614]

conv:(1.03)

85. hours-per-week='(30.4-40.2]' 17735 ==> salary= >50K 3632 conf:(0.2) lift:(0.85) lev:(-

0.02) [-638] conv:(0.95)

86. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' 3637 conf:(0.2) lift:(1.19) lev:(0.02)

[593] conv:(1.04)

87. workclass= Private 22696 ==> age='(24.3-31.6]' 4617 conf:(0.2) lift:(1.12) lev:(0.02) [511]

conv:(1.03)

88. education-num='(8.5-10]' 17792 ==> age='(-inf-24.3]' salary= <=50K 3605 conf:(0.2)

lift:(1.2) lev:(0.02) [594] conv:(1.04)

89. hours-per-week='(30.4-40.2]' 17735 ==> age='(24.3-31.6]' 3517 conf:(0.2) lift:(1.1)

lev:(0.01) [308] conv:(1.02)

90. hours-per-week='(30.4-40.2]' 17735 ==> age='(38.9-46.2]' 3481 conf:(0.2) lift:(1.04)

lev:(0) [124] conv:(1.01)

91. workclass= Private 22696 ==> age='(31.6-38.9]' 4426 conf:(0.2) lift:(1.05) lev:(0.01) [210]

conv:(1.01)

92. workclass= Private 22696 ==> age='(-inf-24.3]' 4404 conf:(0.19) lift:(1.13) lev:(0.02) [521]

conv:(1.03)

93. workclass= Private 22696 ==> age='(-inf-24.3]' salary= <=50K 4360 conf:(0.19) lift:(1.14)

lev:(0.02) [520] conv:(1.03)

94. hours-per-week='(30.4-40.2]' 17735 ==> age='(31.6-38.9]' 3371 conf:(0.19) lift:(1.02)

lev:(0) [76] conv:(1.01)

95. workclass= Private 22696 ==> education-num='(11.5-13]' 4280 conf:(0.19) lift:(0.96)

lev:(-0.01) [-196] conv:(0.99)

96. workclass= Private 22696 ==> hours-per-week='(40.2-50]' 4270 conf:(0.19) lift:(1.03)

lev:(0) [131] conv:(1.01)

Page 28: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 27 ~

97. education-num='(8.5-10]' 17792 ==> age='(31.6-38.9]' 3283 conf:(0.18) lift:(0.99) lev:(0) [-

21] conv:(1)

98. workclass= Private 22696 ==> age='(38.9-46.2]' 4127 conf:(0.18) lift:(0.96) lev:(-0.01) [-

168] conv:(0.99)

99. workclass= Private 22696 ==> age='(24.3-31.6]' salary= <=50K 4023 conf:(0.18)

lift:(1.13) lev:(0.01) [477] conv:(1.03)

100. salary= <=50K 24720 ==> age='(31.6-38.9]' 4371 conf:(0.18) lift:(0.95) lev:(-0.01) [-220]

conv:(0.99)

101. salary= <=50K 24720 ==> age='(-inf-24.3]' workclass= Private 4360 conf:(0.18) lift:(1.3)

lev:(0.03) [1016] conv:(1.05)

102. salary= <=50K 24720 ==> age='(24.3-31.6]' workclass= Private 4023 conf:(0.16)

lift:(1.15) lev:(0.02) [517] conv:(1.02)

103. salary= <=50K 24720 ==> education-num='(11.5-13]' 3936 conf:(0.16) lift:(0.81) lev:(-

0.03) [-939] conv:(0.95)

104. salary= <=50K 24720 ==> age='(38.9-46.2]' 3934 conf:(0.16) lift:(0.84) lev:(-0.02) [-744]

conv:(0.96)

105. salary= <=50K 24720 ==> age='(-inf-24.3]' education-num='(8.5-10]' 3605 conf:(0.15)

lift:(1.31) lev:(0.03) [843] conv:(1.04)

106. salary= <=50K 24720 ==> hours-per-week='(40.2-50]' 3586 conf:(0.15) lift:(0.8) lev:(-

0.03) [-922] conv:(0.96)

Page 29: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 28 ~

CHANGED MIN METRIC = 0.9 AND lowerBoundMinSupport=0.1

Page 30: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 29 ~

=== Run information ===

Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S 1.0 -V

-c -1

Relation: income-weka.filters.unsupervised.attribute.Remove-R6,8-12,14-

weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R5-

weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Discretize-

B10-M-1.0-Rfirst-last

Instances: 32561

Attributes: 5

age

workclass

education-num

hours-per-week

salary

=== Associator model (full training set) ===

Apriori

=======

Minimum support: 0.1 (3256 instances)

Minimum metric <confidence>: 0.9

Significance level: 1

Number of cycles performed: 14

Generated sets of large itemsets:

Size of set of large itemsetsL(1): 12

Large ItemsetsL(1):

age='(-inf-24.3]' 5570

age='(24.3-31.6]' 5890

age='(31.6-38.9]' 6048

age='(38.9-46.2]' 6163

age='(46.2-53.5]' 3967

Page 31: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 30 ~

workclass= Private 22696

education-num='(8.5-10]' 17792

education-num='(11.5-13]' 6422

hours-per-week='(30.4-40.2]' 17735

hours-per-week='(40.2-50]' 5938

salary= <=50K 24720

salary= >50K 7841

Size of set of large itemsetsL(2): 25

Large ItemsetsL(2):

age='(-inf-24.3]' workclass= Private 4404

age='(-inf-24.3]' education-num='(8.5-10]' 3637

age='(-inf-24.3]' salary= <=50K 5509

age='(24.3-31.6]' workclass= Private 4617

age='(24.3-31.6]' hours-per-week='(30.4-40.2]' 3517

age='(24.3-31.6]' salary= <=50K 5086

age='(31.6-38.9]' workclass= Private 4426

age='(31.6-38.9]' education-num='(8.5-10]' 3283

age='(31.6-38.9]' hours-per-week='(30.4-40.2]' 3371

age='(31.6-38.9]' salary= <=50K 4371

age='(38.9-46.2]' workclass= Private 4127

age='(38.9-46.2]' hours-per-week='(30.4-40.2]' 3481

age='(38.9-46.2]' salary= <=50K 3934

workclass= Private education-num='(8.5-10]' 12874

workclass= Private education-num='(11.5-13]' 4280

workclass= Private hours-per-week='(30.4-40.2]' 12849

workclass= Private hours-per-week='(40.2-50]' 4270

workclass= Private salary= <=50K 17733

workclass= Private salary= >50K 4963

Page 32: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 31 ~

education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177

education-num='(8.5-10]' salary= <=50K 14730

education-num='(11.5-13]' salary= <=50K 3936

hours-per-week='(30.4-40.2]' salary= <=50K 14103

hours-per-week='(30.4-40.2]' salary= >50K 3632

hours-per-week='(40.2-50]' salary= <=50K 3586

Size of set of large itemsetsL(3): 7

Large ItemsetsL(3):

age='(-inf-24.3]' workclass= Private salary= <=50K 4360

age='(-inf-24.3]' education-num='(8.5-10]' salary= <=50K 3605

age='(24.3-31.6]' workclass= Private salary= <=50K 4023

workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 7597

workclass= Private education-num='(8.5-10]' salary= <=50K 10832

workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524

education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 8574

Size of set of large itemsetsL(4): 1

Large ItemsetsL(4):

workclass= Private education-num='(8.5-10]' hours-per-week='(30.4-40.2]' salary= <=50K 6513

Best rules found:

1. age='(-inf-24.3]' education-num='(8.5-10]' 3637 ==> salary= <=50K 3605 conf:(0.99)

lift:(1.31) lev:(0.03) [843] conv:(26.54)

2. age='(-inf-24.3]' workclass= Private 4404 ==> salary= <=50K 4360 conf:(0.99) lift:(1.3)

lev:(0.03) [1016] conv:(23.57)

3. age='(-inf-24.3]' 5570 ==> salary= <=50K 5509 conf:(0.99) lift:(1.3) lev:(0.04) [1280]

conv:(21.63)

Page 33: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 32 ~

Changed lowerBoundMinSupport = 0.3 minMetric = 0.65

Page 34: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 33 ~

=== Run information ===

Scheme: weka.associations.Apriori -I -R -N 200 -T 0 -C 0.65 -D 0.05 -U 1.0 -M 0.3 -S 1.0 -V

-c -1

Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-

weka.filters.unsupervised.attribute.Remove-R3-4,6-12,14-

weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last

Instances: 32561

Attributes: 5

age

workclass

education-num

hours-per-week

salary

=== Associator model (full training set) ===

Apriori

=======

Minimum support: 0.3 (9768 instances)

Minimum metric <confidence>: 0.65

Significance level: 1

Number of cycles performed: 10

Generated sets of large itemsets:

Size of set of large itemsetsL(1): 4

Large ItemsetsL(1):

workclass= Private 22696

education-num='(8.5-10]' 17792

hours-per-week='(30.4-40.2]' 17735

salary= <=50K 24720Size of set of large itemsets L(2): 6

Large ItemsetsL(2):

workclass= Private education-num='(8.5-10]' 12874

Page 35: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 34 ~

workclass= Private hours-per-week='(30.4-40.2]' 12849

workclass= Private salary= <=50K 17733

education-num='(8.5-10]' hours-per-week='(30.4-40.2]' 10177

education-num='(8.5-10]' salary= <=50K 14730

hours-per-week='(30.4-40.2]' salary= <=50K 14103

Size of set of large itemsetsL(3): 2

Large ItemsetsL(3):

workclass= Private education-num='(8.5-10]' salary= <=50K 10832

workclass= Private hours-per-week='(30.4-40.2]' salary= <=50K 10524

Best rules found:

1. workclass= Private education-num='(8.5-10]' 12874 ==> salary= <=50K 10832 conf:(0.84)

lift:(1.11) lev:(0.03) [1058] conv:(1.52)

2. education-num='(8.5-10]' 17792 ==> salary= <=50K 14730 conf:(0.83) lift:(1.09) lev:(0.04)

[1222] conv:(1.4)

3. workclass= Private hours-per-week='(30.4-40.2]' 12849 ==> salary= <=50K 10524

conf:(0.82) lift:(1.08) lev:(0.02) [769] conv:(1.33)

4. hours-per-week='(30.4-40.2]' 17735 ==> salary= <=50K 14103 conf:(0.8) lift:(1.05)

lev:(0.02) [638] conv:(1.18)

5. workclass= Private 22696 ==> salary= <=50K 17733 conf:(0.78) lift:(1.03) lev:(0.02) [502]

conv:(1.1)

6. hours-per-week='(30.4-40.2]' salary= <=50K 14103 ==>workclass= Private 10524

conf:(0.75) lift:(1.07) lev:(0.02) [693] conv:(1.19)

7. education-num='(8.5-10]' salary= <=50K 14730 ==>workclass= Private 10832 conf:(0.74)

lift:(1.06) lev:(0.02) [564] conv:(1.14)

8. hours-per-week='(30.4-40.2]' 17735 ==>workclass= Private 12849 conf:(0.72) lift:(1.04)

lev:(0.01) [487] conv:(1.1)

9. education-num='(8.5-10]' 17792 ==>workclass= Private 12874 conf:(0.72) lift:(1.04)

lev:(0.01) [472] conv:(1.1)

10. salary= <=50K 24720 ==>workclass= Private 17733 conf:(0.72) lift:(1.03) lev:(0.02) [502]

conv:(1.07)

Page 36: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 35 ~

J48:

The decision trees generated by J48 can be used for classification. It uses the

fact that each attribute of the data can be used to make a decision by splitting

the data into smaller subsets.

WITH ALL ATTRIBUTES

Page 37: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 36 ~

SELECTED ATTRIBUTES

Page 38: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 37 ~

Page 39: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 38 ~

K-MEANS CLUSTERING:

K-means clustering is a method of vector quantization, originally from signal

processing, that is popular for cluster analysis in data mining. k-means

clustering aims to partition n observations into k clusters in which each

observation belongs to the cluster with the nearest mean, serving as a

prototype of the cluster.

Page 40: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 39 ~

=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I

500 -S 10

Relation: income-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last

Instances: 32561

Attributes: 15

age

workclass

fnlwgt

education

education-num

martial-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

salary

Test mode:split 66% train, remainder test

=== Model and evaluation on training set ===

kMeans

======

Number of iterations: 4

Within cluster sum of squared errors: 146940.0

Page 41: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 40 ~

Missing values globally replaced with mean/mode

Cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(32561) (7137) (7920) (9177) (4102) (4225)

========================================================================================

==============================================================

age '(38.9-46.2]' '(46.2-53.5]' '(-inf-24.3]' '(38.9-46.2]' '(24.3-31.6]' '(24.3-

31.6]'

workclass Private PrivatePrivatePrivatePrivatePrivate

fnlwgt '(159527-306769]' '(-inf-159527]' '(-inf-159527]' '(159527-306769]' '(159527-306769]'

'(159527-306769]'

education HS-grad Bachelors Some-college HS-grad Bachelors

HS-grad

education-num '(8.5-10]' '(11.5-13]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'

martial-status Married-civ-spouse Married-civ-spouse Never-married Married-civ-spouse Never-

married Never-married

occupation Prof-specialty Prof-specialty Other-service Craft-repair Prof-specialty

Adm-clerical

relationship Husband Husband Own-child Husband Not-in-family

Unmarried

race White WhiteWhiteWhiteWhiteWhite

sex Male MaleMaleMale Female Female

capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-

9999.9]'

capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'

hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'

'(30.4-40.2]'

native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States

salary<=50K >50K <=50K <=50K <=50K <=50K

Page 42: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 41 ~

Time taken to build model (full training data) : 3.13 seconds

=== Model and evaluation on test split ===

kMeans

======

Number of iterations: 4

Within cluster sum of squared errors: 103681.0

Missing values globally replaced with mean/mode

Cluster centroids:

Cluster#

Attribute Full Data 0 1 2 3 4

(21490) (11082) (3995) (3024) (2584) (805)

========================================================================================

==============================================================

age '(38.9-46.2]' '(31.6-38.9]' '(-inf-24.3]' '(24.3-31.6]' '(24.3-31.6]' '(-inf-24.3]'

workclass Private PrivatePrivatePrivatePrivatePrivate

fnlwgt '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-306769]' '(159527-

306769]' '(159527-306769]'

education HS-grad HS-grad Some-college HS-grad Bachelors

Some-college

education-num '(8.5-10]' '(8.5-10]' '(8.5-10]' '(8.5-10]' '(11.5-13]' '(8.5-10]'

martial-status Married-civ-spouse Married-civ-spouse Never-married Never-marriedNever-

marriedNever-married

occupation Craft-repair Craft-repair Other-service Adm-clerical Prof-specialty

Other-service

relationship Husband Husband Own-child Not-in-family Not-in-familyNot-in-

family

race White WhiteWhiteWhiteWhiteWhite

sex Male Male Female FemaleFemaleFemale

capital-gain '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-9999.9]' '(-inf-

9999.9]'

capital-loss '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]' '(-inf-435.6]'

hours-per-week '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]' '(30.4-40.2]'

'(20.6-30.4]'

native-country United-States United-StatesUnited-StatesUnited-StatesUnited-StatesUnited-States

salary<=50K <=50K <=50K <=50K <=50K <=50K

Page 43: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 42 ~

Time taken to build model (percentage split) : 1.99 seconds

Clustered Instances

0 5618 ( 51%)

1 2119 ( 19%)

2 1569 ( 14%)

3 1358 ( 12%)

4 407 ( 4%)

Page 44: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 43 ~

SAS ENTERPRISE MINER

Cluster Analysis:

This analysis attempts to find natural groupings of observations in the data,

based on a set of input variables. After grouping the observations into clusters,

you can use the input variables to try to characterize each group. When the

clusters have been identified and interpreted, you can decide whether to treat

each cluster independently. Clustering can therefore be formulated as a multi-

objective optimization problem. The appropriate clustering algorithm and

parameter settings (including values such as the distance function to use, a

density threshold or the number of expected clusters) depend on the individual

data set and intended use of the results. Cluster analysis as such is not an

automatic task, but an iterative process of knowledge discovery or interactive

multi-objective optimization that involves trial and failure. It will often be

necessary to modify data preprocessing and model parameters until the result

achieves the desired properties.

In this dataset we built clusters to group similar items in our dataset. We

changed the properties of the cluster.

Cluster variable role = Segment

Specification Method = User Specify

Maximum Number of Clusters = 5

Page 45: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 44 ~

Page 46: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 45 ~

From the clustering segments (pie-chart), we can observe that cluster 1

contains a large part of the data set.

Page 47: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 46 ~

SAS DECISION TREES

Node Rules:

*------------------------------------------------------------*

Node = 11

*------------------------------------------------------------*

if Relationship IS ONE OF: NOT-IN-FAMILY, OWN-CHILD, UNMARRIED, OTHER-RELATIVE or MISSING

AND Hours-Per-Week >= 35.5 or MISSING

AND Capital-Gain >= 7073.59

then

Tree Node Identifier = 11

Number of Observations = 285

Predicted: Salary=>50K = 0.99

Predicted: Salary=<=50K = 0.01

*------------------------------------------------------------*

Node = 13

*------------------------------------------------------------*

if Relationship IS ONE OF: HUSBAND, WIFE

AND Education-Num < 12.5 or MISSING

AND Capital-Gain >= 5095.5

then

Tree Node Identifier = 13

Number of Observations = 522

Predicted: Salary=>50K = 0.98

Predicted: Salary=<=50K = 0.02

Page 48: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 47 ~

*------------------------------------------------------------*

Node = 15

*------------------------------------------------------------*

if Relationship IS ONE OF: HUSBAND, WIFE

AND Education-Num >= 12.5

AND Capital-Gain >= 5095.5

then

Tree Node Identifier = 15

Number of Observations = 678

Predicted: Salary=>50K = 1.00

Predicted: Salary=<=50K = 0.00

*------------------------------------------------------------*

Node = 161

*------------------------------------------------------------*

if Workclass IS ONE OF: STATE-GOV, SELF-EMP-NOT-INC

AND Relationship IS ONE OF: HUSBAND, WIFE

AND Occupation IS ONE OF: ADM-CLERICAL, EXEC-MANAGERIAL, PROF-SPECIALTY, SALES, TECH-SUPPORT,

PROTECTIVE-SERV

AND Hours-Per-Week >= 37.5 or MISSING

AND Education-Num < 12.5 AND Education-Num >= 9.5 or MISSING

AND Capital-Loss < 1512 or MISSING

AND Capital-Gain < 5095.5 or MISSING

AND Age >= 33.5 or MISSING

then

Tree Node Identifier = 161

Number of Observations = 157

Predicted: Salary=>50K = 0.39

Predicted: Salary=<=50K = 0.61

Page 49: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 48 ~

Page 50: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 49 ~

Fit Statistics

Target=Salary Target Label=' '

FitStatistics Statistics Label Train

_NOBS_ Sum of Frequencies 32561.00

_MISC_ Misclassification Rate 0.14

_MAX_ Maximum Absolute Error 1.00

_SSE_ Sum of Squared Errors 6368.80

_ASE_ Average Squared Error 0.10

_RASE_ Root Average Squared Error 0.31

_DIV_ Divisor for ASE 65122.00

_DFT_

Total Degrees of Freedom

32561.00

Page 51: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 50 ~

Page 52: DM PROJECT

DATA MINING FINAL PROJECT FALL 2014

~ 51 ~

CONCLUSION

1. It is likely that if the age is around 24 years, the education level is 11th grade,

12th grade or some college, then the income would be less than 50K.

2. It is likely that if the age is around 24 years, and the person is working for a

private firm, then the income would be less than 50K.

3. It is likely that if the age is around 24 years, the income would be less than

50K

4. 90% of females have salary less than or equal to 50k whereas 60% of males

have salary <=50k

5. 95% of the population belonging to other services category belong to salary

<=50k

6. 95% of the population belonging to the age group of 23 to 24 have salary

<=50k

We made couple of runs of the J48 classifiers and found out the following 5

attributes to be important in predicting the Income of a person:

Education number, age, salary, work class, hours-per-week. These columns

provide an accuracy of 80% in predicting the Income.