Data preparation: Selection, Preprocessing, and Transformation

1

Data preparation: Selection, Data preparation: Selection, Preprocessing, and TransformationPreprocessing, and Transformation

Literature:Literature: I.H. Witten and E. Frank, Data Mining, I.H. Witten and E. Frank, Data Mining,

chapter 2 and chapter 7chapter 2 and chapter 7

2

Fayyad’s KDD MethodologyFayyad’s KDD Methodology

data

Targetdata

Processeddata

Transformeddata Patterns

Knowledge

Selection

Preprocessing& cleaning

Transformation& featureselection

Data Mining

InterpretationEvaluation

3

ContentsContents

Data Selection Data Selection Data PreprocessingData Preprocessing Data TransformationData Transformation

4

Data SelectionData Selection

GoalGoal Understanding the dataUnderstanding the data Explore the data:Explore the data:

possible attributespossible attributes their valuestheir valuesdistribution, outliersdistribution, outliers

5

Getting to know the dataGetting to know the data

Simple visualization tools are very useful for Simple visualization tools are very useful for identifying problemsidentifying problemsNominal attributes: histograms (Distribution Nominal attributes: histograms (Distribution

consistent with background knowledge?)consistent with background knowledge?)Numeric attributes: graphs (Any obvious outliers?)Numeric attributes: graphs (Any obvious outliers?)

2-D and 3-D visualizations show dependencies2-D and 3-D visualizations show dependencies Domain experts need to be consultedDomain experts need to be consulted Too much data to inspect? Take a sample!Too much data to inspect? Take a sample!

6

Data preprocessingData preprocessing

Problem: different data sources (e.g. sales Problem: different data sources (e.g. sales department, customer billing department, …)department, customer billing department, …)

Differences: styles of record keeping, conventions, Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errorstime periods, data aggregation, primary keys, errors

Data must be assembled, integrated, cleaned upData must be assembled, integrated, cleaned up ““Data warehouse”: consistent point of accessData warehouse”: consistent point of access External data may be required (“overlay data”)External data may be required (“overlay data”) Critical: type and level of data aggregationCritical: type and level of data aggregation

7

Data PreprocessingData Preprocessing

Choose data structure (table, tree or set of Choose data structure (table, tree or set of tables)tables)

Choose attributes with enough informationChoose attributes with enough information Decide on a first representation of the Decide on a first representation of the

attributes (numeric or nominal)attributes (numeric or nominal) Decide on missing valuesDecide on missing values Decide on inaccurate data (cleansing)Decide on inaccurate data (cleansing)

8

Attribute types used in practiceAttribute types used in practice

Most schemes accommodate just two levels of Most schemes accommodate just two levels of measurement: nominal and ordinalmeasurement: nominal and ordinal

Nominal attributes are also called “categorical”, Nominal attributes are also called “categorical”, “enumerated”, or “discrete”“enumerated”, or “discrete”

But: “enumerated” and “discrete” imply orderBut: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute)Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or Ordinal attributes are called “numeric”, or

“continuous”“continuous” But: “continuous” implies mathematical continuityBut: “continuous” implies mathematical continuity

9

The ARFF formatThe ARFF format% ARFF file for weather data with some numeric features% ARFF file for weather data with some numeric features%%@relation weather@relation weather@attribute outlook {sunny, overcast, rainy}@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute temperature numeric@attribute humidity numeric@attribute humidity numeric@attribute windy {true, false}@attribute windy {true, false}@attribute play? {yes, no}@attribute play? {yes, no}@data@datasunny, 85, 85, false, nosunny, 85, 85, false, nosunny, 80, 90, true, nosunny, 80, 90, true, noovercast, 83, 86, false, yesovercast, 83, 86, false, yes......

10

Attribute typesAttribute types

ARFF supports numeric and nominal attributesARFF supports numeric and nominal attributes Interpretation depends on learning schemeInterpretation depends on learning scheme Numeric attributes are interpreted asNumeric attributes are interpreted as

ordinal scales if less-than and greater-than are usedordinal scales if less-than and greater-than are used ratio scales if distance calculations are performedratio scales if distance calculations are performed (normalization/standardization may be required)(normalization/standardization may be required)

Instance-based schemes define distance Instance-based schemes define distance between nominal values (0 if values are equal, between nominal values (0 if values are equal, 1 otherwise)1 otherwise)

Integers: nominal, ordinal, or ratio scale?Integers: nominal, ordinal, or ratio scale?

11

Nominal vs. ordinalNominal vs. ordinal

Attribute “age” nominalAttribute “age” nominalIf age = young and astigmatic = no and If age = young and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = softIf age = pre-presbyopic and If age = pre-presbyopic and astigmatic = no and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = soft

Attribute “age” ordinalAttribute “age” ordinal(e.g. “young” < “pre-presbyopic” < “presbyopic”)(e.g. “young” < “pre-presbyopic” < “presbyopic”)If age If age pre-presbyopic and pre-presbyopic and astigmatic = no and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = soft

12

Missing valuesMissing values

Frequently indicated by out-of-range entriesFrequently indicated by out-of-range entriesTypes: unknown, unrecorded, irrelevantTypes: unknown, unrecorded, irrelevantReasons: malfunctioning equipment, changes in Reasons: malfunctioning equipment, changes in

experimental design, collation of different datasets, experimental design, collation of different datasets, measurement not possiblemeasurement not possible

Missing value may have significance in itself Missing value may have significance in itself (e.g. missing test in a medical examination)(e.g. missing test in a medical examination)Most schemes assume that is not the case Most schemes assume that is not the case

“missing” may need to be coded as additional value“missing” may need to be coded as additional value

13

Inaccurate valuesInaccurate values

Reason: data has not been collected for mining itReason: data has not been collected for mining it Result: errors and omissions that don’t affect original Result: errors and omissions that don’t affect original

purpose of data (e.g. age of customer)purpose of data (e.g. age of customer) Typographical errors in nominal attributes Typographical errors in nominal attributes

values need to be checked for consistencyvalues need to be checked for consistency Typographical and measurement errors in numeric Typographical and measurement errors in numeric

attributes attributes outliers need to be identified outliers need to be identified Errors may be deliberate (e.g. wrong zip codes)Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale dataOther problems: duplicates, stale data

14

TransformationTransformationAttribute selectionAttribute selection

Adding a random (i.e. irrelevant) attribute can Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performancesignificantly degrade C4.5’s performanceProblem: attribute selection based on smaller and Problem: attribute selection based on smaller and

smaller amounts of datasmaller amounts of data IBL is also very susceptible to irrelevant IBL is also very susceptible to irrelevant

attributesattributesNumber of training instances required increases Number of training instances required increases

exponentially with number of irrelevant attributesexponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem.Naïve Bayes doesn’t have this problem. Relevant Relevant attributes can also be harmfulattributes can also be harmful

15

Scheme-independent selectionScheme-independent selection

FilterFilter approach: assessment based on general approach: assessment based on general characteristics of the datacharacteristics of the data

One method: find subset of attributes that is enough to One method: find subset of attributes that is enough to separate all the instancesseparate all the instances

Another method: use different learning scheme (e.g. Another method: use different learning scheme (e.g. C4.5, 1R) to select attributesC4.5, 1R) to select attributes

IBL-based attribute weighting techniques can also be IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes)used (but can’t find redundant attributes)

CFS: uses correlation-based evaluation of subsetsCFS: uses correlation-based evaluation of subsets

16

Attribute subsets for weather data

17

Searching the attribute spaceSearching the attribute space

Number of possible attribute subsets is Number of possible attribute subsets is exponential in the number of attributesexponential in the number of attributes

Common greedy approaches: Common greedy approaches: forward selectionforward selection and and backward eliminationbackward elimination

More sophisticated strategies:More sophisticated strategies:Bidirectional searchBidirectional searchBest-first search: can find the optimum solutionBest-first search: can find the optimum solutionBeam search: approximation to best-first searchBeam search: approximation to best-first searchGenetic algorithmsGenetic algorithms

18

Scheme-specific selectionScheme-specific selection

Wrapper Wrapper approach: attribute selection implemented as approach: attribute selection implemented as wrapper around learning schemewrapper around learning scheme Evaluation criterion: cross-validation performanceEvaluation criterion: cross-validation performance

Time consuming: adds factor Time consuming: adds factor kk22 even for greedy even for greedy approaches with approaches with k k attributesattributes Linearity in Linearity in k k requires prior ranking of attributesrequires prior ranking of attributes

Scheme-specific attribute selection essential for Scheme-specific attribute selection essential for learning decision tableslearning decision tables

Can be done efficiently for DTs and Naïve BayesCan be done efficiently for DTs and Naïve Bayes

19

Discretizing numeric attributesDiscretizing numeric attributes

Can be used to avoid making normality Can be used to avoid making normality assumption in Naïve Bayes and Clusteringassumption in Naïve Bayes and Clustering

Simple discretization scheme is used in 1RSimple discretization scheme is used in 1R C4.5 performs C4.5 performs locallocal discretization discretization Global Global discretization can be advantageous discretization can be advantageous

because it’s based on more databecause it’s based on more dataLearner can be applied to discretized attribute Learner can be applied to discretized attribute oror It can be applied to binary attributes coding the cut It can be applied to binary attributes coding the cut

points in the discretized attributepoints in the discretized attribute

20

Unsupervised discretizationUnsupervised discretization

Unsupervised Unsupervised discretization generates discretization generates intervals without looking at class labelsintervals without looking at class labelsOnly possible way when clusteringOnly possible way when clustering

Two main strategies:Two main strategies:Equal-interval binningEqual-interval binningEqual-frequency binningEqual-frequency binning (also called (also called histogram histogram

equalizationequalization)) Inferior to supervised schemes in classification Inferior to supervised schemes in classification

taskstasks

21

Entropy-based discretizationEntropy-based discretization

SupervisedSupervised method that builds a decision tree method that builds a decision tree with pre-pruning on the attribute being with pre-pruning on the attribute being discretizeddiscretizedEntropy used as splitting criterionEntropy used as splitting criterionMDLP used as stopping criterionMDLP used as stopping criterion

State-of-the-art discretization methodState-of-the-art discretization method Application of MDLP:Application of MDLP:

““Theory” is the splitting point (logTheory” is the splitting point (log22[N-1] bits) plus [N-1] bits) plus class distribution in each subsetclass distribution in each subset

DL before/after adding splitting point is comparedDL before/after adding splitting point is compared

22

Example: temperature attribute

23

Formula for MDLPFormula for MDLP

NN instances and instances andkk classes and entropy classes and entropy EE in original set in original setkk11 classes and entropy classes and entropy EE11 in first subset in first subset

kk22 classes and entropy classes and entropy EE22 in first subset in first subset

Doesn’t result in any discretization intervals for Doesn’t result in any discretization intervals for the temperature attributethe temperature attribute

N

EkEkkE

N

Ngain

k221122 )23(log)1(log

24

Other discretization methodsOther discretization methods

Top-down procedure can be replaced by Top-down procedure can be replaced by bottomup methodbottomup method

MDLP can be replaced by chi-squared testMDLP can be replaced by chi-squared test Dynamic programming can be used to find Dynamic programming can be used to find

optimum optimum kk-way split for given additive criterion-way split for given additive criterionRequires time quadratic in number of instances if Requires time quadratic in number of instances if

entropy is used as criterionentropy is used as criterionCan be done in linear time if error rate is used as Can be done in linear time if error rate is used as

evaluation criterionevaluation criterion

25

TransformationTransformation

WEKA provides a lot of filters that can help you WEKA provides a lot of filters that can help you transforming and selecting your attributes!transforming and selecting your attributes!

Use them to build a promising model for the Use them to build a promising model for the caravan data!caravan data!

Documents

Data preparation: Selection, Preprocessing, and Transformation