Upload
galena-hodges
View
25
Download
0
Embed Size (px)
DESCRIPTION
Data preparation: Selection, Preprocessing, and Transformation. Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7. Knowledge. Transformed data. Patterns. Target data. Processed data. Interpretation Evaluation. Data Mining. Transformation & feature - PowerPoint PPT Presentation
Citation preview
1
Data preparation: Selection, Data preparation: Selection, Preprocessing, and TransformationPreprocessing, and Transformation
Literature:Literature: I.H. Witten and E. Frank, Data Mining, I.H. Witten and E. Frank, Data Mining,
chapter 2 and chapter 7chapter 2 and chapter 7
2
Fayyad’s KDD MethodologyFayyad’s KDD Methodology
data
Targetdata
Processeddata
Transformeddata Patterns
Knowledge
Selection
Preprocessing& cleaning
Transformation& featureselection
Data Mining
InterpretationEvaluation
3
ContentsContents
Data Selection Data Selection Data PreprocessingData Preprocessing Data TransformationData Transformation
4
Data SelectionData Selection
GoalGoal Understanding the dataUnderstanding the data Explore the data:Explore the data:
possible attributespossible attributes their valuestheir valuesdistribution, outliersdistribution, outliers
5
Getting to know the dataGetting to know the data
Simple visualization tools are very useful for Simple visualization tools are very useful for identifying problemsidentifying problemsNominal attributes: histograms (Distribution Nominal attributes: histograms (Distribution
consistent with background knowledge?)consistent with background knowledge?)Numeric attributes: graphs (Any obvious outliers?)Numeric attributes: graphs (Any obvious outliers?)
2-D and 3-D visualizations show dependencies2-D and 3-D visualizations show dependencies Domain experts need to be consultedDomain experts need to be consulted Too much data to inspect? Take a sample!Too much data to inspect? Take a sample!
6
Data preprocessingData preprocessing
Problem: different data sources (e.g. sales Problem: different data sources (e.g. sales department, customer billing department, …)department, customer billing department, …)
Differences: styles of record keeping, conventions, Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errorstime periods, data aggregation, primary keys, errors
Data must be assembled, integrated, cleaned upData must be assembled, integrated, cleaned up ““Data warehouse”: consistent point of accessData warehouse”: consistent point of access External data may be required (“overlay data”)External data may be required (“overlay data”) Critical: type and level of data aggregationCritical: type and level of data aggregation
7
Data PreprocessingData Preprocessing
Choose data structure (table, tree or set of Choose data structure (table, tree or set of tables)tables)
Choose attributes with enough informationChoose attributes with enough information Decide on a first representation of the Decide on a first representation of the
attributes (numeric or nominal)attributes (numeric or nominal) Decide on missing valuesDecide on missing values Decide on inaccurate data (cleansing)Decide on inaccurate data (cleansing)
8
Attribute types used in practiceAttribute types used in practice
Most schemes accommodate just two levels of Most schemes accommodate just two levels of measurement: nominal and ordinalmeasurement: nominal and ordinal
Nominal attributes are also called “categorical”, Nominal attributes are also called “categorical”, “enumerated”, or “discrete”“enumerated”, or “discrete”
But: “enumerated” and “discrete” imply orderBut: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute)Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or Ordinal attributes are called “numeric”, or
“continuous”“continuous” But: “continuous” implies mathematical continuityBut: “continuous” implies mathematical continuity
9
The ARFF formatThe ARFF format% ARFF file for weather data with some numeric features% ARFF file for weather data with some numeric features%%@relation weather@relation weather@attribute outlook {sunny, overcast, rainy}@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute temperature numeric@attribute humidity numeric@attribute humidity numeric@attribute windy {true, false}@attribute windy {true, false}@attribute play? {yes, no}@attribute play? {yes, no}@data@datasunny, 85, 85, false, nosunny, 85, 85, false, nosunny, 80, 90, true, nosunny, 80, 90, true, noovercast, 83, 86, false, yesovercast, 83, 86, false, yes......
10
Attribute typesAttribute types
ARFF supports numeric and nominal attributesARFF supports numeric and nominal attributes Interpretation depends on learning schemeInterpretation depends on learning scheme Numeric attributes are interpreted asNumeric attributes are interpreted as
ordinal scales if less-than and greater-than are usedordinal scales if less-than and greater-than are used ratio scales if distance calculations are performedratio scales if distance calculations are performed (normalization/standardization may be required)(normalization/standardization may be required)
Instance-based schemes define distance Instance-based schemes define distance between nominal values (0 if values are equal, between nominal values (0 if values are equal, 1 otherwise)1 otherwise)
Integers: nominal, ordinal, or ratio scale?Integers: nominal, ordinal, or ratio scale?
11
Nominal vs. ordinalNominal vs. ordinal
Attribute “age” nominalAttribute “age” nominalIf age = young and astigmatic = no and If age = young and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = softIf age = pre-presbyopic and If age = pre-presbyopic and astigmatic = no and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = soft
Attribute “age” ordinalAttribute “age” ordinal(e.g. “young” < “pre-presbyopic” < “presbyopic”)(e.g. “young” < “pre-presbyopic” < “presbyopic”)If age If age pre-presbyopic and pre-presbyopic and astigmatic = no and astigmatic = no and tear production rate = normal tear production rate = normal then recommendation = softthen recommendation = soft
12
Missing valuesMissing values
Frequently indicated by out-of-range entriesFrequently indicated by out-of-range entriesTypes: unknown, unrecorded, irrelevantTypes: unknown, unrecorded, irrelevantReasons: malfunctioning equipment, changes in Reasons: malfunctioning equipment, changes in
experimental design, collation of different datasets, experimental design, collation of different datasets, measurement not possiblemeasurement not possible
Missing value may have significance in itself Missing value may have significance in itself (e.g. missing test in a medical examination)(e.g. missing test in a medical examination)Most schemes assume that is not the case Most schemes assume that is not the case
“missing” may need to be coded as additional value“missing” may need to be coded as additional value
13
Inaccurate valuesInaccurate values
Reason: data has not been collected for mining itReason: data has not been collected for mining it Result: errors and omissions that don’t affect original Result: errors and omissions that don’t affect original
purpose of data (e.g. age of customer)purpose of data (e.g. age of customer) Typographical errors in nominal attributes Typographical errors in nominal attributes
values need to be checked for consistencyvalues need to be checked for consistency Typographical and measurement errors in numeric Typographical and measurement errors in numeric
attributes attributes outliers need to be identified outliers need to be identified Errors may be deliberate (e.g. wrong zip codes)Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale dataOther problems: duplicates, stale data
14
TransformationTransformationAttribute selectionAttribute selection
Adding a random (i.e. irrelevant) attribute can Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performancesignificantly degrade C4.5’s performanceProblem: attribute selection based on smaller and Problem: attribute selection based on smaller and
smaller amounts of datasmaller amounts of data IBL is also very susceptible to irrelevant IBL is also very susceptible to irrelevant
attributesattributesNumber of training instances required increases Number of training instances required increases
exponentially with number of irrelevant attributesexponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem.Naïve Bayes doesn’t have this problem. Relevant Relevant attributes can also be harmfulattributes can also be harmful
15
Scheme-independent selectionScheme-independent selection
FilterFilter approach: assessment based on general approach: assessment based on general characteristics of the datacharacteristics of the data
One method: find subset of attributes that is enough to One method: find subset of attributes that is enough to separate all the instancesseparate all the instances
Another method: use different learning scheme (e.g. Another method: use different learning scheme (e.g. C4.5, 1R) to select attributesC4.5, 1R) to select attributes
IBL-based attribute weighting techniques can also be IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes)used (but can’t find redundant attributes)
CFS: uses correlation-based evaluation of subsetsCFS: uses correlation-based evaluation of subsets
16
Attribute subsets for weather data
17
Searching the attribute spaceSearching the attribute space
Number of possible attribute subsets is Number of possible attribute subsets is exponential in the number of attributesexponential in the number of attributes
Common greedy approaches: Common greedy approaches: forward selectionforward selection and and backward eliminationbackward elimination
More sophisticated strategies:More sophisticated strategies:Bidirectional searchBidirectional searchBest-first search: can find the optimum solutionBest-first search: can find the optimum solutionBeam search: approximation to best-first searchBeam search: approximation to best-first searchGenetic algorithmsGenetic algorithms
18
Scheme-specific selectionScheme-specific selection
Wrapper Wrapper approach: attribute selection implemented as approach: attribute selection implemented as wrapper around learning schemewrapper around learning scheme Evaluation criterion: cross-validation performanceEvaluation criterion: cross-validation performance
Time consuming: adds factor Time consuming: adds factor kk22 even for greedy even for greedy approaches with approaches with k k attributesattributes Linearity in Linearity in k k requires prior ranking of attributesrequires prior ranking of attributes
Scheme-specific attribute selection essential for Scheme-specific attribute selection essential for learning decision tableslearning decision tables
Can be done efficiently for DTs and Naïve BayesCan be done efficiently for DTs and Naïve Bayes
19
Discretizing numeric attributesDiscretizing numeric attributes
Can be used to avoid making normality Can be used to avoid making normality assumption in Naïve Bayes and Clusteringassumption in Naïve Bayes and Clustering
Simple discretization scheme is used in 1RSimple discretization scheme is used in 1R C4.5 performs C4.5 performs locallocal discretization discretization Global Global discretization can be advantageous discretization can be advantageous
because it’s based on more databecause it’s based on more dataLearner can be applied to discretized attribute Learner can be applied to discretized attribute oror It can be applied to binary attributes coding the cut It can be applied to binary attributes coding the cut
points in the discretized attributepoints in the discretized attribute
20
Unsupervised discretizationUnsupervised discretization
Unsupervised Unsupervised discretization generates discretization generates intervals without looking at class labelsintervals without looking at class labelsOnly possible way when clusteringOnly possible way when clustering
Two main strategies:Two main strategies:Equal-interval binningEqual-interval binningEqual-frequency binningEqual-frequency binning (also called (also called histogram histogram
equalizationequalization)) Inferior to supervised schemes in classification Inferior to supervised schemes in classification
taskstasks
21
Entropy-based discretizationEntropy-based discretization
SupervisedSupervised method that builds a decision tree method that builds a decision tree with pre-pruning on the attribute being with pre-pruning on the attribute being discretizeddiscretizedEntropy used as splitting criterionEntropy used as splitting criterionMDLP used as stopping criterionMDLP used as stopping criterion
State-of-the-art discretization methodState-of-the-art discretization method Application of MDLP:Application of MDLP:
““Theory” is the splitting point (logTheory” is the splitting point (log22[N-1] bits) plus [N-1] bits) plus class distribution in each subsetclass distribution in each subset
DL before/after adding splitting point is comparedDL before/after adding splitting point is compared
22
Example: temperature attribute
23
Formula for MDLPFormula for MDLP
NN instances and instances andkk classes and entropy classes and entropy EE in original set in original setkk11 classes and entropy classes and entropy EE11 in first subset in first subset
kk22 classes and entropy classes and entropy EE22 in first subset in first subset
Doesn’t result in any discretization intervals for Doesn’t result in any discretization intervals for the temperature attributethe temperature attribute
N
EkEkkE
N
Ngain
k221122 )23(log)1(log
24
Other discretization methodsOther discretization methods
Top-down procedure can be replaced by Top-down procedure can be replaced by bottomup methodbottomup method
MDLP can be replaced by chi-squared testMDLP can be replaced by chi-squared test Dynamic programming can be used to find Dynamic programming can be used to find
optimum optimum kk-way split for given additive criterion-way split for given additive criterionRequires time quadratic in number of instances if Requires time quadratic in number of instances if
entropy is used as criterionentropy is used as criterionCan be done in linear time if error rate is used as Can be done in linear time if error rate is used as
evaluation criterionevaluation criterion
25
TransformationTransformation
WEKA provides a lot of filters that can help you WEKA provides a lot of filters that can help you transforming and selecting your attributes!transforming and selecting your attributes!
Use them to build a promising model for the Use them to build a promising model for the caravan data!caravan data!