Upload
purushothama-reddy
View
215
Download
0
Embed Size (px)
Citation preview
8/18/2019 dm2part2
1/26
University of Florida CISE department Gator Engineering
Data Preprocessing
Dr. Sanjay RankaProfessor
Computer and Information Science and Engineering
University of Florida, Gainesville
8/18/2019 dm2part2
2/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Data Preprocessing
• What preprocessing step can or should weapply to the data to make it more suitablefor data mining?
–
Aggregation– Sampling
– Dimensionality Reduction
– Feature Subset Selection
– Feature Creation
– Discretization and Binarization
– Attribute Transformation
8/18/2019 dm2part2
3/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Aggregation
• Aggregation refers to combing two or moreattributes (or objects) into a single attribute (orobject)
•
For example, merging daily sales figures toobtain monthly sales figures
• Why aggregation?– Data reduction
•
Allows use of more expensive algorithms
– If done properly, aggregation can act as scope or scale,providing a high level view of data instead of a lowlevel view
8/18/2019 dm2part2
4/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
–
Behavior of group of objects in more stablethan that of individual objects
• The aggregate quantities have less variability thanthe individual objects being aggregated
Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation
8/18/2019 dm2part2
5/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Sampling
• Sampling is the process of understandingcharacteristics of data or models based ona subset of the original data. It is used
extensively in all aspects of dataexploration and mining
•
Why sample– Obtaining the entire set of “data of interest” is
too expensive or time consuming– Obtaining the entire set of data may not be
necessary (and hence a waste of resources)
8/18/2019 dm2part2
6/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Representative Sample
• A sample is representative for a particularoperation if it results in approximately thesame outcome as if the entire data set was
used•
A sample that may be representative forone operation, may not be representativefor another operation–
For example, a sample may be representativefor histogram along one dimension but maynot be good enough for correlation betweentwo dimensions
8/18/2019 dm2part2
7/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Sampling Approaches
• Simple Random Sampling
–
There is an equal probability of selecting
any particular item–
Sampling without replacement: Once anitem is selected, it is removed from thepopulation for obtaining future samples
–
Sampling with replacement: Selecteditem is not removed from the populationfor obtaining future samples
8/18/2019 dm2part2
8/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
• Stratified Sampling– When subpopulations vary considerably, it is
advantageous to sample each subpopulation(stratum) independently
–
Stratification is the process of grouping members ofthe population into relatively homogeneoussubgroups before sampling
– The strata should be mutually exclusive : everyelement in the population must be assigned to onlyone stratum. The strata should also be collectivelyexhaustive : no population element can be excluded
– Then random sampling is applied within eachstratum. This often improves the representative-nessof the sample by reducing sampling error
Sampling Approaches
8/18/2019 dm2part2
9/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Sample Size
• Even if proper sampling technique isknown, it is important to choose propersample size
•
Larger sample sizes increase theprobability that a sample will berepresentative, but also eliminate much of
the advantage of sampling•
With smaller sample size, patterns may bemissed or erroneous patters detected
8/18/2019 dm2part2
10/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Sample Size / Fraction
8000 points 500 points2000 points
Example of the Loss of Structure with Sampling
8/18/2019 dm2part2
11/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Sample Size
Probability a sample contains points from each of ten groups
Ten Clusters
• Sample size required to obtain at least onesample from each group
8/18/2019 dm2part2
12/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Dimensionality Reduction
• Curse of dimensionality: Data analysis becomessignificantly harder as the dimensionality of the dataincreases
Curse ofDimensionality
Decrease in the relative distance between pointsAs dimensionality increases
8/18/2019 dm2part2
13/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Dimensionality Reduction
• Determining dimensions (or combinations ofdimensions) that are important for modeling
• Why dimensionality reduction?
–
Many data mining algorithms work better if thedimensionality of data (i.e. number of attributes) islower
– Allows the data to be more easily visualized
–
If dimensionality reduction eliminates irrelevantfeatures or reduces noise, then quality of results mayimprove
– Can lead to a more understandable model
8/18/2019 dm2part2
14/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Dimensionality Reduction
• Redundant features duplicate much or all ofthe information contained in one or moreattributes
–
The purchase price of product and the salestax paid contain the same information
• Irrelevant features contain no information
that is useful for data mining task at hand– Student ID numbers would be irrelevant to
the task of predicting their GPA
8/18/2019 dm2part2
15/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Dimensionality Reduction using
Principal Component Analysis• Some of the most common approaches fordimensionality reduction, particularly forcontinuous data, use a linear or non-linear
projection of data from a high dimensional spaceto a lower dimensional space
• PCA is a linear algebra technique for continuousattributes that finds new attributes (principal
components) that– Are linear combinations of original attributes
– Are orthogonal to each other
– Capture the maximum amount of variation in data
8/18/2019 dm2part2
16/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Dimensionality Reduction by
Feature Subset Selection• There are three standard approaches tofeature selection:– Embedded approaches: Feature selection
occurs naturally as part of the data miningalgorithm
– Filter approaches: Features are selected beforethe data mining algorithm is run
–
Wrapper approaches: Use the target datamining algorithm as a black box to find thebest subset of attributes (typically withoutenumerating all subsets)
8/18/2019 dm2part2
17/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Architecture for Feature Subset Selection
Attributes
SelectedAttributes
Subset ofAttributes
StoppingCriterion
SearchStrategy
ValidationProcedure
Evaluation
Not done
Done
Flowchart of a feature subset selection process
8/18/2019 dm2part2
18/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Feature Creation
• Sometimes, a small number of new attributes cancapture the important information in a data setmuch more efficiently than the originalattributes
• Also, the number of new attributes can be oftensmaller than the number of original attributes.Hence, we get benefits of dimensionalityreduction
•
Three general methodologies:– Feature Extraction
– Mapping the Data to a New Space
– Feature Construction
8/18/2019 dm2part2
19/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Feature Extraction
• One approach to dimensionality reduction is featureextraction, which is creation of a new, smaller set offeatures from the original set of features
• For example, consider a set of photographs, where each
photograph is to be classified whether its human face ornot
• The raw data is set of pixels, and as such is not suitablefor many classification algorithms
• However, if data is processed to provide high-level
features like presence or absence of certain types ofedges or areas correlated with presence of human faces,then a broader set of classification techniques can beapplied to the problem
8/18/2019 dm2part2
20/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Mapping the Data to a New Space
• Sometimes, a totally different view of the data can reveal importantand interesting features
• Example: Applying Fourier transformation to data to detect timeseries patterns
Original Time Series Time Series with noise Frequency plot
8/18/2019 dm2part2
21/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Feature Construction
• Sometimes features have the necessaryinformation, but not in the form necessary forthe data mining algorithm. In this case, one ormore new features constructed out of theoriginal features may be useful
• Example, there are two attributes that recordvolume and mass of a set of objects
• Suppose there exists a classification model basedon material of which the objects are constructed
• Then a density feature constructed from theoriginal two features would help classification
8/18/2019 dm2part2
22/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Discretization and Binarization
• Discretization is the process of converting acontinuous attribute to a discrete attribute
• A common example is rounding off real
numbers to integers• Some data mining algorithms require that the
data be in the form of categorical or binaryattributes. Thus, it is often necessary to convertcontinuous attributes in to categorical attributesand / or binary attributes
• Its pretty straightforward to convert categoricalattributes in to discrete or binary attributes
8/18/2019 dm2part2
23/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Discretization of Continuous Attributes
• Transformation of continuous attributes toa categorical attributes involves
– Deciding how many categories to have
–
How to map the values of the continuousattribute to categorical attribute
•
A basic distinction between discretization
methods for classification is whether classinformation is used (supervised) or not(unsupervised)
8/18/2019 dm2part2
24/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Different Discretization Techniques
Data Equal interval width
Equal frequency K-means
8/18/2019 dm2part2
25/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Different Discretization Techniques
• Entropy based approach
3 categories for both x and y 5 categories for both x and y
8/18/2019 dm2part2
26/26
University of Florida CISE department Gator Engineering
Data Mining Sanjay Ranka Spring 2011
Attribute Transformation
• An attribute transformation refers to atransformation that is applied to all valuesof an attribute, i.e., for each object, the
transformation is applied to the value ofthe attribute for that object
•
There are two important types of attributetransformations–
Simple function transformations• Example: xk, log x, ex, sqrt x, etc.
– Standardization or normalization