dm2part2

8/18/2019 dm2part2

1/26

University of Florida CISE department Gator Engineering

Data Preprocessing

Dr. Sanjay RankaProfessor

Computer and Information Science and Engineering

University of Florida, Gainesville

[email protected]

8/18/2019 dm2part2

2/26


Data Mining Sanjay Ranka Spring 2011

Data Preprocessing

• What preprocessing step can or should weapply to the data to make it more suitablefor data mining?

–

Aggregation– Sampling

– Dimensionality Reduction

– Feature Subset Selection

– Feature Creation

– Discretization and Binarization

– Attribute Transformation

8/18/2019 dm2part2

3/26



Aggregation

• Aggregation refers to combing two or moreattributes (or objects) into a single attribute (orobject)

•

For example, merging daily sales figures toobtain monthly sales figures

• Why aggregation?– Data reduction

•

Allows use of more expensive algorithms

– If done properly, aggregation can act as scope or scale,providing a high level view of data instead of a lowlevel view

8/18/2019 dm2part2

4/26



–

Behavior of group of objects in more stablethan that of individual objects

• The aggregate quantities have less variability thanthe individual objects being aggregated

Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

8/18/2019 dm2part2

5/26



Sampling

• Sampling is the process of understandingcharacteristics of data or models based ona subset of the original data. It is used

extensively in all aspects of dataexploration and mining

•

Why sample– Obtaining the entire set of “data of interest” is

too expensive or time consuming– Obtaining the entire set of data may not be

necessary (and hence a waste of resources)

8/18/2019 dm2part2

6/26



Representative Sample

• A sample is representative for a particularoperation if it results in approximately thesame outcome as if the entire data set was

used•

A sample that may be representative forone operation, may not be representativefor another operation–

For example, a sample may be representativefor histogram along one dimension but maynot be good enough for correlation betweentwo dimensions

8/18/2019 dm2part2

7/26



Sampling Approaches

• Simple Random Sampling

–

There is an equal probability of selecting

any particular item–

Sampling without replacement: Once anitem is selected, it is removed from thepopulation for obtaining future samples

–

Sampling with replacement: Selecteditem is not removed from the populationfor obtaining future samples

8/18/2019 dm2part2

8/26



• Stratified Sampling– When subpopulations vary considerably, it is

advantageous to sample each subpopulation(stratum) independently

–

Stratification is the process of grouping members ofthe population into relatively homogeneoussubgroups before sampling

– The strata should be mutually exclusive : everyelement in the population must be assigned to onlyone stratum. The strata should also be collectivelyexhaustive : no population element can be excluded

– Then random sampling is applied within eachstratum. This often improves the representative-nessof the sample by reducing sampling error

Sampling Approaches

8/18/2019 dm2part2

9/26



Sample Size

• Even if proper sampling technique isknown, it is important to choose propersample size

•

Larger sample sizes increase theprobability that a sample will berepresentative, but also eliminate much of

the advantage of sampling•

With smaller sample size, patterns may bemissed or erroneous patters detected

8/18/2019 dm2part2

10/26



Sample Size / Fraction

8000 points 500 points2000 points

Example of the Loss of Structure with Sampling

8/18/2019 dm2part2

11/26



Sample Size

Probability a sample contains points from each of ten groups

Ten Clusters

• Sample size required to obtain at least onesample from each group

8/18/2019 dm2part2

12/26



Dimensionality Reduction

• Curse of dimensionality: Data analysis becomessignificantly harder as the dimensionality of the dataincreases

Curse ofDimensionality

Decrease in the relative distance between pointsAs dimensionality increases

8/18/2019 dm2part2

13/26




• Determining dimensions (or combinations ofdimensions) that are important for modeling

• Why dimensionality reduction?

–

Many data mining algorithms work better if thedimensionality of data (i.e. number of attributes) islower

– Allows the data to be more easily visualized

–

If dimensionality reduction eliminates irrelevantfeatures or reduces noise, then quality of results mayimprove

– Can lead to a more understandable model

8/18/2019 dm2part2

14/26




• Redundant features duplicate much or all ofthe information contained in one or moreattributes

–

The purchase price of product and the salestax paid contain the same information

• Irrelevant features contain no information

that is useful for data mining task at hand– Student ID numbers would be irrelevant to

the task of predicting their GPA

8/18/2019 dm2part2

15/26



Dimensionality Reduction using

Principal Component Analysis• Some of the most common approaches fordimensionality reduction, particularly forcontinuous data, use a linear or non-linear

projection of data from a high dimensional spaceto a lower dimensional space

• PCA is a linear algebra technique for continuousattributes that finds new attributes (principal

components) that– Are linear combinations of original attributes

– Are orthogonal to each other

– Capture the maximum amount of variation in data

8/18/2019 dm2part2

16/26



Dimensionality Reduction by

Feature Subset Selection• There are three standard approaches tofeature selection:– Embedded approaches: Feature selection

occurs naturally as part of the data miningalgorithm

– Filter approaches: Features are selected beforethe data mining algorithm is run

–

Wrapper approaches: Use the target datamining algorithm as a black box to find thebest subset of attributes (typically withoutenumerating all subsets)

8/18/2019 dm2part2

17/26



Architecture for Feature Subset Selection

Attributes

SelectedAttributes

Subset ofAttributes

StoppingCriterion

SearchStrategy

ValidationProcedure

Evaluation

Not done

Done

Flowchart of a feature subset selection process

8/18/2019 dm2part2

18/26



Feature Creation

• Sometimes, a small number of new attributes cancapture the important information in a data setmuch more efficiently than the originalattributes

• Also, the number of new attributes can be oftensmaller than the number of original attributes.Hence, we get benefits of dimensionalityreduction

•

Three general methodologies:– Feature Extraction

– Mapping the Data to a New Space

– Feature Construction

8/18/2019 dm2part2

19/26



Feature Extraction

• One approach to dimensionality reduction is featureextraction, which is creation of a new, smaller set offeatures from the original set of features

• For example, consider a set of photographs, where each

photograph is to be classified whether its human face ornot

• The raw data is set of pixels, and as such is not suitablefor many classification algorithms

• However, if data is processed to provide high-level

features like presence or absence of certain types ofedges or areas correlated with presence of human faces,then a broader set of classification techniques can beapplied to the problem

8/18/2019 dm2part2

20/26



Mapping the Data to a New Space

• Sometimes, a totally different view of the data can reveal importantand interesting features

• Example: Applying Fourier transformation to data to detect timeseries patterns

Original Time Series Time Series with noise Frequency plot

8/18/2019 dm2part2

21/26



Feature Construction

• Sometimes features have the necessaryinformation, but not in the form necessary forthe data mining algorithm. In this case, one ormore new features constructed out of theoriginal features may be useful

• Example, there are two attributes that recordvolume and mass of a set of objects

• Suppose there exists a classification model basedon material of which the objects are constructed

• Then a density feature constructed from theoriginal two features would help classification

8/18/2019 dm2part2

22/26



Discretization and Binarization

• Discretization is the process of converting acontinuous attribute to a discrete attribute

• A common example is rounding off real

numbers to integers• Some data mining algorithms require that the

data be in the form of categorical or binaryattributes. Thus, it is often necessary to convertcontinuous attributes in to categorical attributesand / or binary attributes

• Its pretty straightforward to convert categoricalattributes in to discrete or binary attributes

8/18/2019 dm2part2

23/26



Discretization of Continuous Attributes

• Transformation of continuous attributes toa categorical attributes involves

– Deciding how many categories to have

–

How to map the values of the continuousattribute to categorical attribute

•

A basic distinction between discretization

methods for classification is whether classinformation is used (supervised) or not(unsupervised)

8/18/2019 dm2part2

24/26



Different Discretization Techniques

Data Equal interval width

Equal frequency K-means

8/18/2019 dm2part2

25/26



Different Discretization Techniques

• Entropy based approach

3 categories for both x and y 5 categories for both x and y

8/18/2019 dm2part2

26/26



Attribute Transformation

• An attribute transformation refers to atransformation that is applied to all valuesof an attribute, i.e., for each object, the

transformation is applied to the value ofthe attribute for that object

•

There are two important types of attributetransformations–

Simple function transformations• Example: xk, log x, ex, sqrt x, etc.

– Standardization or normalization

Documents

dm2part2