dm2part2

Embed Size (px)

Citation preview

  • 8/18/2019 dm2part2

    1/26

    University of Florida CISE department Gator Engineering

    Data Preprocessing

    Dr. Sanjay RankaProfessor

    Computer and Information Science and Engineering

    University of Florida, Gainesville

    [email protected]

  • 8/18/2019 dm2part2

    2/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Data Preprocessing

    • What preprocessing step can or should weapply to the data to make it more suitablefor data mining?

    – 

    Aggregation– Sampling

    – Dimensionality Reduction

    – Feature Subset Selection

    – Feature Creation

    – Discretization and Binarization

    – Attribute Transformation

  • 8/18/2019 dm2part2

    3/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Aggregation

    • Aggregation refers to combing two or moreattributes (or objects) into a single attribute (orobject)

    • 

    For example, merging daily sales figures toobtain monthly sales figures

    • Why aggregation?–  Data reduction

    • 

    Allows use of more expensive algorithms

    –  If done properly, aggregation can act as scope or scale,providing a high level view of data instead of a lowlevel view

  • 8/18/2019 dm2part2

    4/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    – 

    Behavior of group of objects in more stablethan that of individual objects

    • The aggregate quantities have less variability thanthe individual objects being aggregated

    Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

  • 8/18/2019 dm2part2

    5/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Sampling

    • Sampling is the process of understandingcharacteristics of data or models based ona subset of the original data. It is used

    extensively in all aspects of dataexploration and mining

    • 

    Why sample– Obtaining the entire set of “data of interest” is

    too expensive or time consuming– Obtaining the entire set of data may not be

    necessary (and hence a waste of resources)

  • 8/18/2019 dm2part2

    6/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Representative Sample

    • A sample is representative for a particularoperation if it results in approximately thesame outcome as if the entire data set was

    used•

     

    A sample that may be representative forone operation, may not be representativefor another operation– 

    For example, a sample may be representativefor histogram along one dimension but maynot be good enough for correlation betweentwo dimensions

  • 8/18/2019 dm2part2

    7/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Sampling Approaches

    • Simple Random Sampling

    – 

    There is an equal probability of selecting

    any particular item– 

    Sampling without replacement: Once anitem is selected, it is removed from thepopulation for obtaining future samples

    – 

    Sampling with replacement: Selecteditem is not removed from the populationfor obtaining future samples

  • 8/18/2019 dm2part2

    8/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    • Stratified Sampling–  When subpopulations vary considerably, it is

    advantageous to sample each subpopulation(stratum) independently

    – 

    Stratification is the process of grouping members ofthe population into relatively homogeneoussubgroups before sampling

    –  The strata should be mutually exclusive : everyelement in the population must be assigned to onlyone stratum. The strata should also be collectivelyexhaustive : no population element can be excluded

    –  Then random sampling is applied within eachstratum. This often improves the representative-nessof the sample by reducing sampling error

    Sampling Approaches

  • 8/18/2019 dm2part2

    9/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Sample Size

    • Even if proper sampling technique isknown, it is important to choose propersample size

    • 

    Larger sample sizes increase theprobability that a sample will berepresentative, but also eliminate much of

    the advantage of sampling•

     

    With smaller sample size, patterns may bemissed or erroneous patters detected

  • 8/18/2019 dm2part2

    10/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Sample Size / Fraction

    8000 points 500 points2000 points

    Example of the Loss of Structure with Sampling

  • 8/18/2019 dm2part2

    11/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Sample Size

    Probability a sample contains points from each of ten groups

    Ten Clusters

    • Sample size required to obtain at least onesample from each group

  • 8/18/2019 dm2part2

    12/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Dimensionality Reduction

    •  Curse of dimensionality: Data analysis becomessignificantly harder as the dimensionality of the dataincreases

    Curse ofDimensionality

    Decrease in the relative distance between pointsAs dimensionality increases

  • 8/18/2019 dm2part2

    13/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Dimensionality Reduction

    • Determining dimensions (or combinations ofdimensions) that are important for modeling

    • Why dimensionality reduction?

    – 

    Many data mining algorithms work better if thedimensionality of data (i.e. number of attributes) islower

    –  Allows the data to be more easily visualized

    – 

    If dimensionality reduction eliminates irrelevantfeatures or reduces noise, then quality of results mayimprove

    –  Can lead to a more understandable model

  • 8/18/2019 dm2part2

    14/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Dimensionality Reduction

    • Redundant features duplicate much or all ofthe information contained in one or moreattributes

    – 

    The purchase price of product and the salestax paid contain the same information

    • Irrelevant features contain no information

    that is useful for data mining task at hand– Student ID numbers would be irrelevant to

    the task of predicting their GPA

  • 8/18/2019 dm2part2

    15/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Dimensionality Reduction using

    Principal Component Analysis• Some of the most common approaches fordimensionality reduction, particularly forcontinuous data, use a linear or non-linear

    projection of data from a high dimensional spaceto a lower dimensional space

    • PCA is a linear algebra technique for continuousattributes that finds new attributes (principal

    components) that–  Are linear combinations of original attributes

    –  Are orthogonal to each other

    –  Capture the maximum amount of variation in data

  • 8/18/2019 dm2part2

    16/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Dimensionality Reduction by

    Feature Subset Selection• There are three standard approaches tofeature selection:– Embedded approaches: Feature selection

    occurs naturally as part of the data miningalgorithm

    – Filter approaches: Features are selected beforethe data mining algorithm is run

    – 

    Wrapper approaches: Use the target datamining algorithm as a black box to find thebest subset of attributes (typically withoutenumerating all subsets)

  • 8/18/2019 dm2part2

    17/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Architecture for Feature Subset Selection

    Attributes

    SelectedAttributes

    Subset ofAttributes

    StoppingCriterion

    SearchStrategy

    ValidationProcedure

    Evaluation

    Not done

    Done

    Flowchart of a feature subset selection process

  • 8/18/2019 dm2part2

    18/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Feature Creation

    • Sometimes, a small number of new attributes cancapture the important information in a data setmuch more efficiently than the originalattributes

    • Also, the number of new attributes can be oftensmaller than the number of original attributes.Hence, we get benefits of dimensionalityreduction

    • 

    Three general methodologies:–  Feature Extraction

    –  Mapping the Data to a New Space

    –  Feature Construction

  • 8/18/2019 dm2part2

    19/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Feature Extraction

    •  One approach to dimensionality reduction is featureextraction, which is creation of a new, smaller set offeatures from the original set of features

    •  For example, consider a set of photographs, where each

    photograph is to be classified whether its human face ornot

    •  The raw data is set of pixels, and as such is not suitablefor many classification algorithms

    •  However, if data is processed to provide high-level

    features like presence or absence of certain types ofedges or areas correlated with presence of human faces,then a broader set of classification techniques can beapplied to the problem

  • 8/18/2019 dm2part2

    20/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Mapping the Data to a New Space

    •  Sometimes, a totally different view of the data can reveal importantand interesting features

    •  Example: Applying Fourier transformation to data to detect timeseries patterns

    Original Time Series Time Series with noise Frequency plot

  • 8/18/2019 dm2part2

    21/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Feature Construction

    • Sometimes features have the necessaryinformation, but not in the form necessary forthe data mining algorithm. In this case, one ormore new features constructed out of theoriginal features may be useful

    • Example, there are two attributes that recordvolume and mass of a set of objects

    • Suppose there exists a classification model basedon material of which the objects are constructed

    • Then a density feature constructed from theoriginal two features would help classification

  • 8/18/2019 dm2part2

    22/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Discretization and Binarization

    • Discretization is the process of converting acontinuous attribute to a discrete attribute

    • A common example is rounding off real

    numbers to integers• Some data mining algorithms require that the

    data be in the form of categorical or binaryattributes. Thus, it is often necessary to convertcontinuous attributes in to categorical attributesand / or binary attributes

    • Its pretty straightforward to convert categoricalattributes in to discrete or binary attributes

  • 8/18/2019 dm2part2

    23/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Discretization of Continuous Attributes

    • Transformation of continuous attributes toa categorical attributes involves

    – Deciding how many categories to have

    – 

    How to map the values of the continuousattribute to categorical attribute

    • 

    A basic distinction between discretization

    methods for classification is whether classinformation is used (supervised) or not(unsupervised)

  • 8/18/2019 dm2part2

    24/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Different Discretization Techniques

    Data Equal interval width

    Equal frequency K-means

  • 8/18/2019 dm2part2

    25/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Different Discretization Techniques

    • Entropy based approach

    3 categories for both x and y  5 categories for both x and y 

  • 8/18/2019 dm2part2

    26/26

    University of Florida CISE department Gator Engineering

    Data Mining Sanjay Ranka Spring 2011

    Attribute Transformation

    • An attribute transformation refers to atransformation that is applied to all valuesof an attribute, i.e., for each object, the

    transformation is applied to the value ofthe attribute for that object

    • 

    There are two important types of attributetransformations– 

    Simple function transformations• Example: xk, log x, ex, sqrt x, etc.

    – Standardization or normalization