Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Outline Introduction Descriptive Data Summarization Data Cleaning

Missing value Noise data

Data Integration Redundancy

Data Transformation

Data Cleaning

Importance “Data cleaning is one of the three

biggest problems in data warehousing”—Ralph Kimball

“Data cleaning is the number one problem in data warehousing”—DCI survey

Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy

data

Missing Data

Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time

of entry not register history or changes of the data

It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )

How to Handle Missing Data?

Ignore the tuple: usually done when class

label is missing (assuming the tasks in

classification—not effective when the

percentage of missing values per attribute

varies considerably.

Fill in the missing value manually: tedious

+ infeasible

How to Handle Missing Data?

Fill in it automatically with a global constant : e.g., “unknown”, a new

class?!

the attribute mean

the attribute mean for all samples belonging to

the same class: smarter

the most probable value: inference-based such

as Bayesian formula or decision tree




Data Transformation

Noisy Data

Noise: random error or variance in a measured variable

How to Handle Noisy Data? Binning Regression Clustering

Binning Binnig methods smooth a sorted data

value by consulting its “neighborhood” First of all, we sort all the values Then, the sorted values are distributed into

a number of “buckets”, or “bins” Then we smooth the values by

Means (bin value is replace by mean value), or Medium (bin value is replace by medium value), or Boundaries (bin value is replace by the closest

boundary value)

Simple Discretization Methods: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Regression

x

y

y = x + 1

X1

Y1

Y1’

Cluster Analysis




Data Transformation

Data integration

Data integration: Combines data from multiple sources

into a coherent store

Data integration problems Schema integration:

e.g., A.cust-id B.cust-# Integrate metadata from different sources

Detecting and resolving data value conflicts For the same real world entity, attribute values

from different sources are different Possible reasons: different representations,

different scales, e.g., metric vs. British units

Redundant data

Redundant data occur often when integration of multiple databases Object identification: The same attribute or

object may have different names in different databases

Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

Redundant data

Redundant attributes may be able to be detected by correlation analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Pearson’s product moment coefficient

Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, and are the respective means of A and B, σA and σB are the

respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.

BABA n

BAnAB

n

BBAAr BA )1(

)(

)1(

))((,

Pearson’s product moment coefficient The correlation coefficient is always

between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations.

-1.0 to -0.7 strong negative association. -0.7 to -0.3 weak negative association. -0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.

Chi-Square Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are related

Chi-Square Calculation: An Example Suppose a group of 1500 people was

surveyed.

The gender of each person was noted Male: 300 Female: 1200

We have two attributes: Gender Prefer-reading

Chi-Square Calculation: An Example

E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500

=90

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

i

j

Male Female Sum (row)

Like science fiction

250(90)

200(360) 450

Not like science fiction

50(210)

1000(840) 1050

Sum(col.) 300 1200 1500

Chi-Square Calculation: An Example For this 2 by 2 table, the degree of freedom

are (2-1)(2-1)=1

For 1 degree of freedom, the Chi-Square value needed to reject the hypothesis at the 0.001 significance is 10.828

Since our value is above this, we can conclude that the gender and prefer_reading are (strongly) correlated for the given group of people




Data Transformation

Data Transformation

Data Transformation can involve the following: Smoothing: remove noise from the data,

including binning, regression and clustering

Aggregation Generalization Normalization Attribute construction

Normalization

Normalization

Normalization

Min-max normalization Z-score normalization Decimal normalization

Min-max normalization

Min-max normalization: to

[new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

716.00)00.1(000,12000,98

000,12600,73

Z-score normalization

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

A

Avv

'

225.1000,16

000,54600,73

Decimal normalization

Normalization by decimal scaling

Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

Documents

Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation