Upload
blake-york
View
223
Download
1
Tags:
Embed Size (px)
Citation preview
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
Data Cleaning
Importance “Data cleaning is one of the three
biggest problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
Missing Data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time
of entry not register history or changes of the data
It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )
How to Handle Missing Data?
Ignore the tuple: usually done when class
label is missing (assuming the tasks in
classification—not effective when the
percentage of missing values per attribute
varies considerably.
Fill in the missing value manually: tedious
+ infeasible
How to Handle Missing Data?
Fill in it automatically with a global constant : e.g., “unknown”, a new
class?!
the attribute mean
the attribute mean for all samples belonging to
the same class: smarter
the most probable value: inference-based such
as Bayesian formula or decision tree
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
Noisy Data
Noise: random error or variance in a measured variable
How to Handle Noisy Data? Binning Regression Clustering
Binning Binnig methods smooth a sorted data
value by consulting its “neighborhood” First of all, we sort all the values Then, the sorted values are distributed into
a number of “buckets”, or “bins” Then we smooth the values by
Means (bin value is replace by mean value), or Medium (bin value is replace by medium value), or Boundaries (bin value is replace by the closest
boundary value)
Simple Discretization Methods: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Regression
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
Data integration
Data integration: Combines data from multiple sources
into a coherent store
Data integration problems Schema integration:
e.g., A.cust-id B.cust-# Integrate metadata from different sources
Detecting and resolving data value conflicts For the same real world entity, attribute values
from different sources are different Possible reasons: different representations,
different scales, e.g., metric vs. British units
Redundant data
Redundant data occur often when integration of multiple databases Object identification: The same attribute or
object may have different names in different databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redundant data
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Pearson’s product moment coefficient
Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, and are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.
BABA n
BAnAB
n
BBAAr BA )1(
)(
)1(
))((,
Pearson’s product moment coefficient The correlation coefficient is always
between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations.
-1.0 to -0.7 strong negative association. -0.7 to -0.3 weak negative association. -0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.
Chi-Square Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
Chi-Square Calculation: An Example Suppose a group of 1500 people was
surveyed.
The gender of each person was noted Male: 300 Female: 1200
We have two attributes: Gender Prefer-reading
Chi-Square Calculation: An Example
E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500
=90
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
i
j
Male Female Sum (row)
Like science fiction
250(90)
200(360) 450
Not like science fiction
50(210)
1000(840) 1050
Sum(col.) 300 1200 1500
Chi-Square Calculation: An Example For this 2 by 2 table, the degree of freedom
are (2-1)(2-1)=1
For 1 degree of freedom, the Chi-Square value needed to reject the hypothesis at the 0.001 significance is 10.828
Since our value is above this, we can conclude that the gender and prefer_reading are (strongly) correlated for the given group of people
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
Data Transformation
Data Transformation can involve the following: Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation Generalization Normalization Attribute construction
Normalization
Normalization
Normalization
Min-max normalization Z-score normalization Decimal normalization
Min-max normalization
Min-max normalization: to
[new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
716.00)00.1(000,12000,98
000,12600,73
Z-score normalization
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
A
Avv
'
225.1000,16
000,54600,73
Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1