Upload
jacob-ruiz
View
33
Download
2
Embed Size (px)
DESCRIPTION
Data Cleansing: Filling Missing Values in Data. Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan. Overview. Problems Caused Methods for retrieving missing values Predicting values The average way - PowerPoint PPT Presentation
Citation preview
Data Cleansing: Filling Missing Values in
Data
Class PresentationCIS 764
Instructor Presented by
Dr. William Hankley Gaurav Chauhan
CIS 764-Gaurav Chauhan
Overview Problems Caused Methods for retrieving missing
values Predicting values
The average way The probabilistic way By leveraging the relational network
structure Conclusions
CIS 764-Gaurav Chauhan
Problems Caused
Following problems occur in data analysis because of missing values in the same
Summarizing variables Computing new variables Comparing variables Combining variables In Time Series Analysis
CIS 764-Gaurav Chauhan
Methods for retrieving missing values
Considering average of the available values for prediction
Using probabilistic approach for value prediction
Leveraging relation network structure of the data to predict values
CIS 764-Gaurav Chauhan
Predicting Values- the average wayYear Rainfall (avg) in (cm) Temperature (avg)
1936 30 60F
1937 32 66F
1938 N.A, Predicted = 28.5 cm
62F
1939 25 64F
1940 23 69F
1941 30 59F
1942 N.A, Predicted = 29.0 cm
60F
1943 28 59F
1944 22 65F
CIS 764-Gaurav Chauhan
For finding the values for year 1938 and 1942
We can calculate the rainfall for these two years as:
Taking avg of rainfall of 1937 and 1939Rainfall in 1938 = (32+25)/2 cm
= 28.5 cmTaking avg of rainfall of 1941 and 1943Rainfall in 1942 = (30+28)/2 cm
= 29 cm
CIS 764-Gaurav Chauhan
Predicting Values- the probabilistic way Assume that we have n values and we are
required to predict n+1th value For every i such that i=1 to n the probability
that a data instance has a value vi is p(vi) Each of these probabilities is calculated on the
bases of the frequency with which vi occurs in the data.
That said, vn+1 is picked at random such that
p(vn+1= vi ) > p(vn+1 = vj)
If p(vi)>p(vj)
CIS 764-Gaurav Chauhan
Predicting Values by leveraging the relational
network
This technique applies only to relational data only
The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values
CIS 764-Gaurav Chauhan
Predicting Values by leveraging the relational
network
CIS 764-Gaurav Chauhan
Predicting Valuesby leveraging the relational
network
Example 1Book A Book C Book BCategory A Category C Category B
Book A Book C Book B ? (Predicted= A) Category C Category B
CIS 764-Gaurav Chauhan
Predicting Values by leveraging the relational
network
Example 2Teacher
Student 1 Student 2 Student 3 Student 4
Age(19) ? Age(18) Age(19) (Predicted 19)
CIS 764-Gaurav Chauhan
Conclusion
Missing values in the data are bad when it is used for analysis, learning or mining purposes
Various techniques aim at predicting data but none has reached a 100% accuracy
An average of 90% accuracy with which these values are predicted is still acceptable
CIS 764-Gaurav Chauhan
References www.hrs.co.nz
http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982
CIS 764-Gaurav Chauhan
Questions Anyone
I am shivering not because of nervousness but because of cold room temperature
-one nervous student