14
Data Cleansing: Filling Missing Values in Data Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan

Data Cleansing: Filling Missing Values in Data

Embed Size (px)

DESCRIPTION

Data Cleansing: Filling Missing Values in Data. Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav Chauhan. Overview. Problems Caused Methods for retrieving missing values Predicting values The average way - PowerPoint PPT Presentation

Citation preview

Page 1: Data Cleansing:  Filling Missing Values in Data

Data Cleansing: Filling Missing Values in

Data

Class PresentationCIS 764

Instructor Presented by

Dr. William Hankley Gaurav Chauhan

Page 2: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Overview Problems Caused Methods for retrieving missing

values Predicting values

The average way The probabilistic way By leveraging the relational network

structure Conclusions

Page 3: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Problems Caused

Following problems occur in data analysis because of missing values in the same

Summarizing variables Computing new variables Comparing variables Combining variables In Time Series Analysis

Page 4: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Methods for retrieving missing values

Considering average of the available values for prediction

Using probabilistic approach for value prediction

Leveraging relation network structure of the data to predict values

Page 5: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Values- the average wayYear Rainfall (avg) in (cm) Temperature (avg)

1936 30 60F

1937 32 66F

1938 N.A, Predicted = 28.5 cm

62F

1939 25 64F

1940 23 69F

1941 30 59F

1942 N.A, Predicted = 29.0 cm

60F

1943 28 59F

1944 22 65F

Page 6: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

For finding the values for year 1938 and 1942

We can calculate the rainfall for these two years as:

Taking avg of rainfall of 1937 and 1939Rainfall in 1938 = (32+25)/2 cm

= 28.5 cmTaking avg of rainfall of 1941 and 1943Rainfall in 1942 = (30+28)/2 cm

= 29 cm

Page 7: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Values- the probabilistic way Assume that we have n values and we are

required to predict n+1th value For every i such that i=1 to n the probability

that a data instance has a value vi is p(vi) Each of these probabilities is calculated on the

bases of the frequency with which vi occurs in the data.

That said, vn+1 is picked at random such that

p(vn+1= vi ) > p(vn+1 = vj)

If p(vi)>p(vj)

Page 8: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational

network

This technique applies only to relational data only

The values of missing instances are predicted as the mode of the peers who fit the relational network and have no missing values

Page 9: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational

network

Page 10: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Valuesby leveraging the relational

network

Example 1Book A Book C Book BCategory A Category C Category B

Book A Book C Book B ? (Predicted= A) Category C Category B

Page 11: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Predicting Values by leveraging the relational

network

Example 2Teacher

Student 1 Student 2 Student 3 Student 4

Age(19) ? Age(18) Age(19) (Predicted 19)

Page 12: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Conclusion

Missing values in the data are bad when it is used for analysis, learning or mining purposes

Various techniques aim at predicting data but none has reached a 100% accuracy

An average of 90% accuracy with which these values are predicted is still acceptable

Page 13: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

References www.hrs.co.nz

http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982

Page 14: Data Cleansing:  Filling Missing Values in Data

CIS 764-Gaurav Chauhan

Questions Anyone

I am shivering not because of nervousness but because of cold room temperature

-one nervous student