17
August 16, 2022 Data Mining: Concepts and Techniques 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Data Preparation

Embed Size (px)

DESCRIPTION

Data Preparation in Strategic Business Intelligence

Citation preview

August 10, 2015Data Mining: Concepts and Techniques 1Why Data Preprocessing?Data in the rea !ord is dirtyinco"pete: ac#ing attri$ute %aues, ac#ing certain attri$utes o& interest, or containing ony aggregate datae'g', occupation() *noisy: containing errors or outierse'g', +aary(),10*inconsistent: containing discrepancies in codes or na"ese'g', Age()-2* .irthday()0/00101221*e'g', Was rating )1,2,/*, no! rating )A, ., C*e'g', discrepancy $et!een dupicate recordsAugust 10, 2015Data Mining: Concepts and Techniques 2Why 3s Data Dirty?3nco"pete data "ay co"e &ro")4ot appica$e* data %aue !hen coectedDi5erent considerations $et!een the ti"e !hen the data !as coected and !hen it is anay6ed'7u"an0hard!are0so&t!are pro$e"s4oisy data 8incorrect %aues9 "ay co"e &ro":auty data coection instru"ents7u"an or co"puter error at data entry;rrors in data trans"ission3nconsistent data "ay co"e &ro"Di5erent data sources:unctiona dependency %ioation 8e'g', "odi&y so"e in#ed data9Dupicate records aso need data ceaningAugust 10, 2015Data Mining: Concepts and Techniques /Why 3s Data Preprocessing 3"portant?4o quaity data, no quaity "ining resuts