Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CS639:DataManagementfor
DataScienceLecture23:DataCleaning
[basedonslidesbyJohnCanny]
TheodorosRekatsinas1
DirtyData• TheStatistics View:
• Thereisaprocessthatproducesdata• Wewanttomodelidealsamplesofthatprocess,butinpracticewehavenon-idealsamples:
• Distortion – somesamplesarecorruptedbyaprocess• SelectionBias- likelihoodofasampledependsonitsvalue• Leftandrightcensorship- userscomeandgofromourscrutiny• Dependence – samplesaresupposedtobeindependent,butarenot(e.g.socialnetworks)
• Youcanaddnewmodelsforeachtypeofimperfection,butyoucan’tmodeleverything.
• What’sthebesttrade-offbetweenaccuracyandsimplicity?
DirtyData
• TheDatabase View:• Igotmyhandsonthisdataset• Someofthevaluesaremissing,corrupted,wrong,duplicated
• Resultsareabsolute(relationalmodel)• Yougetabetteranswerbyimprovingthequalityofthevaluesinyourdataset
DirtyData
• TheDomainExpert’s View:• ThisDataDoesn’tlookright• ThisAnswerDoesn’tlookright• Whathappened?
• Domainexpertshaveanimplicitmodelofthedatathattheycantestagainst…
DirtyData
• TheDataScientist’s View:• SomeCombinationofalloftheabove
DataQualityProblems• (Source)Dataisdirtyonitsown.
• Transformationscorruptthedata(complexityofsoftwarepipelines).
• Datasetsarecleanbutintegration (i.e.,combiningthem)screwsthemup.
• “Rare”errorscanbecomefrequentaftertransformationorintegration.
• Datasetsarecleanbutsuffer“bitrot”• Olddatalosesitsvalue/accuracyovertime
• Anycombinationoftheabove
BigPicture:WherecanDirtyDataArise?
7
ExtractTransform
Load
IntegrateClean
NumericOutliers
AdaptedfromJoeHellerstein’s 2012CS194GuestLecture
DataCleaningMakesEverythingOkay?
Theappearanceofaholeintheearth'sozonelayeroverAntarctica,firstdetectedin1976,wassounexpectedthatscientistsdidn'tpayattentiontowhattheirinstrumentsweretellingthem;theythoughttheirinstrumentsweremalfunctioning.
NationalCenterforAtmosphericResearch Infact,thedatawere
rejectedasunreasonablebydataqualitycontrolalgorithms
DirtyDataProblems• FromStanfordDataIntegrationCourse:
1) parsingtextintofields(separatorissues)2) Namingconventions:ER:NYCvs NewYork3) Missingrequiredfield(e.g.keyfield)4) Differentrepresentations(2vs Two)5) Fieldstoolong(gettruncated)6) Primarykeyviolation(fromun- tostructuredor
duringintegration7) RedundantRecords(exactmatchorother)8) Formattingissues– especiallydates9) Licensingissues/Privacy/keepyoufromusingthe
dataasyouwouldlike?
ConventionalDefinitionofDataQuality• Accuracy
• Thedatawasrecordedcorrectly.
• Completeness• Allrelevantdatawasrecorded.
• Uniqueness• Entitiesarerecordedonce.
• Timeliness• Thedataiskeptuptodate.
• Specialproblemsinfederateddata:timeconsistency.
• Consistency• Thedataagreeswithitself.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
Problems…• Unmeasurable
• Accuracyandcompletenessareextremelydifficult,perhapsimpossibletomeasure.
• Contextindependent• Noaccountingforwhatisimportant.E.g.,ifyouarecomputingaggregates,youcantoleratealotofinaccuracy.
• Incomplete• Whataboutinterpretability,accessibility,metadata,analysis,etc.
• Vague• Theconventionaldefinitionsprovidenoguidancetowardspracticalimprovementsofthedata.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
Findingamoderndefinition
• Weneedadefinitionofdataqualitywhich• Reflectstheuse ofthedata• Leadstoimprovementsinprocesses• Ismeasurable (wecandefinemetrics)
• First,weneedabetterunderstandingofhowandwheredataqualityproblemsoccur
• Thedataqualitycontinuum
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
MeaningofDataQuality(2)
• Therearemanytypesofdata,whichhavedifferentusesandtypicalqualityproblems
• Federateddata• Highdimensionaldata• Descriptivedata• Longitudinaldata• Streamingdata• Web(scraped)data• Numericvs.categoricalvs.textdata
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
MeaningofDataQuality(2)• Therearemanyusesofdata
• Operations• Aggregateanalysis• Customerrelations…
• DataInterpretation:thedataisuselessifwedonʼtknowalloftherules behindthedata.
• DataSuitability:Canyougettheanswerfromtheavailabledata
• Useofproxydata• Relevantdataismissing
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
TheDataQualityContinuum• Dataandinformationisnotstatic,itflowsinadatacollectionandusageprocess
• Datagathering• Datadelivery• Datastorage• Dataintegration• Dataretrieval• Datamining/analysis
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataGathering• Howdoesthedataenterthesystem?• Sourcesofproblems:
• Manualentry• Nouniformstandardsforcontentandformats• Paralleldataentry(duplicates)• Approximations,surrogates– SW/HWconstraints• Measurementorsensorerrors.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataGathering- Solutions• PotentialSolutions:
• Preemptive:• Processarchitecture(buildinintegritychecks)• Processmanagement(rewardaccuratedataentry,datasharing,datastewards)
• Retrospective:• Cleaningfocus(duplicateremoval,merge/purge,name&addressmatching,fieldvaluestandardization)
• Diagnosticfocus(automateddetectionofglitches).
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataDelivery• Destroyingormutilatinginformationbyinappropriatepre-processing
• Inappropriateaggregation• Nullsconvertedtodefaultvalues
• Lossofdata:• Bufferoverflows• Transmissionproblems• Nochecks
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataDelivery- Solutions• Buildreliabletransmissionprotocols
• Usearelayserver
• Verification• Checksums,verificationparser• Dotheuploadedfilesfitanexpectedpattern?
• Relationships• Aretheredependenciesbetweendatastreamsandprocessingsteps
• Interfaceagreements• Dataqualitycommitmentfromthedatastreamsupplier.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataStorage
• Yougetadataset.Whatdoyoudowithit?• Problemsinphysicalstorage
• Canbeanissue,butterabytesarecheap.
• Problemsinlogicalstorage• Poormetadata.
• Datafeedsareoftenderivedfromapplicationprogramsorlegacydatasources.Whatdoesitmean?
• Inappropriatedatamodels.• Missingtimestamps,incorrectnormalization,etc.
• Ad-hocmodifications.• StructurethedatatofittheGUI.
• Hardware/softwareconstraints.• DatatransmissionviaExcelspreadsheets,Y2K
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataStorage- Solutions• Metadata
• Documentandpublishdataspecifications.
• Planning• Assumethateverythingbadwillhappen.• Canbeverydifficult.
• Dataexploration• Usedatabrowsinganddataminingtoolstoexaminethedata.
• Doesitmeetthespecificationsyouassumed?• Hassomethingchanged?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataRetrieval
• Exporteddatasetsareoftenaviewoftheactualdata.Problemsoccurbecause:
• Sourcedatanotproperlyunderstood.• Needforderiveddatanotunderstood.• Justplainmistakes.
• Innerjoinvs.outerjoin• UnderstandingNULLvalues
• Computationalconstraints• E.g.,tooexpensivetogiveafullhistory,weʼll supplyasnapshot.
• Incompatibility• Ebcdic?Unicode?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataMiningandAnalysis• Whatareyoudoingwithallthisdataanyway?• Problemsintheanalysis.
• Scaleandperformance• Confidencebounds?• Blackboxesanddartboards• Attachmenttomodels• Insufficientdomainexpertise• Casualempiricism
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
RetrievalandMining- Solutions• Dataexploration
• Determinewhichmodelsandtechniquesareappropriate,finddatabugs,developdomainexpertise.
• Continuousanalysis• Aretheresultsstable?Howdotheychange?
• Accountability• Maketheanalysispartofthefeedbackloop.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataQualityConstraints
• Manydataqualityproblemscanbecapturedbystaticconstraintsbasedontheschema.
• Nullsnotallowed,fielddomains,foreignkeyconstraints,etc.
• Manyothersareduetoproblemsinworkflow,andcanbecapturedbydynamic constraints
• E.g.,ordersabove$200areprocessedbyBiller2
• Theconstraintsfollowan80-20rule• Afewconstraintscapturemostcases,thousandsofconstraintstocapturethelastfewcases.
• Constraintsaremeasurable.DataQualityMetrics?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
DataQualityMetrics
• Wewantameasurablequantity• Indicateswhatiswrongandhowtoimprove• RealizethatDQisamessyproblem,nosetofnumberswillbeperfect
• Typesofmetrics• Staticvs.dynamicconstraints• Operationalvs.diagnostic
• Metricsshouldbedirectionallycorrect withanimprovementinuseofthedata.
• Averylargenumbermetricsarepossible• Choosethemostimportantones.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
ExamplesofDataQualityMetrics• Conformancetoschema
• Evaluateconstraintsonasnapshot.
• Conformancetobusinessrules• Evaluateconstraintsonchangesinthedatabase.
• Accuracy• Performinventory(expensive),oruseproxy(trackcomplaints).Auditsamples?
• Accessibility• Interpretability• Glitchesinanalysis• Successfulcompletionofend-to-endprocess
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
TechnicalApproaches• Weneedamulti-disciplinaryapproachtoattackdataqualityproblems
• Nooneapproachsolvesallproblem
• Processmanagement• Ensureproperprocedures
• Statistics• Focusonanalysis:findandrepairanomaliesindata.
• Database• Focusonrelationships:ensureconsistency.
• Metadata/domainexpertise• Whatdoesitmean?Interpretation
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL1.Detect
UniversityofChicago,Cicago,IL
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL1.Detect
UniversityofChicago,Cicago,IL
UniversityofChicago,Chicago,IL2.Repair
A simple exampleChicago’sfoodinspectiondataset
Detect andrepair errorsinastructureddataset
Constraints and minimalityFunctionaldependencies
Bohannonetal.,2005,2007;KolahiandLakshmanan,2005;Bertossietal.,2011;Chuetal.,2013;2015Faginetal.,2015
Constraints and minimalityFunctionaldependencies
Action:Fewererroneousthancorrectcells;performminimumnumberofchangestosatisfyallconstraints
Constraints and minimalityFunctionaldependencies
Error;correctzipcodeis60608
Doesnotfixerrorsandintroducesnewones.
External informationExternallistofaddressesMatchingdependencies
Fanetal.,2009;Bertossietal.,2010;Chuetal.,2015
External informationExternallistofaddressesMatchingdependencies
Action:Mapexternalinformationtoinputdatasetusingmatchingdependenciesandrepairdisagreements
External informationExternallistofaddressesMatchingdependencies
Externaldictionariesmayhavelimitedcoverageornotexistaltogether
Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple
Estimatethedistributiongoverningeachattribute
Hellerstein,2008;Mayfieldetal.,2010;Yakoutetal.,2013
Example:Chicagoco-occurswithIL
Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple
Estimatethedistributiongoverningeachattribute
Again,failstorepairthewrongzipcode
Let’s combine everything
Quantitativestatistics
Constraintsandminimality Externaldata
Differentsolutionssuggestdifferentrepairs
A probabilistic model for data repairs
Learning the model