44
CS639: Data Management for Data Science Lecture 23: Data Cleaning [based on slides by John Canny] Theodoros Rekatsinas 1

Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

CS639:DataManagementfor

DataScienceLecture23:DataCleaning

[basedonslidesbyJohnCanny]

TheodorosRekatsinas1

Page 2: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DirtyData• TheStatistics View:

• Thereisaprocessthatproducesdata• Wewanttomodelidealsamplesofthatprocess,butinpracticewehavenon-idealsamples:

• Distortion – somesamplesarecorruptedbyaprocess• SelectionBias- likelihoodofasampledependsonitsvalue• Leftandrightcensorship- userscomeandgofromourscrutiny• Dependence – samplesaresupposedtobeindependent,butarenot(e.g.socialnetworks)

• Youcanaddnewmodelsforeachtypeofimperfection,butyoucan’tmodeleverything.

• What’sthebesttrade-offbetweenaccuracyandsimplicity?

Page 3: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DirtyData

• TheDatabase View:• Igotmyhandsonthisdataset• Someofthevaluesaremissing,corrupted,wrong,duplicated

• Resultsareabsolute(relationalmodel)• Yougetabetteranswerbyimprovingthequalityofthevaluesinyourdataset

Page 4: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DirtyData

• TheDomainExpert’s View:• ThisDataDoesn’tlookright• ThisAnswerDoesn’tlookright• Whathappened?

• Domainexpertshaveanimplicitmodelofthedatathattheycantestagainst…

Page 5: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DirtyData

• TheDataScientist’s View:• SomeCombinationofalloftheabove

Page 6: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataQualityProblems• (Source)Dataisdirtyonitsown.

• Transformationscorruptthedata(complexityofsoftwarepipelines).

• Datasetsarecleanbutintegration (i.e.,combiningthem)screwsthemup.

• “Rare”errorscanbecomefrequentaftertransformationorintegration.

• Datasetsarecleanbutsuffer“bitrot”• Olddatalosesitsvalue/accuracyovertime

• Anycombinationoftheabove

Page 7: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

BigPicture:WherecanDirtyDataArise?

7

ExtractTransform

Load

IntegrateClean

Page 8: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

NumericOutliers

AdaptedfromJoeHellerstein’s 2012CS194GuestLecture

Page 9: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataCleaningMakesEverythingOkay?

Theappearanceofaholeintheearth'sozonelayeroverAntarctica,firstdetectedin1976,wassounexpectedthatscientistsdidn'tpayattentiontowhattheirinstrumentsweretellingthem;theythoughttheirinstrumentsweremalfunctioning.

NationalCenterforAtmosphericResearch Infact,thedatawere

rejectedasunreasonablebydataqualitycontrolalgorithms

Page 10: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DirtyDataProblems• FromStanfordDataIntegrationCourse:

1) parsingtextintofields(separatorissues)2) Namingconventions:ER:NYCvs NewYork3) Missingrequiredfield(e.g.keyfield)4) Differentrepresentations(2vs Two)5) Fieldstoolong(gettruncated)6) Primarykeyviolation(fromun- tostructuredor

duringintegration7) RedundantRecords(exactmatchorother)8) Formattingissues– especiallydates9) Licensingissues/Privacy/keepyoufromusingthe

dataasyouwouldlike?

Page 11: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

ConventionalDefinitionofDataQuality• Accuracy

• Thedatawasrecordedcorrectly.

• Completeness• Allrelevantdatawasrecorded.

• Uniqueness• Entitiesarerecordedonce.

• Timeliness• Thedataiskeptuptodate.

• Specialproblemsinfederateddata:timeconsistency.

• Consistency• Thedataagreeswithitself.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 12: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Problems…• Unmeasurable

• Accuracyandcompletenessareextremelydifficult,perhapsimpossibletomeasure.

• Contextindependent• Noaccountingforwhatisimportant.E.g.,ifyouarecomputingaggregates,youcantoleratealotofinaccuracy.

• Incomplete• Whataboutinterpretability,accessibility,metadata,analysis,etc.

• Vague• Theconventionaldefinitionsprovidenoguidancetowardspracticalimprovementsofthedata.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 13: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Findingamoderndefinition

• Weneedadefinitionofdataqualitywhich• Reflectstheuse ofthedata• Leadstoimprovementsinprocesses• Ismeasurable (wecandefinemetrics)

• First,weneedabetterunderstandingofhowandwheredataqualityproblemsoccur

• Thedataqualitycontinuum

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 14: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

MeaningofDataQuality(2)

• Therearemanytypesofdata,whichhavedifferentusesandtypicalqualityproblems

• Federateddata• Highdimensionaldata• Descriptivedata• Longitudinaldata• Streamingdata• Web(scraped)data• Numericvs.categoricalvs.textdata

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 15: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

MeaningofDataQuality(2)• Therearemanyusesofdata

• Operations• Aggregateanalysis• Customerrelations…

• DataInterpretation:thedataisuselessifwedonʼtknowalloftherules behindthedata.

• DataSuitability:Canyougettheanswerfromtheavailabledata

• Useofproxydata• Relevantdataismissing

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 16: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

TheDataQualityContinuum• Dataandinformationisnotstatic,itflowsinadatacollectionandusageprocess

• Datagathering• Datadelivery• Datastorage• Dataintegration• Dataretrieval• Datamining/analysis

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 17: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataGathering• Howdoesthedataenterthesystem?• Sourcesofproblems:

• Manualentry• Nouniformstandardsforcontentandformats• Paralleldataentry(duplicates)• Approximations,surrogates– SW/HWconstraints• Measurementorsensorerrors.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 18: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataGathering- Solutions• PotentialSolutions:

• Preemptive:• Processarchitecture(buildinintegritychecks)• Processmanagement(rewardaccuratedataentry,datasharing,datastewards)

• Retrospective:• Cleaningfocus(duplicateremoval,merge/purge,name&addressmatching,fieldvaluestandardization)

• Diagnosticfocus(automateddetectionofglitches).

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 19: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataDelivery• Destroyingormutilatinginformationbyinappropriatepre-processing

• Inappropriateaggregation• Nullsconvertedtodefaultvalues

• Lossofdata:• Bufferoverflows• Transmissionproblems• Nochecks

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 20: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataDelivery- Solutions• Buildreliabletransmissionprotocols

• Usearelayserver

• Verification• Checksums,verificationparser• Dotheuploadedfilesfitanexpectedpattern?

• Relationships• Aretheredependenciesbetweendatastreamsandprocessingsteps

• Interfaceagreements• Dataqualitycommitmentfromthedatastreamsupplier.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 21: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataStorage

• Yougetadataset.Whatdoyoudowithit?• Problemsinphysicalstorage

• Canbeanissue,butterabytesarecheap.

• Problemsinlogicalstorage• Poormetadata.

• Datafeedsareoftenderivedfromapplicationprogramsorlegacydatasources.Whatdoesitmean?

• Inappropriatedatamodels.• Missingtimestamps,incorrectnormalization,etc.

• Ad-hocmodifications.• StructurethedatatofittheGUI.

• Hardware/softwareconstraints.• DatatransmissionviaExcelspreadsheets,Y2K

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 22: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataStorage- Solutions• Metadata

• Documentandpublishdataspecifications.

• Planning• Assumethateverythingbadwillhappen.• Canbeverydifficult.

• Dataexploration• Usedatabrowsinganddataminingtoolstoexaminethedata.

• Doesitmeetthespecificationsyouassumed?• Hassomethingchanged?

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 23: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataRetrieval

• Exporteddatasetsareoftenaviewoftheactualdata.Problemsoccurbecause:

• Sourcedatanotproperlyunderstood.• Needforderiveddatanotunderstood.• Justplainmistakes.

• Innerjoinvs.outerjoin• UnderstandingNULLvalues

• Computationalconstraints• E.g.,tooexpensivetogiveafullhistory,weʼll supplyasnapshot.

• Incompatibility• Ebcdic?Unicode?

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 24: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataMiningandAnalysis• Whatareyoudoingwithallthisdataanyway?• Problemsintheanalysis.

• Scaleandperformance• Confidencebounds?• Blackboxesanddartboards• Attachmenttomodels• Insufficientdomainexpertise• Casualempiricism

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 25: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

RetrievalandMining- Solutions• Dataexploration

• Determinewhichmodelsandtechniquesareappropriate,finddatabugs,developdomainexpertise.

• Continuousanalysis• Aretheresultsstable?Howdotheychange?

• Accountability• Maketheanalysispartofthefeedbackloop.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 26: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataQualityConstraints

• Manydataqualityproblemscanbecapturedbystaticconstraintsbasedontheschema.

• Nullsnotallowed,fielddomains,foreignkeyconstraints,etc.

• Manyothersareduetoproblemsinworkflow,andcanbecapturedbydynamic constraints

• E.g.,ordersabove$200areprocessedbyBiller2

• Theconstraintsfollowan80-20rule• Afewconstraintscapturemostcases,thousandsofconstraintstocapturethelastfewcases.

• Constraintsaremeasurable.DataQualityMetrics?

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 27: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

DataQualityMetrics

• Wewantameasurablequantity• Indicateswhatiswrongandhowtoimprove• RealizethatDQisamessyproblem,nosetofnumberswillbeperfect

• Typesofmetrics• Staticvs.dynamicconstraints• Operationalvs.diagnostic

• Metricsshouldbedirectionallycorrect withanimprovementinuseofthedata.

• Averylargenumbermetricsarepossible• Choosethemostimportantones.

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 28: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

ExamplesofDataQualityMetrics• Conformancetoschema

• Evaluateconstraintsonasnapshot.

• Conformancetobusinessrules• Evaluateconstraintsonchangesinthedatabase.

• Accuracy• Performinventory(expensive),oruseproxy(trackcomplaints).Auditsamples?

• Accessibility• Interpretability• Glitchesinanalysis• Successfulcompletionofend-to-endprocess

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 29: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

TechnicalApproaches• Weneedamulti-disciplinaryapproachtoattackdataqualityproblems

• Nooneapproachsolvesallproblem

• Processmanagement• Ensureproperprocedures

• Statistics• Focusonanalysis:findandrepairanomaliesindata.

• Database• Focusonrelationships:ensureconsistency.

• Metadata/domainexpertise• Whatdoesitmean?Interpretation

AdaptedfromTedJohnson’sSIGMOD2003Tutorial

Page 30: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Data cleaning for structured dataDetect andrepair errorsinastructureddataset

UniversityofChicago,Cicago,IL

Page 31: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Data cleaning for structured dataDetect andrepair errorsinastructureddataset

UniversityofChicago,Cicago,IL1.Detect

UniversityofChicago,Cicago,IL

Page 32: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Data cleaning for structured dataDetect andrepair errorsinastructureddataset

UniversityofChicago,Cicago,IL1.Detect

UniversityofChicago,Cicago,IL

UniversityofChicago,Chicago,IL2.Repair

Page 33: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

A simple exampleChicago’sfoodinspectiondataset

Detect andrepair errorsinastructureddataset

Page 34: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Constraints and minimalityFunctionaldependencies

Bohannonetal.,2005,2007;KolahiandLakshmanan,2005;Bertossietal.,2011;Chuetal.,2013;2015Faginetal.,2015

Page 35: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Constraints and minimalityFunctionaldependencies

Action:Fewererroneousthancorrectcells;performminimumnumberofchangestosatisfyallconstraints

Page 36: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Constraints and minimalityFunctionaldependencies

Error;correctzipcodeis60608

Doesnotfixerrorsandintroducesnewones.

Page 37: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

External informationExternallistofaddressesMatchingdependencies

Fanetal.,2009;Bertossietal.,2010;Chuetal.,2015

Page 38: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

External informationExternallistofaddressesMatchingdependencies

Action:Mapexternalinformationtoinputdatasetusingmatchingdependenciesandrepairdisagreements

Page 39: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

External informationExternallistofaddressesMatchingdependencies

Externaldictionariesmayhavelimitedcoverageornotexistaltogether

Page 40: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple

Estimatethedistributiongoverningeachattribute

Hellerstein,2008;Mayfieldetal.,2010;Yakoutetal.,2013

Example:Chicagoco-occurswithIL

Page 41: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple

Estimatethedistributiongoverningeachattribute

Again,failstorepairthewrongzipcode

Page 42: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Let’s combine everything

Quantitativestatistics

Constraintsandminimality Externaldata

Differentsolutionssuggestdifferentrepairs

Page 43: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

A probabilistic model for data repairs

Page 44: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted

Learning the model