Upload
pravat
View
51
Download
1
Tags:
Embed Size (px)
DESCRIPTION
GIS Data Quality. Producing better data quality through robust business processes. BrightStar TRAINING. Kim Ollivier. Schedule Day One. Suggested breaks for the following times: Start: 9:00 Session 1 ( 90 min) Morning tea:10:30 to 10:45 Session 2 ( 105 min) - PowerPoint PPT Presentation
Citation preview
GIS Data QualityGIS Data Quality
Producing better data quality Producing better data quality through robust business through robust business
processesprocesses
Kim Ollivier BrightStar
TRAINING
Schedule Day OneSchedule Day One
Suggested breaks for the following times: Start: 9:00
Session 1 ( 90 min)Morning tea: 10:30 to 10:45
Session 2 ( 105 min)Lunch: 12:30 to 1:30
Session 3 ( 90 min) Afternoon tea: 3:00 to 3:15
Session 4 ( 105 min)Finish: 5:00
Each session will have an exercise or interactive discussion
TodayToday
IntroductionIntroduction What causes poor qualityWhat causes poor quality
LunchLunch
Assessing Quality processesAssessing Quality processes GIS upgrade project examplesGIS upgrade project examples
TomorrowTomorrow
Metadata Designing rules
Lunch
Data warehouse and ETL Feature maintenance
OverviewOverview
Introduce yourselfIntroduce yourself Your goals for this course?Your goals for this course?
Build a data quality systemBuild a data quality system Avoid the worst trapsAvoid the worst traps Be able to describe a project scopeBe able to describe a project scope
• Budget, timeline, prioritiesBudget, timeline, priorities
Sections of course based onSections of course based on
With permission from the author
ISBN 978-0-09771400-2
What is Data Quality?What is Data Quality?
“If they are fit for their intended uses in operations, decision making and planning.”
“If they correctly represent the real-world construct to which they refer.”
Spatial AccuracySpatial Accuracy
Statistical AccuracyStatistical Accuracy
Completeness Score = Relevant Relevant + MissingAccuracy Score = Relevant - Errors Relevant Overall Score = Relevant - Errors Relevant + Missing
CompletenessCompleteness
LINZ Bulk Data ExtractLINZ Bulk Data Extract metadata\metadata\meta.htmlmeta.html
Data ProfilingData Profiling
Find out what is thereFind out what is there Assess the risksAssess the risks Understand data challenges earlyUnderstand data challenges early Have an enterprise view of all dataHave an enterprise view of all data
Profile MetricsProfile Metrics
IntegrityIntegrity ConsistencyConsistency Completeness, DensityCompleteness, Density ValidityValidity TimelinessTimeliness AccessibilityAccessibility UniquenessUniqueness
SecuritySecurity
ConfidentialityConfidentiality PossessionPossession IntegrityIntegrity AuthenticityAuthenticity AvailabilityAvailability UtilityUtility
ConsistencyConsistency
Discrepancies between attributesDiscrepancies between attributes Exceptions in a cluster Exceptions in a cluster Spatial discrepanciesSpatial discrepancies
A GIS Data A GIS Data Quality SystemQuality System
Assess
Data Quality AssessmentData Profiling
Improve Prevent Recognise
Data CleaningMonitoring
Data IntegrationInterfaces
Ensuring Quality ofData Conversionand Consolidation
Building DataQuality Metadata
Warehouse
Monitor
Recurrent Data QualityAssessment
Course examplesCourse examples
LINZ coordinate upgrade 1998-2003LINZ coordinate upgrade 1998-2003 NSCC services upgrade 2008NSCC services upgrade 2008 Valuation roll structure and matchingValuation roll structure and matching ETL of utilites from SDE to AutocadETL of utilites from SDE to Autocad Address location issues NAR, DRAAddress location issues NAR, DRA
Documents and examples on memory stick
Exercise 1:Exercise 1:Nominate your databaseNominate your database
Select a representative example dataset Select a representative example dataset for later discussionfor later discussion
You may be responsible forYou may be responsible for Or, you have to integrateOr, you have to integrate Or, you have to load itOr, you have to load it Or, you supply it to othersOr, you supply it to others
Morning Tea
Assessing QualityAssessing Quality
1.1. Project stepsProject steps2.2. Required rolesRequired roles3.3. Defining the objectivesDefining the objectives4.4. Designing rulesDesigning rules5.5. Scorecard and MetadataScorecard and Metadata6.6. Frequency of assessmentFrequency of assessment7.7. Common mistakesCommon mistakes
Processes Affecting Data QualityProcesses Affecting Data Quality
Real-TimeInterfaces
Batch Feeds
Manual DataEntry
System Consolidations
Initial Data Conversion
Processes bringing data from outside
Process Automation
Loss of Expertise
New DataUses
System Upgrades
Changes notcaptured
Processes causingdata decay
Processes changing data from within
Data processing Data cleaning Data purging
Database
Outside: Initial Data ConversionOutside: Initial Data Conversion
Define data mappingDefine data mapping Extract, Transform, Load (ETL)Extract, Transform, Load (ETL) Drown in Data ProblemsDrown in Data Problems Find Scapegoat Find Scapegoat
Outside: System ConsolidationOutside: System Consolidation
Often from mergers (Auckland?)Often from mergers (Auckland?)• Unplanned, unreasonable timeframesUnplanned, unreasonable timeframes
Head-on two car wreckHead-on two car wreck Square pegs into round holesSquare pegs into round holes Winner – loser merging (50% wrong)Winner – loser merging (50% wrong)
Outside: Manual Data EntryOutside: Manual Data Entry
High error rateHigh error rate Complex and poor entry formsComplex and poor entry forms Users find ways around checksUsers find ways around checks Forcing non blanks does not workForcing non blanks does not work
Outside: Batch FeedsOutside: Batch Feeds
Large volumes mean lots of errorsLarge volumes mean lots of errors Source system subject to changesSource system subject to changes Errors accumulateErrors accumulate Especially dangerous if triggers Especially dangerous if triggers
activatedactivated
Outside: Real-Time InterfacesOutside: Real-Time Interfaces
Data between db’s in synchronisationData between db’s in synchronisation Data in small packets out of contextData in small packets out of context Too fast to validateToo fast to validate Rejection loses record, so acceptedRejection loses record, so accepted
Faster or better but not both!Faster or better but not both!
Decay: Changes Not CapturedDecay: Changes Not Captured
Object changes are unnoticed by Object changes are unnoticed by computerscomputers
Retroactive changes may not be Retroactive changes may not be propagatedpropagated
Decay: System UpgradesDecay: System Upgrades
The data is assumed to comply with the The data is assumed to comply with the new requirementsnew requirements
Upgrades are tested against what the Upgrades are tested against what the data is supposed to be, not what is data is supposed to be, not what is actually thereactually there
Once upgrades are implemented Once upgrades are implemented everything goes haywireeverything goes haywire
Decay: New Data UsesDecay: New Data Uses
““Fitness to the purpose of use” may not Fitness to the purpose of use” may not applyapply
Acceptable error rates may now be an Acceptable error rates may now be an issueissue
Value granularity, map scaleValue granularity, map scale Data retention policyData retention policy
Decay: Loss of ExpertiseDecay: Loss of Expertise
Meaning of codes may change over time Meaning of codes may change over time that only “experts” knowthat only “experts” know
Experts know when data looks wrongExperts know when data looks wrong Retirees rehired to work systemsRetirees rehired to work systems Auckland address points were entered Auckland address points were entered
on corners and the rest guessed, later on corners and the rest guessed, later used as exact.used as exact.
Decay: Process AutomationDecay: Process Automation
Web 2.0 bots automate form fillingWeb 2.0 bots automate form filling Transactions are generated without ever Transactions are generated without ever
being checked by peoplebeing checked by people Customers given automated access are Customers given automated access are
more sensitive to errors in their own more sensitive to errors in their own datadata
Within: Data ProcessingWithin: Data Processing
Changes in the programsChanges in the programs Programs may not keep up with changes Programs may not keep up with changes
in data collectionin data collection Processing may be done at the wrong Processing may be done at the wrong
timetime
Special GIS Data IssuesSpecial GIS Data Issues
Coordinate data not usually readableCoordinate data not usually readable Data models CAD v GIS Data models CAD v GIS Fuzzy matching is not Boolean (near)Fuzzy matching is not Boolean (near) Atomic objects harder to defineAtomic objects harder to define Features have 2,3,4,5 dimensionsFeatures have 2,3,4,5 dimensions Projection systems are not exactProjection systems are not exact Topology requires special operatorsTopology requires special operators
Within: Data PurgingWithin: Data Purging
Highly risky for data qualityHighly risky for data quality Relevant data may be purgedRelevant data may be purged Erroneous data may fit criteriaErroneous data may fit criteria It may not work the next yearIt may not work the next year
Within: Data CleaningWithin: Data Cleaning
En masseEn masse processes may add errors processes may add errors Cleaning processes may have bugsCleaning processes may have bugs Incomplete information about dataIncomplete information about data
Assessing Data QualityAssessing Data Quality
Data profilingData profiling Interview usersInterview users Examine data modelExamine data model Data GazingData Gazing
Data GazingData Gazing
Count the recordsCount the records Just open the sources and scrollJust open the sources and scroll Sort and look at the endsSort and look at the ends Run some simple frequency reportsRun some simple frequency reports See if the field names make senseSee if the field names make sense What is missing that should be thereWhat is missing that should be there
Lunch
Data CleaningData Cleaning
There are always lots of errorsThere are always lots of errors It is too much to inspect all by handIt is too much to inspect all by hand Data experts are rare and too busyData experts are rare and too busy It does not fix process errorsIt does not fix process errors You may make it worseYou may make it worse
Automated CleaningAutomated Cleaning
The only practical methodThe only practical method Needs sophisticated pattern analysisNeeds sophisticated pattern analysis Allow for backtrackingAllow for backtracking Data quality rules are interdependentData quality rules are interdependent
Common MistakesCommon Mistakes
1.1. Inadequate Staffing of Data Quality Teams Inadequate Staffing of Data Quality Teams 2.2. Hoping That Data Will Get Better by Itself Hoping That Data Will Get Better by Itself 3.3. Lack of Data Quality Assessment Lack of Data Quality Assessment 4.4. Narrow Focus Narrow Focus 5.5. Bad Metadata Bad Metadata 6.6. Ignoring Data Quality During Data Conversions Ignoring Data Quality During Data Conversions 7.7. Winner-Loser Approach in Data Consolidation Winner-Loser Approach in Data Consolidation 8.8. Inadequate Monitoring of Data Interfaces Inadequate Monitoring of Data Interfaces 9.9. Forgetting About Data Decay Forgetting About Data Decay 10.10. Poor Organization of Data Quality Metadata Poor Organization of Data Quality Metadata
MetadataMetadata
Data modelData model Business rules, relations, stateBusiness rules, relations, state Subclasses (lookup tables)Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XMLGIS Metadata (NZGLS or ISO) XML Readme.txtReadme.txt
Includes everything known about the data
Data ExchangeData Exchange
Batch or interactiveBatch or interactive ETL (Extract Transform Load)ETL (Extract Transform Load) ReplicationReplication Time differences in dataTime differences in data
GIS in Business ProcessesGIS in Business Processes
Integrates many different sourcesIntegrates many different sources Spatial patterns are revealedSpatial patterns are revealed Display thousands of records Display thousands of records
simultaneously with direct accesssimultaneously with direct access Location now seen as importantLocation now seen as important
ScorecardScorecard
DQ Score
Score SummaryScore Decompositions
Intermediate Error ReportsAtomic Level Data Quality Information
Case StudyCase Study
Outline a GIS data quality systemOutline a GIS data quality system Measles ChartMeasles Chart PrioritisePrioritise InterviewInterview Build up a scorecardBuild up a scorecard
Afternoon Tea
Assessment ExerciseAssessment Exercise
Split into pairsSplit into pairs Interview one person about their datasetInterview one person about their dataset Collect basic informationCollect basic information Devise a strategy for a profileDevise a strategy for a profile
Rotate pair with anotherRotate pair with another Interview other personInterview other person
Verbal reports to classVerbal reports to class
Major Upgrade ProjectsMajor Upgrade Projects
LINZ Coordinate upgradeLINZ Coordinate upgrade NSCC Coordinate upgradeNSCC Coordinate upgrade
ReferencesReferences
Data Quality Assessment – Arkady MaydanchikData Quality Assessment – Arkady Maydanchik