43
Data Quality Journey at IRS RAS – The Importance of Metadata Robin Rappaport, CAP Senior Operations Research Analyst Data Quality and Metadata Team Leader IAIDQ Webinar Facilitator INFORMS Certified Analytics Professional (CAP) Exam Committee Member IRS | Research, Analysis, and Statistics (RAS) 2014 International Data Quality Summit

2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

Data Quality Journey at IRS RAS –The Importance of Metadata

Robin Rappaport, CAP

Senior Operations Research AnalystData Quality and Metadata Team LeaderIAIDQ Webinar FacilitatorINFORMS Certified Analytics Professional(CAP) Exam Committee Member

IRS | Research, Analysis, and Statistics (RAS)

2014 International Data Quality Summit

Page 2: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2

About Our Speaker

• Data Quality and Metadata Team Leader responsible for delivery of DataQuality Initiative for Research Databases at Internal Revenue Service (IRS).Work of team contributed to IRS being awarded a The DataWarehousing Institute (TWDI) 2011 Best Practices Award for theCompliance Data Warehouse (CDW), a Computerworld Honor, and aGovernment Computer News (GCN) Gala Award.

• Over 25 years experience as Data Quality practitioner. Undergraduate degree inEconomics with Computer Science. Graduate work in Operations Research with concentration inMathematical Modeling in Information Systems. Worked in both private (6 years) and publicsectors (since 1990). Positions include Computer Programmer, Systems Analyst, andOperations Research Analyst.

• International Association for Information & Data Quality (IAIDQ) Webinar Facilitator;Member, Institute for Operations Research and Management Science (INFORMS), CertifiedAnalytics Professional (CAP) Exam Committee; Chairman, Individual Membership forWashington, D.C. chapter from 1987 – 1990; elected Secretary and served from 1990 - 1991.

• Specialties: Data Quality, Metadata, SAS, Analytics, Sybase SQL.

Page 3: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

Data Quality JourneyThe Data Quality journey at the U.S. InternalRevenue Service (IRS) – Research,Analysis, and Statistics (RAS) began in 2005for the Compliance Data Warehouse (CDW)(started in 1997).

As the largest IRS database; CDW providesdata, metadata, tools, training and computingservices to hundreds of research analystsworking to improve tax administration.

3

Page 4: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

4

Data Quality Initiative for IRS Research Databases

• January 2005, Temporary assignment to helpwith strategies and initiatives to improve thequality and delivery of data and informationservices to the Research community.

• March 2005, Discussion Paper by Director,Research Databases entitled: “Using theCompliance Data Warehouse (CDW) toImprove Data Quality for Research”.

• September 30, 2005, Permanent Position.

Page 5: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

5

Proof of Data Quality Value

• Problem: Why CDW numbers differ fromanother data source used by Projections &Forecasting Group (PFG).

Response from IRS to Congress should be authoritative. Determined differences from CDW to PFG Data Source. Compared CDW numbers to CDW Data Source.

• Actual Problem: CDW numbers did not matchCDW data source. Only affected certain states.

• Resolution: Three missing tapes identified andloaded; Permanent position obtained.

Page 6: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

6

Why a Data Quality Initiative for IRS Research Databases?

Data is fundamental to research.

Data Quality should be applied wherever data is analyzed.

Analysis requires data to be obtained.

Analysis requires assessment of quality of data.

Analysis requires knowledge of what the data represents.

Analysis requires knowledge of what fields should be selected.

Analysis requires knowledge of accuracy of the data.

Analysis requires knowledge of reliability of the data.

Page 7: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

7

Why is Metadata Important for IRS Research Databases?

Data is fundamental to research.

Metadata is the key to turn data into information.

Research requires proper understanding of the data.

Research requires knowledge of what the datarepresents.

Research requires knowledge of what fields shouldbe selected.

Page 8: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

8

Findings and conclusions only as good as the data

Raw Data

StaticReports

Ad hocReports

DescriptiveAnalysis

PredictiveModeling

Simulation &Optimization

What happened?What happened?

Why did it happen?Why did it happen?

What will happen?What will happen?

Ret

urn

on In

vest

men

t

Data and computing requirements

Data Analysis Ladder

Page 9: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

9

Research only as good as understanding of the data

Raw Data

StaticReports

Ad hocReports

DescriptiveAnalysis

PredictiveModeling

Simulation &Optimization

What happened?What happened?

Why did it happen?Why did it happen?

What will happen?What will happen?

Ret

urn

on In

vest

men

t

Data and computing requirements

Data Analysis Ladder

Page 10: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

10

Good research requires quality data and metadataIRS Strategic Foundations: Invest for High PerformanceUse data and research across organization to make informeddecisions and allocate resources RAS Goal: Become our customer’s preferred source. Centralized source of

timely, relevant, accurate, accessible, interpretable, and coherentData; Metadata; Tools; System Infrastructure; and Training.

Ongoing Effort: Improve quality of research databases; Increasenumber of users, and enhance online tools and web-enabled knowledge

Customer Comments: “…should win … award for making such usefuldata available to the research community and working … to ensuredata accuracy and consistency!”

Industry Recognition: The Data Warehousing Institute (TDWI) 2011 BestPractices, IAC Excellence.gov Award Finalist 2008, ComputerWorld Honor 2007,and Award from Government Computer News (GCN) 2007 for IRS’ Update ofCompliance Data Warehouse makes analysis less taxing

Government Recognition: IRS Enterprise Data Management Office, CanadaRevenue Agency, Department of Treasury, Pennsylvania Department of Taxation,National Security Agency (NSA), and Federal Aviation Administration (FAA)

Page 11: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

Lessons Learned 1

• No one dimension is moreimportant than the others.

• Metadata is more important thanpeople realize.

• What it takes to be successfulwriting metadata for researchers.

11

Page 12: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

12

CDW Aims to Address Data Quality in Six AreasArea Goal

Timeliness► Minimize amount of time to capture, move, and release data and metadata to

users.► Leverage new technologies or processes, where appropriate to increase

efficiencies throughout the data and metadata supply chain.

Relevance► Ensure data and metadata gaps are filled.► Make new investments in data and metadata that reflect expected future

priorities.

Accuracy► Create processes to routinely assess fitness of data and metadata.► Report quality assessment findings.► Publish statistical metadata through the CDW website.► Cross-validate release data against source and other system data.

Accessibility► Improve organization and delivery of metadata.► Provide online knowledge base.► Facilitate more efficient searching for metadata.► Invest in third-party tools to enhance access and analysis of data/metadata.

Interpretability► Standardize naming and typing conventions.► Create clear, concise, easy to understand data definitions (metadata).

Coherence► Develop common key structures to improve record matching across database

tables.► Ensure key fields have common names and data types (metadata).

Source 6 DQ Dimensions: Federal Committee on Survey Methodology; Gordon Brackstone, “Managing Data Quality in aStatistical Agency”, Survey Methodology, December 1999.

Page 13: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

13

Become our Customer’s Preferred Source

Focus on Improvements

Frequency of Data Updates: Annual, Semi-Annual, Quarterly, Monthly,Weekly.

Data Augmentation: Derived fields, Pre-joining tables, Summary tables.

Data Profiling, Rule Development, Quality Assessment.

Website Redesign and Search Capabilities.

Metadata Development and Maintenance (Data Definitions, Lookup Tables,Column Profile).

Standard Key Structures, Naming Conventions, and Type Designations.

Page 14: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

Researcher Quotes• It is frustrating to use data without

metadata.• How do I select the fields for my

analysis without knowing what theymean (Name, Format, No definition)?

• Please make sure to update dates afterloading data (Latest Update Cycle, Month,Year).

14

Page 15: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

15

CDW Fact Sheet CDW

What is the Compliance Data Warehouse (CDW)?

What is CDW? It is anenvironment that provides data,tools, and computing servicesto the IRS Researchcommunity. It offers a high-performance analyticalenvironment for conductingresearch studies.

Why is CDW important?It is the preferred environmentfor over 1,000 researchanalysts and other businessusers to perform analytical andhigh volume data processing.Users include IRS, Treasury,GAO, and others.

Who can benefit fromCDW?Those who require access to acomprehensive set ofcompliance data for researchand analysis in a secureenvironment. Average. dailydatabase queries: 8,000.

The Compliance Data Warehouse (CDW) is an analytical data environment thatis specifically designed to support research activities in the IRS. It is managed andadministered by RAS. Key features of the CDW environment include:

Data: Taxpayer-level data from nearly 40 different legacy sources, including tax returns,customer accounts, information returns, case management systems, and third-partydata. With nearly 2 petabytes of total data, CDW is the largest database in the IRS.Data sources are released on a weekly, monthly, quarterly, and annual basis.

Metadata: Web-based metadata and dynamic data profiling for all databases, includingdefinitions, lookup tables, cross-references, and other artifacts for over 40,000 dataelements and over 1 million searchable attributes; Average daily web-based queries:1,500.

Tools: Software licenses for SQL clients, SAS, SAS Enterprise Miner/Text Miner, R,Hyperion Intelligence, ArcGIS, and support for any ODBC- or JDBC-compliantapplication.

Computing: Server-based computing environment for remote processing of complexand high-volume jobs; flexible storage management solutions for both temporary andpermanent user files;

Training and Support: In-house training for SQL and Hyperion; group rate solutionsfor SAS and ArcGIS; and general support for data, tools, and account managementservices;

Security: FISMA Certification & Accreditation, Online 5081 Authorization, TIN Masking,Database Audits, System Logging, and other Security Controls.

Page 16: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

1616

CDW Data Road Map CDW

Major Categories

Customer Account

Tax Return

InformationReporting

Compliance andCase

Management

Third-Party

Major Categories of Data in CDW

Customer Account: Payments, abatements, credits, adjustments,freeze conditions, additional assessments, adjustments, reversals,bankruptcies, claims, penalties, offer in compromise, etc.

Tax Return: Federal tax returns filed by individuals, businesses, exemptorganizations, and government entities. Includes Forms 1040, 6251, 3520,1120, 1065, 1041, 990, and others.

Information Reporting: Information filed by a financial institution,employer, partnership, or other party on behalf of a taxpayer. IncludesForms W-2, 1098, 1099-B, 1099-MISC, Schedule K-1, and others.

Compliance/Case Management: Case management systemscontaining information on examinations, delinquent accounts, underreporteractivity, enforcement revenue, or other compliance-based data.

Third-Party: Taxpayer data from federal-state sharing agreements,treaty partners, and other federal agencies, including Social SecurityAdministration, Department of Justice, and State Department.

Other: Customer surveys, national statistical samples, credit bureau data,publically available data, fee-based financial data, and other sources.

Other

What types of data are available in CDW?

Customer Service

Page 17: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

17

CDW: Infrastructure

Database Server(Sybase IQ)

Application Server(SAS, Hyperion)

Shared Storage(2 Petabyte)

IRS Network IRS Network

SPSS Access SQL Web SAS

Page 18: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

1818

CDW Metadata CDW

How is CDW metadata created and published?

What is metadata?Metadata is sometimes defined

as “data about data”. It isideally maintained as arepository of searchableinformation about available data.

Why is metadataimportant?Metadata is the key to turn datainto information. It can beused as a tool to selectappropriate data for a study andthen understand and interpretfindings. It can also be used asa basis for developing businessrules for data validation andaugmentation.

Who can access CDWmetadata?Anyone with access to the IRSintranet can view CDW metadataon the CDW website.

Level Types of Metadata Attributes

Database Name, Source, Description, First Year, Last Year, Last Updated

Table Name, Database, Description, Frequency, Frequency Type,First Year, Last Year, Last Updated, Number of Columns

ColumnName, Legacy Name, Description, Table, First Year, Last Year,Last Updated, Data Type, Distribution Type, Range Type, NullsAllowed, Has Lookup, Minimum Length, Maximum Length,Primary Key, Last Updated, Legacy Source, Legacy File Name

CDW maintains a web-based repository of metadata for over 40,000columns of data. Metadata is available at the database, table, and columnlevel, and are created and updated based on some of the following sources: Internal Revenue Manual (IRM) Document 6209 (IRS Processing Codes and Information) Functional Specification Packages (FSPs), Computer Programming

Handbooks, Core Record Layouts (CRLs) , Program RequirementPackages (PRPs)

Tax Returns and Tax Return Instructions Other official documents and materials

Examples of CDW Metadata at the Database, Table, and Column Level

Page 19: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

19

Standard Template for CDW Metadata Definition

• Reference to source

• Informative

• Clear

• Concise

• Complete

• Easy to understand

• Consistent

• Easier to develop

• Easier to maintain

The <legacy name> ....

Choose all that apply:

was added in <extract cycle>.(It) has data through <extract cycle>.(It) is <short description: clear, concise, and easy to

maintain>.

It is reported on <Form #, Line#>. (It is (transferredto OR included on <Form #, Line#> (notated'<notation>').) (The format is (<word for number>character(s) OR numeric). OR It is reported in

(positive, negative, or positive and negative) (wholedollars OR dollars and cents).) Valid values (if known)are .... Values are (if valid not known) ..... It is (zero,blank, null) (if not present OR if not applicable). (Values(other than valid) also appear.) (See <related fields>.)

Benefits

oNote: Basic template for form related fields. Other variations forindicators, codes, and computer generated fields.

Page 20: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

20

SMEs Needed for Metadata Development: Value Statement

If you provide Subject Matter Experts (SMEs) formetadata development, you can expect betterunderstanding of data fields due to morecomplete definitions which should result inimproved and more reliable research.

If you do not, then you can expect continueddifficulty in identifying fields for researchwhich could result in flawed research andmisinformed decisions.

Developed at DGIQ 2010 Conference: Danette McGilvray’s tutorial.

* Presently looking for hard-workers capable of synthesizinginformation from various sources into clear, concise definitions.

These volunteers often volunteer others.

Page 21: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

21

Skills Required to Write Metadata• Careful, attention to detail.• Innate curiosity.• Understand data through exploratory statistics and

Structured Query Language (SQL) programming orstatistical packages, such as SAS.

• Understand the benefit of using a standardtemplate (controlled vocabulary).

• Textual research skills (key word searches).• Knowledge of Service documentation or Subject

Matter Expertise (SME).• Ability to write in a clear, concise manner.

Page 22: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

22

Loading Metadata to the CDW Website• SAS ® enables efficient delivery of large-scale

metadata to the IRS Research community as part of abroader data quality initiative.

• Column definitions and other attributes imported fromExcel spreadsheets, iteratively processed usingSAS ® macros, and exported to Microsoft SQLServer.

• Metadata in SQL Server managed through DATAsteps and procedures via the ODBC engine.

• Metadata published on CDW website in usableformat help IRS researchers quickly search for andunderstand meaning of data available for analysis.

Page 23: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

23

CDW: Viewing Metadata

Microsoft SQLServer

IRS Network IRS Network

CDWWebsite

CDWWebsite

CDWWebsite

CDWWebsite

Page 24: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2424

CDW Metadata CDW

What metadata is available on the CDW website?

Database and Table-Level MetadataNames, descriptions, sources, availability, update status, and links toother internal websites for program or operational information.Metadata Availability

Database-Level

Table-Level

Column-Level

Lookup Tables

Reviews

Page 25: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2525

CDW Metadata CDW

What metadata is available on the CDW website (cont’d)?

Column-Level MetadataDefinitions, legacy references, availability, release frequency, datatypes, primary key candidates, Nulls, distribution type, range type, andother attributes.

Metadata Availability

Database-Level

Table-Level

Column-Level

Lookup Tables

Reviews

Page 26: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2626

CDW Metadata CDW

What metadata is available on the CDW website (cont’d)?

Lookup Tables and Column ReviewsDefinitions for unique values of discrete (categorical) fields, and abilityto view and submit comments about anomalies or other features.Metadata Availability

Database-Level

Table-Level

Column-Level

Lookup Tables

Reviews

Page 27: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2727

Standardized Search Results on the CDW Website

Page 28: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2828

CDW Data Profiling CDW

Identifying patterns (and problems) in data with profiling

What is Data Profiling?Data profiling involves astandardized analysis of data todetermine its completeness andsuitability for use.

Why is profiling important?Analyzing basic patterns in datacan reveal both insights andanomalies that are useful indescribing the underlying structureand improving data quality.

Who can profile CDW data?As of March 2010, those withaccess to the IRS intranet canprofile data via the CDW website.One does not need to be a CDWuser to view the CDW website andperform basic profiling.

Data profiling is a process of analyzing data to gauge overall suitabilityfor use. Common profiling tasks can help identify:

Invalid values in fields (values out of range) Missing values or empty fields (fields containing no data at all) Inconsistent methods of representing the same value Data elements used for purposes other than expected Violation of business rules Unrealistic frequencies or percentages of specific values in a column Violations of referential integrity Misspelled text values

The Role of Metadata in Data Profiling

Metadata includes valid values, when known. The process of dataprofiling is more informative when valid values are known. The actualvalues identified as part of the data profiling can be compared to thevalid values. This allows for Data Validation of a specific data fieldusing the data in that one field. Metadata can also enable cross-validation to what should be the same field on other data tables.Metadata can also include more complex business rules that can beused for validation.

Page 29: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

2929

CDW Data Profiling CDW

What data profiling features are available on the CDW website?

Table Statistics

Row counts for a given table by State, County, and Zip Code. Rowcounts represent original, unaltered data from the authoritative source.

Data ProfilingFeatures

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 30: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3030

CDW Data Profiling CDW

What data profiling features are available on the CDW website (cont’d)?

Frequency TableFrequencies (row counts) for unique values of a discrete (categorical)field for a specific time period. Cumulative statistics are generated.

Data ProfilingFeatures

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 31: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3131

CDW Data Profiling CDW

What data profiling features are available on the CDW website (cont’d)?

Column StatisticsBasic distributional statistics for a given time period and filter condition.Users can drill down on unique values of a discrete (categorical) field.

Data ProfilingFeatures

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 32: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3232

CDW Data Profiling CDW

What data profiling features are available on the CDW website (cont’d)?

Trend AnalysisBasic statistics over time and by unique values of a discrete(categorical) field. Up to five years of data can be displayed.

Data ProfilingFeatures

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 33: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3333

CDW Data Profiling CDW

What data profiling features are available on the CDW website (cont’d)?

Geographic Maps: U.S., State, and County levelData Profiling

Features

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 34: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3434

CDW Data Profiling CDW

What data profiling features are available on the CDW website (cont’d)?

Data Reviews: Comments on specific data fields (columns) postedby usersData Profiling

Features

Table Statistics

FrequencyTable

ColumnStatistics

Trend Analysis

GeographicMaps

Reviews

Page 35: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

3535

CDW Data Alerts CDW

What types of alerts are available on the CDW website?

Data Alert Types

New Metadata

UpdatedMetadata

DataAugmentation

Standardization

Accuracy Issues

DataCorrections

Page 36: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

Lessons Learned 2• The more metadata we provide, the

greater the demand.• Expectations of up-to-date metadata

for all fields in CDW.• Metadata development and

maintenance is very time consumingand requires skilled resources.

• Better Metadata = Fewer Questions.36

Page 37: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

37

CDW: Data Analysis Tools

Database Greater Than Data WarehouseDefinition

106 (1 MB) Tiny

109 (1 GB) Small

1010 (10 GB) Big

1011 (100 GB) Large

1012 (1 TB) Very Large

1013 (10 TB) Huge

1014 (100 TB) Massive1015 (1 PB) Ridiculous*

* Attributable to Ed Wegman, Center for Computational Statistics, George Mason University

Large data volume means choosing the right tool for the right job

The larger the amount of data being analyzed, the greater the efficiencies fromremote computing and using SQL or SQL-based products

CDW isin thisrange

Page 38: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

38

Tools for Data Quality and Analysis

AnalysisTool Functional Characteristic

SQL► Easy-to-use database language consisting of a few basic commands that provides fast and

efficient retrieval, summarization, and processing of data► Virtually all ad-hoc or descriptive queries performed by Research users are conducive to SQL

Hyperion► Web-enabled, point-and-click tool for query, analysis, and report-writing activities► Ability to interactively create pivots, charts, and user-defined table views► Server-side implementation for maximum performance

SAS

► End-to-end data management and business intelligence software package supporting a widerange of functionality for data management, statistical analysis, econometrics and time series,operations research, GIS, data visualization, data mining, database connectivity, web andapplication development, and other business and analytical needs

► Server-side processing available to support analysis of very large databases► Most widely used all-purpose analysis software among federal statistical agencies► DataFlux product purchased to assist with Data Quality Assessment

SPSS► Data analysis tool widely used in the Research community that can be used to retrieve data

directly from CDW databases without any intermediate steps

Other ► Microsoft Access, Excel, and other third-party tools can be used for data retrieval, but are notsuited for processing data on the server

Page 39: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

39

Quality AssessmentRule Failures Reported by Column

Rules # Errors # Rows Error RateRule 1: (Valid Values)

Rule 2: (Sign Test)

Rule 3: (Range Test)

Rule 4: (Legislative)

Rule n: (Others)

Column Name YEAR

* Rules Require Metadata to Write

Page 40: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

40

Current Status and Future Plans

• Data Integration

• Data Profiling

• Rule Development

• Quality Assessment

• Continuous Monitoring

• Data Stewardship

Metadata is important for all of these activities

Page 41: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

DQ Journey – Lessons LearnedAlong the Way

Without Metadata,Data Management toproperly serve a customercommunity of researchersis impossible.

41

Page 42: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

42

Acknowledgements

• Jeff Butler, Director, Research Databases, IRSRAS

• CDW Data Quality Team (Headquarters, Field,and Contractors)

• Lwanga Yonke, Information Quality ProcessManager, Aera Energy LLC and Advisor toBoard of Directors, IAIDQ

Page 43: 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research Databases at Internal Revenue Service (IRS). Work of team contributed to IRS being awarded

43

Data Quality Journey at IRS RAS – TheImportance of Metadata

Robin Rappaport, CAPSenior Operations Research AnalystInternal Revenue Service (IRS)Research, Analysis, and Statistics (RAS)RAS:DM:RDD:TRD1111 Constitution Avenue, N.W. KWashington, D.C. 20001

[email protected]

2014 International Data Quality Summit