235
© Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield By Michael Scofield Manager, Data Asset Development Manager, Data Asset Development ESRI, Inc. Redlands, CA ESRI, Inc. Redlands, CA Asst. Professor, Health Information Asst. Professor, Health Information Management Management Loma Linda University Loma Linda University [email protected] Vers. 32 MSP June 9, 2008 L-3 Vers. 32 MSP June 9, 2008 L-3

© Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

Embed Size (px)

Citation preview

Page 1: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield, all rights reserved.

Managing the Data Acquisition & Exchange

RelationshipBy Michael ScofieldBy Michael Scofield

Manager, Data Asset DevelopmentManager, Data Asset DevelopmentESRI, Inc. Redlands, CAESRI, Inc. Redlands, CA

Asst. Professor, Health Information ManagementAsst. Professor, Health Information ManagementLoma Linda University Loma Linda University

[email protected]

Vers. 32 MSP June 9, 2008 L-3Vers. 32 MSP June 9, 2008 L-3

Page 2: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

2

About Michael ScofieldAbout Michael ScofieldMichael ScofieldMichael Scofield is Manager of Data Asset Development at ESRI in is Manager of Data Asset Development at ESRI in Redlands, California. He is a popular speaker in topics of data Redlands, California. He is a popular speaker in topics of data management, data quality, data warehouse design, as well as satellite management, data quality, data warehouse design, as well as satellite imagery interpretation and emergency communications. His career has imagery interpretation and emergency communications. His career has included education and private industry in areas of data warehousing and included education and private industry in areas of data warehousing and data management. His articles appear in DM Review, the B-Eye data management. His articles appear in DM Review, the B-Eye Newsletter, InformationWeek magazine, the IBI Systems Journal, and other Newsletter, InformationWeek magazine, the IBI Systems Journal, and other professional journals. professional journals.

He has spoken to over 120 professional audiences for groups such as Data He has spoken to over 120 professional audiences for groups such as Data Management Assn chapters, European Metadata Conferences, Information Management Assn chapters, European Metadata Conferences, Information Quality Conferences, The Data Warehousing Institute, Oracle User Groups, Quality Conferences, The Data Warehousing Institute, Oracle User Groups, Institute of Internal Auditors, Assn. of Government Accountants, Quality Institute of Internal Auditors, Assn. of Government Accountants, Quality Assurance Association chapters, Assn. for Computing Machinery and other Assurance Association chapters, Assn. for Computing Machinery and other professional and civic audiences. professional and civic audiences.

Mr. Scofield is also Asst. Professor of Health Information Management at Mr. Scofield is also Asst. Professor of Health Information Management at Loma Linda University. Loma Linda University.

NMS intro

Page 3: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

3

Alternate titles:Alternate titles:

““Managing the Data Acquisition Relationship”Managing the Data Acquisition Relationship”

““How Not to Mess Up When You Import Data”How Not to Mess Up When You Import Data”

“data acquisition”

…traditionally in science and engineering instrumentation.

Source User“data”

Page 4: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

4

Topics & Areas of ConcernTopics & Areas of Concern

Spelling out the relationship Spelling out the relationship

Difference between data and informationDifference between data and information

Understanding specific data and information needsUnderstanding specific data and information needs

Asking for the right data and finding what you need Asking for the right data and finding what you need

Data value and utilityData value and utility

Assessing the burden on potential data providersAssessing the burden on potential data providers

Scope and complete-ness of dataScope and complete-ness of data

Page 5: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

5

TopicsTopics (cont.)(cont.) Versioning and timelinessVersioning and timeliness

Media and physical formatMedia and physical format

Compatibility of logical data architecturesCompatibility of logical data architectures

Data quality assessment Data quality assessment

Updates and refresh issuesUpdates and refresh issues

Data collection biasData collection bias

Legal issuesLegal issues

Continuing data flow surveillance Continuing data flow surveillance

Page 6: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

6

How do you describe a dataset?How do you describe a dataset?

ArchitectureArchitecture What subjects (things) are described by a record What subjects (things) are described by a record Facts/fields/attributes/columns Facts/fields/attributes/columns Logical data model Logical data model

ScopeScope What records are included excluded on dimensionsWhat records are included excluded on dimensions Dimensions: time, geography, org., Dimensions: time, geography, org.,

CurrencyCurrency Compared to declared scopeCompared to declared scope Table level, and column-specific Table level, and column-specific

QualityQuality PrecisionPrecision Complete-ness (by column) Complete-ness (by column) Accuracy Accuracy

However….data acquisition is much, much more.

Page 7: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

7

IntroductionIntroductionWhy talk about this?Why talk about this?

Because…Because…

……we want more and more data, and we want more and more data, and we don’t generate it all ourselves. we don’t generate it all ourselves.

So….we So….we acquireacquire it somewhere else. it somewhere else.

Page 8: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

8

Never a simple flow of data!Never a simple flow of data!

Source User(“target”)

“data”

Relationship

Expectations: subjects covered by data scope of data quality of data currency of data

Expectations: money how you use data burden others?

Often forgotten topics: Updates and refresh Corrections Documentation Other measures of quality

Terms: Usage rights

Page 9: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

9

Pull: Data requestor sends query to

source database.

Push: Data host compiles data file and

sends a data file.

Kinds of data “flows”Kinds of data “flows”

Trigger events:

Elapsed time (day, week, month, sub-day)

Source business event (usually a transaction)

Target business event (transaction makes request

for limited data; e.g. bal. chk.)

Human decision (e.g. BI)

Record growth trigger (e.g. every 5,000 records in a source transaction file)

“Push” vs. “pull”:

When the trigger happens, which side does the heavy work?

App.target environment

target environment

Importapp.

Appl. database

query

results

Page 10: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

10

Flows exist in many placesFlows exist in many places

Enterprise Data supplier

Appl. A Appl. B

Acquired division

Outsidedatauser

Appl. G

DW

Un-coordinated applications

Business Intelligence

Page 11: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

11

Each source has a data architectureEach source has a data architecture

Enterprise Data supplier

Appl. A Appl. B

Acquired division

Outsidedatauser

Appl. G

DW

Un-coordinated applications

Business Intelligence

Expectations

Constraints

Page 12: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

12

What is data architecture?What is data architecture?

The logical and semantic structure of the business (or The logical and semantic structure of the business (or that part of the business) and the data which describes that part of the business) and the data which describes and supports it. and supports it.

Described by a data modelDescribed by a data model

Subject entitiesSubject entities Relationships Relationships Attributes Attributes Entity-relationship diagram Entity-relationship diagram

Is abstract (not understood by many)Is abstract (not understood by many)

Can be complexCan be complex

Page 13: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

13

Each FLOW has a data architectureEach FLOW has a data architecture

Enterprise Data supplier

Appl. A Appl. B

Acquired division

Outsidedatauser

Appl. G

DW

Un-coordinated applications

B.I.

Expectations

Page 14: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

14

Enterprise-captured data life cycleEnterprise-captured data life cycleTransaction-based

data capture

Businessapplication

Businessdatabase

Archive DW

other in-house applications

Data derivation & enhancement

Association with own history

Integration with other lateral data

Computing derived data (ratios, aggregates, etc.)

other in-house applications

Executivesummaryreports

export

Page 15: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

15

Reasons to import data Reasons to import data Enhance an internal DW for support of Enhance an internal DW for support of improved executive decision-making. improved executive decision-making.

Bolster operational data resources Bolster operational data resources independent of the data exchange independent of the data exchange relationship. relationship.

Engage in new business processes Engage in new business processes involving a B2B partnership formed involving a B2B partnership formed through data exchange.through data exchange.

E-discovery: litigationE-discovery: litigation

DWB.I.

Page 16: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

16

Reasons to import data Reasons to import data

DWB.I.

Timing:

Periodic big batch files:

daily, weekly, monthly, etc.

Transaction-driven:

“micro” data flows (SOA)

One-time

Page 17: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

17

Spelling out Spelling out the the

relationshiprelationship

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 18: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

18

Key questions: Key questions: What are your expectations?What are your expectations?

What are your uses of the data?What are your uses of the data?

What motivates the source to give it to you?What motivates the source to give it to you?

What are the political-cultural barriers between you and What are the political-cultural barriers between you and the source?the source?

What are your expectations of…What are your expectations of…

quality, complete-ness, currencyquality, complete-ness, currency media media updates and refresh updates and refresh

How can you strengthen the relationship? How can you strengthen the relationship?

Page 19: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

19

Political & cultural barriersPolitical & cultural barriers

Separate systemyouthem

Peer division or department

youthem

Totally unrelated legal entity youthem

“Information is power!”People don’t want to give up power.

Page 20: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

20

Typical risks and surprisesTypical risks and surprises

To save money, the source does not maintain previous To save money, the source does not maintain previous quality in data capture and processing. Updates show quality in data capture and processing. Updates show lower quality. lower quality.

To expand its market, the source alters the logical and To expand its market, the source alters the logical and physical data architecture without telling you. physical data architecture without telling you.

In response to business morphing pressures, the source In response to business morphing pressures, the source alters the coding scheme for one or more fields.alters the coding scheme for one or more fields.

The source discovers some errors, but does not inform The source discovers some errors, but does not inform you of it, nor supply you with corrections or corrected you of it, nor supply you with corrections or corrected records. records.

Page 21: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

21

Mitigating strategiesMitigating strategies

Spell out all expectations about the data. Spell out all expectations about the data.

Develop language, words, & models to enhance Develop language, words, & models to enhance precision of communication about data expectations. precision of communication about data expectations.

Rigorous testing of data Rigorous testing of data priorprior to purchase to purchase

Strengthen relationship through cooperative data testing Strengthen relationship through cooperative data testing strategiesstrategies Offer to test their updates Offer to test their updates Provide non-threatening feedback on DQ Provide non-threatening feedback on DQ Get source to seek you out as consultant on DQ Get source to seek you out as consultant on DQ (this will allow you to monitor their morphing pressures)(this will allow you to monitor their morphing pressures)

Page 22: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

22

Data & Data & informationinformation

structured data and unstructured data

What makes data (information) useful?

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 23: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

23

Data Data vsvs. information. information

data information

simple (single) observation, fact, or declaration

data (facts) with context to be more meaningful and useful

“Knowledge: valuable information from the human mind”

For many thinkers, there is a subtle, almost philosophical difference between data and information.

Page 24: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

24

Initial definitions Initial definitions

RealityReality

Data Data

InformationInformation

KnowledgeKnowledge

Wisdom Wisdom

Things and events.Things and events.

A single observation about A single observation about reality, clearly defined.reality, clearly defined.

One or more items of data, One or more items of data, with definition and context to with definition and context to make it meaningful.make it meaningful.

Simultaneous awareness of Simultaneous awareness of much information, and ability much information, and ability to cognitively integrate it. to cognitively integrate it.

Knowing not to sleep Knowing not to sleep through this lecture.through this lecture.

Page 25: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

25

Structural elements of tabular “data”Structural elements of tabular “data”

2

2

Piece of data; “a fact” a.k.a. “cell”

RecordTable

Database

What are you seeking? A fact, a record, a table, or a database?

Page 26: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

26

Acquiring data or information?Acquiring data or information?

Tabular UnstructuredSemi-structured

Web page

Raster

Text document

Cartesian dataset

multi-table database

diary, memoirs

The web is not a source!

It is a medium!

Page 27: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

27

Data vs. meaningData vs. meaning

Name AddressLucy Davis 41 Main St.Franz Kraemer 532 Elm Ave. Apt GAlex Karnov 563-A Pine StreetGlenn Pratt 78 Mills LaneDavid Orr 587 New York Ave.Peter Vines 798 Wisconsin Ave.Sally Forth 21 Market St.Adam Karr 487 Riverside Dr.

Name AddressLUCY DAVIS 41 MAIN ST.FRANZ KRAEMER 532 ELM AVE. APT GALEX KARNOV 563-A PINE STREETGLENN PRATT 78 MILLS LANEDAVID ORR 587 NEW YORK AVE.PETER VINES 798 WISCONSIN AVE.SALLY FORTH 21 MARKET STREETADAM KARR 487 RIVERSIDE DR.

Are these the same data?

Source A Source B

Same meaning? Yes. But not the same data.

Mixed case is difficult to derive correctly from ALL CAPS.

Page 28: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

28

Universe of Universe of knowledge, knowledge,

information, information, & data& data

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 29: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

29

Structured vs. unstructured dataStructured vs. unstructured data

Structured data

Most tabular databases:

businessgovernmentscience & research

Can fit into RDBMS

Unstructured dataPersonal letters

Memoirs, diaries

Literature (history, poetry, fiction)

Most books

Still images (paintings, photos, x-ray, ultrasound)

Sounds (sound recordings, EKG, SOSUS)

Moving images (cinema, TV, etc.)

Page 30: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

30

Structured vs. unstructured dataStructured vs. unstructured data

Structured data Unstructured data

Geospatial data

Raster imagery topos

Vector data streets, areas

“points, lines, polygons”

GIS data

Page 31: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

31

Parsing and processing dataParsing and processing data

tabular unstructured

Tabular data Unstructured data

Computers are good at processing. SQL, relational model, etc.

Humans are good at processing.

memory, free association.

Page 32: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

32

Processing unstructured dataProcessing unstructured data

Unstructured data

Humans are good at processing.

memory, free association.

Examples:

Hearing classical music, and correctly guessing the composer.

Recognizing the signature style of a oil painting.

Recognizing voices

Reading emotions on faces

Understanding incomplete sentences.

Seeing humor (intended and

not). .

Sta

r W

ars

Page 33: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

33

Asking for Asking for the right datathe right data

…or…

Asking for the right information

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 34: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

34

Who is the first user? …the final user?Who is the first user? …the final user?

Analytical support of macro-decisionsAnalytical support of macro-decisions

Data warehouse and business intelligenceData warehouse and business intelligence Probably to be manipulated by analysts Probably to be manipulated by analysts High-level decision-maker will use final output High-level decision-maker will use final output

Operational business systemOperational business system (micro-decisions)(micro-decisions)

geocoding customersgeocoding customers CRM CRM Oil exploration Oil exploration Agricultural field characteristics Agricultural field characteristics

Pure, undirected researchPure, undirected research

Discovery for litigationDiscovery for litigation

Page 35: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

35

What do What do decision-makers want?decision-makers want?

Data or information?Data or information?

Page 36: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

36

““Yeah, we got data. Lots of data!”Yeah, we got data. Lots of data!”010011010111001001111011101100100010110111000101101100011001000010011010111001001111011101100100010110111000101101100011001000000000001111000000111001110000000011101101110110001000010000010000000001111000000111001110000000011101101110110001000010000010111001001111011101100100010110111000101100101100101110010011110111001001111011101100100010110111000101100101100101110010011110111011001000110010000000000011110000001110011100000000111011011111011001000110010000000000011110000001110011100000000111011011101100101101110011001011011000110010011100111000000001110000000101100101101110011001011011000110010011100111000000001110000000001101110110001000010100100100110011000000000000000110001101011001101110110001000010100100100110011000000000000000110001101011001100100111001110000000011101101110110001000010000010011001111001100100111001110000000011101101110110001000010000010011001111010011010101011100100111101110110010001011011100010110110001100010011010101011100100111101110110010001011011100010110110001100100000000000111100000011100111000000001110110111011000100001000100000000000111100000011100111000000001110110111011000100001000001011100100111101110110010001011011100010110010110010111001001001011100100111101110110010001011011100010110010110010111001001111011101100100011001000000000001111000000111001110000000011101111011101100100011001000000000001111000000111001110000000011101101110110010110111001100101101100011001001110011100000000111000101110110010110111001100101101100011001001110011100000000111000000000110111011000100001010010010011001100000000000011001001110000000110111011000100001010010010011001100000000000011001001110011100000000100110011110100100110010011100111000000001110110111011100000000100110011110100100110010011100111000000001110110111011000100001000001001100111101001000001100011010110011001001110011000100001000001001100111101001000001100011010110011001001110011010011001111000110101101011100100111101110000010010001001110011010011001111000110101101011100100111101110000010010001001110010101010001000010010001001001001000100000110010001011011100010010101010001000010010001001001001000100000110010001011011100010110110001111100110011100111000000001110110111011000100001000001110110001111100110011100111000000001110110111011000100001000001001100111100011010100000010111000011101101110110001000010000010001100111100011010100000010111000011101101110110001000010000010011001111010010011001001110011100000000111011011101100010000100011001111010010011001001110011100000000111011011101100010000100

Page 37: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

37

Always strive to make information Always strive to make information more useful to the recipient! more useful to the recipient!

Los Angeles LXXIV

San AntonioLXVIIDetroit LXXXV

Boston LXXIII

Seattle LXXV

Phoenix LXXIX

Basketball scores

Page 38: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

38

Data vs. expressionData vs. expression

Executive may ask for this:

% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8

Are you going to ask for just six records from your source?

No! Why?

This information (report) has a high probability of being inadequate. The executive will inevitably ask for more.

Page 39: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

39

Supporting macro-decisions is iterative.Supporting macro-decisions is iterative.

ProcessFiltering

AggregationExplorationCorrelation

Analysis

External sources

Internal sources

Knowledge worker(s)

Data whse

% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8

Manufacturing as share of total employment

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

1950 1960 1970 1980 1990 2000 2010

32.1 %

11.7 %

Share of consumption by category

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Moto

r vehic

les

Furnitu

re &

house

hold

Other d

urable

Food

Clothing

& s

hoes

Gasolin

e, fu

els

Other n

on-dura

ble

Housin

g

House

hold

ope

ratio

n

Transp

ortatio

n

Medic

al car

e

Recre

ation

Other s

ervice

s

1929

2001

B.I.

Data mart(s)ETL

Page 40: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

40

Raw data vs. derived dataRaw data vs. derived dataYou always want raw data, at the most granular level You always want raw data, at the most granular level possible ! possible !

No ratios or averages --they can NOT be aggregated. No ratios or averages --they can NOT be aggregated.

Country Pop DensBelgium 340.0France 111.3Germany 230.9Italy 193.0Netherlands 397.1Spain 80.0Switzerland 182.2

Country Population Sq Km Pop DensBelgium 10,379,067 30,528 340.0France 60,876,136 547,030 111.3Germany 82,422,299 357,021 230.9Italy 58,133,509 301,230 193.0Netherlands 16,491,461 41,526 397.1Spain 40,397,842 504,782 80.0Switzerland 7,523,934 41,290 182.2

276,224,248 1,823,407 151.5

derived dataraw data

=Avg = 219.2

Population Density

Page 41: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

41

Anticipate the analysis and Anticipate the analysis and information delivery.information delivery.

Have data analysis tools ready. Have data analysis tools ready.

Output will be iterative. Output will be iterative.

Best output allows for graphic analysisBest output allows for graphic analysis

Time series are valuable…Time series are valuable…

… …but require history. but require history.

Don’t neglect history when asking for data. Don’t neglect history when asking for data.

Page 42: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

42

Why trend graphs Why trend graphs (a.k.a. “time series”)?(a.k.a. “time series”)?

257 Deaths per 100,000 persons due to heart disease in CY-2000

Deaths from heart disease in U.S.

0

100

200

300

400

500

600

1960 1970 1980 1990 2000 2010

Dea

ths

per

100,

000

pers

ons

257

559 deaths per 100,000

This statistic alone, lacks meaning!

We must give it context!

Page 43: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

43

How do executives make decisions?How do executives make decisions?Cognitive vs. feelingsCognitive vs. feelings

Week 1 Week 2 Week 3 Week 4Product Line Mon. Tue Wed Thu Fri Mon. Tue Wed Thu Fri Mon. Tue Wed Thu Fri Mon. TuePeas 45.6 46.6 47.1 48.1 51.4 43.6 44.1 45.5 45.1 48.4 47.9 46.2 47.1Carrots 20.4 20.8 21.1 21.5 22.9 19.5 19.7 20.3 20.1 21.6 21.4 20.6 21.1Tomatos 75.8 77.4 78.2 79.8 85.2 72.3 73.2 75.5 74.8 80.3 79.6 76.6 78.2Cucumbers 21.4 21.8 22.1 22.5 24 20.4 20.6 21.3 21.1 22.6 22.4 21.6 22.1Green beans 35.9 36.7 37.1 37.9 40.4 34.3 34.7 35.8 35.5 38.1 37.7 36.3 37.1Corn 57.3 58.5 59.1 60.4 64.5 54.7 55.3 57.1 56.6 60.7 60.2 57.9 59.1Esparigus 2.91 2.98 3.01 3.07 3.28 2.78 2.81 2.9 2.88 3.09 3.06 2.95 3.01Borcolli 13.6 13.9 14 14.3 15.3 13 13.1 13.6 13.4 14.4 14.3 13.7 14Oranges 69 70.4 71.2 72.6 77.6 65.8 66.6 68.7 68.1 73.1 72.4 69.7 71.2Lemons 10.7 10.9 11 11.3 12 10.2 10.3 10.7 10.6 11.3 11.2 10.8 11Pineapple 27.2 27.8 28.1 28.6 30.6 26 26.3 27.1 26.9 28.8 28.6 27.5 28.1Lettuce 94.2 96.2 97.2 99.2 106 89.9 91 93.9 93.1 99.8 99 95.3 97.2Garlic 1.94 1.98 2.01 2.05 2.19 1.85 1.88 1.94 1.92 2.06 2.04 1.96 2.01Guava 0.97 0.99 1 1.02 1.09 0.93 0.94 0.97 0.96 1.03 1.02 0.98 1Blackberries 2.91 2.98 3.01 3.07 3.28 2.78 2.81 2.9 2.88 3.09 3.06 2.95 3.01Strawberries 7.77 7.93 8.02 8.18 8.74 7.42 7.5 7.75 7.68 8.23 8.16 7.86 8.02Blueberries 27.2 27.8 28.1 28.6 30.6 26 26.3 27.1 26.9 28.8 28.6 27.5 28.1Rapsberries 11.7 11.9 12 12.3 13.1 11.1 11.3 11.6 11.5 12.3 12.2 11.8 12Boysenberries 8.74 8.93 9.02 9.21 9.83 8.34 8.44 8.71 8.63 9.26 9.18 8.84 9.02

TOTAL 535 546 552 564 602 511 517 534 529 567 562 541 552 0 0 0 0

When executives ask for data or information, be sure they understand the total costs.

Page 44: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

44

Tables (raw data) are hard to understandTables (raw data) are hard to understand

U.S. Monthly unemployment statistics

Year J an Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1980 6.3 6.3 6.3 6.9 7.5 7.6 7.8 7.7 7.5 7.5 7.5 7.21981 7.5 7.4 7.4 7.2 7.5 7.5 7.2 7.4 7.6 7.9 8.3 8.51982 8.6 8.9 9 9.3 9.4 9.6 9.8 9.8 10.1 10.4 10.8 10.81983 10.4 10.4 10.3 10.2 10.1 10.1 9.4 9.5 9.2 8.8 8.5 8.31984 8 7.8 7.8 7.7 7.4 7.2 7.5 7.5 7.3 7.4 7.2 7.31985 7.3 7.2 7.2 7.3 7.2 7.4 7.4 7.1 7.1 7.1 7 71986 6.7 7.2 7.2 7.1 7.2 7.2 7 6.9 7 7 6.9 6.61987 6.6 6.6 6.6 6.3 6.3 6.2 6.1 6 5.9 6 5.8 5.71988 5.7 5.7 5.7 5.4 5.6 5.4 5.4 5.6 5.4 5.4 5.3 5.31989 5.4 5.2 5 5.2 5.2 5.3 5.2 5.2 5.3 5.3 5.4 5.41990 5.4 5.3 5.2 5.4 5.4 5.2 5.5 5.7 5.9 5.9 6.2 6.31991 6.4 6.6 6.8 6.7 6.9 6.9 6.8 6.9 6.9 7 7 7.31992 7.3 7.4 7.4 7.4 7.6 7.8 7.7 7.6 7.6 7.3 7.4 7.41993 7.3 7.1 7 7.1 7.1 7 6.9 6.8 6.7 6.8 6.6 6.51994 6.6 6.6 6.5 6.4 6.1 6.1 6.1 6 5.9 5.8 5.6 5.51995 5.6 5.4 5.4 5.8 5.6 5.6 5.7 5.7 5.6 5.5 5.6 5.61996 5.6 5.5 5.5 5.6 5.6 5.3 5.5 5.1 5.2 5.2 5.4 5.41997 5.3 5.2 5.2 5.1 4.9 5 4.9 4.8 4.9 4.7 4.6 4.71998 4.6 4.6 4.7 4.3 4.4 4.5 4.5 4.5 4.6 4.5 4.4 4.41999 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 4.1 4.1 42000 4 4.1 4 3.8 4 4 4 4.1 3.9 3.9 3.9 3.92001 4.2 4.2 4.3 4.4 4.3 4.5 4.6 4.9 5 5.3 5.5 5.72002 5.7 5.7 5.7 5.9 5.8 5.8 5.8 5.7 5.7 5.7 5.9 62003 5.8 5.9 5.9 6 6.1 6.3 6.2 6.1 6.1 6 5.8 5.72004 5.7 5.6 5.8 5.6 5.6 5.6 5.5 5.4 5.4 5.5 5.4 5.42005 5.2 5.4 5.2 5.1 5.1 5 5 4.9 5.1 5 5 4.82006 4.7 4.7 4.7 4.7 4.7 4.6 4.7 4.7 4.5 4.4 4.5 4.42007 4.6 4.5 4.4 4.5 4.5 4.6 4.7 4.7 4.7 4.8 4.7 52008 4.9 4.8 5.1 5 5.5

Page 45: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

45

U.S. unemployment rate, seasonally adj

0

1

2

3

4

5

6

7

8

9

Jan-

92

Jan-

93

Jan-

94

Jan-

95

Jan-

96

Jan-

97

Jan-

98

Jan-

99

Jan-

00

Jan-

01

Jan-

02

Jan-

03

Jan-

04

Jan-

05

Jan-

06

Jan-

07

Jan-

08

Unemployment Unemployment

Source: Bureau of Labor Statistics web site

Clinton Bush II

Page 46: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

46

Placing data points into context Placing data points into context yields information!yields information!

Surround your requested data points with context!

Time series

Peer data

Causal factors

Breakdown / drilldown

Graphical expression

All these require many more data points than the executive originally requested!

On nearly every dimension.

Page 47: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

47

Choices in detail of dataChoices in detail of data

Original or derivativeOriginal or derivative

Granular or summaryGranular or summary

Filtered or notFiltered or not

Translated or notTranslated or not

Data is always easier to aggregate than to disaggregate!

It is always easier to filter out unneeded data than to request more data later.

Page 48: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

48

Converting data to informationConverting data to information

Query and reporting tool are requiredQuery and reporting tool are required

Needed functions:Needed functions:

AggregationAggregation

Sorting and filteringSorting and filtering

Association and joiningAssociation and joining

Clustering and stratificationClustering and stratification

GraphicsGraphics

Page 49: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

49

Converting data to informationConverting data to information

200 deaths from TBin Baker County,CY-2004

Raw data50,000

Avg. populationBaker County,CY-2004

4 deaths per 1,000 pop.

Baker Co., CY-2004

County TB deaths Population TB RateAdams 128 21,490 6.0Baker 200 50,000 4.0Carswell 87 17,215 5.1Davis 189 41,200 4.6Eaton 200 38,000 5.3

Conclusion: Baker County has the lowest TB rate of 5 peer counties.

Add context

Compute ratio

Add context

Page 50: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

50

Converting data to informationConverting data to information

Add context

Compute ratio

Add context

Raw data

Useful information

TB rate by county

6.0

4.0

5.14.6

5.3

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Adams Baker Carswell Davis Eaton

Page 51: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

51

But time series is even better.But time series is even better.

TB rates by county

0

1

2

3

4

5

6

7

8

1999 2000 2001 2002 2003 2004 2005

Adams

Baker

Carswell

Davis

Eaton

Baker County

Baker County

Skip gasoline

Page 52: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

52

Can precision be distracting?Can precision be distracting?

Assets U.S. DollarsCurrent assets:

Cash and cash equivalents 12,568,197,382.24Marketable securities 1,118,075,118.52

Notes and accts recievable 9,540,118,972.94Short-term financing receivables 13,750,181,442.41

Other accounts receivable 1,138,348,791.55Inventories 2,841,211,897.62

Deferred taxes 1,765,108,773.94Prepaid expenses and other current assets 2,941,012,486.33

Total current assets 45,662,254,865.55

Assets $ MilCurrent assets:

Cash and cash equivalents 12,568Marketable securities 1,118

Notes and accts recievable 9,540Short-term financing receivables 13,750

Other accounts receivable 1,138Inventories 2,841

Deferred taxes 1,765Prepaid expenses and other current assets 2,941

Total current assets 45,662

Rounded to $ mil.

Source: IBM Annual Report, 2005. Pennies contrived.

Page 53: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

53

How much scope?How much scope?

My fundamental bias:My fundamental bias:

Get as much as you can get for the same price.Get as much as you can get for the same price.

TimeTime

OrganizationalOrganizational

Cost is mainly labor--creating the extract file. Cost is mainly labor--creating the extract file.

Same labor for getting 4 years of history as 2 years. Same labor for getting 4 years of history as 2 years.

Media and storage costs are trivial. Media and storage costs are trivial.

Page 54: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

54

Why more data?Why more data?

TestingTesting

Continuity of definitions over time.Continuity of definitions over time.

Reasonableness of row counts, etc. Reasonableness of row counts, etc.

Test predictive models on historical data.Test predictive models on historical data.

Decision-makers will expand scope of query later. Decision-makers will expand scope of query later.

Context! You can never have too much context!

Page 55: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

55

Options in granularityOptions in granularity

Line item detailProduct Qty sold

1004 1571005 1091006 1421007 75

Product summary

Customer Revenue4778 2,951.844779 3,357.724780 3,876.704781 3,803.81

Customer value

Date Customer Product Qty Un Price Ext Price1/4/2007 4781 1004 60 37.81 2,268.601/5/2007 4780 1004 20 37.81 756.201/6/2007 4779 1005 37 13.98 517.261/7/2007 4778 1006 10 28.99 289.901/8/2007 4781 1004 15 37.81 567.151/9/2007 4780 1005 20 13.98 279.60

1/10/2007 4779 1006 15 28.99 434.851/11/2007 4778 1006 12 28.99 347.881/12/2007 4781 1005 18 13.98 251.641/13/2007 4780 1004 24 37.81 907.441/14/2007 4779 1006 30 28.99 869.701/15/2007 4778 1006 12 28.99 347.881/16/2007 4781 1007 10 32.18 321.801/17/2007 4780 1006 30 28.99 869.701/18/2007 4779 1007 18 32.18 579.241/19/2007 4778 1004 12 37.81 453.721/20/2007 4781 1004 6 37.81 226.861/21/2007 4780 1005 22 13.98 307.561/22/2007 4779 1006 18 28.99 521.821/23/2007 4778 1007 37 32.18 1,190.661/24/2007 4781 1005 12 13.98 167.761/25/2007 4780 1004 20 37.81 756.201/26/2007 4779 1006 15 28.99 434.851/27/2007 4778 1007 10 32.18 321.80

raw data file derivative data files

Customer Product Revenue4778 1004 453.724778 1006 985.664778 1007 1512.464779 1005 517.264779 1006 2261.224779 1007 579.244780 1004 2419.844780 1005 587.164780 1006 869.74781 1004 3062.614781 1005 419.44781 1007 321.8

Prod./cust summary

Page 56: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

56

Potential data Potential data providers:providers:

your impact your impact upon themupon them

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 57: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

57

Key questions: Key questions:

Does the data come from their operations?Does the data come from their operations?

Do they log business transactions adequately?Do they log business transactions adequately?

Do they log changes to kernel-stable entities Do they log changes to kernel-stable entities adequately?adequately?

What “enhancements” must they make to their What “enhancements” must they make to their application to extract the data you desire?application to extract the data you desire?

What cutoff policies do they have on transactions?What cutoff policies do they have on transactions?

Page 58: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

58

Kinds of data source organizationsKinds of data source organizations

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharing data is a cultural value, not for revenue

BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone

Credit bureaus (Equifax, Experian, Trans Union)

Marketing companies (D&B, DMA)

Suppliers of… maps imagery

News org’s (UPI)

Knowledge sellers (Lexus-Nexus)

Government agencies

Academic research

NGO’s

Page 59: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

59

Kinds of data source organizationsKinds of data source organizations

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharing data is a cultural value, not for revenue

May sell you the data, but more guarded about the documentation.

Writing external data documentation is an annoyance !

Simpler datasets require less semantic documentation.

These people ought to get the documentation right and rich!

Ask to see it first.

Page 60: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

60

What burden will you place on the What burden will you place on the data provider?data provider?

Depends upon…Depends upon…

how they store and manage their datahow they store and manage their data

……and…and…

your needs of scope, architecture, timing, your needs of scope, architecture, timing, quality.quality.

Page 61: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

61

Two kinds of data generationTwo kinds of data generation

Data as byproduct of business processes

Data as gathered as non-business research

commercial sector

banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.

governments

social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)

field surveys of land, topo, etc.

observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

satellite & aerial imagery

Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.

Page 62: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

62

Two kinds of data generationTwo kinds of data generation

Data as byproduct of business processes

Data as gathered as non-business research

Captured through business applications

Complex logical data architectures

May not have complete logging

Data extract may be a logistical and programming burden

Generally must be done for DW.

Captured through special studies

Generally simple logical data architectures

Not stored in application databases

Easier to extract and make copies

Page 63: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

63

Research dataResearch data

field surveys of land, topo, etc.

observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

satellite & aerial imagery

Data as gathered as non-business research

Simple logical data architecture

Page 64: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

64

Research dataResearch data

Example: interview-based research -- census

Interviewee

Family

Residence

Employment

Data as gathered as non-business research

Simple logical data architecture

Page 65: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

65

Research dataResearch data Data as gathered as non-business research

Example: cancer research study

Patient

Family

Examination& diagnosis

Hospital stay

Treatment

episode

eventevent

kernel-stable

kernel-stable

Simple logical data architecture

Page 66: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

66

Data created in business processesData created in business processes

Business application

software

Application database

usersbusiness

transactions

database reads and writes

Page 67: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

67

Application characteristicsApplication characteristics

Business application

software

Built to facilitate business operations.

Data captured to support ops.

Has a logical data architecture: you need to understand it.

Generally designed to meet on-line performance expectations.

Memory (versioning) often not important on many entities (particularly customers).

Business will not stop (“freeze”) for you to extract data.

Page 68: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

68

Application database characteristicsApplication database characteristics

Architecture supports application.

Hopefully well-normalized.

May or may not include business event logging.

DBMS: IMS, relational, network, flat

Application database

Page 69: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

69

Application database characteristicsApplication database characteristics

Application database

master files

kernel-stable

transactionschange

logs

events business logs or DBMS logs

Page 70: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

70

Over-time life cycle of subject entitiesOver-time life cycle of subject entities

long lives

“kernel-stable”

limited life

“episode”

point-in-time transactions

“events”

Page 71: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

71

Kernel-stable entitiesKernel-stable entities

customerscustomers

partiesparties

peoplepeople

departmentsdepartments

productsproducts

servicesservices

facilitiesfacilities

vehiclesvehicles

ships and aircraftships and aircraft

library holdinglibrary holding

propertiesproperties

cost centerscost centers

accounts accounts (bank, credit card, G/L)(bank, credit card, G/L)

corporationcorporation

institutioninstitution

groupings of above ***groupings of above ***

Page 72: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

72

Episode-like entitiesEpisode-like entities

hospital stayhospital stay

subscriptionsubscription

illnessillness

maintenance & maintenance & support contractsupport contract

employment periodemployment period

projectproject

library check-outlibrary check-out

hotel room stayhotel room stay

prison sentenceprison sentence

unemployment benefitunemployment benefit

conference registrationconference registration

college enrollmentcollege enrollment

phone call phone call (successful)(successful)

accounting periodaccounting period

Page 73: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

73

Event or transaction entitiesEvent or transaction entities

customer's ordercustomer's order

shipmentshipment

invoiceinvoice

G/L postingG/L posting

phone call (failed)phone call (failed)

sale of assetsale of asset

treatmenttreatment

test or observationtest or observation

airline flightairline flight

inquiryinquiry

turnstile passageturnstile passage

application for collegeapplication for college

graduationgraduation

credit card chargecredit card charge

paycheckpaycheck

Page 74: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

74

Kernel-stable entitiesKernel-stable entities

customerscustomerspartiespartiespeoplepeopledepartmentsdepartmentsproductsproductsservicesservicesfacilitiesfacilitiesvehiclesvehicleslibrary holdinglibrary holdingpropertypropertycost centercost centerGL accountGL accountinstitutioninstitutiongroupingsgroupings

Stable identity and existence.

2 kinds of changes: change to existence or ID change to non-key attribute

Changes to attributes occur rarely.

Are such changes logged in the application?

Are all changes logged?

Is versioning valued?

Page 75: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

75

Episode-like entitiesEpisode-like entities

hospital stayhospital staysubscriptionsubscriptionillnessillnesscontract for servicecontract for serviceemployment periodemployment periodprojectprojectlibrary check-outlibrary check-outhotel room stayhotel room stayprison sentenceprison sentenceunemployment ben.unemployment ben.conf. registrationconf. registrationcollege enrollmentcollege enrollmentphone callphone call

Always exist over a finite period of time.

End-point not always known.

Often confused with the starting event. (may have same key)

May have many kinds of subordinate events.

May have subordinate episodes.

May or may not be mutually exclusive with peers.

Page 76: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

76

Event or transaction entitiesEvent or transaction entities

customer's ordercustomer's ordershipmentshipmentinvoiceinvoiceG/L postingG/L postingphone call (failed)phone call (failed)sale of assetsale of assettreatmenttreatmenttest or observationtest or observationairline flightairline flightturnstile passageturnstile passageapplicationapplicationgraduationgraduationcr. card chargecr. card chargepaycheckpaycheck

Not designed to last a long time.

Generally only one key date/time.

Revisions may occur, but rare.

May be negated or reversed by subsequent transaction.

Mutually exclusive with peers.

May be subordinate to one or more episodes.

Almost always subordinate to other kernel entity(s).

Page 77: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

77

Event or transaction entitiesEvent or transaction entities

customer's ordercustomer's ordershipmentshipmentinvoiceinvoiceG/L postingG/L postingphone call (failed)phone call (failed)sale of assetsale of assettreatmenttreatmenttest or observationtest or observationairline flightairline flightturnstile passageturnstile passageapplicationapplicationgraduationgraduationcr. card chargecr. card chargepaycheckpaycheck

BIG QUESTION!

Can records be changed (updated, corrected) after creation?

This has profound consequence upon your over-time updating of your copy of the data.

Page 78: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

78

Basic accounting entitiesBasic accounting entities

accounting period

balance sheet

Jan. 1, 2005

balance sheet

Dec. 31, 2005

All accounting data are either…

1. events (postings)

2. aggregates of events over a time period (episode), …or…

3. statement of condition at a point in time (balance sheet).

episode

eventevent

Page 79: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

79

Confusing episodes and eventsConfusing episodes and events

book checked out

check-out event

expected return date

Other ambiguities:

incarceration

phone call

hospital admission

airline flight

episode

eventevent

Page 80: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

80

Episodes can contain eventsEpisodes can contain events

Hotel roomstay

Charge

Hospitalstay

Test

Medication

Project

Tasks

Laborcharges

War

Campaign

Battle

CasualtyEpisodes may contain “sub-episodes”

episode

event

episode

event

episode

episode

event

episode

episode

event

episode

Page 81: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

81

Data created in business processesData created in business processes

Business application

software

Application database

usersbusiness

transactions

database reads and writes

What data do you need from this environment?

What timing? once, continuous

How do you expect it to be extracted?

Who is going to make that happen?

Architecture and application design are often barriers to sharing data in an organization.

Page 82: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

82

Are your data needs 1-time, or continuous?Are your data needs 1-time, or continuous?

time

big first extract

and load

Jan. Feb. Mar. Apr. May

overtime “refresh” or update

What data do you want in “updates”?

Incremental or complete refresh?

How about corrections?

Page 83: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

83

Full refresh vs. incremental updatesFull refresh vs. incremental updates

Simple extract for source: delete and reload target.

Complex processing both at source and at target

Jan. Feb. Mar. Apr. May

overtime “refresh” or update

. . .

Jan. Feb. Mar.initial

Consumes more resources

Resource (bandwidth) efficient.

You don’t know what changed, and what was deleted.

You can measure rate of change easily.

Both may require coding and paradigm translation

month-end copies

Page 84: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

84

If logging takes place…where?If logging takes place…where?

Business application

software

Application database

usersbusiness

transactions

database reads and writes

Business event logging

DBMS technical logging

Page 85: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

85

How are backups taken?How are backups taken?

Business application

software

Application database

usersbusiness

transactions

database reads and writes DBMS backup for

archive, recovery

Page 86: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

86

How are backups taken?How are backups taken?

Business application

software

Application database

usersbusiness

transactions

database reads and writes

DBMS backup for archive, recovery

Extract business datafrom technical

backup

Business-readable full refresh

Page 87: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

87

Is there a data warehouse?Is there a data warehouse?

Business application

software

Application database

usersbusiness

transactions

database reads and writes

Periodic extracts

DW

Does this have the data you need?

Page 88: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

88

Dangers of DW extract data:Dangers of DW extract data:

May not have the granularity you need.May not have the granularity you need. May already have been aggregatedMay already have been aggregated

May not have desired fieldsMay not have desired fields

May not have required scope May not have required scope (org, geo, etc.)(org, geo, etc.)

May not include correctionsMay not include corrections

May not match your needs of time covered May not match your needs of time covered

May have been transformed, cleansed, or May have been transformed, cleansed, or filtered in some way. filtered in some way.

DW

ETLfile

Page 89: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

89

Distinguish between update & correctionDistinguish between update & correction

time

big first extract

and load

Jan. Feb. Mar. Apr. May

incremental updates

corrections

Page 90: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

90

Update vs. correction Update vs. correction

Address was 123 Main St.

Is now 548 Elm St.

He moved on April 4 (effective date)

We learned about it May 25.

We posted it on June 3 (record

change date)

Record showed 519 Fern St.

Should have been 984 Mills.

He never lived at 519 Fern Street.

It was an error.

It was never true.

Page 91: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

91

Logical and physical structure of…Logical and physical structure of…

extract(big bulk snapshot)

update(new, change, delete)

correction

Will the physical file transfer format recognize nulls?

Logical data architecture

describes

data model

Logical data architecture

Logical data architecture

data model data model

somewhat similar, but not the same

Page 92: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

92

Source burden for incremental updateSource burden for incremental updateCreate a record when any major table experiences a…Create a record when any major table experiences a…

new recordnew record

change in an existing recordchange in an existing record

delete (or tag “delete”) of existing recorddelete (or tag “delete”) of existing record

Change of kernel-stable records generally reflects a business event, and thus should be logged by application.

But is it? Or are all kernel entities so logged?

Page 93: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

93

Is it important for you to know what Is it important for you to know what changed? Why?changed? Why?

Are the major changes to kernel-stable entities Are the major changes to kernel-stable entities important to know?important to know?

Yes, they are, if they serve as dimensions.Yes, they are, if they serve as dimensions.

Discontinuities of dimensions are problematic Discontinuities of dimensions are problematic (an understatement) (an understatement) ! !

Page 94: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

94

Example of kernel-stable entity changesExample of kernel-stable entity changes

Customer

Address in time

Is it important to know address of customer for past history?

Does application software maintain address history?

If not, do you need to track such changes (go forward) ?

Are such changes being logged by application?

Change log

Who wants to know about street address history?

marketing analysisepidemiology studiescredit rating analysissecurity clearance research

Page 95: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

95

What are volatile fields (attributes)?What are volatile fields (attributes)?

Volatile attributes:Volatile attributes:

Street addressStreet address

Cell phone numberCell phone number

e-mail addresse-mail address

Stable attributes:

SpouseChildren

Page 96: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

96

Documentation burden upon sourceDocumentation burden upon source

Nobody likes writing data documentation

(except, perhaps, some data bigots).

Especially so…

…when incidental to their primary duties.

Especially so…

…long after the system change was made.

Possible solution:

For a discount, offer to send back to them data behavior documentation.

Requires reverse data engineering

Page 97: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

97

Other issuesOther issuesBe suspicious of tabularizing unstructured Be suspicious of tabularizing unstructured datadata

Often requires coding taxonomies…Often requires coding taxonomies…

… … are they sufficiently granular? are they sufficiently granular?

Page 98: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

98

Example: coding traffic fatalitiesExample: coding traffic fatalities

Roll over after skidRoll over after skid

Hit center dividerHit center divider

Hit bridge abutmentHit bridge abutment

Drove off a cliffDrove off a cliff

Drove into drainage ditchDrove into drainage ditch

Hit a deerHit a deer

Tree fell on vehicleTree fell on vehicle

Collision with parked trailerCollision with parked trailer

Bicyclist hit treeBicyclist hit tree

1 Auto-pedestrian

2 Auto-auto

3 Auto-fixed object

4 Auto-railroad

Page 99: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

99

Kinds of data source organizationsKinds of data source organizations

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharing data is a cultural value, not for revenue

BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone

Credit bureaus (Equifax, Experian, Trans Union)

Marketing companies (D&B, DMA)

Suppliers of… maps imagery

News org’s (UPI)

Knowledge sellers (Lexus-nexus)

Government agencies

Academic research

NGO’s

Page 100: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

100

Physical form Physical form and mediaand media

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 101: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

101

Key questions: Key questions:

Is the data being supplied on media which you can Is the data being supplied on media which you can read with your technology?read with your technology?

Is a special program or database management system Is a special program or database management system required to read it?required to read it?

Is the documentation supplied in a manner which you Is the documentation supplied in a manner which you can read and copy?can read and copy?

Is the data supplied in bulk, or incrementally, or even Is the data supplied in bulk, or incrementally, or even one transaction at a time? one transaction at a time?

Are there any compression techniques used on all or Are there any compression techniques used on all or certain types of data in the file? certain types of data in the file?

Page 102: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

102

Structured vs. unstructuredStructured vs. unstructuredStructured Unstructured

but indexed.Unstructured NOT indexed.

Anything in a… spreadsheet DBMS file with defined fields

Automated-ly managed… documents (document mgmt systems)

medical records medical imaging satellite imagery sound and video

Encyclopedia

MemoirsPersonal lettersLiteratureMeeting minutesBlogsPictures of my vacation

Library books are catalogued as a whole, but not in part.

Page 103: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

103

Search engines and indexingSearch engines and indexing

The internet is a medium, not a source and The internet is a medium, not a source and certainly not an “authoritative source”. certainly not an “authoritative source”.

Each web site probably has an agenda and bias.Each web site probably has an agenda and bias.

Search engines find Search engines find texttext—not meaning.—not meaning.

Web sites can mask tabular data from search Web sites can mask tabular data from search engines. engines.

Search engines may not see some academic Search engines may not see some academic sources sources (peer-reviewed journals, etc.)(peer-reviewed journals, etc.) because of cost because of cost of access. of access.

Page 104: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

104

Physical media for structured dataPhysical media for structured data

Physically moved –media:

Punched cardsHalf-inch mag tapeIBM tape cartridges9-inch floppy disk5-inch floppy disk3-1/2 inch floppy diskother cassettes or cartridges CD-ROMDVD …paper (yikes!)

Data moved virtually:

Electronic filesMessages (transactions)

Physical formats:Full database (req. DBMS)Flat file (positional) single-format multiple format w/ rec typeChar-delimited fileMS/Excel (or MS/Word) XMLzip fileother

SOM chaos

Page 105: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

105

Details in flat filesDetails in flat files Two record types!business key

record type code

1 A1 B2 A2 B2 B2 B3 A3 B4 A5 A5 B6 A6 B6 B6 B6 B

Rec type B

Rec type A

Page 106: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

106

Details in flat filesDetails in flat files

1 A2 A3 A4 A5 A1 B2 B2 B2 B3 B3 B4 B5 B5 B5 B

Two record types!business key

record type code

Note:

variable-length records

Note:

Children not grouped with parents.

Page 107: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

107

Worse scenario of flat fileWorse scenario of flat file

Two record types!

business key, but found only in rec.type A

record type code

Important!

Record sequence have vital significance!

A 1BA 2BBBA 3BA 4A 5BA 6BBBB

Bad technique!

Page 108: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

108

Mixed mediaMixed media

Data in RDBMSMetadata in

XML

Whole package you are provided

Page 109: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

109

Typical elements of a GeodatabaseTypical elements of a Geodatabase

relationship class

domain

Table 1 Table 2

Feature class

Topology(rules)

Raster dataset(s)

Metadata in XML

Geometric network

Page 110: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

110

XML ?XML ?

XML labels data items. XML labels data items.

““self-documenting”self-documenting”

Means labeling, but not full, rich Means labeling, but not full, rich documentation of business meaning. documentation of business meaning.

It does not describe attributes or entitiesIt does not describe attributes or entities (fields, or tables)(fields, or tables) from a business from a business perspective. perspective.

XML takes more space – often much more.XML takes more space – often much more. (the opposite of data compression?)(the opposite of data compression?)

Page 111: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

111

The opposite of XML: The opposite of XML: Data compression!Data compression!

What compression techniques, if any, might the source use when sending you the data?

Can you read it or unpack it?

Page 112: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

112

Logical data Logical data architecturearchitecture

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 113: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

113

Key questions: Key questions:

What kind of things in the real world are described by What kind of things in the real world are described by the dataset?the dataset?

How many kinds of tables or records are contained?How many kinds of tables or records are contained?

What are the cardinality rules between them? What are the cardinality rules between them?

Are the described instances in the real world mutually Are the described instances in the real world mutually exclusive? exclusive?

Are there format standards (industry or discipline) for Are there format standards (industry or discipline) for this kind of data?this kind of data?

Does this data conform to those format standards?Does this data conform to those format standards?

Page 114: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

114

Key questions: Key questions: (cont.)(cont.)

What is the meaning of each record? What “thing” in What is the meaning of each record? What “thing” in reality does it represent?reality does it represent?

What is the business meaning of each field?What is the business meaning of each field?

Are any fields employed for more than one purpose?Are any fields employed for more than one purpose?

Is the value or meaning of any field contingent upon the Is the value or meaning of any field contingent upon the value in another?value in another?

What coding conventions are employed?What coding conventions are employed?

How are names and addresses structured?How are names and addresses structured?

Are you going to be integrating this source Are you going to be integrating this source with other data?with other data?

Page 115: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

115

Ambiguity of terms!Ambiguity of terms!

“System” A “System” B

“interface”“bridge”

“connect”“data access”

“migrate data”

“flow”

“link”

Page 116: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

116

Inherently ambiguous terms about “link”Inherently ambiguous terms about “link”

interface

bridge

“integrate with”

support

connect connector interconnect

exchange data

migrate data

publish

provide access

exchange data

All have in common:

data movement

What kind of data? What fields? What architecture? What causes data to move?

Page 117: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

117

Linking organizations together: “Ha!”Linking organizations together: “Ha!”

infrastructure

Op Sys

ApplicationSoftware

DataApplication Database

Business

infrastructure

Op Sys

ApplicationSoftware

DataApplication Database

Physical communication.

Semantic compatibility.

Protocol compatibility.

Landline, WiFi, mobile, etc.

XML, etc.

Logical data arch.

Business

Architecturally different

Agency “A” Agency “B”

A business has an architecture!

Page 118: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

118

Linking organizations together: “Ha!”Linking organizations together: “Ha!”

DataApplication Database

Business

DataApplication Database

Semantic compatibility.

Logical data arch.

Business

Architecturally different

Agency “A” Agency “B”

Semantic compatibility:

Presence of data elementsField format compatibilityDefinitional consistencyKeys don’t clash (homonyms, non-reuse, etc.)

Subject entities have similar life cycles

These are subtle, abstract concepts. Not understood by executives or hardware people.

Page 119: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

119

Two levels of architecture matchingTwo levels of architecture matching

DataApplication Database

Business

DataApplication Database

Semantics & meaning

Structural architecture

Logical data arch.

Business

Architecturally different

Agency “A” Agency “B”

Page 120: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

120

Semantic and meaning (field level)Semantic and meaning (field level)

Two fields (in two environments) can have…Two fields (in two environments) can have…same name, same format, same name, same format, but but different domaindifferent domain. .

Source-A Source-B

Page 121: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

121

Semantic and meaning (table level)Semantic and meaning (table level)

Two tables (in two environments) can have… Two tables (in two environments) can have… same name, same format same name, same format (and column list),(and column list), but but different scopedifferent scope or entity meaning. or entity meaning.

Source-A Source-B

Customer orders

Customer orders

same format

unfilled

month total

different scope

Page 122: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

122

What do we mean by “link”? What do we mean by “link”?

Replicate data instantly (at time of transaction)

Reposit data into an ODS(at time of transaction)

1

2

Appl. 1

Appl. 2

Application databases

Appl. 1

Appl. 2ODS

Reposit data into a data warehouse(periodic, in batch)

1

2

Appl. 1

Appl. 2

Datawhse

Application databases

Page 123: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

123

Instant, transactional, replication Instant, transactional, replication

1 2

Appl. 1 Appl. 2

API API

Exchange services

Are the architectures compatible?

Probably not!

Page 124: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

124

Semantic integrationSemantic integration means bringing the means bringing the data together so it makes sense.data together so it makes sense.

Total logical data archiecture level

Presence or absences of entities / tables Cardinalities

Table (subject entity) level

Definitions are the same Field list are the same

Column (field) level

Formats are the same Business definitions are the same Domains & meanings are the same

=A B

=

Cust-A Cust-B

=A B

Page 125: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

125

Data integration involves matching Data integration involves matching “things” from multiple sources “things” from multiple sources

Instance level:

Person

Store

Address

Vehicle

Neighborhood

Event (or episode)

Dimension level:

Time period

Brand or product

Market

Category (“type”, “class”)

Geography

Other grouping

Benefits from “singular” characteristic of entity

Problematic matching between sources

Page 126: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

126

New York media market

Northern New Jersey sales zone

Long Island Sales Zone

Central NJ sales zone

Metro NY sales zone

Page 127: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

127

Name & address formatsName & address formats

Are you going to do name and/or address Are you going to do name and/or address matching?matching?

Many causes of non-matches. Many causes of non-matches.

“They will have the name and addresses in the record.”

“Oh, that’s fine.”

Page 128: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

128

Address formats -- parsingAddress formats -- parsing

First Name Last name M.I. Number Street AptCharles Shepard A 563-A Pine StreetSusan Elkart G 78 Mills Lane CEvelyn Barnard R 587 Canal St.Frankling Turing S 798 Wisconsin Ave.

Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.

Source format:

…to be matched to…

Target format

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

05 FIRST_NAME PIC X(20)05 LAST_NAME PIC X(25)05 MIDDLE_INIT PIC X(01).05 STR_NUMBER PIC X(10).05 STREET_NAME PIC X(30). 05 APT_NO PIC X(10).

Page 129: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

129

Address formats – parsing (2)Address formats – parsing (2)

Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.

Source format:

…to be matched to…

Target format

Are these going to match?

Customer Name Address 1Shepard, Charles 563-A Pine StreetElkart, Susan 78 Mills Ln, Apt. CBarnard, Evelyn 587 Canal St.Turing, Franklin 798 Wisconsin Ave.

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

Page 130: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

130

Address formats – parsing (3)Address formats – parsing (3)

Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.

Source format:

…to be matched to…

Target format

Are these going to match?

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

Ironically…

The meaning is the same, but the data is different.!

Page 131: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

131

Address formats – parsing (3)Address formats – parsing (3)

None.

Which of these will match in native SQL?

Source Target563 A Pine St. 563-A Pine St.587 Canal Street 587 Canal St.798 Wisconsin Ave. 798 Wisconsin781 Mills Lane 781 Mills Ln.418 Elm St. Apt. C 418 Elm, Apt. C21 Valley Forge Ave. 21 ValleyForge Ave.

Page 132: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

132

Conclusion on data architecture:Conclusion on data architecture:

Even if you have an exact physical format Even if you have an exact physical format match… match…

… …source to target…source to target…

Field namesField names Field format Field format

The contents may not match.The contents may not match.

And the meaning may not match.And the meaning may not match.

Page 133: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

133

Semantics Semantics and and

meaningmeaning

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 134: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

134

Key questions: Key questions: Are the languages of text fields, and the character set Are the languages of text fields, and the character set appropriate to your needs?appropriate to your needs?

Are numeric fields in units-of-measure which you Are numeric fields in units-of-measure which you expect? expect?

How is the “null” condition symbolized in each field?How is the “null” condition symbolized in each field?

Is it clear what the business meaning of the null Is it clear what the business meaning of the null condition is? condition is?

What fields need to be translated into your desired What fields need to be translated into your desired coding domain? coding domain?

Does the meaning of any field Does the meaning of any field (or elements of its domain)(or elements of its domain) change over time or over any other scope dimension?change over time or over any other scope dimension?

Page 135: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

135

Potential coding variationsPotential coding variations

State FIPS AbbrAlabama 01 ALAlaska 02 AKArizona 04 AZArkansas 05 ARCalifornia 06 CAColorado 08 COConnecticut 09 CTDelaware 10 DEDistrict of Columbia 11 DCFlorida 12 FLGeorgia 13 GAHawaii 15 HI

Page 136: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

136

Thin documentation can be misleadingThin documentation can be misleading

““Address”Address”

Current address?Current address?

Current address for mailing purposesCurrent address for mailing purposes

but not for billing purposes. but not for billing purposes.

Current address for delivery purposesCurrent address for delivery purposes

but not for mailing, or billing. but not for mailing, or billing.

Page 137: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

137

““Null” Null” Though the “null” value may be stored in the Though the “null” value may be stored in the original database, … original database, …

……will it be transferred effectively through the will it be transferred effectively through the ETL process?ETL process?

There is also the question: “Why is it null?”

That answer can be another kind of metadata.

1. Not applicable

2. Declined to state

3. Will be supplied later

Page 138: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

138

Testing for semantic discontinuitiesTesting for semantic discontinuities

Fields may change meaning over time (or other dimensions)

Codes may change meaning over time

Every code is potentially volatile over time.

Invoice typeAccount typeCustomer numberSales division

Stable codes tend to be OUTSIDE the organization…

…e.g. standard govt codes.

Page 139: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

139

Are Domains Stable Over Time?Are Domains Stable Over Time?Customer File: Invoice Type Code Customer File: Invoice Type Code

INVOICE_TYPE_CODEINVOICE_TYPE_CODE XTAB AGAINST MONTH XTAB AGAINST MONTH

MONTH MONTH 01 02 03 04 05 06 07 01 02 03 04 05 06 07------------------------------------------------------------------------------------------------------------------AA 87 91 96 78 88 92 97AA 87 91 96 78 88 92 97BB 142 148 153 162 149 167 173BB 142 148 153 162 149 167 173CC 197 204 211 225 0 0 0CC 197 204 211 225 0 0 0DD 45 48 51 47 46 48 49DD 45 48 51 47 46 48 49EE 77 76 81 79 84 82 79EE 77 76 81 79 84 82 79F1 4 3 8 5 9 7 11F1 4 3 8 5 9 7 11F2 9 8 4 7 12 9 8F2 9 8 4 7 12 9 8------------------------------------------------------------------------------------------------------------------------

Type “CC” not consistently used over time.

Page 140: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

140

Are the codes consistent over time?Are the codes consistent over time?

Cust. 41

Cust. 8

Cust. 21

Cust. 11

Cust. 5

Cust. 6

Cust. 28Cust. 24

Cust. 29

Cust. 19

Cust. 16

Cust. 3

Cust. 7

Page 141: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

141

Customers are grouped into regions.Customers are grouped into regions.

Cust. 41

Cust. 8

Cust. 21

Cust. 11

Cust. 5

Cust. 6

Cust. 28Cust. 24

Cust. 29

Cust. 19

Cust. 16

Cust. 3

Cust. 7

Region 3

Region 1

Region 2

Page 142: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

142

Regions get “redefined” - “realigned”.Regions get “redefined” - “realigned”.

Cust. 41

Cust. 8

Cust. 21

Cust. 11

Cust. 5

Cust. 6

Cust. 28Cust. 24

Cust. 29

Cust. 19

Cust. 16

Cust. 3

Cust. 7

Region 1

Region 2

Region 3

If this happens…

…will the source tell you?

Can you detect it on your own?

Page 143: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

143

Documentation Documentation & &

metadatametadata

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 144: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

144

Key questions: Key questions: Has format and meaning documentation been Has format and meaning documentation been provided prior to your decision to acquire the provided prior to your decision to acquire the data?data?

Is the documentation current? Is the documentation current?

Can you get sample data to test against? Can you get sample data to test against?

Is the documentation thorough and in sufficient Is the documentation thorough and in sufficient detail?detail?

Does the documentation include data quality Does the documentation include data quality standards?standards?

Page 145: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

145

Documentation topicsDocumentation topics

Format and structureFormat and structure

Meaning of fields and segmentsMeaning of fields and segments

Language & units of measureLanguage & units of measure

Entity life cycle and extract filtersEntity life cycle and extract filters

Scope Scope

Vintage (date ranges) Vintage (date ranges)

Projections (GIS)Projections (GIS)

Reference (GIS)Reference (GIS)

Function of program

code

Function of job

parameters

Traditional “normative” data documentation covers only this.

Page 146: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

146

Format alone does not describe dataFormat alone does not describe data

Batch job step INV04G

Format-A Format-B

Same scope and timing, but different format.

Batch job step INV21K

California ArizonaUtah N.M.

Same format, but different scopes

Page 147: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

147

Data documentation must be more Data documentation must be more than format … much more.than format … much more.

Format(s)

Contentand

meaning tangible data file

Metadata“data about data”“information about data”“information about information”

Many kinds of metadata!

Industry and cultural contexts.

The word, “metadata” is inherently ambiguous.

Page 148: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

149

Technical vs. business metadataTechnical vs. business metadata

01 CUSTOMER_MASTER. 05 CUST_NUM PIC X(08). 05 CUST_NAME PIC X(30).

05 ADDRESS_1 PIC X(30).

05 ADDRESS_2 PIC X(30). 05 CITY PIC X(25).

05 STATE PIC X(02).

“Customers in this file include…

current active customersprospective customersdormant customersrecipients of samples

Other subtypes include:

industrial vs. retaildomestic vs. internationalbroker vs. directplatinum vs. regular”Well-structured,

Machine-readable Unstructured, meaningful only to a human.

Page 149: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

150

Normative vs. dynamic metadataNormative vs. dynamic metadata

If the file is being updated, then source-ID and quality are NOT characteristics of the entire table.

Source-A

Source-B

Observed in 1985

Observed in 2003

Low quality

High quality

This has nothing to do with structural metadata.

Page 150: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

151

Record-level metadataRecord-level metadata

all non-key data acquired as a single unit

source of all info in this record

when record created or updated

ID Name Street Addr City / St Source Updt Dt489735 John Smith 971 Pine Drive Portland, ME CA DMV 8/2/1997489735 Mary Allard 6174 Huron St. Albany, NY NY DMV 4/13/2003489735 Ty Kobb 572 Ottawa Boston, MA US Army 4/14/2003

Source and update date for the whole record (all fields)

Page 151: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

152

Credit bureau record on person

Imbedded metadata – cell levelImbedded metadata – cell level

Person ID Name SSN SSN src SSN updt DOB DOB src DOB updt489735 John Smith 587-98-1473 US Army 4/15/2001 4/3/1952 CA DMV 8/2/1997489735 Mary Allard 589-88-8891 CitiBank 2/2/1997 3/9/1972 NY DMV 4/13/2003489735 Ty Kobb 433-52-8743 Chase 57 6/2/2004 4/15/1978 US Army 4/14/2003

fact

where we got the fact

when we got the fact

Some facts are acquired individually, unrelated to peer cells in a record.

These 3 data elements belong together.

Page 152: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

153

Where is metadata stored?Where is metadata stored?

Central Metadata Repository

Complex dataset

Metadata in XML

Metadata in 3-ring binder

Classic data management problem:

Two copies of knowledge, no rigorous enforcement of refresh and update.

scattered

copy

copy

Page 153: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

154

Documentation standardsDocumentation standardsEasy to establish, sometimes reluctant to fulfill. Easy to establish, sometimes reluctant to fulfill.

Letter but not the spirit of documentation. Letter but not the spirit of documentation.

Nobody wants to write documentationNobody wants to write documentation

INVOICE_AMT DECIMAL (11.2) Def. Total amount of the invoice.

Tautological: The use of redundant language

"If you don't get any better, you'll never improve" --Yogi Berra

INVOICE_AMT DECIMAL (11.2) Def. This data element contains the total invoice amount.

Page 154: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

155

Documentation standardsDocumentation standardsEasy to establish, sometimes reluctant to fulfill. Easy to establish, sometimes reluctant to fulfill.

Letter but not the spirit of documentation. Letter but not the spirit of documentation.

Nobody wants to write documentationNobody wants to write documentation

INVOICE_AMT DECIMAL (11.2)

Def. The total amount to be paid on a regular invoice to the customer; equals the sum of all extended costs of line items net of discounts. Also includes special charges unrelated to specific products. Always in U.S. dollar.

On invoice reversals, this field is normally negative. On credit memos, this field is normally negative.

Good metadata discusses the anomalies!

Page 155: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

156

Scope & Scope & completenesscompleteness

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 156: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

157

Key questions: ScopeKey questions: ScopeAre you getting all the attributes (fields, columns, data Are you getting all the attributes (fields, columns, data elements) which you expect?elements) which you expect?

Are you getting other attributes you didn’t ask for?Are you getting other attributes you didn’t ask for?

Are you getting all the records you expected?Are you getting all the records you expected?

Are you getting any records outside of your scope of Are you getting any records outside of your scope of request or interest?request or interest?

For each field, is the column populated as completely as For each field, is the column populated as completely as is appropriate? is appropriate?

Page 157: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

158

Are you getting the data you expect to get? Are you getting the data you expect to get?

ScopeScope

GeographyGeography

TimeTime

Range of customers by name, account, etc.Range of customers by name, account, etc.

Is there any way your source might have truncated Is there any way your source might have truncated your input data? your input data?

Page 158: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

159

Kinds of scopeKinds of scope

Scope in timeScope in time

Scope in geographyScope in geography

Organizational scopeOrganizational scope

Types or subtypes of major entitiesTypes or subtypes of major entities

Entity life cycle and duplication Entity life cycle and duplication

S M T W T F S

Page 159: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

160

Time scopeTime scope Tally data over timeTally data over time

YEAR 1998 1999 2000 2001MONTH ----------------------------------01 492 742 711 84102 512 701 782 81203 588 689 733 84504 522 0 746 82905 581 618 697 79206 566 682 709 84107 599 623 728 82308 492 593 692 78409 509 608 717 82410 527 631 729 78111 488 597 744 80712 611 845 714 892

S M T W T F S

TABLE FILE GG1SUM RECCNTACROSS MONTHBY YEAREND-RUN

Page 160: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

161

Time scope Time scope (cont.)(cont.)

Records by month

0.0

100.0

200.0

300.0

400.0

500.0

600.0

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

Sep-0

6

Oct-06

Nov-0

6

Dec-0

6

Jan-

07

Feb-0

7

Mar

-07

Apr-0

7

May

-07

Jun-

07

Jul-0

7

Aug-0

7

Sep-0

7

Oct-07

Nov-0

7

Dec-0

7

Probably a discontinuity in definition, inclusion criteria, or scope.

S M T W T F S

Page 161: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

162

Cropping Cropping

Count records by monthCount records by month

Orders by month

0

200

400

600

800

1,000

1,200

Janu

ary

Febru

ary

Marc

hApr

ilM

ayJu

ne July

Augus

t

Septem

ber

Octobe

r

Novem

ber

Decem

ber

Janu

ary

Febru

ary

Marc

hApr

ilM

ayJu

ne July

Order History Table

A purge process exists, but some records had to remain (still outstanding dispute).

S M T W T F S

You may get more records than you expect!

Page 162: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

163

Kinds of scope:Kinds of scope: Geography Geography

Adair County

Baker County

Evans County

Girard County

Caswell County

Duke County

Johnson County

DATASET COVERAGE

Page 163: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

164

Kinds of scope:Kinds of scope: Organizational Organizational

All the divisions, or just some? All the divisions, or just some?

All the sales, or just sales by employee sales reps All the sales, or just sales by employee sales reps (thus excluding broker-negotiated sales)?(thus excluding broker-negotiated sales)?

Domestic activity only or including international?Domestic activity only or including international?

GargantuanIndustries, Inc.

Mining &minerals

ToysDefense

& weaponsHealth &

beauty aids

Page 164: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

165

Kinds of scope:Kinds of scope: Types and subtypes Types and subtypes

Vehicle file includes…Vehicle file includes…

Owned vehicles but not leased vehiclesOwned vehicles but not leased vehicles

Cars but not utility trucksCars but not utility trucks

Dataset of employeesDataset of employees

Full-time but not part-timeFull-time but not part-time

Current but not former employeesCurrent but not former employees

Volunteers? Volunteers?

Page 165: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

166

Subtypes of a hospital employee entity Subtypes of a hospital employee entity

EmployeeEmployee[Emp Num][Emp Num]

CandidatesCandidates ActiveActiveempl.empl.

FormerFormerempl.empl.

DoctorsDoctors ContractContractempl.empl.

Are subtypes mutually exclusive?

Are some data fields present for some, but not all subtypes?

Page 166: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

167

Subtypes of a hospital employee entity Subtypes of a hospital employee entity

EmployeeEmployee[Emp Num][Emp Num]

CandidatesCandidates ActiveActiveempl.empl.

FormerFormerempl.empl.

DoctorsDoctors ContractContractempl.empl.

Perm. Perm. Full-timeFull-time

TemporaryTemporary

Subtypes can have subtypes

Page 167: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

168

Kinds of scope:Kinds of scope: Entity life cycle & duplication Entity life cycle & duplication

Big issue: mutual exclusivity of records, vs. duplication

Can the same instance be represented by multiple records…

…possibly in multiple stages of its life cycle?

Are all records logical peers to each other?

Page 168: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

169

Students at a university Students at a university

StudentStudent[ID Num][ID Num]

FreshmanFreshman SophomoreSophomore JuniorJunior SeniorSenior GraduateGraduate

In reality (business policy), mutually exclusive?

In a file of students, are you getting only one record per student? ….

Or, one record per student-year?

Page 169: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

170

Students at a university Students at a university

StudentStudent[ID Num][ID Num]

FreshmanFreshman SophomoreSophomore JuniorJunior SeniorSenior GraduateGraduate

This gets us back to architecture.

What distinct subject entity does a record represent?

Name change between academic years?

Page 170: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

171

Other fragments of scopeOther fragments of scope

MOST COMMON VALUES OF LAST_NAMEMOST COMMON VALUES OF LAST_NAME--------------------------------------------------------------BROWN 12,943BROWN 12,943DAVIS 9,542DAVIS 9,542ANDERSON 7,227ANDERSON 7,227CLARK 5,344CLARK 5,344ALLEN 4,715ALLEN 4,715CAMPBELL 4,014CAMPBELL 4,014ADAMS 3,800ADAMS 3,800BAKER 3,635BAKER 3,635EVANS 3,271EVANS 3,271COLLINS 3,180COLLINS 3,180CARTER 3,143CARTER 3,143EDWARDS 3,129EDWARDS 3,129COOK 2,772COOK 2,772COOPER 2,646COOPER 2,646

What’s wrong with this picture?

File of prospective customers from outside source

Page 171: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

172

Look at distribution of first character of text Look at distribution of first character of text fields! fields!

NameName

AddressAddress

CityCity

CommentsComments

Need query tool which can create new Need query tool which can create new variables (fields) based on mask. variables (fields) based on mask.

Page 172: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

173

Distribution of first character, Last Name field, Distribution of first character, Last Name field, purchased input file. purchased input file.

A 8,240 N 86A 8,240 N 86B 31,210 O 47B 31,210 O 47C 17,221 P 24C 17,221 P 24D 10,929 Q 13D 10,929 Q 13E 4,507 R 14E 4,507 R 14F 8,081 S 4F 8,081 S 4G 77 T 23G 77 T 23H 71 U 21H 71 U 21I 8 V 13I 8 V 13J 63 W 2J 63 W 2K 36 X 1K 36 X 1L 94 Y 5L 94 Y 5M 82 Z 7M 82 Z 7

Page 173: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

174

Reasonable surname (1st character) Reasonable surname (1st character) distribution in American society.distribution in American society.

A 8,240 N 4,486A 8,240 N 4,486B 31,210 O 3,347B 31,210 O 3,347C 17,221 P 11,724C 17,221 P 11,724D 10,929 Q 513D 10,929 Q 513E 4,507 R 12,864E 4,507 R 12,864F 8,081 S 23,604F 8,081 S 23,604G 11,977 T 8,623G 11,977 T 8,623H 17,171 U 571H 17,171 U 571I 1,008 V 3,453I 1,008 V 3,453J 7,163 W 14,302J 7,163 W 14,302K 8,636 X 36K 8,636 X 36L 11,094 Y 1,435L 11,094 Y 1,435M 21,682 Z 1,187M 21,682 Z 1,187

Page 174: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

175

Accidental truncation of dataAccidental truncation of data

What ways can your source truncate your data?What ways can your source truncate your data?

NameName

Organizational Organizational (e.g. forgot broker sales)(e.g. forgot broker sales)

Life cycle Life cycle (e.g. forgot former employees)(e.g. forgot former employees)

Time Time (clipped in creation date range, but not ship date)(clipped in creation date range, but not ship date)

Page 175: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

176

Are you getting MORE data than you Are you getting MORE data than you wanted?wanted?

Test records Test records

Beyond original scopeBeyond original scope

Page 176: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

177

Detecting duplicate dataDetecting duplicate data

Domain each key Domain each key (if it is a truly unique key)(if it is a truly unique key)

SELECT CUST_KEY, REC_COUNT FROM SELECT CUST_KEY, COUNT(*) AS REC_COUNT FROM CUST_MAST GROUP BY CUST_KEY;ORDER BY REC_COUNT DESCENDING;

CUST_KEY REC_COUNT-------------------004001 1004002 1004003 1004004 1004005 1

CUST_KEY REC_COUNT-------------------004127 5004039 4004113 4004834 3004225 3

Desired results Duplicate keys

Page 177: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

178

Detecting duplicate data Detecting duplicate data (cont.)(cont.)

Test for duplicate data, with non-dup keysTest for duplicate data, with non-dup keysCust.No. Cust name Addr City State ZIP DOB

4127 Tony Martinez 77 River St. Phoenix AZ 87114 8/4/1952

4127 Tony Martinez 77 River St. Phoenix AZ 87114 8/4/1952

Cust.No. Cust name Addr City State ZIP DOB

5793 Angela Connors 29 High St Flagstaff AZ 87114 8/4/1952

6778 Angela Connors 29 High St Flagstaff AZ 87114 8/4/1952

Different key Same person

Entire record duplicated

Page 178: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

179

Incremental updatesIncremental updates

Order # Cust # Order Dt Delivery Dt Total chg Update dt10001 1234 1/5/2006 1/15/2006 1489.14 1/5/200610002 1343 1/8/2006 1/18/2006 874.82 1/8/200610003 1344 1/15/2006 1/25/2006 1378.25 1/15/200610004 1580 1/28/2006 2/8/2006 1184.82 1/28/2006

Customer OrderTable

January activity: 4 records

This is typical for extract to a data warehouse.

Customer Order

Source application

Datawarehouse

extracttranslate &

load

update file

Page 179: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

180

Incremental updatesIncremental updatesCustomer Order

TableJanuary activity: 4 records

Order # Cust # Order Dt Delivery Dt Total chg Update dt10001 1234 1/5/2006 1/15/2006 1489.14 1/5/200610002 1343 1/8/2006 1/18/2006 874.82 1/8/200610003 1344 1/15/2006 1/25/2006 1378.25 1/15/200610004 1580 1/28/2006 2/8/2006 1287.01 2/4/2006

February activity: 4 records

10005 1344 2/4/2006 2/13/2006 1489.14 2/4/200610006 1580 2/7/2006 2/12/2006 874.82 2/7/200610007 1234 2/16/2006 2/26/2006 1378.25 2/16/200610008 1343 2/27/2006 3/7/2006 1184.82 2/27/2006

A change had been made to order 10004.Posted Feb. 4, AFTER the incremental extract for January data.

Page 180: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

181

Incremental updatesIncremental updatesCustomer Order

Table

Order # Cust # Order Dt Delivery Dt Total chg Update dtU 10004 1580 1/28/2006 2/8/2006 1287.01 2/4/2006N 10005 1344 2/4/2006 2/13/2006 1489.14 2/4/2006N 10006 1580 2/7/2006 2/12/2006 874.82 2/7/2006N 10007 1234 2/16/2006 2/26/2006 1378.25 2/16/2006N 10008 1343 2/27/2006 3/7/2006 1184.82 2/27/2006

February change file:

How it should look.

“U” Update“N” New

Dilemma: How far back into history must you look to be sure you have all the changes posted in February?

This places a burden on the source system!

Page 181: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

182

Overlap of date ranges of update filesOverlap of date ranges of update files

Jan. Feb. Mar. Apr. May

Jan.

February

March

April

May14 days back into previous month

Page 182: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

183

Scope: Are all records peers to each other?Scope: Are all records peers to each other?

Can you have detail and summary records co-existing?Can you have detail and summary records co-existing?

Census BlockTract Group Block Tot Pop White Black Latino

471 15 ALL 1,804 1,593 176 35471 15 1 16 11 3 2471 15 2 10 8 1 1471 15 3 421 377 29 15471 15 4 381 329 47 5471 15 5 557 519 38 0471 15 6 419 349 58 12

aggregate recorddetail records

Page 183: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

184

Beware of masking data for political or Beware of masking data for political or confidentiality reasons.confidentiality reasons.

Census BlockTract Group Block Tot Pop White Black Latino

471 15 ALL 1,804 1,593 176 35471 15 1 16 11 blocked blocked471 15 2 10 8 blocked blocked471 15 3 421 377 29 15471 15 4 381 329 47 5471 15 5 557 519 38 0471 15 6 419 349 58 12

Cells with data masked because total figure is too low.

Page 184: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

185

World economic statisticsWorld economic statistics

Source: CIA web site

Rank Country Exports1 World $10,330,000,000,000 2 European Union $1,318,000,000,000 3 Germany $1,016,000,000,000 4 United States $927,500,000,000 5 China $752,200,000,000 6 Japan $550,500,000,000 7 France $443,400,000,000 8 United Kingdom $372,700,000,000 9 Italy $371,900,000,000

10 Netherlands $365,100,000,000 11 Canada $364,800,000,000 12 Korea, South $288,200,000,000 13 Hong Kong $286,300,000,000 14 Belgium $269,600,000,000 15 Russia $245,000,000,000 16 Mexico $213,700,000,000

Page 185: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

186

Reference data & foreign keysReference data & foreign keysCodes need interpretations!Codes need interpretations!

Two places to do it:Two places to do it: Documentation Documentation Active reference tables. Active reference tables.

master files(kernel-stable)

transactions(events)

reference tables(validation)

Page 186: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

187

Reference data & foreign keysReference data & foreign keys

small domain large domain

low volatility

Volatile

GenderU.S. states

countries

customer cd

vendor cd

product cd

facilityinvoice type

ICD-9DRG

transaction type

employee

Page 187: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

188

Inconsistent coverageInconsistent coverage

““We did door-to-door We did door-to-door interviews in the towns, but interviews in the towns, but we are only estimating the we are only estimating the rural areas of the county.”rural areas of the county.”

Sampling and projection.Sampling and projection.

Examples of sampling and Examples of sampling and projection:projection:

Exit pollsExit polls

Radio & TV audienceRadio & TV audience

More reliable statistics:

Journal & newspaper subs

Web ad viewing

Page 188: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

189

Detecting estimatesDetecting estimates

Look at frequently-occurring values

MOST COMMON VALUES OF PLACE POPULATION

POPULATION RECORDS ------------------------------- 25 9,542 100 7,227 200 5,344 50 4,715 150 4,014 300 3,635 250 3,180 400 3,143 125 3,129 120 2,772 40 2,573

Domain study, most frequent observed values.

Page 189: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

190

Detecting estimatesDetecting estimatesPopulation Records

1 42 33 84 125 166 217 288 329 38

10 18511 4512 4713 5214 5515 8516 6117 6618 6819 7220 198

Look at low end of value range.

Records for each population statistic

1 2 3 4 5 6 7 8 9

10

11121314

1516171819

20

21222324

25

26272829303132

0

50

100

150

200

250

300

Spikes suggest estimates

Page 190: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

191

Detecting estimates Detecting estimates (cont.)(cont.) Day of month in dates:Day of month in dates:

Date of birthDate of birth

Filing dateFiling date

Posting datePosting date

Analysis requires a query tool which will extract day of month

Record count by day of month

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Page 191: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

192

Detecting estimates Detecting estimates (cont.)(cont.) Record count by day of month

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Record count by day of month

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Page 192: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

193

Key questions: Best available?Key questions: Best available?

How does this dataset compare in quality How does this dataset compare in quality with alternate sources? with alternate sources?

QualityQuality

CurrencyCurrency

GranularityGranularity

PricePrice

Page 193: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

194

Ask questions!Ask questions! Obvious but sound insultingObvious but sound insulting..

How many employees do you have?How many employees do you have?How many records in the Employee File How many records in the Employee File you are sending us?you are sending us?

Are there any duplicate records in your Are there any duplicate records in your file?file?How are they duplicate? Why?How are they duplicate? Why?In a business sense, what does that In a business sense, what does that mean?mean?

Page 194: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

195

Fundamentals Fundamentals of data qualityof data quality

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 195: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

196

Key questions: Key questions: Are the values observed in each column valid?Are the values observed in each column valid?

For codes, to they conform to a consistent domain?For codes, to they conform to a consistent domain?

For quantities, do they conform to a reasonable range?For quantities, do they conform to a reasonable range?

For quantities, are there any significant outliers?For quantities, are there any significant outliers?

Are the values observed in each column reasonable (given Are the values observed in each column reasonable (given context)?context)?

Are the values in each column accurate? Are the values in each column accurate?

Is the definition for each field (and the data contained therein) Is the definition for each field (and the data contained therein) consistent over the entire dataset?consistent over the entire dataset?

What is the precision of each numeric field?What is the precision of each numeric field?

Is that precision consistent over the entire dataset? Is that precision consistent over the entire dataset?

How does this dataset compare in quality with alternate sources? How does this dataset compare in quality with alternate sources?

Page 196: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

197

Again:Again: Defining data quality. Defining data quality.

““High quality data accurately High quality data accurately describes reality, according describes reality, according to its complete definition.” to its complete definition.”

--Michael Scofield--Michael Scofield

Page 197: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

198

Components of data qualityComponents of data quality

Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)

?

Page 198: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

199

Components of data qualityComponents of data quality

Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)

Cell populated? Cell populated? (need to recognize null condition)(need to recognize null condition)

?

Page 199: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

200

Components of data qualityComponents of data quality

Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)

Cell populated? Cell populated? (need to recognize null condition)(need to recognize null condition)

Is value in cell valid? Is value in cell valid? (compare against rules)(compare against rules)

Is value in cell reasonable? Is value in cell reasonable? (requires context)(requires context)

Is value in cell accurate? Is value in cell accurate? (requires definition)(requires definition)

How precise is the data in the cell?How precise is the data in the cell?

Is value in cell current? Is value in cell current? (time dimension of definition)(time dimension of definition)

Is the definition consistent over all dimensions?Is the definition consistent over all dimensions?

Page 200: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

201

““Completeness” of data Completeness” of data (first definition)(first definition)

Complete table Complete table (all the rows)(all the rows)

Incomplete table Incomplete table (some of the rows)(some of the rows)

70% complete70% complete

Page 201: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

202

““Completeness” of data Completeness” of data (2nd definition)(2nd definition)

Complete table Complete table (all the fields)(all the fields)

Incomplete table Incomplete table (some of the fields)(some of the fields)

Page 202: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

203

Don’t confuse validity with accuracy!Don’t confuse validity with accuracy!

Validity of data means it Validity of data means it conforms to rules.conforms to rules.

It is not necessarily It is not necessarily reasonable reasonable

. . . or accurate. . . . or accurate.

Page 203: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

204

Reasonability may be evaluated in Reasonability may be evaluated in context.context.

City Temperature FCity Temperature F----------------------------------------------

PITTSBURGH 49PITTSBURGH 49

ERIE 95ERIE 95

CLEVELAND 96CLEVELAND 96

HARRISBURG 89HARRISBURG 89

PHILADELPHIA 88PHILADELPHIA 88

Page 204: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

205

Kinds of reports for data analysisKinds of reports for data analysis

Goal: Give visibility to data behavior.Goal: Give visibility to data behavior.

1. Domain studies1. Domain studies

High value formatHigh value format Low value format Low value format

2. Inter-field dependency tests2. Inter-field dependency tests

3. Referential integrity tests3. Referential integrity tests

4. Formatted dumps4. Formatted dumps

5. Other reasonability tests5. Other reasonability tests

Page 205: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

206

Where do you look at the data?Where do you look at the data?

Staging database. Staging database.

Exact replica of source data.Exact replica of source data.

ReplicaComplex

ETL

External data source

Simple ETL

Target database

describes

same

Data architectures

describes

Query tool

Page 206: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

207

Update and Update and refresh issuesrefresh issues

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 207: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

208

Key questions: Key questions:

Is this the only dataset you are going to acquire over Is this the only dataset you are going to acquire over time? time?

Or, are you going to get new versions or updates? Or, are you going to get new versions or updates?

Will any updates you get be incremental (just the Will any updates you get be incremental (just the changes) or complete refreshment? changes) or complete refreshment?

Are you going to get corrections as soon as the source Are you going to get corrections as soon as the source knows about them?knows about them?

How will the source differentiate updates from How will the source differentiate updates from corrections? corrections?

Would they be found in the same dataset? Would they be found in the same dataset?

Page 208: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

209

UpdateUpdate:: Incremental vs. full refreshIncremental vs. full refresh

Can you distinguish between legitimate changes vs. error corrections?

Can you detect changes or corrections? Do you need to know about them? (downstream propagation)

Saves time! Simple processing.

Page 209: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

210

Data Data collection biascollection bias

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 210: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

211

Key questions: Key questions:

What was original the purpose of the data collection What was original the purpose of the data collection efforts? efforts?

What is your purpose or goal in acquiring the dataset? What is your purpose or goal in acquiring the dataset?

What is the business value of the data to you? What is the business value of the data to you?

What values and goals entered into the collection, What values and goals entered into the collection, organizing, and other preparation of the data? organizing, and other preparation of the data?

What purpose does the source/provider have in making What purpose does the source/provider have in making the data available? (profit, persuasion, altruistic)?the data available? (profit, persuasion, altruistic)?

Page 211: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

212

Two kinds of data generationTwo kinds of data generation

Data as byproduct of business processes

Data as gathered as non-business research

commercial sector

banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.

governments

social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)

field surveys of land, topo, etc.

observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

satellite & aerial imagery

Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.

Remember this? Remember this?

Page 212: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

213

Bias towards “certainty”Bias towards “certainty”

Coded fields on data-entry screens discourage Coded fields on data-entry screens discourage ambiguity…ambiguity…

… …and encourage illusion of precision. and encourage illusion of precision.

Premature entry of coded data results in non-nulls, Premature entry of coded data results in non-nulls,

… …and illusion of complete-ness. and illusion of complete-ness.

Tabular structures demand that you can’t say “about” Tabular structures demand that you can’t say “about” next to a piece of data.next to a piece of data.

Tabular data implies / expects precisionTabular data implies / expects precision

data ambiguity lite

Page 213: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

214

““About” …About” …

Textual and non-tabular expression allow Textual and non-tabular expression allow imprecision. imprecision.

Q.Q. ““When does your plane leave?”When does your plane leave?”

R.R. ““About 3 PM.” About 3 PM.”

Q. “How old is the suspect?”Q. “How old is the suspect?”

A. “In his mid-forties.” A. “In his mid-forties.” You cannot tabularize that!

or, as they say in Canada…

data ambiguity lite

Page 214: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

215

Do you understand the data Do you understand the data gathering process? gathering process?

What policies / methods / procedures / screens What policies / methods / procedures / screens introduce data errors?introduce data errors?

What are sources of bias?What are sources of bias?

Page 215: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

216

Ownership, Ownership, usage, and usage, and

liabilityliability

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 216: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

217

Key questions: Key questions: ownership & usageownership & usage

Are you unlimited in the usage of the data?Are you unlimited in the usage of the data?

Can you resell the data?Can you resell the data?

Can you share the data with other Can you share the data with other organizations? organizations?

What restrictions are placed on you in using the What restrictions are placed on you in using the data?data?

Page 217: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

218

Key questions: Key questions: ownership & usage ownership & usage (2)(2)

Does the source (person or organization) take Does the source (person or organization) take any responsibility for the quality, completeness, any responsibility for the quality, completeness, accuracy of the data? accuracy of the data?

Are there any implied limits to the “suitable” Are there any implied limits to the “suitable” usage of the data?usage of the data?

Are you planning on using the data for Are you planning on using the data for something it was not originally designed for?something it was not originally designed for?

Page 218: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

219

ConfidentialityConfidentialityIntroduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 219: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

220

Key questions: confidentiality Key questions: confidentiality

Does the dataset contain any data which Does the dataset contain any data which (other (other than its proprietary nature)than its proprietary nature) should be considered should be considered confidential?confidential?

By what criteria? By what criteria?

Why might it be confidential or sensitive?Why might it be confidential or sensitive?

Who are the interested parties?Who are the interested parties?

Page 220: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

221

Strategic and competitiveStrategic and competitiveEven the fact that you simply Even the fact that you simply havehave the data may be secret. the data may be secret.

Bougainville

Guadalcanal

Henderson Field

Ballale airfield

Adm. Yamamoto To cover up the fact that the Allies were reading Japanese code, American news agencies were told that civilian coast-watchers in the Solomons saw Yamamoto boarding a bomber in the area

Adm. Yamamoto-Isoroku

Page 221: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

222

Data flow Data flow surveillancesurveillance

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Page 222: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

223

Key questions: Key questions: Can you count on your source to notify you about Can you count on your source to notify you about any changes in…any changes in…

logical data architecturelogical data architecture

scopescope

qualityquality

units of measureunits of measure

precisionprecision

biasbias

Page 223: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

224

Designing for the FutureDesigning for the Future

Designing an on-going Surveillance Designing an on-going Surveillance Program for monitoring the stability of Program for monitoring the stability of source data behavior, source data behavior, quality, and quality, and meaning, meaning, and the appropriateness of your and the appropriateness of your mapping.mapping.

In other words….In other words….Preventing nasty surprises!Preventing nasty surprises!

Page 224: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

225

Designing on-going data surveillance Designing on-going data surveillance to protect yourself in the future. to protect yourself in the future.

First time analysis is tedious.First time analysis is tedious.

Lots of exploration of the data. Lots of exploration of the data.

Lots of decisions. Lots of decisions.

Do you want to do it with every incremental update of Do you want to do it with every incremental update of the database? the database?

No, but you can’t assume that next month’s data will No, but you can’t assume that next month’s data will behave the same as the first tape. behave the same as the first tape.

Expect the unexpected. Expect the unexpected.

Page 225: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

226

Goal of testing imported dataGoal of testing imported data

Protect yourself against injury because of errors or Protect yourself against injury because of errors or inconsistencies in imported data. inconsistencies in imported data.

Be aware of changes to meaning of incoming data. Be aware of changes to meaning of incoming data.

Be aware of changes of scope of incoming data.Be aware of changes of scope of incoming data.

Catch the problems as soon as possible…Catch the problems as soon as possible… …not during the database update process. …not during the database update process.

Hence, make the loading process as fast and smooth as Hence, make the loading process as fast and smooth as possible. possible.

Semper vigilansSemper vigilans

Always vigilantAlways vigilant

Page 226: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

227

The challenge of imported dataThe challenge of imported data

Each piece of data is an observation about reality, Each piece of data is an observation about reality, far from where you sit.far from where you sit.

You cannot go out there and verify each piece of You cannot go out there and verify each piece of data which you import.data which you import.

Even sampling is very difficult. Even sampling is very difficult.

You can only test the data against. . .You can only test the data against. . .

Absolute rules about behavior Absolute rules about behavior Reasonability tests to spot problems. Reasonability tests to spot problems.

Page 227: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

228

““What can possibly go wrong?”What can possibly go wrong?”

On updates, your data supplier can…On updates, your data supplier can…

Stop populating a fieldStop populating a field

Filter out records for some reasonFilter out records for some reason

Redefine a code used by them internallyRedefine a code used by them internally

Re-use a field for a new meaningRe-use a field for a new meaning

Give you new data you didn’t expectGive you new data you didn’t expect

Change their source (and quality) of a given fieldChange their source (and quality) of a given field

Page 228: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

229

Watch for .Watch for . . . . .

Unexpected changes in data architecture of sourceUnexpected changes in data architecture of source

New record types or segmentNew record types or segment Changes in cardinality between logical entities Changes in cardinality between logical entities New fields New fields Change in field length or usage Change in field length or usage

Unexpected changes in a field or columnUnexpected changes in a field or column Changes in domain of valid values Changes in domain of valid values Changes in numeric behavior (e.g. going negative) Changes in numeric behavior (e.g. going negative) Changes in null or “missing value” behavior Changes in null or “missing value” behavior

Semper vigilansSemper vigilans

Page 229: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

230

Kinds of tests of imported dataKinds of tests of imported data

• Conformance to Conformance to absolute rulesabsolute rules e.g. Valid value tests, etc.e.g. Valid value tests, etc. Rules relative to…. Rules relative to…. 1. Expected values 1. Expected values 2. Own record 2. Own record 3. Other records 3. Other records

• Reasonability testing Reasonability testing Detecting anomalous behavior based on contextDetecting anomalous behavior based on context (e.g. “This doesn’t seem right!”) (e.g. “This doesn’t seem right!”)

Contexts and scope: Contexts and scope: 1. Own record 1. Own record 2. Own tape 2. Own tape 3. Prior tapes from this source 3. Prior tapes from this source 4. Whole database 4. Whole database

Page 230: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

231

Flow of data through testsFlow of data through tests (ideal vision) (ideal vision)

Absolute checks

Reasonability tests

Reject (suspense) whole tapeReject (suspense) this record

Sound warning andhold tape (record) for further tests or explanation.

DatabaseContext

OK (so far)

New data

Data rules

temp Scrub

Page 231: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

232

Test incoming data A.S.A.P.Test incoming data A.S.A.P.

Test the data as soon as you have the source available, … not when you are updating the database or DW.

Test the data before you do any scrubbing.

If you do scrub, apply original tests and other tests again after scrubbing.

Page 232: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

233

Data ambiguity

Tabular structures seduce us into Tabular structures seduce us into thinking all cells have equal thinking all cells have equal reliability. reliability.

Not necessarily so!Not necessarily so!

data ambiguity lite

Page 233: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

234

Summary & conclusionSummary & conclusion

Understand your users and their data needs.Understand your users and their data needs.

Understand the politics of your source. Understand the politics of your source.

Do they have reason to be guarded?Do they have reason to be guarded?

Understand the burden your request for data places upon Understand the burden your request for data places upon your source. your source.

Decide if you need one-time, Decide if you need one-time, or repeated updates.or repeated updates.

Manufacturing as share of total employment

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

1950 1960 1970 1980 1990 2000 2010

32.1 %

11.7 %

Share of consumption by category

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Mot

or ve

hicles

Furn

iture

& h

ouse

hold

Other

dur

able

Food

Clothi

ng &

sho

es

Gasoli

ne, f

uels

Other

non

-dur

able

Housin

g

House

hold

ope

ratio

n

Tran

spor

tatio

n

Med

ical c

are

Recre

ation

Other

serv

ices

1929

2001

Page 234: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

235

Summary & conclusion Summary & conclusion (cont.)(cont.)

Before signing the agreement…Before signing the agreement…

Review the data documentationReview the data documentation

Understand the data architecture and paradigmUnderstand the data architecture and paradigm

Thoroughly test some sample dataThoroughly test some sample data

Be sure it conforms to your expectations…Be sure it conforms to your expectations…

format (case, etc.)format (case, etc.) scope scope unit of measure unit of measure quality and precision quality and precision

Try to break it!

Page 235: © Copyright 2008 Neils Michael Scofield, all rights reserved. Managing the Data Acquisition & Exchange Relationship By Michael Scofield Manager, Data Asset

© Copyright 2008 Neils Michael Scofield all rights reserved

236

The EndThe End…unless we keep going…unless we keep going

Michael ScofieldMichael Scofield

[email protected]@aol.com

“No vegetables were harmed in the making of this presentation.”