24
DwB – Data without Boundaries Additional Workshop – Metadata Standards 07/12/2011 1 DDI and the GSBPM Data without Boundaries Thematic Workshop on Metadata EDDI11 - SND, Gothenburg, Sweden, Dec 7 2011 Joachim Wackerow GESIS - Leibniz Institute for the Social Sciences Goals of Data without Boundaries DwB, project aims to an integrated model for accessing official data a model where the best solutions for access are available irrespective of national boundaries and flexible enough to fit national arrangements. Description of Workshop DwB is an FP7 program aiming at developing an integrated model for accessing official data, irrespective of national boundaries. In particular, the project proposes to build agreements on standards between different stakeholders such as the Statistical Institutes, the Data Archives and the researchers, who are the final users. The thematic meeting will focus on metadata standards relevant for DwB: SDMX, DDI, and the GSBPM. Each standard will be described and related to the others, and the ongoing works directed to their articulation will be presented.

DDI and the GSBPM - Data without Boundaries · DDI and the GSBPM Data without ... GSBPM DDI Life Cycle Model ... Product Physical Data Product Physical Instance Archive Groups and

  • Upload
    builiem

  • View
    232

  • Download
    3

Embed Size (px)

Citation preview

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

1

DDI and the GSBPM

Data without Boundaries

Thematic Workshop on Metadata

EDDI11 - SND, Gothenburg, Sweden, Dec 7 2011

Joachim Wackerow

GESIS - Leibniz Institute for the Social Sciences

Goals of Data without Boundaries

• DwB, project aims to an integrated model for

accessing official data

– a model where the best solutions for access are

available irrespective of national boundaries and

– flexible enough to fit national arrangements.

Description of Workshop

• DwB is an FP7 program aiming at developing an integrated model for accessing official data, irrespective of national boundaries.– In particular, the project proposes to build agreements on

standards between different stakeholders such as the Statistical Institutes, the Data Archives and the researchers, who are the final users.

• The thematic meeting will focus on metadata standards relevant for DwB: SDMX, DDI, and the GSBPM.– Each standard will be described and related to the others,

and the ongoing works directed to their articulation will be presented.

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

2

Purpose of GSBPM

• The original intention was for the GSBPM to provide a basis for statistical organizations to agree on standard terminology to aid their discussions on developing statistical metadata systems and processes.

• The GSBPM should therefore be seen as a flexible tool to describe and define the set of business processes needed to produce official statistics.

What can DDI do here?

• DDI is a combined informational/data and

process model

• Probably more related to the GSIM

– But GSIM doesn‘t exist in detail yet

• Now we can try to look at relationships

between process phases of GSBPM and DDI

modules/parts

Quality Management / Metadata Management

1SpecifyNeeds

2Design

3Build

4Collect

5Process

6Analyse

7Disseminate

8Archive

1.1Determine needs for

information

1.2Consult &confirm needs

1.3Establish

outputobjectives

1.5Check dataavailability

1.6Prepare

business case

2.1Design outputs

2.4Design frame

& samplemethodology

2.3Design datacollection

methodology

2.5Design statistical

processing methodology

2.6Design production

systems & workflow

3.1Build datacollection

instrument

3.2Build or enhance

process components

3.3Configure workflows

3.4Test production

system

3.6Finalize

production system

4.1Select

sample

4.2Set up

collection

4.3Run

collection

4.4Finalize

collection

5.1Integrate data

5.2Classify & code

5.3Review, Validate &

edit

5.5Derive new variables &

statistical units

5.7Calculate

aggregates

6.1Prepare draft

outputs

6.2Validate outputs

6.3Scrutinize &

explain

6.4Apply

disclosure control

6.5Finalizeoutputs

7.1Update output

systems

7.2Produce

dissemination products

7.3Manage

release of dissemination

products

7.5Manage user

support

7.4Promote

dissemination products

8.1Define

archive rules

8.2Manage archive

repository

8.3Preserve data and

associated metadata

8.4Dispose of

data & associated metadata

5.6Calculate weights

1.4Identify

concepts

9Evaluate

9.1Gather

evaluation inputs

9.2Conduct

evaluation

9.3Agree action plan

5.4Impute

3.5Test statistical

business process

5.8Finalize data files

2.2Design variable

descriptions

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

3

Combining standards?

Main Differences

between GSBPM and DDI• The GSBPM places data archiving at the end of the process, after the analysis phase. It may also

form the end of processing within a specific organization in the DDI model, but a key difference is that the DDI model is not necessarily limited to processes within one organization. Steps such as “Data analysis” and “Repurposing” may be carried out by different organizations to the one that collected the data.

• The DDI model replaces the dissemination phase with “Data Distribution” which takes place before the analysis phase. This reflects a difference in focus between the research and official statistics communities, with the latter putting a stronger emphasis on disseminating data, rather than research based on data disseminated by others.

• The DDI model contains the process of “Repurposing”, defined as the secondary use of a data set, or the creation of a real or virtual harmonized data set. This generally refers to some re-use of a data-set that was not originally foreseen in the design and collect phases. This is covered in the GSBPM phase 1 (Specify Needs), where there is a sub-process to check the availability of existing data, and use them wherever possible. It is also reflected in the data integration sub-process within phase 5 (Process).

• The DDI model has separate phases for data discovery and data analysis, whereas these functions are combined within phase 6 (Analysis) in the GSBPM. In some cases, elements of the GSBPM analysis phase may also be covered in the DDI “Data Processing” phase, depending on the extent of analytical work prior to the “Data distribution” phase.

Main Differences

between GSBPM and DDI

GSBPM

• Data archiving at the end of

the process, after the analysis

phase

• Stronger emphasis on

dissemination

• Availability of existing data in

Specify Needs (1), data

integration in Process (5)

• Analysis (6), combined

DDI

• Similar, but not necessarily limited to processes within one organization

• Data distribution and research based on data disseminated by others

• Repurposing - re-use of a data-set not foreseen in the design and collect phases

• Separate phases for data discovery and data analysis, and for data processing

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

4

GSBPM / DDI

Top Level RelationshipsGSBPM DDI Life Cycle Model

1 Specify Needs Study Concept Repurposing (part)2 Design

3 Build

4 Collect Data Collection

5 Process Data Processing (mostly)Repurposing (part)

6 Analyse Data DiscoveryData AnalysisData Processing (part)

7 Disseminate Data Distribution

8 Archive Data Archiving

9 Evaluate

Quality Management

Metadata Management

Nothing particular for quality

indicators.

But structured metadata on

detailed level is good basis.

Unique identifiers per agency

and support for maintainable

containers supports metadata

menagement.

GSBPM: 1 Specify Needs

• Study Concept - Repurposing (part)1

SpeciyNeeds

1.1Determine needs for

information

1.2Consult &confirm needs

1.3Establish

outputobjectives

1.5Check dataavailability

1.6Prepare

business case

1.4Identify

concepts

GSBPM: 2 Design

2Design

2.1Design outputs

2.4Design frame

& samplemethodology

2.3Design datacollection

methodology

2.5Design statistical

processing methodology

2.6Design production

systems & workflow

2.2Design variable

descriptions

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

5

What Is DDI I?

• An international specification for structured metadata

describing social, behavioral, and economic data

• A standardized framework to maintain and exchange

documentation/metadata

• DDI metadata accompanies and enables data

conceptualization, collection, processing, distribution,

discovery, analysis, repurposing, and archiving.

• A basis on which to build software tools

• Currently expressed in XML – eXtensible Markup Language

What is DDI II?

• Structure for the documentation of data

• Data model for the metadata

Reality

Data Documentation

Structured Metadata

in DDI

DDI Development Lines

• DDI Codebook (DDI 2 branch)

– Reflects components of social science codebooks

– Includes descriptions at the study, file, and variable level

• DDI Lifecycle (DDI 3 branch)

– Reflects research data lifecycle

– Optimized for reuse of metadata

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

6

DDI Lifecycle Features

• Machine-actionable

• Modular and extensible

• Multi-lingual

• Aligned with other metadata standards

• Can carry data in-line

• Focused on metadata reuse

DDI Lifecycle Features

• Support for CAI instruments

• Support for longitudinal surveys

• Focus on comparison, both by design and after-the-fact

(harmonization)

• Robust record and file linkages for complex data files

• Support for geographic content (shape and boundary

files)

• Capability for registries and question banks

Metadata Driven Approach

Metadata

Survey instruments

Paper questionnaires

Statistical source code

Paper documentation

Web documentation

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

7

Basic Types of Metadata

• Concepts (“terms”)

• Studies (“surveys”, “collections”, “data sets”,

“samples”, “censuses”, “trials”, “experiments”,

etc.)

• Survey instruments (“questionnaire”, “form”)

• Questions (“observations”)

• Responses

Basic Types of Metadata (2)

• Variables (“data elements”, “columns”)

• Codes & categories (“classifications”,

“codelists”)

• Universes (“populations”, “samples”)

• Data files (“data sets”, “databases”)

Some Details of DDI

• Ca. 20 Modules for various purposes (data collection, logical data structure, physical representation)

• 14 containers for concepts, questions, variables, etc.

• 120 identifiable items can be referenced for internal and external reuse– Gobally unique identifier, example:

urn:ddi:de.gesis:VariableScheme:vs1786:4.2.3:Variable:age:1

• Over 800 items

• Realized in XML Schema

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

8

Research Data Life Cycle

•Initial concepts

•Questions and

answers

•Grant info

•Questionnaire

•Coded instrument

•CAI metadata

•Paradata

•Data specs

•Recodes

•Summary

descriptive info

•Terms of use

•Citation

•Packaging info

•Catalog record

•Indexing

•Related

publications

•Replication

code

•Publications

•Post-hoc

harmonization

•Data

transformations

•Preservation

metadata

•Confidentiality

•Add’l processing

DDI as backbone for structured metadata

CollectionConcept Processing

Distribution Discovery Analysis

Rep

urpo

sing

CAI Tools

MQDS etc.

Information extracted from

SPSS etc.

Archive

Custom Tools

(e.g. Forms-based)

Statistical packages

Online Analysis.Search engines.

Distribution Packages

Web information system

Data / Documents outside of DDI

DDI as Metadata Backbone

Reuse Across the Lifecycle

• This basic metadata is reused across the

lifecycle

– Responses may use the same categories and

codes which the variables use

– Multiple waves of a study may re-use concepts,

questions, responses, variables, categories, codes,

survey instruments, etc. from earlier waves

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

9

Reuse by Reference

• When a piece of metadata is re-used, a

reference can be made to the original

• In order to reference the original, you must be

able to identify it

• You also must be able to publish it, so it is

visible (and can be referenced)

– It is published to the user community – those

users who are allowed access

Change over Time

• Metadata items change over time, as they move

through the data lifecycle

– This is especially true of longitudinal/repeat cross-

sectional studies

• This produces different versions of the metadata

• The metadata versions have to be maintained as they

change over time

– If you reference an item, it should not change: you

reference a specific version of the metadata item

DDI Support for Metadata Reuse

• DDI allows for metadata items to be identifiable

– They have unique IDs

– They can be re-used by referencing those IDs

• DDI allows for metadata items to be published

– The items are published in resource packages

• Metadata items are maintainable

– They live in “schemes” (lists of items of a single type) or in “modules” (metadata for a specific purpose or stage of the lifecycle)

– All maintainable metadata has a known owner or agency

• Maintainable metadata can be versionable

– This reflects changes over time

– The versionable metadata has a version number

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

10

DDI Support for Comparison

• For data which is completely the same, DDI provides a way of showing comparability: Grouping

– These things are comparable “by design”

– This typically includes longitudinal/repeat cross-sectional studies

• For data which may be comparable, DDI allows for a statement of what the comparable metadata items are: the Comparison module

– The Comparison module provides the mappings between similar items (“ad-hoc” comparison)

– Mappings are always context-dependent (e.g., they are sufficient for the purposes of particular research, and are only assertions about the equivalence of the metadata items)

DDI 3 Lifecycle Model and Related Modules

StudyUnit

Data Collection

LogicalProduct

PhysicalData Product

PhysicalInstance

Archive

Groups and Resource Packages are a means of publishing any portion or combination of sections of the life cycle

Local Holding Package

S04 29

XML Schemas, DDI Modules,

and DDI Schemes

<file>.xsd<file>.xsd<file>.xsd<file>.xsd

XML Schemas DDI Modules

May

Correspond

DDI Schemes

May

Contain

Correspond to

a stage in the

lifecycle

S09 30

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

11

DDI Instance

Citation Coverage

Other Material / NotesTranslation Information

Study Unit Group

Resource Package

3.1 Local Holding Package

S04 31

Citation / Series StatementAbstract / Purpose

Coverage / Universe / Analysis Unit / Kind of DataOther Material / Notes

Funding Information / Embargo

Conceptual Components

DataCollection

LogicalProduct

PhysicalDataProduct

Physical Instance

Archive DDI Profile

Study Unit

S04 32

Group

Conceptual Components

DataCollection

LogicalProduct

PhysicalDataProduct

Sub Group

Archive

DDI Profile

Citation / Series StatementAbstract / Purpose

Coverage / UniverseOther Material / Notes

Funding Information / Embargo

Study Unit Comparison

S04 33

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

12

Resource Package

Any module EXCEPTStudy Unit orGroup

Any Scheme:OrganizationConceptUniverseGeographic Structure Geographic Location QuestionInterviewer InstructionControl Construct CategoryCodeVariableNCubePhysical StructureRecord Layout

Citation / Series StatementAbstract / Purpose

Coverage / UniverseOther Material / Notes

Funding Information / Embargo

S04 34

3.1 Local Holding Package

Depository Study Unit OR Group Reference:[A reference to the stored version of the deposited study unit.]

Local Added Content:[This contains all content available in a Study Unit whose source is the local archive.]

Citation / Series StatementAbstract / Purpose

Coverage / Universe Other Material / Notes

Funding Information / Embargo

S04 35

DDI Schemes: Purpose

• A maintainable structure that contains a list of versionable things

• Supports registries of information such as concept, question and variable banks that are reused by multiple studies or are used by search systems to location information across a collection of studies

• Supports a structured means of versioning the list

• May be published within Resource Packages or within DDI modules

• Serve as component parts in capturing reusable metadata within the life-cycle of the data

S04 36

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

13

Building from Component PartsUniverseScheme

ConceptScheme

CategoryScheme

CodeScheme

QuestionScheme

Instrument

Variable Scheme

NCube Scheme

ControlConstructScheme

LogicalRecord

RecordLayout Scheme [Physical Location]

PhysicalInstance

S04 37

Versioning and Maintenance

• There are three classes of objects:

– Identifiable (has ID)

– Versionable (has version and ID)

– Maintainable (has agency, version, and ID)

• Very often, identifiable items such as Codes

and Variables are maintained in parent

schemes

S08 38

Maintenance Rules

• A maintenance agency is identified by a reserved code based on its domain name (similar to it’s website and e-mail)

– There is a register of DDI agency identifiers which we will look at later in the course

• Maintenance agencies own the objects they maintain

– Only they are allowed to change or version the objects

• Other organizations may reference external items in their own schemes, but may not change those items

– You can make a copy which you change and maintain, but once you do that, you own it!

S08 39

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

14

Publication in DDI• There is a concept of “publication” in DDI which is important

for maintenance, versioning, and re-use

• Metadata is “published” when it is exposed outside the agency which produced it, for potential re-use by other organizations or individuals– Once published, agencies must follow the versioning rules

– Internally, organizations can do whatever they want before publication

• Note that an “agency” can be an organization, a department, a project, or even an individual for DDI purposes– It must be described in an Organization Scheme, however!

• There is an attribute on maintainable objects called “isPublished” which must be set to “true” when an object is published (it defaults to “false”)

S08 40

A study is born

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow419/19/2011

Checksum of Study Design Document

Could be Archived

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow429/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

15

Multiple Collection Processes Begin

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow439/19/2011

Processing – (e.g. Data Cleaning,

Restructuring, Recoding)

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow449/19/2011

Initial Data are Archived

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow459/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

16

Initial Distribution

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow469/19/2011

Initial Distribution – Possibly From

Archive

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow479/19/2011

Initial Data Discovery

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow489/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

17

Initial Data Analysis

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow499/19/2011

Initial Data Analysis and Data Archived

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow509/19/2011

Publications – Reference and

Referenced by Archive

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow519/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

18

SECOND WAVE – Revised Concept

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow529/19/2011

SECOND WAVE – Data Collection

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow539/19/2011

SECOND WAVE – Data Processing

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow549/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

19

SECOND WAVE – Processing Uses

Feedback from Stage 1

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow55

Here something

learned in the

initial distribution

affects future

processing. This

should be

recorded.

9/19/2011

SECOND WAVE – Processing Uses

Feedback from Stage 1

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow56

These metadata

flows may happen

between many

stages, e.g. from

processing to later

collection.

9/19/2011

SECOND WAVE – Distribution

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow579/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

20

SECOND WAVE – Discovery

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow589/19/2011

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow599/19/2011

Final Analysis Archived

60

A Kansan's Cyclone View

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

21

Gantt View – Initial Design

Much of this

movement of data

between stages is

planned from the

beginning of the

project

Gantt with Data Flow (Blue)

Gantt With Planned Data and Metadata Flow

Metadata are

generated as data

move through the

project, as well as

before any data are

gathered.

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

22

Gantt – Collection Changes Project Concept

Some metadata are

unanticipated. Here

something learned

during the first

collection phase

causes a

reconceptualization

Here something

learned during

discovery changes

future collection

Gantt – Discovery Changes Future Collection

Representing Longitudinal Data in DDI(Extract)

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow66

Level Dimension Description DDI Tag(s)

Project/Study

(highest level)

Management Executive control, scientific

leadership, funding, etc.

Group

Citation

Purpose

Abstract

FundingInformation

Archive Module

Organization

Individual

Role (research, management, funding,

etc.)

Location

Email

Telephone

Access How to obtain data and any

restrictions on access

Group/Subgroup/StudyUnit

Archive Module

AccessConditions

AccessPermissions

ConfidentialityStatement

Restrictions

LifecycleInformation

Longitudinal

Survey

Sample

Design and

Procedures

Universe: Population being

sampled: Refreshment

strategy; Replacement

strategy; Potential errors

Group

ConceptualComponents

Universe

Concept

DataCollection

Methodology

SamplingProcedure

DeviationFromSampleDesign

ActionToMinimizeLosses

9/19/2011

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

23

Reuse

Dagstuhl event 11382, Sept. 2011, Hoyle

and Wackerow679/19/2011

Old study

My study

Generic Longitudinal

Business Process Model

Combined Process and Analyse Phase

DwB – Data without Boundaries

Additional Workshop – Metadata

Standards

07/12/2011

24

Additional Phase Research/Publish

Circle View

Acknowledgements

• Arofan Gregory and Wendy Thomas

– Core collection of DDI slides

• Larry Hoyle

– Managing Metadata for Longitudinal Data (2011)

• Steven Vale

– Generic Statistical Business Process Model (METIS 2009)

– Exploring the relationship between DDI, SDMX and the Generic Statistical Business Process Model (EDDI 2010)

• Dagstuhl Workshop on Longitudinal Data 2011

– Working Group on GLBPM, Ingo Barkow, Jay Greenfield, Arofan Gregory, Marcel Hebing, Larry Hoyle, Wolfgang Zenk-Möltgen

• DDI Alliance Working Paper Series. Best Practices for Longitudinal Data http://www.ddialliance.org/resources/publications/working/BestPractices/LongitudinalData