26
in partnership with Title: Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse WP: 1 Deliverable: 1.2 Version : 0.3 Date: 3-01-2013 Autor: Collin Bowler, Michel Lindelauf, Jos Dressen NSI: ONS, CBS

ec.europa.eu€¦ · Web viewQuality of metadata (and data) can sometimes be difficult to define in an unambiguous manner, and in the context of a Statistical Data Warehouse (SDWH)

Embed Size (px)

Citation preview

in partnership with

Title:

Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse

WP: 1 Deliverable: 1.2

Version: 0.3 Date: 3-01-2013

Autor: Collin Bowler, Michel Lindelauf,Jos Dressen NSI: ONS, CBS

ESS - NET

ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS

Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse

1.Introduction

Quality of metadata (and data) can sometimes be difficult to define in an unambiguous manner, and in the context of a Statistical Data Warehouse (SDWH) this is no different.In this document, we are specifically interested in the quality of metadata used in the SDWH

So what is the definition of ‘Quality’?

A general definition which can be used is ‘fitness for use, or purpose’.

ISO9000:2005 defines quality as the ‘degree to which a set of inherent characteristics fulfils requirements ‘.

‘Fitness for use’ is a relative definition, allowing for various perspectives on what constitutes quality, depending on the intended uses of the metadata (and indeed the intended uses of the data to which the metadata refers).

Also, the degree of quality indicates that there will be a set of acceptable quality levels associated with the characteristics, or dimensions, which the metadata must satisfy in order to be fit for use.

2. Quality measure or quality indicator?

Quality measures are defined as items that directly measure aparticular aspect of quality. For example, the time lag from the reference date to therelease of the output is a direct measure.

However, in practice many quality measurescan be difficult or costly to calculate. Instead,the use of quality indicators can givean insight into quality.

Quality indicators usually consist of information that is a by-product of the statisticalprocess. They do not measure quality directly but can provide enough information toprovide an insight into quality. (ONS – 2007)

3. Types of Statistical Metadata

Lars-GoranLundell (2012) definesthree main metadata categories in use in the SDWH, and also states that any item of metadata will normally fit intoeach of these categories:

Active /Passive Formalised/Free-form Structural/Reference.

Active metadata, enables operational use, driving the processes within the S-DWH (e.g. scripts/triggers to carry out activities on the data/metadata), whereas Passive metadata does not act upon the the data /metadata within the system, e.g. quality reports, documentation etc.

Formalised metadata would have some form of structure, e.g. classifications/code lists, whereas free-form metadata might contain descriptive information, as in quality reports for example.

Structural metadata is generally thought of (especially in the statistical data world) as metadata which defines data, and generally help the user ‘find, identify, access and utilise the data’– for example, classification codes. Reference metadata, by contrast, describe the content and quality of the data, and is most usually associated with quality reports.

All of these categories of metadata could be subject to quality measurement, except perhaps the quality report reference metadata which is itself a report on quality measurement.

4. International Standards for Metadata

There are some international standards and statistical models which apply to, or are concerned with metadata, and quality characteristics are mentioned in some of them. Appendix B and Appendix C provide more detail of specific standards available.

The ISO 11179 standard pertains to Metadata Registries (MDR), which has the data element as the fundamental concept, and is concerned with the semantics around metadata definitions

The Wikipedia definition of metadata registry is: “a central location in an organization where metadata definitions are stored and maintained in a controlled method.”

Within an MDR, quality is monitored through the use of a registration status. The status records the level of quality.

ISO 11179 states that the main purposes of monitoring metadata quality are:

Monitoring adherence to rules for providing metadata for each attribute Monitoring adherence to conventions for forming definitions, creating names, and performing

classification. Determining whether an administered item still has relevance Determining the similarity of related administered items and harmonizing their differences Determining whether it is possible to ever get higher quality metadata for some administered

items

In a metadata registry, metadata quality is monitored through the use of a registration status. The status records the level of quality for each administered item (i.e an administered item’s level of conformance to the required standard), and the levels go (in increasing quality) from Candidate, Recorded, Qualified, Standard, Preferred Standard.

This is a rigorous evaluation process, and could be used to apply to different elements of metadata which are used for evaluation of specific quality dimensions, as appropriate to the scenario, or use-case (see below).

5. Dimensions, or Characteristics,of Metadata Quality

When examining the dimensions to be used when assessing quality in the context of statistical processing, there are many available.

The European Statistical System (ESS) specifiesthat we have dimensions relating to:

Relevance

Accuracy

Timeliness and Punctuality

Accessibility and Clarity

Comparability

Coherence

From Johanis (2002), is the suggestion of a similar set of dimensions, originating from Statistics Canada’s Quality Assurance Framework (QAF) (2002):

Relevance,

Accuracy

Timeliness

Accessibility

Interpretability

Coherence

Whilst these are seen as ‘static’ quality dimensions, the QAF also defines some complementary quality aspects which are seen as ‘dynamic’:

Non-Response

Coverage

Sampling

Bruce & Hillman (2004), discussing metadata quality within the digital library context, suggest 7 similar dimensions, with an additional one covering ‘Provenance’:

Completeness

Accuracy,

Provenance

Conformance to expectations

Logical consistency and coherence

Timeliness

Accessibility

From Daas& van Nederpelt (2010), the dimensionsthought appropriate to metadata in the context of ‘secondary data sources’ (i.e. mainly non-survey sources) are:

Clarity (encompassing coherence)

Comparability (encompassing linkability, replaceability and uniqueness)

Completeness (encompassing coverage, detailedness, availability, relevance, selectivity and size)

Confidentiality

Correctness (encompassing accuracy, authenticity, and reliability)

Stability

Timeliness (encompassing punctuality)

Daas&Ossen(2011)proposes that when evaluating metadata quality of secondary data sources, the use of ‘hyperdimensions’ is appropriate. These are where several metadata quality characteristics are grouped together to give an overall quality assessment for a data source.

So which set of dimensions do we use when assessing metadata quality throughout the SDWH?

There does not seem to be any conclusive guidance around this specific issue.

Most sets of dimensions quoted in statistical quality frameworks appear to be aimed specifically at statistical outputs from a data perspective, rather than metadata. Whenwe examine the detail of the dimensions, we can see that some are really aimed at metadata e.g. when considering Timeliness, an examination of the period to which data pertains compared to the period for which data is required, the period information itself would be considered as metadata. However, measurement of the Timeliness aspect is still a quality attribute which relates to the data itself rather than the metadata.

Perhaps the best approach is to use whatever dimensions are appropriate for aparticular scenario or use-case.

6. Application of Quality Characteristics to Metadata in the Layers

The importance of the various quality characteristics when assessing the quality of different metadata will vary depending upon a set of criteria which includes (but is not necessarily limited to):

(1) the layer of the SDWH in which the evaluation needs to take place;

(2) the source of the metadata (e.g..may be accompanying the data provided/collected, or may be entered separately); and

(3) the use to which the data associated with the metadata is to be put.

Examples of how quality dimensions may be applied in the layers

Source Layer

The assessment of the quality of data concepts, definitions and classifications of the administrative populations and variables will determine the relevance of this data to be used within an output. Whereas a statistical institution can adjust the concepts,definitions and classifications used in its own surveys to meet userneeds, the institution usually has little or no influence over thoseused by administrative sources. Hence the presence of metadata containing sufficiently accurate descriptions of the concepts can assist the decision of whether the source data meets their needs.

All metadata made available by an administrative data supplier should be described, along with thosemetadata items which are missing. A description shouldinclude how the missing metadata affect the ability to assess thefitness for purpose of the administrative data. The completeness of the information would be used to determine whether the users can make appropriate use of data. Links to appropriatemetadata ensure that this information is accessible.

For example, if data for a variable from an external data source has an accompanying description of simply ‘Sales’ then the metadata it would fail the quality requirements of an output which might require

aggregation of variables values with a more specific description, such as ‘Sales - excluding VAT’. In this instance, because of the quality of the metadata, this piece of data will be overlooked for this particular output, even though the variable might have actually represented ‘Sales excluding VAT’ but did not expressly say so in the description.

In another scenario, the measure of aprovenance characteristic might be important when assessing the usefulness of a particular metadata item in relation to its source. This could be used as part of the quality assessment as to whether a particular piece of micro data could (or should) be used to contribute to an output. For example, if a piece of data arrives at the SDWH from an administrative data source which is known to have previously supplied unreliable or inaccurate measure data associated with a variable, this can be used as a quality evaluation when carrying out a selection of the data which will be used to contribute to a particular output.

Integration layer

Many of the issues relating to the quality of metadata in the source layer are relevant to the integration layer also. This is where an examination of the quality and status of the metadata relating to prospective data for inclusion in the integration process. In addition, in this layer we would expect processes such as editing, imputation, classification/coding to take place, often to be carried out by automated scripts. The assessment of the quality of these scripts (which are actually Active metadata) would be particularly important.

Interpretation and Analysis layer

When generating or prototyping a potential new output, the user will need to check whether data exists for the statistical concept(s) that they are measuring. This would include a quality check of descriptions of the statistical measure, the population, variables, statistical unit types, domains and time reference. Quality checking this metadata would give users an understanding of the relevance ofthe input data to their needs (for example, whether the output coverstheir required population or time period). Access layer

In the scenario of carrying out a search for valid datasets via some form of data explorer, the entry into the search engine of valid search criteria is obviously very important if the appropriate datasets are to be found. This means that any metadata entered as part of the search criteria much have an acceptable level of quality in order for the search to be successful. For example, if the user enters a value of ‘201203’ as the reference period of the data they require, but the metadata is held in the SDWH in the form of ‘2012Q1’, then the search will fail. Metadata quality checks need to be carried out on the correctness (or accuracy) of the metadata entered by the user.

7. Acceptable Quality Levels

These are threshold values which will indicate the acceptability of following the application of quality measurements to the metadata, for each of the appropriate quality dimensions.

These levels could conceivably change depending upon the quality requirements of particular outputs or processes.

8. Metadata Quality Management

Should we be concerned about the management of metadata quality?

Some aspects of the Quality Management principles (ISO 9000) should be applied to metadata quality management. In particular the following principles seem particularly relevant to the SDWH environment::

Customer focus - Organizations depend on their customers and therefore should understand current and future customer needs, should meet customer requirements and strive to exceed customer expectations;

Process approach - A desired result is achieved more efficiently when activities and related resources are managed as aprocess;

System approach to management - Identifying, understanding and managing interrelated processes as a system contributes to theorganization's effectiveness and efficiency in achieving its objectives

This indicates that the processes surrounding the SDWH should encompass quality management processes. For example, it would be expected that a customer, such as an expert user who is carrying out some detailed analysis process, will have access to a system which will provide all the information required by the user relating to the metadata, including some form of mechanism for feeding back any information relating to the metadata quality which might come to light as a result of the process being carried out by the user.

9. References

Lars-GoranLundell (2012) – Metadata Framework for Statistical Data Warehousing (ESSnetproject on Statistical Data Warehouse)

International Standard ISO9000:2005 – Quality Management Systems fundamentals and vocabulary

International Standard ISO/IEC 11179 – Information Technology – Metadata Registries (Parts 1 – 6)

Office for National Statistics (2007) – Guidelines for Measuring Statistical Quality – Published by Her Majesty’s Stationery Office (HMSO) – now ‘The Stationery Office’ - for the Office for National Statistics

Paul Johanis (2002) - Assessing the Quality of Metadata. Statistics Canada presentation at the work session on METIS, 6-8 March 2002, Luxembourg

Statistics Canada - Statistics Canada’s Quality Assurance Framework (2002)

Thomas R. Bruce & Diane I Hillman (2004) – The Continnuum of metadata quality: Defining, expressing, exploiting. From Metadata in Practice (pp.238-256). Chicago ALA

Piet J.H. Daas and Peter W.M. van Nederpelt (2010) - Application of the object oriented quality management model to secondary data sources – Statistics Netherlands

Piet J.H. Daas and Saskia J.L. Ossen (2011) – Metadata Quality Evaluation of Secondary Data Sources - Statistics Netherlands. Presented at the 5th International Quality Conference, May 20th 2011

Appendix A – Quality Dimension DefinitionsQuality Assurance Framework- Stats Canada

Relevance:The relevance of statistical information reflects the degree to which it meets the real needs of clients. It isconcerned with whether the available information sheds light on the issues of most importance to users.Assessing relevance is a subjective matter dependent upon the varying needs of users. The Agency’s challengeis to weigh and balance the conflicting needs of current and potential users to produce a program that goes asfar as possible in satisfying the most important needs within given resource constraints.

Accuracy:The accuracy of statistical information is the degree to which the information correctly describes thephenomena it was designed to measure. It is usually characterized in terms of error in statistical estimates and istraditionally decomposed into bias (systematic error) and variance (random error) components. It may also bedescribed in terms of the major sources of error that potentially cause inaccuracy (e.g., coverage, sampling,non-response, response).

Timeliness:The timeliness of statistical information refers to the delay between the reference point (or the end of thereference period) to which the information pertains, and the date on which the information becomes available. Itis typically involved in a trade-off against accuracy. The timeliness of information will influence its relevance.

Accessibility:The accessibility of statistical information refers to the ease with which it can be obtained from the Agency.This includes the ease with which the existence of information can be ascertained, as well as the suitability ofthe form or medium through which the information can be accessed. The cost of the information may also be anaspect of accessibility for some users.

Interpretability:The interpretability of statistical information reflects the availability of the supplementary information andmetadata necessary to interpret and utilize it appropriately. This information normally covers the underlyingconcepts, variables and classifications used, the methodology of data collection and processing, and indicationsof the accuracy of the statistical information.

Coherence:The coherence of statistical information reflects the degree to which it can be successfully brought togetherwith other statistical information within a broad analytic framework and over time. The use of standardconcepts, classifications and target populations promotes coherence, as does the use of common methodologyacross surveys. Coherence does not necessarily imply full numerical consistency

ONS Guidelines for Measuring Statistical Quality

Relevance - The degree to which the statistical product meets user needs for both coverage and content.

Accuracy - The degree to which the statistical product meets user needs for both coverage andcontent.

Timeliness and Punctuality - Timeliness refers to the lapse of time between publication and the period to which the data refer. Punctuality refers to the time lag between the actual and planned dates of publication.

Accessibility and Clarity - Accessibility is the ease with which users are able to access the data. It also relates to the format(s) in which the data are available and the availability of supporting information.Clarity refers to the quality and sufficiency of the metadata, illustrations and accompanying advice.

Comparability - The degree to which data can be compared over time and domain.

Coherence - The degree to which data that are derived from different sources or methods, butwhich refer to the same phenomenon, are similar.Appendix B – International Standards relevant to Metadata

ISO/IEC TR 20943 – Achieving Metadata Registry Content Consistencyhttp://metadata-stds.org/20943/index.html

First conclusion and Summary:

This standards consist of 6 parts and some are still under development or on hold but can provide the reader with useful information on the subject of metadata within a SDWH to our opinion as we discussed several items shortly during the WP1 meeting in The Hague.

The purpose of ISO/IEC TR 20943-1:2003 is to describe a set of procedures for the consistent registration of ata elements and their attributes in a registry. ISO/IEC TR 20943-1:2003 is not a data entry manual, but a user’s guide for conceptualizing a data element and its associated metadata items for the purpose of consistently establishing good quality data elements. An organization may adapt and/or add to these procedures as necessary. The scope of ISO/IEC TR 20943-1:2003 is limited to the associated items of a data element: the data element identifier, names and definitions in particular contexts, and examples; data element concept; conceptual omain with its value meanings; and value domain with its permissible values.

The purpose of ISO/IEC 20943-2is to describe ways of representing XML structured data in a 11179-3 metadata registry hereinafter referred to as "a 11179 MDR" or simply "an MDR"). XML structures may be mapped to, and represented by, one or more constructs in an MDR. ISO/IEC 11179-3:2003 does not explicitly specify how to represent XML structures, and practitioners have found more than one way to represent similar structures using the constructs defined by ISO/IEC 11179-3:2003. This part describes some possible representations of various XML structures, some pros and cons of each, with techniques for mapping from one to another.

ISO/IEC TR 9789:1994 - Guidelines for the organisation and representation of data elements for data interchange * coding methods and principles

http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=17651

First conclusion and summary:

We are not sure if this ISO standard can provide the reader with useful extra information as it is not a free standard. For downloading thePDF- document you must pay the amount of 98 CHF!

The ISO 9789 standard provides general guidance on the manner on which data can be expressed by codes. Describes the objectives of coding, the characteristics, advantages and disadvantages of different coding methods, the features of codes and gives guidelines for the design of codes.

ISO/IEC TR 14957:2010 - Representation of data elements values: Notation of the formathttp://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?

csnumber=55652

First conclusion and summary:

We are not sure if this ISO standard can provide the reader with useful extra information as it is not a free standard. For downloading thePDF- document you must pay the amount of 50 CHF!

ISO/IEC 14957:2010 specifies the notation to be used for stating the format, i.e. the character classes, used in the representation of data elements and the length of these representations. It also specifies additional notations relative to the representation of numerical figures. For example, this formatting technique might be used as part of the metadata for data elements. The scope of ISO/IEC 14957:2010 is limited to graphic characters, such as digits, letters and special characters. The scope is limited to the basic datatypes of characters, character strings, integers, reals, and pointers.

ISO/IEC TR 15452:2000 – Specification of Data Value Domainhttp://webstore.iec.ch/p-preview/info_isoiec15452%7Bed1.0%7Den.pdf

First conclusion and summary:

According to the above hyperlink this document has become withdrawn and therefore no longer valid

ISO/IEC 19763 - Framework for Metamodel interoperabilityhttp://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=38637

First conclusion and summary:

According to the above hyperlink this document has become withdrawn and therefore no longer valid.

ISO/IEC 24706 - Metadata for technical standards and specifications documentshttp://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=56081

First conclusion and summary:

According to the above hyperlink this document is still under development and therefore there is no summary available yet. We think it is worth to check this standard again in the near future but for now it is not usefull for the project.

ISO/IEC 19773 – Metadata registries (MDR) Module

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=41769

First conclusion and summary:

We are not sure if this ISO standard can provide the reader with useful extra information as it is not a free standard. For downloading thePDF- document you must pay the amount of 196 CHF!

ISO/IEC 19773:2011 specifies small modules of data that can be used or reused in applications. These modules have been extracted from ISO/IEC 11179-3, ISO/IEC 19763, and OASIS EBXML, and have been refined further. These modules are intended to harmonize with current and future versions of the ISO/IEC 11179 series and the ISO/IEC 19763 series. These modules include: reference-or-literal (reflit) for on-demand choices of pointers or data; multitext, multistring, etc. for recording internationalized and localized data within the same structure; slots and slot arrays for standardized extensible data structures; internationalized contact data, including UPU postal addresses, ITU-T E.164 phone numbers, internet E-mail addresses, etc.; generalized model for context data based upon who-what-where-when-why-how (W5H); data structures for reified relationships and entity-person-groups. Conformity can be selected on a per-module basis.

ISO/IEC 20944 – Metadata Registry Interoperabillity& Bindings (MDR-IB)http://metadata-stds.org/20944/index.html

First conclusion and summary:

This standards consist of 5 parts and some are still under development but can provide the reader with usefull information on the subject of metadata within a SDWH to our opinion as we discussed several items shortly during the WP1 meeting in The Hague. Although further research is necessary to judge this.

The ISO/IEC 20944 family of standards is being developed to provide interoperability among metadata registries (11179-3), such as reading/writing attributes from/to a metadata registry. However, the ISO/IEC 20944 series may be used generically, such as for applications that are unrelated to 11179-3 metadata registries, or applications that extend 11179-3 metadata registry attributes (attributes outside of the 11179-3 specification).

Appendix C - Summary of ISO/IEC 11179

Introduction

International standards apply to metadata. Much work is being accomplished in the national and international standards communities, especially ANSI (American National Standards Institute) and ISO (International Organization for Standardization) to reach consensus on standardizing metadata and registries.

The core standard is ISO/IEC 11179-1 and subsequent standards. All yet published registrations according to this standard cover just the definition of metadata and do not serve the structuring of metadata storage or retrieval neither any administrative standardisation. It is important to note that this standard refers to metadata as data about containers of data and not to metadata (metacontent) as data about data contents. It should also be noted that this standard describes itself originally as a "data element" registry, describing disembodied data elements, and explicitly disavows the capability of containing complex structures. Thus the original term "data element" is more applicable than the later applied buzzword "metadata".

Intended purpose

Today, organizations often want to exchange data quickly and precisely between computer systems using enterprise application integration technologies. Completed transactions are also often transferred to separate data warehouse and business rules systems with structures designed to support data for analysis. The industry de facto standard model for data integration platforms is the Common Warehouse Model (CWM). Data integration is often also solved as a data, rather than a metadata, problem, with the use of so called master data. ISO/IEC 11179 claims that it is a standard for metadata-driven exchange of data in an heterogeneous environment, based on exact definitions of data.

Structure of an ISO/IEC 11179 metadata registry

The ISO/IEC 11179 model is a result of two principles of semantic theory, combined with basic principles of data modelling.

The first principle from semantic theory is the thesaurus type relation between wider and more narrow (or specific) concepts, e.g. the wide concept "income" has a relation to the more narrow concept "net income".

The second principle from semantic theory is the relation between a concept and its representation, i.e. "buy" and "purchase" are the same concept even if different terms are used.

The basic principle of data modelling is the combination of an object class and a characteristic. For example, "Person - hair color".

When applied to data modelling, ISO/IEC 11179 combines a wide "concept" with an "object class" to form a more specific "data element concept". For example, the high-level concept "income" is combined with the object class "person" to form the data element concept "net income of person". Note that "net income" is more specific than "income".

The different possible representations of a data element concept are then described with the use of one or more data elements. Differences in representation may be a result of the use of synonyms or different value domains in different data sets in a data holding. A value domain is the permitted range of values for a characteristic of an object class. An example of a value domain for "gender of person"

is "M = Male, F = Female, U = Unknown". The letters M, F and U are then the permitted values of gender of person in a particular dataset.

The data element concept "monthly net income of person" may thus have one data element called "monthly net income of individual by 100 dollar groupings" and one called "monthly net income of person range 0-1000 dollars", etc., depending on the heterogeneity of representation that exists within the data holdings covered by one ISO/IEC 11179 registry. Note that these two examples have different terms for the object class (person/individual) and different value sets (a 0-1000 dollar range as opposed to 100 dollar groupings).

The result of this is a catalogue of sorts, in which related data element concepts are grouped by a high-level concept and an object class, and data elements grouped by a shared data element concept. Strictly speaking, this is not a hierarchy, even if it resembles one.

It is worth noting that ISO/IEC 11179 proper does not describe data as it is actually stored. There is no part of the model that caters to the description of physical files, tables and columns. All the ISO/IEC 11179 constructs are "semantic" as opposed to "physical" or "technical".

Since the standard has two main purposes (definition and exchange) the core object is the data element concept, since it defines a concept and, ideally, describes data independent of its representation in any one system, table, column or organisation.

The data element is foundational concept in an ISO/IEC 11179 metadata registry. The purpose of the registry is to maintain a semantically precise structure of data elements.

Each Data element in an ISO/IEC 11179 metadata registry:

should be registered according to the Registration guidelines (11179-6) will be uniquely identified within the register (11179-5) should be named according to Naming and Identification Principles (11179-5) See data

element name should be defined by the Formulation of Data Definitions rules (11179-4) See data element

definition and may be classified in a Classification Scheme (11179-2) See classification scheme

Data elements that store "Codes" or enumerated values must also specify the semantics of each of the code values with precise definitions

Structure of the ISO/IEC 11179 standard

The standard consists of six parts:

Part 1 - Framework Part 2 - Classification Part 3 - Registry metamodel and basic attributes Part 4 - Formulation of data definitions Part 5 - Naming and identification principles Part 6 - Registration

Part 1 explains the purpose of each part. Part 3 specifies the metamodel that defines the registry. The other parts specify various aspects of the use of the registry.

11179-1: Framework

This part of ISO/IEC 11179 introduces and discusses fundamental ideas of data elements, value domains, data element concepts, conceptual domains, and classification schemes essential to the understanding of this set of standards and provides the context for associating the individual parts of ISO/IEC 11179.

11179-2: Classification

This part of ISO/IEC 11179 provides a conceptual model for managing classification schemes. There are many structures used to organize classification schemes and there are many subject matter areas that classification schemes describe. So, this Part also provides a two-faceted classification for classification schemes themselves.

11179-3: Registry metamodel and basic attributes

This part of ISO/IEC 11179 specifies a conceptual model for a metadata registry, and a set of basic attributes for metadata for use when a full registry solution is not needed.

11179-4: Formulation of data definition

This part of ISO/IEC 11179 provides guidance on how to develop unambiguous data definitions. A number of specific rules and guidelines are presented in ISO/IEC 11179-4 that specify exactly how a data definition should be formed. A precise, well-formed definition is one of the most critical requirements for shared understanding of an administered item; well-formed definitions are imperative for the exchange of information. Only if every user has a common and exact understanding of the data item can it be exchanged trouble-free.

11179-5: Naming and identification principles

This part of ISO/IEC 11179 provides guidance for the identification of administered items. Identification is a broad term for designating, or identifying, a particular data item. Identification can be accomplished in various ways, depending upon the use of the identifier. Identification includes the assignment of numerical identifiers that have no inherent meanings to humans; icons (graphic symbols to which meaning has been assigned); and names with embedded meaning, usually for human understanding, that are associated with the data item's definition and value domain.

11179-6: Registration

This part of ISO/IEC 11179 provides instruction on how a registration applicant may register a data item with a central Registration Authority and the allocation of unique identifiers for each data item. Maintenance of administered items already registered is also specified in this document.

Additional information

Classification scheme: 11179-2 (Wikipedia)

In metadata a classification scheme is a hierarchical arrangement of kinds of things (classes) or groups of kinds of things. Typically it is accompanied by descriptive information of the classes or groups. A classification scheme is intended to be used for an arrangement or division of individual objects into the classes or groups. The classes or groups are based on characteristics which the objects (members) have in common. In linguistics, the subordinate concept is called a hyponym of its superordinate. Typically a hyponym is 'a kind of' its superordinate (Keith Allan, Natural language Semantics.

The ISO/IEC 11179 metadata registry standard uses classification schemes as a way to classify administered items, such as data elements, in a metadata registry.

Some quality criteria for classification schemes are:

Whether different kinds are grouped together. In other words whether it is a grouping system or a pure classification system. In case of grouping, a subset (subgroup) does not have (inherit) all the characteristics of the superset, which makes that the knowledge and requirements about the superset are not applicable for the members of the subset.

Whether the classes have overlaps. Whether subordinates (may) have multiple superordinates. Some classification schemes

allow that a kind of thing has more than one superordinate others don't. Multiple supertypes for one subtype implies that the subordinate has the combined characteristics of all its superordinates. This is called multiple inheritance (of characteristics from multiple superordinates to their subordinates).

Whether the criteria for belonging to a class or group are well defined. Whether the kinds of relations between the concepts are made explicit and well defined. Whether subtype-supertype relations are distinguished from composition relations (part-

whole relations) and from object-role relations.

Benefits of using classification schemes

Using one or more classification schemes for the classification of a collection of objects has many benefits. Some of these include:

It allows a user to find an individual object quickly on the basis of its kind or group.

It makes it easier to detect duplicate objects.

It conveys semantics (meaning) of an object from the definition of its kind, which meaning is not conveyed by the name of the individual object or its way of spelling.

Knowledge and requirements about a kind of thing can be applied to the members of the kind.

Examples of kinds of classification schemes

The following are examples of different kinds of classification schemes. This list is in approximate order from informal to more formal:

thesaurus - a collection of categorized concepts, denoted by words or phrases, that are related to each other by narrower term, wider term and related term relations.

taxonomy - a formal list of concepts, denoted by controlled words or phrases, arranged from abstract to specific, related by subtype-supertype relations or by superset-subset relations.

data model - an arrangement of concepts (entity types), denoted by words or phrases, that have various kinds of relationships. Typically, but not necessarily, representing requirements and capabilities for a specific scope (application area).

network (mathematics) - an arrangement of objects in a random graph.

ontology - an arrangement of concepts that are related by various well defined kinds of relations. The arrangement can be visualized in a directed acyclic graph.

One example of a classification scheme for data elements is a representation term.

Data element definition 11179-4 (Wikipedia)

In metadata, a data element definition is a human readable phrase or sentence associated with a data element within a data dictionary that describes the meaning or semantics of a data element.

Data element definitions are critical for external users of any data system. Good definitions can dramatically ease the process of mapping one set of data into another set of data. This is a core feature of distributed computing and intelligent agent development.

There are several guidelines that should be followed when creating high-quality data element definitions.

Properties of clear definitions

A good definition is:

Precise - The definition should use words that have a precise meaning. Try to avoid words that have multiple meanings or multiple word senses.

Concise - The definition should use the shortest description possible that is still clear.

Non Circular - The definition should not use the term you are trying to define in the definition itself. This is known as a circular definition.

Distinct - The definition should differentiate a data element from other data elements. This process is called disambiguation.

Unencumbered - The definition should be free of embedded rationale, functional usage, domain information, or procedural information.

A data element definition is a required property when adding data elements to a metadata registry.

Definitions should not refer to terms or concepts that might be misinterpreted by others or that have different meanings based on the context of a situation. Definitions should not contain acronyms that are not clearly defined or linked to other precise definitions.

If you are creating a large number of data elements, all the definitions should be consistent with related concepts.

Critical Data Element -- Not all data elements are of equal importance or value to an organization. A key metadata property of an element is categorizing the data as a Critical Data Element (CDE). This categorization provides focus for data governance and data quality. An organization often has various sub-categories of CDEs, based on use of the data. e.g.,

Security Coverage – data elements that are categorized as personal health information or PHI warrant particular attention for security and access

Marketing Department Usage – the Marketing department could have a particular set of CDEs identified for identifying Unique Customer or for Campaign Management

Finance Department Usage – the Finance department could have a different set of CDEs from Marketing. They are focused on data elements which provide measures and metrics for fiscal reporting

Standards such as the ISO/IEC 11179 Metadata Registry specification give guidelines for creating precise data element definitions. Specifically chapter four of the ISO/IEC 11179 metadata registry standard covers data element definition quality standards

Using precise words

Common words such as play or run frequently have many meanings. For example the WordNet database documents over 57 different distinct meanings for the word "play" but only a single definition for the term dramatic play. Fewer definitions in a chosen word's dictionary entry is preferable. This minimizes misinterpretation related to a reader's context and background. The process of finding a good meaning of a word is called Word sense disambiguation.

Examples of definitions that could be improved

Here is the definition of "person" data element as defined in the www.w3c.org Friend of a Friend specification *:

Person: A person.

Although most people do have an intuitive understanding of what a person is, the definition has much room for improvement. The first problem is that the definition is circular. Note that this definition really does not help most readers and needs to be clarified.

Here is the definition of the "Person" Data Element in the Global Justice XML Data Model 3.0 *:

person: Describes inherent and frequently associated characteristics of a person.

Note that once again the definition is still circular. Person should not reference itself. The definition should use terms other than person to describe what a person is.

Here is a more precise but shorter definition of a person:

Person: An individual human being.

Note that it uses the word individual to state that this is an instance of a class of things called human being. Technically you might use "homo sapiens" in your definition, but more people are familiar with the term "human being" than "homo sapiens," so commonly used terms, if they are still precise, are always preferred.

Sometimes your system may have cultural norms and assumptions in the definitions. For example if your "Person" data element tracked characters in a science fiction series that included aliens you may need a more general term other than human being.

Person: An individual of a sentient species.

Data element name 11179-5 (Wikipedia)

A data element name is a name given to a data element in, for example, a data dictionary or metadata registry. In a formal data dictionary, there is often a requirement that no two data elements may have the same name, to allow the data element name to become an identifier, though some data dictionaries may provide ways to qualify the name in some way, for example by the application system or other context in which it occurs.

In a database driven data dictionary, the fully qualified data element name may become the primary key, or an alternate key, of a Data Elements table of the data dictionary.

The data element name typically conforms to ISO/IEC 11179 metadata registry naming conventions and has at least three parts:

Object, Property and Representation term.Many standards require the use of Upper camel case to differentiate the components of a data element name. This is the standard used by ebXML, GJXDM and NIEM.

Example of ISO/IEC 11179 naming in relational databasesISO/IEC 11179 is applicable when naming tables and columns within a relational database.

Tables are Collections of Entities, and follow Collection naming guidelines. Ideally, a collective name is used: e.g., Personnel. Plural is also correct: Employees. Incorrect names include: Employee, tblEmployee, and EmployeeTable.

Columns are Properties of the Entity and are named in a multi-part format:

[Object] [Qualifier] Property RepresentationTermThe Object part may be omitted from a name when the property is within its object's context. The Qualifier is used when it is necessary to uniquely identify an element. For example, columns on the WorkOrders table would be expressed as:

WorkOrder_NumberRequirements_TextRequesting_Employee_NumberApproving_Employee_NumberFor Requirements_Text, the full name (i.e., the name that goes in the registry, or data dictionary) is WorkOrder_Requirements_Text.

Object is WorkOrder in full name. Property is Requirements in full name. RepresentationTerm is Text in full name.

The Requesting_Employee_Number and Approving_Employee_Number columns have Qualifiers to ensure that the data element names are unique and descriptive. The Object part of the element name is also omitted because it is declared within the object context.

Note that for the examples provided, an underscore was used as a separator. A separator is not mandated by ISO/IEC 11179 but is recommended.

Example of ISO/IEC 11179 name in XML

Users frequently encounter ISO/IEC 11179 when they are exposed to XML Data Element names that have a multi-part Camel Case format:

Object [Qualifier] Property RepresentationTermThe specification also includes normative documentation in appendices.

For example the XML element for a person's given (first) name would be expressed as:

<PersonGivenName>John</PersonGivenName>Where Person is the Object=Person, Property=Given and Representation term="Name". In this case the optional qualifier is not used.