SDMX Global Conference 2013, Paris “ Global SDMX Implementation : Modernising Official Statistics “
SDMX STATISTICAL CAPACITY BUILDING
GUIDELINES FOR DESIGNING DATA
STRUCTURE DEFINITIONS (DSDs)
Overview
• Design principles
• Exchange context
• Design process
• Data structuring approaches
• DSD analysis: STES as example
SDMX Global Conference 2013, Paris “ Global SDMX Implementation : Modernising Official Statistics “
Design Principles
Structural
• Parsimony
• Simplicity
• Purity
• Unambiguousness
• Exhaustiveness
• Orthogonality
Other
• Re-use of existing artefacts
• Flexibility and future needs
• Fitness for use throughout
statistical business process
• User-friendliness
Data Exchange Context
• Single- or multi-domain
• Single- or multi-purpose
• Type of data
• Human or machine as recipient
• Level of data exchange
• Role in data exchange
• Process pattern
• Phase of statistical process
Design Process
1. Specify context
2. Identify relevant existing DSDs
3. Check DSD suitability
4.2. Use suitable DSDs
4.3. Define new DSDs
5. Define supporting artefacts
4.1. Define modified DSDs
available not available
partly suitable suitable not suitable
Specify context
Identify relevant
existing DSDs
Check DSD
suitability
Define
modified DSDs
Use suitable
DSDs
Define new
DSDs
Define supporting
artefacts
Define new
DSDs
Design Process
Define new DSDs
4.3.1. Specify concepts
4.3.2. Specify code lists
4.3.3. Specify data formats
4.3.4. Assemble DSDs
Specify
concepts
Specify code
lists
Specify data
formats
Assemble
DSDs
Specify
concepts
Design Process
Specify concepts
4.3.1.2. Identify relevant existing concepts
4.3.1.3. Check concept suitability
4.3.1.4.2. Define new concepts
4.3.1.5. Define concept roles
4.3.1.4.1. Use suitable concepts
suitable not suitable
available not available
4.3.1.6. Define groups
4.3.1.1. Decide structuring approach
revise revise
4.3.1.7. Define attribute attachment levels
Structuring
approach
Relevant
concepts …
Concepts
suitable?
Use! Define new!
Define
concept roles
Define groups
Define
attachment levels
Design Process
Define new DSDs
4.3.1. Specify concepts
4.3.2. Specify code lists
4.3.3. Specify data formats
4.3.4. Assemble DSDs
Specify
concepts
Specify code
lists
Specify data
formats
Assemble
DSDs
Specify code
lists
Design Process
Specify code lists
4.3.2.1. Identify relevant existing code lists
4.3.2.2. Check code list suitability
4.3.2.3.2. Define modified code lists
4.3.2.3.3. Define new code lists
4.3.2.3.1. Use suitable code lists
suitable not suitablepartly suitable
available not available
Relevant code
lists available?
Code lists
suitable?
Use! Modify! Define new!
Design Process
Iterative
4.3.1. Specify concepts
4.3.2. Specify code lists
4.3.3. Specify data formats
4.3.4. Assemble DSDs
Specify
concepts
Specify code
lists
Specify data
formats
Assemble
DSDs
Specify
concepts
Number and content of dimensions
Number of DSDs
FEWER CONCEPTS AND DIMENSIONS IN THE KEY
NOT COMPLETELY INDEPENDENT:
LARGER NUMBER OF DSDs
Data structuring approaches
DATA CHARACTERISTICS : C1 C2 C3 C4 Sex Age Sector Employment status…
Composite concepts: More characteristics = 1 concept e.g. Sex and Age
Pure concepts: 1 characteristic = 1 concept Sex; Age; Sector; …
wider use of composite concepts
lower number of dimensions
Number and content of dimensions
Horizontal complexity
V e r t i c a l c o m p l e x i t y
Codelist1
1
2
--
--
--
K1
Codelist2
1
2
--
--
--
K2
CodelistN
1
2
--
--
--
KN
………
Key: Dim1.Dim2………………………………….DimN
Many pure
Few mixed
Pure vs. composite concepts
● clean data structure
● flexible in terms of mappings to other data structure… may be
mapped to any mixed representation
● flexible in terms of defining queries (for a skilled user)
● short and simple codelists
● long observation keys
● difficult to handle by end user (long codes; many dimensions) but for
skilled users is more flexible
● special values (not applicable; total) widely used
● creates sparseness
● needs many constraints (due to sparseness)
Some of the critical points may be overcome through a different strategy in choosing the number of DSDs. More DSDs reduce sparseness and the need for constraints, and would result in shorter keys.
Many pure concepts
All pure concepts
Too many?
Composite concepts
Many different DSD’s
Trade-off
Strategies
SDMX technical notes annex 6 (343)
“Avoid composite dimensions”
but in particular context they may be useful
Eg: to disseminate few key economic indicators
(multi-domain)
Composite concepts
ONE DSD or MANY DSDs?
A possible approach: Master and satellite artefacts (derived via constraints)
Number of DSDs
Data exchange scenario
Concepts SC1 SC2 SC3 …… SCm
# 1 X X X X X
# 2 X O X X X
# 3 O X O O O
…… …. …. … … …
# n X O O X X
Master DSD matrix
Master DSD
DSD1 DSD2 ………..DSDn
constraints
Multiple satellite DSDs
(unique key structures)
Master and satellite DSDs
Multiple satellite DSDs
Master DSD
Dataflow 1 dataflow 2 ……….. dataflow n
constraints
ONE DSD
Master and satellite DSDs
One DSD, multiple data flows
A bit different approach: Master DSD
DSD1 DSD2 ………..DSDn
Dropping not
relevant
dimensions
Multiple satellite DSDs
(multiple key structures)
Master and satellite DSDs
Multiple satellite DSDs
CONCEPT DESCRIPTION CODE LIST ID
SUBJECT Subject matter CL_SUBJECT
MEASURE Quantitative variable value CL_MEASURE
FREQ Periodicity CL_FREQ
REFERENCE_AREA “Reference area” and/or “Counterpart
area”
CL_AREA
ADJUSTMENT Seasonal adjustment CL_ADJUSTMENT
UNIT Generic list with code values CL_UNIT
TIME_PERIOD Defines the observation period
CONCEPT ATTRIBUTES CODE LIST ID
UNIT_MULT Indicating the magnitude in the units
of measurements
CL_UNIT_MULT
OBS_STATUS The observation status CL_OBS_STATUS
Example: Short-term Economic Statistics
DSD: Dimensions and attributes
• Reuse of existing code lists and future needs:
Adjustment, frequency, reference area, subject
matter.
• Parsimony, simplicity, density:
DSD is not redundant and has a small number of
dimensions. The DSD provides data for most of
the cells.
• Purity:
In this case we have the code list CL_UNIT
which is not pure but adds to simplicity.
DSD analysis
Design principles
• Unambiguousness and orthogonality:
The code list MEASURE seems to be ambiguous
and CL_UNIT and CL_MEASUREMENT show
overlaps.
• Exhaustiveness:
It is possible to identify all data in the flow.
DSD analysis
Design principles
• The DSD includes the dimension MEASURE to
differentiate the indicators expressed as an index
number from the rest.
• This item was added to the DSD as an independent
dimension, when by its nature, could be
incorporated into the CL_UNIT dimension.
• In the code list of the UNIT dimension the following
codes of different nature were included:
Physical unit measures
Monetary units
Several base periods for index numbers
DSD analysis
Dimensions
CL_MEASURE
Code Description
ST Number, rate, value
IXNB Index
CL_UNIT
Code Description
1995100 1995=100
2000100 2000=100
2003100 2003=100
2005100 2005=100
2008100 2008=100
2010100 2010=100
AUD Australian Dollar
BPA Barrels per day
BPM Barrels per month
BRL Brazilian Real
CAD Canadian Dollar
CHF Swiss Franc
CLP Chilean Peso
CNY Yuan Renminbi
CZK Czech Koruna
DKK Danish Krone
DW Dwellings
EUR Euro
GBP Pound Sterling
GWH Gigawatt hour
HUF Forint
IDR Rupiah
ILS New Israeli Sheqel
INR Indian Rupee
ISK Iceland Krona
JB Jobs
Description: Generic list
with code values
(including currency,
base period, measures)
Description: A summary
(means, mode, total, index,
etc.) of the individual
quantitative variable values
for the statistical units in a
specific group (study
domains).
DSD analysis
Code lists
Eliminate the MEASURE dimension.
Add to the CL_UNIT the code IXNB = Index
number, so that indicators expressed as
indices can be identified.
Eliminate from the CL_UNIT the codes for
base period.
Create a new concept to specify the base
period with its own code list / format.
DSD analysis
Suggestions