16
Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

Page 2: Integrated Data Infrastructure and prototype

2

Crown copyright © This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You are free to copy, distribute, and adapt the work, as long as you attribute the work to Statistics NZ and abide by the other licence terms. Please note you may not use any departmental or governmental emblem, logo, or coat of arms in any way that infringes any provision of the Flags, Emblems, and Names Protection Act 1981. Use the wording 'Statistics New Zealand' in your attribution, not the Statistics NZ logo.

Liability While all care and diligence has been used in processing, analysing, and extracting data and information in this publication, Statistics New Zealand gives no warranty it is error free and will not be liable for any loss or damage suffered by the use directly, or indirectly, of the information in this publication.

Citation Statistics New Zealand (2012). Integrated Data Infrastructure and prototype. Available from www.stats.govt.nz. ISBN 978-0-478-37754-5 (online)

Published in February 2012 by Statistics New Zealand Tatauranga Aotearoa Wellington, New Zealand

Contact Statistics New Zealand Information Centre: [email protected] Phone toll-free 0508 525 525 Phone international +64 4 931 4610 www.stats.govt.nz

Page 3: Integrated Data Infrastructure and prototype

3

Contents

Purpose and summary ............................................................................................. 4

Purpose ............................................................................................................................ 4

Summary .......................................................................................................................... 4

Background to the Integrated Data Infrastructure ................................................. 5

Purpose of new infrastructure and prototype .................................................................. 5

Integrated datasets .......................................................................................................... 5

Usability limitations with the current integrated datasets ................................................ 5

Funding for the Integrated Data Infrastructure ................................................................ 6

Benefits of the new infrastructure ........................................................................... 7

Potential for new official statistics .................................................................................... 7

Providing new opportunities for research ........................................................................ 7

Meeting strategic priorities ............................................................................................... 8

Linking of datasets in the prototype ..................................................................... 10

Data sources used in the Integrated Data Infrastructure .............................................. 10

Linking process .............................................................................................................. 11

Link results for the four stages of linking ....................................................................... 11

Privacy issues for the prototype and Integrated Data Infrastructure ................. 13

Unique identifiers for linking .......................................................................................... 13

Data storage .................................................................................................................. 13

Use of the data .............................................................................................................. 14

Data retention ................................................................................................................ 14

Confidentiality ................................................................................................................ 14

Next steps: building the Integrated Data Infrastructure ...................................... 15

References .............................................................................................................. 16

Figure 1 Integrated Data Infrastructure ............................................................................. 10

Page 4: Integrated Data Infrastructure and prototype

4

Purpose and summary

Purpose The purpose of the Integrated Data Infrastructure and prototype is to:

• highlight the successful completion of the Integrated Data Infrastructure prototype (IDI prototype)

• announce that Statistics New Zealand has started developing the Integrated Data Infrastructure (IDI) which, when completed, will replace and enhance the prototype

• outline potential benefits of the new infrastructure. The sections of the paper are:

• Background to the Integrated Data Infrastructure • Benefits of the new infrastructure • Linking of datasets in the prototype • Privacy issues for the prototype and Integrated Data Infrastructure • Next steps: building the Integrated Data Infrastructure.

Summary Statistics NZ has previously undertaken a number of projects that integrate data supplied by different government agencies. However, these datasets currently exist independent of each other, which limits the use of the data.

The IDI and prototype address the limitations of current integrated datasets. Statistics NZ developed the IDI prototype using Migrant Levy funding received as a result of a Department of Labour (DoL) proposal to Cabinet. The IDI prototype consolidates current integrated datasets managed by Statistics NZ, and links Department of Labour migration and international movements data with a link through to Statistics NZ’s Longitudinal Business Database (LBD).

Further funding as part of the programme of work known as Statistics 2020 Te Kāpehu Whetū (Stats 2020) has allowed Statistics NZ to allocate resource to build the IDI, which is due to be completed in June 2015.

This new infrastructure will enable research across domains to produce more powerful statistics not currently possible from isolated datasets. It will also allow for the investigation of currently unanswerable questions and will allow us to integrate additional datasets in future in response to user demands for more or new information.

Developing the IDI helps Statistics NZ meet its Stats 2020 strategic priorities as well as provide new opportunities for research.

Results from the linking process used for the IDI prototype indicate that the match rates are sufficiently high to enable research using the IDI prototype, even before the IDI is fully developed. Further work is continuing to enhance the linking process as part of the IDI development.

Privacy issues for the prototype and the fully developed infrastructure have been assessed, see Privacy Impact Assessment for the Integrated Data Infrastructure.

The fully developed IDI will provide a more systematic approach to longitudinal data linkage across the Official Statistics System.

Page 5: Integrated Data Infrastructure and prototype

5

Background to the Integrated Data Infrastructure

This section outlines the purpose of the Integrated Data Infrastructure and background to developing the IDI prototype.

Purpose of new infrastructure and prototype The Integrated Data Infrastructure will create an integrated data environment with longitudinal microdata about individuals, households, and firms that researchers can access to answer research, policy, and evaluation questions to support informed decision making. The IDI prototype is being used by Statistics NZ to test, analyse, and develop the IDI, and its processes and outputs. The prototype is also being used by researchers to undertake research a number of years before the IDI is completed.

Integrated datasets A 1997 Cabinet directive directed that “where datasets are integrated across agencies from information collected for unrelated purposes, Statistics NZ should be custodian of these datasets in order to ensure public confidence in the protection of individual records” (Cabinet minutes, 1997).

Since then, Statistics NZ has undertaken a number of projects that integrate datasets supplied by different government agencies. These projects include:

• Student Loans and Allowances integrated dataset (SLA) – source data supplied by Inland Revenue, Ministry of Social Development (MSD) (StudyLink), and Ministry of Education.

• Linked Employer–Employee Data (LEED) – source data supplied by Inland Revenue and linked to the Statistics NZ Business Frame.

• LEED–MSD benefits dynamic data (BDD) – benefit dynamics data supplied by MSD and linked to LEED.

• Longitudinal Business Database prototype (LBD) – based on a longitudinal business frame, with data from Statistics NZ business surveys or other outputs and external administrative data from other government agencies.

• Employment Outcomes of Tertiary Education (EOTE) feasibility study – secondary and tertiary education supplied by Ministry of Education and linked to LEED.

• LEED–Household Labour Force Survey (HLFS) feasibility study – source data is HLFS data collected by Statistics NZ linked to LEED.

Usability limitations with the current integrated datasets The SLA, EOTE, LEED–MSD, LEED–HLFS, and LBD integrated datasets currently exist in independent environments within Statistics NZ. These environments have limitations which affect the usability of the data. The environments:

• are inflexible • make data integration difficult because of their separate and independent

approaches to storing data.

The integrated datasets within these environments are also unable to efficiently handle frequent changes in administrative data in response to policy or real world changes.

Page 6: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

6

Funding for the Integrated Data Infrastructure In March 2011, Cabinet agreed to a Department of Labour proposal for Migrant Levy funding for Statistics NZ to integrate the Department of Labour’s migration datasets with Statistics NZ’s integrated datasets.

As a result, Statistics NZ created the IDI prototype that consolidates current integrated datasets managed by Statistics NZ, and links Department of Labour migration and international movements data with a link through to Statistics NZ’s Longitudinal Business Database (LBD). The combined components form the IDI prototype.

Statistics NZ also received funding in the Budget 2011 to invest in the organisation-wide programme of change, known as Statistics 2020 Te Kāpehu Whetū (Stats 2020). This funding enabled Statistics NZ to allocate resources to build the Integrated Data Infrastructure, due for completion in June 2015.

Page 7: Integrated Data Infrastructure and prototype

7

Benefits of the new infrastructure

This section outlines the benefits of having the new infrastructure: • Potential for new official statistics • Providing new opportunities for research • Meeting strategic priorities o Relevant, trustworthy, accessible information o More value from official statistics o Transform how statistics are delivered o Customer focus.

Potential for new official statistics The completed IDI will offer the potential for new and/or improved official statistics. These possibilities will only be considered after the IDI is completed in June 2015.

Providing new opportunities for research The IDI will enable a wide range of research opportunities to answer research, policy, and evaluation questions to support informed decision making.

Here are examples that the Department of Labour and Ministry of Education are considering.

Department of Labour The Department of Labour are considering the possibility of using the IDI to:

• monitor employment and earnings outcomes, and benefit uptake, of all immigrants over the last 10 years and into the future

• track the labour market pathways of immigrants • identify characteristics associated with different economic outcomes for immigrants • determine whether there are links between the employment of immigrants and

improved workforce productivity and international trade • determine the educational and demographic characteristics of emigrants.

The information would be used to:

• inform immigration and broader labour market policy advice • contribute to business improvements to improve client services develop new official

statistics • inform government accountability and other required reports.

The Department of Labour also see the IDI as having a number of broader benefits. It will enhance understanding of net migration flows, as the employment, earnings, and education status of New Zealanders can now be examined. The IDI will also help monitor the outcomes the Government is seeking through its economic growth agenda and its strategic direction for immigration. These outcomes include boosting productivity, filling skill gaps, increasing innovation, and increasing the numbers of New Zealand firms participating in the global economy.

Page 8: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

8

Ministry of Education The Ministry of Education is interested in analysing what happens to students who leave school but do not take up tertiary education. Using data from the IDI would help answer questions such as “What are the characteristics of these students” and “How do their situations in future (eg financial, lifestyle) compare with those who had tertiary education?”

The Ministry see the potential of using data from the IDI to explore the factors associated with higher earnings in order to assess the value of tertiary education.

The Ministry is also interested in analysing post-study employment and earnings of early childhood education (ECE) teaching graduates. The Ministry could use data from the IDI to determine how many teaching graduates go on to work in the ECE sector, how many remain in the sector, and how their earnings compare with teaching graduates who are not in ECE. Such information would help monitor the quality of the ECE sector by providing insights into the make-up and attractiveness of the sector.

Meeting strategic priorities Further benefits of the IDI are grouped by the four Statistics NZ’s strategic priorities established under the Stats 2020 programme of work.

Relevant, trustworthy, accessible information (Strategic priority 1: Lead the official statistics system so that it efficiently meets the country’s needs for relevant, trustworthy, accessible information.)

• The IDI will create a statistical infrastructure that links administrative data from several government departments and Statistics NZ survey data. It will allow linking of individual and business-level data and provide a more systematic approach to longitudinal data linkage across the Official Statistics System.

• Having better access to this information will allow users to tell more detailed stories about people and businesses in New Zealand and investigate currently unanswerable questions. For example, it will enable users to analyse the transitions and outcomes of people through the secondary and tertiary education system, the labour market, the benefit system, with additional links to external migration and business data.

More value from official statistics (Strategic priority 2: Obtain more value from official statistics.)

• Integrating data across domains will meet user demands for more powerful statistics than is currently possible from isolated datasets. This integration will occur without creating any extra burden on respondents.

• The IDI will enhance the research opportunities available for developing and costing social and economic policy as well as evaluating government programmes. For example, data sourced from the IDI will improve our knowledge about the transitions and outcomes of people from the tertiary education system and the labour market, and inform government decisions on investment in tertiary education.

• Creating a linked statistical infrastructure will improve access to existing information sources, encouraging wider use and reuse of data.

Page 9: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

9

Transform how statistics are delivered (Strategic priority 3: Transform the way Statistics NZ delivers statistics.)

• The IDI will be a standardised but flexible data integration environment for longitudinal data linkage. It will provide an environment for integrating further datasets in future in response to user demands for more or new information.

• The infrastructure will maximise the use of existing administrative and survey data, and will: o reduce the time and cost of delivering new statistics o be able to respond to changes in existing data sources more efficiently.

Customer focus (Strategic priority 4: Create a responsive, customer-focused, influential, and sustainable organisation.)

The IDI will provide Statistics NZ with a new capability for linking, research, and analysis. It will:

• improve our ability to respond to changes in the world around us • help us lead and work with other government departments to provide statistical

services • provide a flexible, common platform for our analysts that will enable them to

develop and maintain skills and expertise in data integration and linked data analysis.

Page 10: Integrated Data Infrastructure and prototype

10

Linking of datasets in the prototype

This section covers: • Data sources used in the Integrated Data Infrastructure • Linking process • Link results for the four stages of linking.

While the overall finding was that the link rates are sufficiently high to enable researchers to use data from the IDI prototype, further work is continuing to improve the linking process as part of the IDI development. A detailed report about the linking process will be available on the Statistics NZ website in mid-2012.

Data sources used in the Integrated Data Infrastructure The chief executives of Inland Revenue, Statistics NZ, Ministry of Education, Department of Labour, and Ministry of Social Development signed a business case that proposed the creation of the IDI. This authorised the use of their organisations’ administrative data for the infrastructure.

Statistics NZ will work with source agencies to create data supply agreements that secure ongoing data for the IDI.

The following components form the IDI (and IDI prototype), see figure 1: • iLEED (integrated Longitudinal Employment and Education Data) • migration and movements data • LBD data.

Figure 1

Business dataPerson to businesslink

Educationsecondary & tertiaryMinistry of Education

EMSSelf-employedInland Revenue

Student loans & allowances

Inland Revenue &Ministry of Social

Development (StudyLink)

SLAM

HLFS / NZISSurvey

BenefitsMinistry of

Social Development

(BDD)

Outputs

Relevant releasesDynamic datasets

Cutting edge cubesPowerful research

Central Linking Concordance

(CLC)

Integrated Data Infrastructure

Migration dataDepartment of

Labour

LBDLEED

Abbreviations: iLEED: Integrated Longitudinal Employment and Education DataLBD: Longitudinal Business DatabaseSLAM: Student Loans Account Manager

EMS: Employer Monthly ScheduleHLFS: Household Labour Force SurveyNZIS: New Zealand Income SurveyBDD: Benefits Dynamic Dataset

Page 11: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

11

The iLEED component of the IDI consolidates data from the following existing integrated datasets:

• Linked Employer-Employee Data (LEED) • LEED-Ministry of Social Development benefits dynamic data • LEED-Statistics NZ’s Household Labour Force Survey data • Student Loans and Allowances integrated dataset • Employment Outcomes of Tertiary Education data.

Linking process The processes used to link data in the IDI prototype were based on Statistics NZ’s previous experience of linking datasets, which included linked individual datasets to create the SLA, LEED–MSD, EOTE, and LEED–HLFS datasets.

During the development of the IDI prototype, we also undertook conceptual work to establish a process for linking the datasets.

Preliminary investigation of the Department of Labour migration data indicated that the linking information was of sufficient quality to be able to be integrated with the Statistics NZ’s iLEED datasets.

The linking process describes the way in which Statistics NZ combines information about the same individual across data sources. Due to the number of data sources, the linking process was split into four distinct stages. These were:

• Linking Department of Labour migration data with Inland Revenue tax system data (DoL–IR)

• Linking Inland Revenue, Ministry of Social Development, and Student Loans Account Manager (SLAM) student loans and allowances data with Ministry of Education (MoE) data (SLA–MoE)

• Linking Ministry of Education data with Inland Revenue tax data (MoE–IR) • Linking Statistics NZ Household Labour Force Survey data with Inland Revenue tax

data (HLFS–IR).

Statistics NZ used probabilistic linking for the linking process. Probabilistic linking refers to the process to determine the likelihood that two records from different files belong to the same person. Records were linked using the demographic variables including name, date of birth, and sex. In some cases, data sources have extra identifiers that allow for a near exact link such as IRD numbers, passport codes, and provider code/student id code combinations.

Link results for the four stages of linking Statistics NZ accepts that for each of the four linking stages, the overall link rates will be less than 100 percent. This is because there will be an unknown proportion of people in a population who are not in the corresponding linking population and vice versa. For example, the Department of Labour population will include short term visitors and Australian citizens who are unlikely to have registered with Inland Revenue. Therefore, we do not expect everyone in the Department of Labour population to link with the Inland Revenue population.

Page 12: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

12

Types of possible linking errors Any linking exercise can have two types of errors that can reduce the quality of the links established.

Firstly, there is a possibility of records being linked that in reality are not the same. This type of error is referred to as a ‘false positive’. The degree of false positives was assessed by inspecting a sample of matches. Results from this clerical false positive check are detailed below for each of the four stages of the linking process.

Secondly, there is a possibility of two records for the same person not being linked. This type is referred to as a ‘false negative’. Due to the nature of false negatives it is difficult to quantify them.

Example linking results Here are some link rates for the four distinct stages of the linking process for the IDI prototype. Further technical detail on the linking process and more detailed linking results will be available in mid-2012.

DoL–IR link: • 77 percent of the DoL population of New Zealand citizens and residents linked to

the Inland Revenue population • 95 percent of people who arrived in New Zealand as the principal applicant on an

approved work-skilled visa and stayed in NZ for one year linked to the Inland Revenue population

• the false positive rate for the total number of links created is estimated to be 0.3 percent.

MoE–SLA link:

• 94 percent of the SLA population (a merged population of Inland Revenue, MSD and SLAM data) linked to the Ministry of Education population

• the false positive rate for the total number of links created is estimated to be 0.1 percent.

MoE–IR link:

• 73 percent of the Ministry of Education population linked to the Inland Revenue population

• the false positive rate for the total number of links created is estimated to 1.1 percent.

IR–HLFS link:

• name information is required in order to create a link between Statistics NZ HLFS data and data from the Inland Revenue tax system

• approximately 78 percent of people with name information linked to the Inland Revenue population

• the false positive rate for the total number of links created is estimated to be around 9.4 percent.

Overall, the link rates for the four stages of linking are sufficiently high to enable researchers to use data from the IDI prototype. Development of the IDI will include work to further improve its linking process.

Page 13: Integrated Data Infrastructure and prototype

13

Privacy issues for the prototype and Integrated Data Infrastructure

This section summarises the privacy aspects we considered as part of developing the IDI prototype. They are also relevant for the IDI.

The development of the IDI has followed appropriate security and access protocols to ensure that IDI data is protected and public confidence is maintained. Only individuals with a bona-fide statistical research proposal will be able to access IDI data.

The privacy risks associated with the data integration and the processes for managing these risks are discussed in detail in the Privacy Impact Assessment for the Integrated Data Infrastructure, available from www.stats.govt.nz.

Statistics NZ has discussed the privacy impact assessment with the Office of the Privacy Commissioner and addressed its feedback.

Unique identifiers for linking All data integration requires some form of identifying information to ensure the data is effectively linked. While a variety of identifying information can be used for linking (eg name, sex, and date of birth), the use of unique identifiers generally gives the most accurate link.

This means that particular care is needed to comply with information privacy principle 12 of the Privacy Act 1993, which prevents an agency from assigning a unique identifier used by another agency. Statistics NZ has worked in conjunction with the Office of the Privacy Commissioner to ensure that its data integration work complies with principle 12.

Several new unique identifiers were created in the IDI prototype. It is these transformed unique identifiers that will be available for analysis and research. All original unique identifiers supplied by agencies have been removed.

Data storage The Privacy Act 1993 requires that all reasonable steps be taken to ensure that personal information held by an agency is protected against:

• loss • unauthorised access, use, modification, or disclosure • other misuse.

Statistics NZ’s standard security measures and protocols govern the management of the data used in the IDI. Statistics NZ is required to comply with the confidentiality provisions of the Statistics Act 1975 and also protocols for security in the government sector (Department of Prime Minister and Cabinet, 2002).

The IDI also complies with the Inland Revenue requirements for the use of tax data.

Page 14: Integrated Data Infrastructure and prototype

Integrated Data Infrastructure and prototype

14

Statistics NZ has well-established policies, procedures and systems in place to ensure adequate measures of physical and electronic security. Other security measures for the IDI include the following arrangements:

• All data collections and associated electronic workspaces will be secured (access will only be authorised for project personnel who need to access data for specific tasks, and to selected IT administrators who are required to maintain the IT system).

• Datasets will not be available to third parties. • Regular audits of individuals able to access the dataset will be made.

Use of the data The infrastructure will:

• Provide data to be used only for statistical purposes, which will include research undertaken by Statistics NZ employees, secondees, and bona-fide research undertaken in the Statistics NZ Data Lab

• not provide operational data for administrative purposes, though the data could be used in an operational way.

Any person viewing data with raw unique identifiers or unidentifiable individual data will do so for Statistics NZ purposes. They will be required to read, sign, and comply with Statistics NZ and Inland Revenue declarations of secrecy, which are binding for life.

These declarations effectively place the same obligations, responsibilities, and potential sanctions on external agency staff who are seconded to Statistics NZ, or provided with access through the Statistics NZ Data Lab, as apply to Statistics NZ and Inland Revenue employees.

Statistics NZ has addressed potential concerns from individuals regarding the use of their information by ensuring that:

• data is anonymised as early as possible during processing • access to the data is subject to Statistics NZ and Inland Revenue protocols • data is used only for statistical purposes • no identifiable information is released • access is provided in accordance with Statistics NZ’s security standards, as

required under the government-directed security framework.

Data retention The IDI prototype will provide a valuable research environment for statistical analysis and research. Statistics NZ will retain the prototype until the Integrated Data Infrastructure has been fully developed, when it will replace and enhance the prototype.

If, during the development of the infrastructure, the benefits derived from the IDI data no longer outweigh privacy concerns and risks, then the prototype and the (partly) developed IDI will be archived or destroyed.

Confidentiality Confidentiality refers to the legal obligation Statistics NZ has to protect information provided by individuals and businesses. Statistics NZ has a strong culture around ensuring the confidentiality of information entrusted to us.

Statistics NZ has developed rules for the IDI that apply to potential research outputs to protect the confidentiality of information providers. These rules also take into consideration the existing confidentiality rules for each contributing dataset.

Page 15: Integrated Data Infrastructure and prototype

15

Next steps: building the Integrated Data Infrastructure

Statistics NZ has successfully created the IDI prototype. Upcoming work as part of Statistics 2020 Te Kāpehu Whetū will focus on building the Integrated Data Infrastructure. This will involve:

• using the IDI prototype to inform the creation of the IDI • continuing to improve the linking methodology • redeveloping the LEED and SLA systems.

The IDI will maximise the use of existing administrative and survey data and will be able to respond more efficiently to changes in existing data sources than is currently possible. It will also allow the integration of further datasets in future in response to user demands for more or new information.

Page 16: Integrated Data Infrastructure and prototype

16

References

Cabinet meeting minutes (1997). CAB (97) M 31/4. [Electronic copy not available]. Department of Prime Minister and Cabinet (2002). Security in the government sector. Available from www.nzsis.govt.nz Privacy Act 1993. Available from www.legislation.govt.nz. Statistics Act 1975. Available from www.legislation.govt.nz. Statistics NZ (2012). Privacy Impact Assessment for the Integrated Data Infrastructure. Available from www.stats.govt.nz.