1
Making Data and Information Available for the Long Term The International Polar Year Ruth Duerr, Mark A. Parsons, Ron Weaver, Jane Beitler http:/nsidc.org/ The Case of the Missing 2007/8 IPY Data A Cautionary Tale from the Future… The date: April 22, 2057 The place: World Climate Change Management Bureau (CLIMB) Geneva, Switzerland The 3-vid ran on for several minutes, shockingly. The Yangon food riots of '55 were enough to sober anyone, even this hardened group of administrators. As the story moved on to the causes of the crisis and off the faces of the people in the streets, there was a palpable easing of tension in the group - for all except Charles M'Bwayo Rupert, the world's chief climatologist. He alone knew that the region was far from being out of the woods. As the video drew to a close, he rose to speak. He was far from having an answer. “Madam Chairman, we have determined that the problem started with the IPY data collected in 2007-2009. If you remember, this was the first of the Polar Years where the collected data were born-digital. Unfortunately, we have not been able to recover most of these data.” “That’s ridiculous!”, exclaimed Dr. Marjory Hunter, the Chair of the 14th IPCC Group 1 commission. "How can that be?" “Well”, sighed Rupert, “the first problem we encountered was that the IPY Data and Information System (IPY-DIS) catalog is no longer available. Apparently, some time in the mid-twenties, funding for the IPY-DIS was cut in favor of supporting the later high- resolution sensors. While the parent organization kept the system running as long as they could, eventually that became technically infeasible. They simply couldn't keep pace with the technology. In those days, it took only a few years for even the most advanced data system to become totally obsolete.“And this led to the GCM model issues?” asked Hunter. “Yes; the current trends in Asian and Tibetan snowpack, and their influence on Arctic sea ice and the monsoons, began just before the IPY era - but the huge collections that occurred as a part of that program give us a view of the climate/precip patterns that wasn't equaled until three full decades later.” “We need that data to understand the current pattern - if we're headed for another central Asian drought, we've got to get every arable acre on Earth planted in the next two months. New Delhi and Shanghai just barely scraped by last time…” “We were able to find references to nearly a dozen data sets in the literature; however, most of the journal articles written about these data failed to reference the actual data set itself,” sighed Rupert, “I just don't understand why scientists back then were so adamantly opposed to citing data sets! Didn't they realize that the scientific method requires that the data be available so that the work can be verified?” After visibly calming himself, Dr. Rupert continued, “In any case, of the dozen data sets we were able to track down only six. Of those, one had been held by the PI. He had made it available for years; but, no one knows what happened to the data after he died. We tried his Department and the University's library; but if they ever had it they must have purged it - they don't have any record of it now.” “We do know what happened to the second data set. Up until the Civil War, it had been archived by the Tirzmolian Institute. However, these data were apparently backed up nowhere else in the world, so when the rebels destroyed all sites of high technology in that country, all copies were destroyed. “We were able to obtain the original tapes for the third data set, but we haven’t been able to locate a machine capable of reading them. The tapes themselves look pretty good. We think some of them are probably still readable. It’s possible we could build a machine to extract the bit stream off the tapes. The tape technology used was fairly standard back then, and we have been able to find the specification for writing a bit stream to the tape. However, it is not at all clear how the data itself were organized on the tapes. We have reason to believe that the tapes were once part of one of the large automated tape libraries so common back then. One problem with those libraries was that, as commercial products, they typically used proprietary formats to organize the data when writing to tape. Most of those vendors went out of business when holographic storage became popular. We have had very little luck in finding formatting specifications for these types of systems. "We were also able to obtain the fourth data set. Fortunately, these data were stored on one of the few media that we can still read today, but the data files were written in eMeSS Data Version 2.0 format. The last application capable of reading that went out of production nearly 20 years ago. And again, it was a proprietary format, so we haven’t been able to find the a specification with which to build a new reader application. “We didn't have that problem with the fifth data set. Amazingly enough there are still applications that can read these data. Unfortunately the representational metadata that would have allowed us to interpret the data are missing. As a result, we are not sure quite what these data mean.” At that point, Dr. Rupert hesitated. 01001100010011110101 0011010101000010 0000001101000100 00010101010001000001 1 2 3 4 1 5 6 NSF and the Library of Congress said it best in their report It’s About Time: Research Challenges in Digital Archiving and Long-Term Preservation (2003): “digital objects require constant and perpetual maintenance, and they depend on elaborate systems of hardware, software, data and information models, and standards that are upgraded or replaced every few years” The IPY needs to carefully consider how to prevent our cautionary tale from becoming reality. This requires a long-term commitment. Data and documentation need to be durable enough to withstand constant technological change. 1 2 Evolving Technology: Change is the Only Constant! Data Citations: The Evidence Whodunit What is a Data Citation? A mechanism to properly credit the creator of a data set A mechanism to credit the publisher of the the data set A mechanism to allow your readers to find the data you used in your paper What do they look like? Like a book or paper reference (see examples below) Hall, D.K., G.A. Riggs, and V.V. Salomonson. 2000, updated daily. MODIS/Terra Snow Cover 5-Min L2 Swath 500m V004, September - December 2003. Boulder, CO, USA: National Snow and Ice Data Center. Digital media. Armstrong, R., J. Francis, J. Key, J. Maslanik, T. Scambos, and A. Schweiger. 1998. Polar Pathfinder sampler: Combined AVHRR, SMMR- SSM/I, and TOVS time series and full-resolution samples. Compiled by S. Khalsa. Boulder, CO, U.S.A.: National Snow and Ice Data Center. CD-ROM. If Digital Object Identifiers(DOI’s) are available for the data, they should be included in the citation. Data centers should be able to provide the proper citation for the data 3 Distributed Data Management: Sharing the Load Scientists rarely have the data management expertise needed to ensure that their data will be useful and available for the long term Scientists rarely have the resources required - for example to ensure the security and integrity of the data, to deal with off-site backups, media migration, and technical issues 4 Data Backups: Low-Cost Life Insurance One of the lessons learned from the 9/11 disaster is that it is not enough to simply backup your data off- site, you need to carefully consider the location of that off-site backup. After 9/11, several businesses went under because their backups were stored in the neighboring tower. All copies of their data were lost when both towers collapsed. One of the many data management issues that the IPY- DIS needs to consider is the issue of backups. Are international backups necessary? What are the implications of the answer on the IPY data system? “Preservation without access is pointless; Access without preservation is impossible!” - heard in the halls of NSIDC, 2004 5 Data and Metadata Structure and Content: Clues to the Content Content Information Preservation Description Information Descriptive Information About Package 1 Package 1 Packaging Information 6 Data Integrity: A Question of Scientific Character There are three main components to ensuring the integrity of data: The data must demonstrate scientific integrity - peer-review of data and the data citation process are needed (see ). The data repository must be trustworthy. The data must not have been altered since creation (or any alterations have been well described) - adequate attention to source authentication and data fixity is needed (see ). 2 5 IPY should adopt the OAIS reference model. IPY should discourage proprietary data formats and encourage coordination of formats. “It’s a capital mistake to theorize without data” - Sherlock Holmes (Sir Arthur Conan Doyle IPY should mandate the use of data citations and strongly encourage the use of DOI’s. The IPY-DIS should help define and promote best practices in data management, as well as to coordinate their implementation by all elements of the IPY program. In the USA, investigators frequently manage their data. While this can be OK for the short term, it rarely works for the long term. Two main issues are: Due to the international, distributed nature of the IPY, the data and information collected necessarily will be archived and made accessible via distributed mechanisms providing additional data management challenges. The Open Archival Information System (OAIS) Reference Model is an international standard for archival. It does not specify how to properly make data and information available for the long term, rather it defines all of the needed information and decisions that must be made. The OAIS reference model defines four types of information needed in order to make data useable into the future (see figure below): Raymond (2004) describes four attributes of a good data format: Interoperability Transparency Extensibility Storage economy This indicates that a textual format is often best. Even when a binary format is necessary, it is useful to have a representative textual sample of some of the data. IPY needs to establish mechanisms to ensure data integrity. 5 5 5

Making Data and Information Available for the Long Term The International Polar Year Ruth Duerr, Mark A. Parsons, Ron Weaver, Jane Beitlerhttp:/nsidc.org

Embed Size (px)

Citation preview

Page 1: Making Data and Information Available for the Long Term The International Polar Year Ruth Duerr, Mark A. Parsons, Ron Weaver, Jane Beitlerhttp:/nsidc.org

Making Data and Information Available for the Long Term

The International Polar Year

Ruth Duerr, Mark A. Parsons, Ron Weaver, Jane Beitler http:/nsidc.org/

The Case of the Missing 2007/8 IPY Data A Cautionary Tale from the Future…

The date: April 22, 2057The place: World Climate Change Management Bureau (CLIMB) Geneva, Switzerland

The 3-vid ran on for several minutes, shockingly. The Yangon food riots of '55 were enough to sober anyone, even this hardened group of administrators. As the story moved on to the causes of the crisis and off the faces of the people in the streets, there was a palpable easing of tension in the group - for all except Charles M'Bwayo Rupert, the world's chief climatologist. He alone knew that the region was far from being out of the woods. As the video drew to a close, he rose to speak. He was far from having an answer.

“Madam Chairman, we have determined that the problem started with the IPY data collected in 2007-2009. If you remember, this was the first of the Polar Years where the collected data were born-digital. Unfortunately, we have not been able to recover most of these data.”

“That’s ridiculous!”, exclaimed Dr. Marjory Hunter, the Chair of the 14th IPCC Group 1 commission. "How can that be?"

“Well”, sighed Rupert, “the first problem we encountered was that the IPY Data and Information System (IPY-DIS) catalog is no longer available. Apparently, some time in the mid-twenties, funding for the IPY-DIS was cut in favor of supporting the later high-resolution sensors. While the parent organization kept the system running as long as they could, eventually that became technically infeasible. They simply couldn't keep pace with the technology. In those days, it took only a few years for even the most advanced data system to become totally obsolete.”

“And this led to the GCM model issues?” asked Hunter.

“Yes; the current trends in Asian and Tibetan snowpack, and their influence on Arctic sea ice and the monsoons, began just before the IPY era - but the huge collections that occurred as a part of that program give us a view of the climate/precip patterns that wasn't equaled until three full decades later.”

“We need that data to understand the current pattern - if we're headed for another central Asian drought, we've got to get every arable acre on Earth planted in the next two months. New Delhi and Shanghai just barely scraped by last time…”

“We were able to find references to nearly a dozen data sets in the literature; however, most of the journal articles written about these data failed to reference the actual data set itself,” sighed Rupert, “I just don't understand why scientists back then were so adamantly opposed to citing data sets! Didn't they realize that the scientific method requires that the data be available so that the work can be verified?”

After visibly calming himself, Dr. Rupert continued, “In any case, of the dozen data sets we were able to track down only six. Of those, one had been held by the PI. He had made it available for years; but, no one knows what happened to the data after he died. We tried his Department and the University's library; but if they ever had it they must have purged it - they don't have any record of it now.”

“We do know what happened to the second data set. Up until the Civil War, it had been archived by the Tirzmolian Institute. However, these data were apparently backed up nowhere else in the world, so when the rebels destroyed all sites of high technology in that country, all copies were destroyed.

“We were able to obtain the original tapes for the third data set, but we haven’t been able to locate a machine capable of reading them. The tapes themselves look pretty good. We think some of them are probably still readable. It’s possible we could build a machine to extract the bit stream off the tapes. The tape technology used was fairly standard back then, and we have been able to find the specification for writing a bit stream to the tape. However, it is not at all clear how the data itself were organized on the tapes. We have reason to believe that the tapes were once part of one of the large automated tape libraries so common back then. One problem with those libraries was that, as commercial products, they typically used proprietary formats to organize the data when writing to tape. Most of those vendors went out of business when holographic storage became popular. We have had very little luck in finding formatting specifications for these types of systems.

"We were also able to obtain the fourth data set. Fortunately, these data were stored on one of the few media that we can still read today, but the data files were written in eMeSS Data Version 2.0 format. The last application capable of reading that went out of production nearly 20 years ago. And again, it was a proprietary format, so we haven’t been able to find the a specification with which to build a new reader application.

“We didn't have that problem with the fifth data set. Amazingly enough there are still applications that can read these data. Unfortunately the representational metadata that would have allowed us to interpret the data are missing. As a result, we are not sure quite what these data mean.” At that point, Dr. Rupert hesitated.

“Well!”, said Madam Chairman, “Go on. You said you found six data sets. What happened to the last one?”

“Well… we have, and we can read the sixth data set.” said Dr Rupert somewhat hesitantly. “We even have the representational metadata, so we know what the data mean; but we aren’t sure these data can be trusted. We don't quite know what to think. We haven't been able to track down a clear record of custody for the data and, well, the data just look wrong. We are wondering if they have been altered over time or even whether they were ever accurate.”

The Chairman sat, thought for a minute, then asked, “So, are you telling me that we can't recover any of the IPY data from 2007 and 8?"

Dr. Rupert frowned, “Well, yes I think that is so.”

“What was the matter with these people,” exploded the Chairman. “Had they no pride? Where is their legacy? Didn't they understand their obligation to future users?”

“Uh,” replied Dr. Rupert, “I don't know if that was the problem. Even today, data management is often an afterthought, something to be funded after everything else - if there is enough funding left. Finding funding for the long term is still a problem. So how can we blame them?”

01001100010011110101

0011010101000010

0000001101000100

00010101010001000001

1

2

3

4

1

5

6

NSF and the Library of Congress said it best in their report It’s About Time: Research Challenges in Digital Archiving and Long-Term Preservation (2003):

“digital objects require constant and perpetual maintenance,and they depend on elaborate systems of hardware, software, data and

information models, and standards that are upgraded or replaced every few years”

The IPY needs to carefully consider how to prevent our cautionary tale from becoming reality. This requires a long-term commitment. Data and documentation need to be durable enough to withstand constant technological change.

1

2

Evolving Technology: Change is the Only Constant!

Data Citations: The Evidence Whodunit

What is a Data Citation?• A mechanism to properly credit the creator of a data set• A mechanism to credit the publisher of the the data set• A mechanism to allow your readers to find the data you used in your

paperWhat do they look like?

• Like a book or paper reference (see examples below)• Hall, D.K., G.A. Riggs, and V.V. Salomonson. 2000, updated daily. MODIS/Terra Snow Cover 5-Min

L2 Swath 500m V004, September - December 2003. Boulder, CO, USA: National Snow and Ice Data Center. Digital media.

• Armstrong, R., J. Francis, J. Key, J. Maslanik, T. Scambos, and A. Schweiger. 1998. Polar Pathfinder sampler: Combined AVHRR, SMMR- SSM/I, and TOVS time series and full-resolution samples. Compiled by S. Khalsa. Boulder, CO, U.S.A.: National Snow and Ice Data Center. CD-ROM.

• If Digital Object Identifiers(DOI’s) are available for the data, they should be included in the citation.

Data centers should be able to provide the proper citation for the data

3 Distributed Data Management: Sharing the Load

• Scientists rarely have the data management expertise needed to ensure that their data will be useful and available for the long term

• Scientists rarely have the resources required - for example to ensure the security and integrity of the data, to deal with off-site backups, media migration, and technical issues

4 Data Backups: Low-Cost Life Insurance

One of the lessons learned from the 9/11 disaster is that it is not enough to simply backup your data off-site, you need to carefully consider the location of that off-site backup. After 9/11, several businesses went under because their backups were stored in the neighboring tower. All copies of their data were lost when both towers collapsed.

One of the many data management issues that the IPY-DIS needs to consider is the issue of backups. Are international backups necessary? What are the implications of the answer on the IPY data system?

“Preservation without access is pointless; Access without preservation is impossible!”

- heard in the halls of NSIDC, 2004

5 Data and Metadata Structure and Content: Clues to the Content

ContentInformation

PreservationDescriptionInformation

DescriptiveInformation

About Package 1

Package 1

Packaging Information

6 Data Integrity: A Question of Scientific Character

There are three main components to ensuring the integrity of data:• The data must demonstrate scientific integrity - peer-review of data and

the data citation process are needed (see ).• The data repository must be trustworthy.• The data must not have been altered since creation (or any alterations

have been well described) - adequate attention to source authentication and data fixity is needed (see ).

2

5

IPY should adopt the OAIS reference model. IPY should discourage proprietary data formats and encourage coordination of formats.

“It’s a capital mistake to theorize without data”- Sherlock Holmes (Sir Arthur Conan Doyle)

IPY should mandate the use of data citations and strongly encourage the use of DOI’s.

The IPY-DIS should help define and promote best practices in data management, as well as to coordinate their implementation by all elements of the IPY program.

In the USA, investigators frequently manage their data. While this can be OK for the short term, it rarely works for the long term. Two main issues are:

Due to the international, distributed nature of the IPY, the data and information collected necessarily will be archived and made accessible via distributed mechanisms providing additional data management challenges.

The Open Archival Information System (OAIS) Reference Model is an international standard for archival. It does not specify how to properly make data and information available for the long term, rather it defines all of the needed information and decisions that must be made.

The OAIS reference model defines four types of information needed in order to make data useable into the future (see figure below):

Raymond (2004) describes four attributes of a good data format:• Interoperability• Transparency• Extensibility• Storage economy

This indicates that a textual format is often best. Even when a binary format is necessary, it is useful to have a representative textual sample of some of the data.

IPY needs to establish mechanisms to ensure data integrity.

5

5

5