Upload
elijah-mckinney
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Why data management?
Selfish reasons Work more efficiently Avoid data corruption and loss
Altruistic reasons Facilitates data exchange Avoid data loss
Altruistic = selfish in long run Treat others like you want to be treated
Why conserve data?
Moral obligation Price of data collection Uniqueness of observations
You can’t measure a 2003 temperature in 2009
Allow peer review and audit of results Cfr molecular genetics – requirement to deposit
sequences in international databases (Genbank)
Tools of the trade
Principles, attitude more important than hardware in principle, dissociated from computer use
cfr gigantic card indices of some libraries in practice, involves the use of computerised
databasesoften RDBMSNot always! (Genbank, World Ocean Database)
E2EDM
Data management starts from day α Data management plan should be part of any
‘project’ description
Data management ends on day as last activity of project submitting final data set to ‘deep archive’
‘End-to-End Data Management’
Data?
Results of measurements or observations Monitoring vs scientific ‘Operational’ vs delayed-mode Supporting data (eg ‘underway data’ collected
automatically by research vessels) Not necessarily numerical (eg species
identifications) Measurement scales: nominal, rank, interval, ratio Representation: string, boolean, integer, real
Information?
Widely different meanings (supporting data) Interpreted data Metadata: data about data Data about the science rather than about the
scientific subject Eg bibliographies, directories
Different aspects
Documentation and inventories Recording and logging procedures Quality control Exchange, redistribution Back up Archive
Documentation
Creating information about the dataset: metadata what, where, objectives, limitations… make available as widely as possible
avoid duplicationattract partners (scientific!)
Store metadata together with data
Documentation
Different types of metadata Discovery Documentation Technical
Serve different purposes, often different systems
Ideally ‘harvested’ from data
Inventorising
Metadata database Discovery type information
Document not only what has been measured, but also planned campaigns Make inventory searchable Facilitate exchange of data and information Avoid duplication
Existing systems
Global Change Master Directory (GCMD) Gcmd.nasa.gov
IODE Marine Environmental Directory of Information (MEDI)
Recording
Often in systems other than final data management system Paper forms
Reminder of what information should be recorded Spreadsheet
Makes quality control possible during first steps
Needs system to control data flow
Quality control
Automated Range check (impossible values) Statistical (improbable values)
Danger of excluding unexpected phenomena (eg hole in ozone layer, El Nino)
Expert ‘manually’, anything that requires knowledge of the
subject area Often involves creating graphs
Flag, don’t delete
Backing up
Needs rigorous procedures Keeping separate copy of working data sets
Disaster recoveryNeeds copy to be kept in separate location
Wrong manipulation On larger systems: on specialised hardware
(tape drives…), necessitated by large volume But the principle is more important!!
Exchanging
Communicating data to others To systems – distributed data systems To people
Requires data exchange protocols Agree on the formats for exchange
Requires data exchange policy Agree on what can be done with data by
‘recipient’
Archiving
Important to ensure long-term integrity of the data On time scales that are typically much longer than a
project… Often will involve specialised organisations
Data repositories – data centres Needs careful thinking about storage medium
Magnetic media are not ideal, certainly not in tropical countries
Documentation, viewer software
Role of data centres
Data management tasks Inventorising and documenting Archiving
Specific tasks Redistribution Integration
Support
Redistribution
Preferably on line Fast and efficient No marginal costs
Inventory Metadatabase as a tool
Data rescue Recovering data that are in danger of being lost
Respecting rights of data providers Data policy Proper use statement
Integration
Over different disciplines CTD cast/Niskin bottles
Over different institutions Implies ‘trust’ Needs formal arrangements
Data policy
Creates possibility of extra quality control Checks on consistency
New technologies
Technological developments make new types of applications possible Internet, bandwidth Standard protocols
DiGIR, XML Distributed databases
Data centres are forced to rethink their role No longer passive archive, but active service centre
Data policy
Formal agreement between partners exchanging data
Describes rights and duties of data provider and data user
Considerations Data are public property Rights of data collector
Prisoner’s dilemma
Thought experiment in ESS research Fate of prisoners depends on their behaviour:
If they collaborate, they have a reasonable chance of escaping
A traitor is released, his companion stays
Evolutionary stable strategy: works against collaboration
Data sharers’ dilemma
A’s data B’s data C’s data
A and B play it fair C cheats
The cheat wins, since (s)he has access to his own, *and* to his naïve colleagues’ data