Presentation at the Deep Carbon Observatory Summer School 2014, Big Sky, MT, USA.
Text of Why data science matters and what we can do with it
Xiaogang (Marshall) Ma and DCO-Data Science Team
Tetherless World ConstellationRensselaer Polytechnic Institute
Why Data Science Matters?and what can we do with it
• Data Management and Publication• Interoperability of Data• Provenance of Research• Era of Science 2.0
Data Management and Publication
• Meet grant requirements– Many funding agencies now require researchers formally state
how they will manage and preserve datasets generated from a research project.
Why Manage and Publish Data
• Increase your research efficiency– Have you ever had a hard time understanding the data that
you or your colleagues have collected?
Image courtesy of British Geological Survey
Nice, now I have my DATA well managed, and next…
• Increase the visibility of your research– Making your data available to other researchers through
widely-searched repositories can increase your prominence and demonstrate continued use of the data and relevance of your research.
• Facilitate new discoveries– Enabling other researchers to use your data reinforces open
scientific inquiry and can lead to new and unanticipated discoveries. And doing so prevents duplication of effort by enabling others to use your data rather than trying to gather the data themselves.
Data Management Plan: What and How
• What is a Data Management Plan?– A data management plan is a formal document that outlines
what you will do with your data during and after you complete your research.
• What is involved in developing one?– Developing a data management plan can be time-consuming,
tedious, and daunting, but it's a very important step in ensuring that your research data is safe and sound for the present and future.
– With the right process and framework it does not take too long time and can pay-off enormously in the long-run.
• Topics in a data management plan include
– Introduction and context– Data types, formats, standards and capture methods– Short-term storage and data management– Deposit and long-term preservation – Data sharing, access and re-use– Resourcing– Adherence and review
• Resources/Tools help create DMPs:
– DCC Data Management Plans: http://www.dcc.ac.uk/resources/data-management-plans
– MIT Data Management and Publishing: http://libraries.mit.edu/data-management/
– NSF Data Management Plan Requirements: http://www.nsf.gov/eng/general/dmp.jsp
– DMPTool: https://dmptool.org – IEDA Data Management Plan Tool: http://
Interoperable:“Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.”
• Interoperability does not mean that all data should be mediated or standardized.
• However, it is important that data archives are accompanied by detailed documentation, clarifying data provenance, data model, vocabularies used, etc.
(Ma et al., 2011)
Provenance of Research
• Documenting provenance – Linking a range of observations and model outputs, research
activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them.
Well-curated provenance information makes scientific workflows transparent and improves the credibility and trustworthiness of their outputs. It also facilitates informed and rational policy and decision-making.
Image from nature.com
(Ma et al., 2014)
“Figure 1.2: Sea Level Rise: Past, Present, and Future” in the Third National Climate Assessment report draft of USA (NCA3) 29
What is the provenance of this figure?
• Detailed caption of that figure: – Estimated, observed and possible amounts of global sea level rise
from 1800 to 2100. Proxy estimates (Kemp et al. 2012) (for example, based on sediment records) are shown in red (pink band shows uncertainty), tide gauge data in blue (Church and White 2011a), and satellite observations are shown in green (Nerem et al. 2010). The future scenarios range from 0.66 feet to 6.6 feet in 2100 (Parris et al. 2012). Higher or lower amounts of sea level rise are considered implausible, as represented by the gray shading. The orange line at right shows the currently projected range of sea level rise of 1 to 4 feet by 2100, which falls within the larger risk-based scenario range. The large projected range reflects uncertainty about how glaciers and ice sheets will react to the warming ocean, the warming atmosphere, and changing winds and currents. As seen in the observations, there are year-to-year variations in the trend. (Figure source: Josh Willis, NASA Jet Propulsion Laboratory)
As a case study, let’s trace the provenance of this paper.
Provenance tracing of NASA contributions to Figure 1.2 in draft NCA3
Here only the details of Topex-Poseidon mission are shown
Here only the details of one paper (i.e., “paper/103”) cited by that figure are shown
(a) Instances of calibration, model and software underpinning “paper/103”
(b) Instances of sensor, instrument and platform underpinning that paper