24
Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh ([email protected])

Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh ([email protected])

Embed Size (px)

Citation preview

Page 1: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

Data provenance in astronomy

Bob Mann

Wide-Field Astronomy UnitUniversity of Edinburgh

([email protected])

Page 2: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

2/24

Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions

Page 3: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

3/24

Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions

Page 4: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

4/24

Astronomers observe across the whole electromagnetic spectrum

Galaxy images look different across spectrum, due to: Inherent angular resolution of the telescope Different emission processes

Page 5: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

5/24

Astronomical data: original form Different detector technologies used across the

spectrum, yielding different types of data: e.g. Ultraviolet/optical/infrared

Image: array of pixel values

X-ray Event list: positions, arrival times, energies of all

detected photons

Radio Interferometric visibilities: sparse Fourier transform

of a region of the sky

Page 6: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

6/24

Astronomical data: final form Most research done using catalogue data

i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc)

Data compression Catalogue – few % of image data volume

Amenable to representation in relational DB Natural indexing by location in sky

…but original data products (images, spectra, event lists) sometimes needed

Page 7: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

7/24

Astronomical databases Telescope archives

Heterogeneous collections of raw data files from all observations taken

Download data for reduction and analysis

Sky survey archives Homogeneous data and pipeline reduction “Science Archive” – do science on DB

Bibliographic archives – scans of journals

Page 8: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

8/24

Astronomical data processing Data reduction

Remove instrumental signatures from raw data and produce “science-ready” data Software packages written for specific instruments

Data analysis Derive scientific results from science-ready data

products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages

Fortran, C/C++, Python, Java Some use of commercial packages

e.g. Interactive Data Language (IDL)

Page 9: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

9/24

Outline Data and databases in astronomy Case Study: UKIDSS

Introduction to UKIDSS Data life-cycle in UKIDSS Provenance in UKIDSS

Conclusions

Page 10: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

10/24

UK Infrared Deep Sky Survey Set of five infrared sky surveys

Covering ~1/6 of the sky From large/shallow to

very small/very deep See www.ukidss.org

Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii

Page 11: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

11/24

UKIDSS data life-cycle (1) Summit of Mauna Kea

Data acquired from 4 WFCAM detectors Summit pipeline: instrument health Data written to LTO tape in NDF format Tapes couriered to Cambridge weekly

Cambridge Raw data converted from NDF to FITS Data reduction pipeline run on nightly basis: ~100Gb/night

Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes

Page 12: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

12/24

UKIDSS data life-cycle (2) Edinburgh

Ingest data from Cambridge:catalogues into RDBMS; image metadata into RDBMS; images on disk

Combine data from multiple nights: generate new catalogues from stacked images

Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa

Users worldwide Extract raw images from Cambridge Extract image and catalogues in FITS files from Edinburgh Run queries on catalogues & image metadata in WSA

Page 13: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

13/24

Provenance in UKIDSS Why is provenance important in UKIDSS?

What provenance information is recorded? How will this be used?...and by whom? …and is this adequate?

Page 14: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

14/24

Importance of provenance Much UKIDSS science is rare object search

Ratio of fluxes in H & K bands

Ratio of fluxes in J & H bands

Objects with these colours would be very unusual – and possibly very interesting.

Are they real?

Need ability to trace back to reduced image within which object was detected – maybe back to raw image.

Page 15: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

15/24

Structure of a FITS file

Extensions

Primary Header

Primary Data Array

Header

Data

Header

Data

Header: composedof 80-characterASCII records

Data units can be images or tables

Page 16: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

16/24

FITS header records Almost all records of the form

KEYWORD = ‘ value ‘ / COMMENT Some standard keywords defined, but

considerable freedom to define new ones Relevant metadata for particular instruments

Amongst standard set is HISTORY Format: HISTORY free text Provenance information can be stored in a

series of HISTORY records

Page 17: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

17/24

UKIDSS FITS files (1) Raw image files

Primary header: telescope/instrumentset-up, observing conditions, target,observational parameters

Primary data array: empty Extensions: (header,data) pairs for each of four

detectors: header has detector-specific metadata; data is compressed image

Header keywords defined in Interface Control Document between Hawaii & Cambridge

Page 18: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

18/24

UKIDSS FITS files (2) Reduced image files

Primary header & data array: metadatapropagated from raw data file

Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g

HISTORY 20060615 17:30:02HISTORY $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $HISTORY 20060615 17:31:04HISTORY $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $HISTORY 20060615 17:32:36HISTORY $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $HISTORY 20060615 20:01:58HISTORY $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $

What When Who

Page 19: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

19/24

UKIDSS FITS files (3) Catalogue files

Primary header: metadata propagatedfrom raw image

Primary data array: empty Headers of extensions include metadata for

catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records

Header keywords for both reduced images and catalogues are defined in an Interface Control

Document between Cambridge & Edinburgh

Page 20: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

20/24

User access to provenance info All header records from all FITS files ingested

into WSA except HISTORY records So, users can track provenance through

queries against WSA, and can get HISTORY records by downloading files

Hopefully enough to determined whether unusual object is real,but this is this good enough?

Page 21: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

21/24

Recap:Astronomical data processing

Data reduction Remove instrumental signatures from raw data

and produce “science-ready” data Software packages written for specific instruments

Data analysis Derive scientific results from science-ready data

products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages

Fortran, C/C++, Python, Java Some use of commercial packages

e.g. Interactive Data Language (IDL)?

Page 22: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

22/24

Provenance in data analysis:Two main problems Less controlled software environment

Little bits of code written for a specific analysis, not tried and tested pipeline modules

Use of data from many sources UKIDSS/WSA is state-of-the-art for provenance Many (esp. older) data resources not so good

Provenance of combined dataset only as good as provenance of worst constituent dataset?

Page 23: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

23/24

Does this matter? Provenance information for data analysis is

recorded in the journal paper (sort of) Improving links between online literature and data

sources

Increasing importance of large sky surveys with well controlled environments Moving more of the data analysis from the user’s

desktop to the data centre

Page 24: Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh (rgm@roe.ac.uk)

24/24

Conclusions Modern sky survey systems record & publish

extensive provenance for data reduction

Very little provenance recorded from data analysis – except description in journal paper More could surely be done – but would

researchers support overhead of doing so? Improvements as more analysis in data centre

Could/should we be doing more?