Upload
douglas-campbell
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Data provenance in astronomy
Bob Mann
Wide-Field Astronomy UnitUniversity of Edinburgh
2/24
Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions
3/24
Outline Data and databases in astronomy Case Study: UK Infrared Deep Sky Survey Conclusions
4/24
Astronomers observe across the whole electromagnetic spectrum
Galaxy images look different across spectrum, due to: Inherent angular resolution of the telescope Different emission processes
5/24
Astronomical data: original form Different detector technologies used across the
spectrum, yielding different types of data: e.g. Ultraviolet/optical/infrared
Image: array of pixel values
X-ray Event list: positions, arrival times, energies of all
detected photons
Radio Interferometric visibilities: sparse Fourier transform
of a region of the sky
6/24
Astronomical data: final form Most research done using catalogue data
i.e. tables of attributes of detected sources – mainly discrete sources (stars, galaxies, etc)
Data compression Catalogue – few % of image data volume
Amenable to representation in relational DB Natural indexing by location in sky
…but original data products (images, spectra, event lists) sometimes needed
7/24
Astronomical databases Telescope archives
Heterogeneous collections of raw data files from all observations taken
Download data for reduction and analysis
Sky survey archives Homogeneous data and pipeline reduction “Science Archive” – do science on DB
Bibliographic archives – scans of journals
8/24
Astronomical data processing Data reduction
Remove instrumental signatures from raw data and produce “science-ready” data Software packages written for specific instruments
Data analysis Derive scientific results from science-ready data
products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages
Fortran, C/C++, Python, Java Some use of commercial packages
e.g. Interactive Data Language (IDL)
9/24
Outline Data and databases in astronomy Case Study: UKIDSS
Introduction to UKIDSS Data life-cycle in UKIDSS Provenance in UKIDSS
Conclusions
10/24
UK Infrared Deep Sky Survey Set of five infrared sky surveys
Covering ~1/6 of the sky From large/shallow to
very small/very deep See www.ukidss.org
Observations: 2005-2012 using Wide Field Camera (WFCAM) on UK Infrared Telescope (UKIRT) in Hawaii
11/24
UKIDSS data life-cycle (1) Summit of Mauna Kea
Data acquired from 4 WFCAM detectors Summit pipeline: instrument health Data written to LTO tape in NDF format Tapes couriered to Cambridge weekly
Cambridge Raw data converted from NDF to FITS Data reduction pipeline run on nightly basis: ~100Gb/night
Remove instrumental signatures, combine images, detect and classify objects, calibrate positions & fluxes
12/24
UKIDSS data life-cycle (2) Edinburgh
Ingest data from Cambridge:catalogues into RDBMS; image metadata into RDBMS; images on disk
Combine data from multiple nights: generate new catalogues from stacked images
Prepare release databases for WFCAM Science Archive (WSA): see http://surveys.roe.ac.uk/wsa
Users worldwide Extract raw images from Cambridge Extract image and catalogues in FITS files from Edinburgh Run queries on catalogues & image metadata in WSA
13/24
Provenance in UKIDSS Why is provenance important in UKIDSS?
What provenance information is recorded? How will this be used?...and by whom? …and is this adequate?
14/24
Importance of provenance Much UKIDSS science is rare object search
Ratio of fluxes in H & K bands
Ratio of fluxes in J & H bands
Objects with these colours would be very unusual – and possibly very interesting.
Are they real?
Need ability to trace back to reduced image within which object was detected – maybe back to raw image.
15/24
Structure of a FITS file
Extensions
Primary Header
Primary Data Array
Header
Data
Header
Data
Header: composedof 80-characterASCII records
Data units can be images or tables
16/24
FITS header records Almost all records of the form
KEYWORD = ‘ value ‘ / COMMENT Some standard keywords defined, but
considerable freedom to define new ones Relevant metadata for particular instruments
Amongst standard set is HISTORY Format: HISTORY free text Provenance information can be stored in a
series of HISTORY records
17/24
UKIDSS FITS files (1) Raw image files
Primary header: telescope/instrumentset-up, observing conditions, target,observational parameters
Primary data array: empty Extensions: (header,data) pairs for each of four
detectors: header has detector-specific metadata; data is compressed image
Header keywords defined in Interface Control Document between Hawaii & Cambridge
18/24
UKIDSS FITS files (2) Reduced image files
Primary header & data array: metadatapropagated from raw data file
Headers of extensions include HISTORY records for data reduction steps run at Cambridge, e.g
HISTORY 20060615 17:30:02HISTORY $Id: cir_stage1.c,v 1.11 2005/12/15 14:44:04 jim Exp $HISTORY 20060615 17:31:04HISTORY $Id: cir_qblkmed.c,v 1.9 2005/08/12 14:35:19 jim Exp $HISTORY 20060615 17:32:36HISTORY $Id: cir_xtalk.c,v 1.5 2005/10/17 14:58:50 jim Exp $HISTORY 20060615 20:01:58HISTORY $Id: cir_arith.c,v 1.8 2005/02/25 10:14:55 jim Exp $
What When Who
19/24
UKIDSS FITS files (3) Catalogue files
Primary header: metadata propagatedfrom raw image
Primary data array: empty Headers of extensions include metadata for
catalogue generation process – invocations of software modules in HISTORY records, with parameter values in separate records
Header keywords for both reduced images and catalogues are defined in an Interface Control
Document between Cambridge & Edinburgh
20/24
User access to provenance info All header records from all FITS files ingested
into WSA except HISTORY records So, users can track provenance through
queries against WSA, and can get HISTORY records by downloading files
Hopefully enough to determined whether unusual object is real,but this is this good enough?
21/24
Recap:Astronomical data processing
Data reduction Remove instrumental signatures from raw data
and produce “science-ready” data Software packages written for specific instruments
Data analysis Derive scientific results from science-ready data
products – e.g. statistical analyses Some astro-specific packages/environments – e.g. IRAF Some use of programming languages
Fortran, C/C++, Python, Java Some use of commercial packages
e.g. Interactive Data Language (IDL)?
22/24
Provenance in data analysis:Two main problems Less controlled software environment
Little bits of code written for a specific analysis, not tried and tested pipeline modules
Use of data from many sources UKIDSS/WSA is state-of-the-art for provenance Many (esp. older) data resources not so good
Provenance of combined dataset only as good as provenance of worst constituent dataset?
23/24
Does this matter? Provenance information for data analysis is
recorded in the journal paper (sort of) Improving links between online literature and data
sources
Increasing importance of large sky surveys with well controlled environments Moving more of the data analysis from the user’s
desktop to the data centre
24/24
Conclusions Modern sky survey systems record & publish
extensive provenance for data reduction
Very little provenance recorded from data analysis – except description in journal paper More could surely be done – but would
researchers support overhead of doing so? Improvements as more analysis in data centre
Could/should we be doing more?