43
VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Embed Size (px)

Citation preview

Page 1: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

WFCAM Science Archive

Nigel HamblyWide Field Astronomy Unit

Institute for Astronomy, University of Edinburgh

Page 2: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Background & context

• Wide Field Astronomy: - large-scale public surveys - multi-colour, multi-epoch imaging data sets

• Developments over recent decades: - whole-sky Schmidt telescope surveys (eg. SuperCOSMOS) - current generation optical/IR, eg. SDSS, WFCAM - next generation, eg. VISTA

Prime examples of key datasets that will be the cornerstone of the VO datagrid

Page 3: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

SuperCOSMOS scans photographicmedia:

• 10 Gbyte/day• 3 colours: B, R & I• 1 colour (R) at 2 epochs• 0.7 “/pixel• 2 byte/pixel• whole sky• total data volume (pix): ~15 Tbyte• S hemisphere completed 2002 (N hemisphere by end 2005)

Page 4: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

WFCAM will image the sky directly using IR sensitive detectors; deployment on a 4m telescope (UKIRT):

• 100 Gbyte/night• 5 colours: ZYJHK; some multi-epoch imaging • 0.4 “/pixel• 4 byte/pixel• ~10% sky coverage in selected areas (various depths)• total data volume (pix): ~100 Tbyte• observations start in 2004; 7 yr programme planned

Page 5: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

• 500 Gbyte/night• 4 colours: zJHK• targeted surveys (various depths & areas)• 0.34 “/pixel• total data volume (pix): ~0.5 Pbyte• observations start at the end of 2006

VISTA (also 4m) will have 4x as many IR detectors as WFCAM:

Page 6: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Characteristics of astronomy DBs (I)• pixel images processed into lists of parameterised detections known as “catalogues” (parameterised data typically <10% of pixel data volume)

• detection association within survey data yielding multi-colour, multi-epoch source record

Page 7: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Characteristics of astronomy DBs (II)• detailed (but relatively small) amount of descriptive data with images and catalogues

• required to track descriptive data and images along with catalogue data

• for current/future generation surveys processing and ingest dictated by observing patterns • but users require well defined, stable catalogue products on which to do their science

hence require periodic release of stable, well-defined, read-only catalogues

Page 8: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Typical usages (I)• increasingly involve jointly querying different survey datasets in different databases

-example shows stellar population discrimination using SDSS colours and SSA proper motions

(Digby et al., astro-ph/0304056, MNRAS in print)

Page 9: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Typical usages (II)

• position & proximity searches v. common - spatial indexing (2d, spherical geom.) required

• statistical studies: ensemble characteristics of different species of source

• one-in-a-million searches for peculiar sources with highly detailed, specific properties - whole table scans

• …?

=> enable flexible interrogation to inspire new, innovative usage and promote new science

Page 10: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Science archive development at WFAU:

• SSA: a few Tbytes

• WSA = 10x SSA

• VSA = 5x WSA

approach is to set up a prototype archive system now (SSA), expand and implement WSA to coincide with WFCAM ops, then scale to VSA.

Page 11: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Database design: key requirements (I)

Flexibility:

• ingested data are rich in structure

• daily ingest; daily/weekly/monthly curation

• many varied usage modes

• protect proprietorial rights

• allow for changes/enhancements in design

VO as a Data Grid, NeSC ‘03

Page 12: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Database design: key requirements (II)

Scalability:

• ~2 Tbytes of new data per year

• operating lifetime > 5 years

• maintain performance for increasing data volumes

Portability:

• V1.0/V2.0 phased approach to hardware/OS/DBMS

VO as a Data Grid, NeSC ‘03

Page 13: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Database design: fundamentals (I)

• RDBMS, not OODBMS

• WSA V1.0: Windows/SQL Server (“SkyServer”) - V2.0 may be the same, DB2, or Oracle

• Image data stored as external flat files, not BLOBs - but image metadata stored in DBMS

• All attributes “not null”, ie. mandatory values

• Archive curation information stored in DBMS

VO as a Data Grid, NeSC ‘03

Page 14: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Database design: fundamentals (II)

• Calibration coefficients stored for astrometry & photometry

- instrumental quantities stored (XY in pix; flux in ADU)

- calibrated quantities stored based on current calibration

- all previous coefficients and versioning stored

VO as a Data Grid, NeSC ‘03

Page 15: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Database design: fundamentals (III)

• Reruns: reprocessed image data - same observations yield new source attribute values - re-ingest, but retain old parameterisation

• Repeats: better measurements of the same source - eg. stacked image detections - again, retain old parameterisation

• Duplicates: same source & filter but different observation - eg. overlap regions - store all data, and flag “best”

VO as a Data Grid, NeSC ‘03

Page 16: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Hardware design (I)

• separate servers for - pixels - catalogue curation - catalogue public access - web services

• different hardware solutions - mass storage on IDE with HW RAID5 - high bandwidth catalogue servers using SCSI and SW RAID

Page 17: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Hardware design (II)

• mass storage of pixels using low-cost IDE

Page 18: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Hardware design (III)

• dual P4 Xeon server

• independent PCI-X buses for maximum b/w

• dual channel Ultra320 SCSI adapters

High bandwidth catalogue server

Page 19: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Hardware design (IV)

• individual Seagate 146 Gbyte disks sustain > 50 Mbyte/s sequential read

• Ultra320 saturates at 200 Mbyte/s in one channel

• 4 disks per channel

• SW RAID striping across disks

(following SkyServer design of Gray, Szalay & colleagues)

Page 20: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

The SuperCOSMOS Science Archive (SSA)

• WFCAM Science Archive prototype

• Existing ad hoc flat file archive (inflexible, restricted access) re-implemented in an RDBMS

• Catalogue data only (no image pixel data)

• 1.3 Tbytes of catalogue data

• Implement a working service for users & developers to exercise prior to arrival of Tbytes of WFCAM data

VO as a Data Grid, NeSC ‘03

Page 21: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

SSA has several similarities to WSA:

• spatial indexing is required over celestial sphere

• many source attributes in common, eg. position, brightness, colour, shape, …

• multi-colour, multi-epoch detection information results from multiple measurements of the same source

VO as a Data Grid, NeSC ‘03

Page 22: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Development method: “20 queries approach”

• a set of real-world astronomical queries, expressed in SQL

• includes joint queries between the SSA and SDSS

Example:

/* Q14: Provide a list of stars with multiple epoch measurements, which have light variations >0.5 mag. */

select objid into results from Source where (classR1=1 and classR2=1 and qualR1<128 and qualR2<128) and abs (bestmagR1-bestmagR2) > 0.5

VO as a Data Grid, NeSC ‘03

Page 23: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

SSA relational model:

• relatively simple

• catalogues have ~256 byte records with mainly 4-byte attributes, ie. 50 to 60 per record

• so 2 tables dominate the DB - Detection: 0.83 Tbyte - Source: 0.44 Tbyte

VO as a Data Grid, NeSC ‘03

Page 24: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

SSA has been implemented & data are being ingested:

VO as a Data Grid, NeSC ‘03

Page 25: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

WSA has significant differences, however:

• catalogue and pixel data;

• science – driven, nested survey programmes (as opposed to SSA “atlas” maps of whole sky) result in complex data structure;

• curation & update within DBMS (whereas SSA is a finished data product ingested once into the DBMS).

VO as a Data Grid, NeSC ‘03

Page 26: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

WFCAM Science Archive : relational design

VO as a Data Grid, NeSC ‘03

Page 27: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

WFCAM Science Archive

Schematic picture of the WSA:

• Pixels: - one flat – file image store; access layer restricts public access - filenames and all metadata are tracked in DBMS tables with unrestricted access

• Catalogues: - WFAU incremental (no public access) - Public, released DBs - external survey datasets also held

VO as a Data Grid, NeSC ‘03

Page 28: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Image metadata relational model

• Programme & Field => vital

• library calibration frames stored & related

• primary/extension HDU keys logically stored & related

• this will work for VISTA

VO as a Data Grid, NeSC ‘03

Page 29: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Astrometric and photometric calibration data:

• require to store calibration information

• recalibration is required – esp. photometric

• old calibration coefficients must be stored

• time-dependence (versioning) complicates the relational model

Calibration data are related to images; source detections arerelated to images and hence their relevant calibration data

VO as a Data Grid, NeSC ‘03

Page 30: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Image calibration data:

• “set-ups” define nightly detector & filter combinations: - extinctions have nightly values - zps have detector & nightly values

• coefficients split into current & previous entities• Versioning & timing recorded• highly non-linear systematics are allowed for via 2D maps

Page 31: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Catalogue data: general model

• related back through progenitor image to calibration data

• detection list for each programme (or set of sub-surveys)

• merged source entity is maintained

• merge events recorded

• list re-measurements derived

VO as a Data Grid, NeSC ‘03

Page 32: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Non-WFCAM data: general model

• each non-WFCAM survey has a stored catalogue (currently locally stored).• cross-neighbour table: - records nearby sources between any two surveys - yields associated (“nearest”) source

VO as a Data Grid, NeSC ‘03

Page 33: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Example: UKIDSS LAS & relationship to SDSS

• UKIDSS LAS overlaps with SDSS

• list measurements: - at positions defined by IR source, but in optical image data; - do not currently envisage implementing this the other way (ie. optical source positions placed in IR image data)

VO as a Data Grid, NeSC ‘03

Page 34: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

– set of entities to track in-DBMS processing:

• archived programmes have: - required filter set - required join(s) - required list – driven measurement product(s) - release date(s) - final curation task - one or more curation timestamps

• a set of curation procedures is defined for the archive

VO as a Data Grid, NeSC ‘03

Curation:

Page 35: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

WFCAM Science Archive: V1.0 schemaimplementation

VO as a Data Grid, NeSC ‘03

Page 36: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Implementation: unique identifiers (UIDs)• meaningful UIDs, not arbitrary DBMS-assigned sequence no.

• following relational model, compound UIDs from appropriate attributes, eg. - detection UID is a combination of sequence no. on detector and detector UID - detector UID is a combination of extension no. of detector and multiframe UID

• but: top-level UIDs compounded into new attribute to avoid copying many columns down the relational hierarchy, eg. - meaningful multiframe UID is made up from UKIRT run no., and observation and ingest dates.

VO as a Data Grid, NeSC ‘03

Page 37: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Critical Design Review, April 2003

Implementation: SQL Serverdatabase picture (I)

• Multiframe & nearest neighbour tables

VO as a Data Grid, NeSC ‘03

Page 38: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Implementation: SQL Server database picture (II)

• UKIDSS LAS & nearest neighbour tables

VO as a Data Grid, NeSC ‘03

Page 39: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Implementation: spatial indexattributes

• Hierarchical Triangular Mesh algorithm (courtesy of P. Kunszt, A. Szalay & colleagues)

• HTM attribute HTMID for each occurrence of RA & Dec

• SkyServer functions & stored procedures: - spHTM_Lookup, spHTM_Cover, spHTM_To_String, fHTM_Cover etc.

VO as a Data Grid, NeSC ‘03

Page 40: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

Implementation: table indexing

• standard RDBMS practice: index tables on commonly used fields

• one “clustered” index per table based on primary key (default) - results in re-ordering of data on disk

• further non-clustered indices: - when indexing on more than one field, put in order of decreasing selectivity - HTM index attribute is included as most selective in at least one non-clustered index on appropriate tables - index files stored on different disk volumes to tables to help minimise disk “thrashing”

= > experimentation required with real astronomical data and queries: SSA prototype

VO as a Data Grid, NeSC ‘03

Page 41: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

User interface & Grid context (I)• “traditional” interfaces (ftp/http), eg. existing implementations:

WWW from interface

Access via CDS Aladin tool

Page 42: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

User interface & Grid context (II)• SQL form interfaces:

Page 43: VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

User interface & Grid context (III)

• web services under development (XML/SOAP/VOtable)

• other data (eg. SDSS, 2MASS, …) mirrored locally initially

• but aspiration is to enable usages employing distributed resources (both data and CPU) ultimately

recast web services as Grid services to integrate WSA into the VO Data Grid