VO as a Data Grid, NeSC ‘03 WFCAM Science Archive Nigel Hambly Wide Field Astronomy Unit Institute...

Preview:

Citation preview

VO as a Data Grid, NeSC ‘03

WFCAM Science Archive

Nigel HamblyWide Field Astronomy Unit

Institute for Astronomy, University of Edinburgh

VO as a Data Grid, NeSC ‘03

Background & context

• Wide Field Astronomy: - large-scale public surveys - multi-colour, multi-epoch imaging data sets

• Developments over recent decades: - whole-sky Schmidt telescope surveys (eg. SuperCOSMOS) - current generation optical/IR, eg. SDSS, WFCAM - next generation, eg. VISTA

Prime examples of key datasets that will be the cornerstone of the VO datagrid

VO as a Data Grid, NeSC ‘03

SuperCOSMOS scans photographicmedia:

• 10 Gbyte/day• 3 colours: B, R & I• 1 colour (R) at 2 epochs• 0.7 “/pixel• 2 byte/pixel• whole sky• total data volume (pix): ~15 Tbyte• S hemisphere completed 2002 (N hemisphere by end 2005)

VO as a Data Grid, NeSC ‘03

WFCAM will image the sky directly using IR sensitive detectors; deployment on a 4m telescope (UKIRT):

• 100 Gbyte/night• 5 colours: ZYJHK; some multi-epoch imaging • 0.4 “/pixel• 4 byte/pixel• ~10% sky coverage in selected areas (various depths)• total data volume (pix): ~100 Tbyte• observations start in 2004; 7 yr programme planned

VO as a Data Grid, NeSC ‘03

• 500 Gbyte/night• 4 colours: zJHK• targeted surveys (various depths & areas)• 0.34 “/pixel• total data volume (pix): ~0.5 Pbyte• observations start at the end of 2006

VISTA (also 4m) will have 4x as many IR detectors as WFCAM:

Characteristics of astronomy DBs (I)• pixel images processed into lists of parameterised detections known as “catalogues” (parameterised data typically <10% of pixel data volume)

• detection association within survey data yielding multi-colour, multi-epoch source record

VO as a Data Grid, NeSC ‘03

Characteristics of astronomy DBs (II)• detailed (but relatively small) amount of descriptive data with images and catalogues

• required to track descriptive data and images along with catalogue data

• for current/future generation surveys processing and ingest dictated by observing patterns • but users require well defined, stable catalogue products on which to do their science

hence require periodic release of stable, well-defined, read-only catalogues

VO as a Data Grid, NeSC ‘03

Typical usages (I)• increasingly involve jointly querying different survey datasets in different databases

-example shows stellar population discrimination using SDSS colours and SSA proper motions

(Digby et al., astro-ph/0304056, MNRAS in print)

VO as a Data Grid, NeSC ‘03

Typical usages (II)

• position & proximity searches v. common - spatial indexing (2d, spherical geom.) required

• statistical studies: ensemble characteristics of different species of source

• one-in-a-million searches for peculiar sources with highly detailed, specific properties - whole table scans

• …?

=> enable flexible interrogation to inspire new, innovative usage and promote new science

VO as a Data Grid, NeSC ‘03

Science archive development at WFAU:

• SSA: a few Tbytes

• WSA = 10x SSA

• VSA = 5x WSA

approach is to set up a prototype archive system now (SSA), expand and implement WSA to coincide with WFCAM ops, then scale to VSA.

Database design: key requirements (I)

Flexibility:

• ingested data are rich in structure

• daily ingest; daily/weekly/monthly curation

• many varied usage modes

• protect proprietorial rights

• allow for changes/enhancements in design

VO as a Data Grid, NeSC ‘03

Database design: key requirements (II)

Scalability:

• ~2 Tbytes of new data per year

• operating lifetime > 5 years

• maintain performance for increasing data volumes

Portability:

• V1.0/V2.0 phased approach to hardware/OS/DBMS

VO as a Data Grid, NeSC ‘03

Database design: fundamentals (I)

• RDBMS, not OODBMS

• WSA V1.0: Windows/SQL Server (“SkyServer”) - V2.0 may be the same, DB2, or Oracle

• Image data stored as external flat files, not BLOBs - but image metadata stored in DBMS

• All attributes “not null”, ie. mandatory values

• Archive curation information stored in DBMS

VO as a Data Grid, NeSC ‘03

Database design: fundamentals (II)

• Calibration coefficients stored for astrometry & photometry

- instrumental quantities stored (XY in pix; flux in ADU)

- calibrated quantities stored based on current calibration

- all previous coefficients and versioning stored

VO as a Data Grid, NeSC ‘03

Database design: fundamentals (III)

• Reruns: reprocessed image data - same observations yield new source attribute values - re-ingest, but retain old parameterisation

• Repeats: better measurements of the same source - eg. stacked image detections - again, retain old parameterisation

• Duplicates: same source & filter but different observation - eg. overlap regions - store all data, and flag “best”

VO as a Data Grid, NeSC ‘03

VO as a Data Grid, NeSC ‘03

Hardware design (I)

• separate servers for - pixels - catalogue curation - catalogue public access - web services

• different hardware solutions - mass storage on IDE with HW RAID5 - high bandwidth catalogue servers using SCSI and SW RAID

VO as a Data Grid, NeSC ‘03

Hardware design (II)

• mass storage of pixels using low-cost IDE

VO as a Data Grid, NeSC ‘03

Hardware design (III)

• dual P4 Xeon server

• independent PCI-X buses for maximum b/w

• dual channel Ultra320 SCSI adapters

High bandwidth catalogue server

VO as a Data Grid, NeSC ‘03

Hardware design (IV)

• individual Seagate 146 Gbyte disks sustain > 50 Mbyte/s sequential read

• Ultra320 saturates at 200 Mbyte/s in one channel

• 4 disks per channel

• SW RAID striping across disks

(following SkyServer design of Gray, Szalay & colleagues)

The SuperCOSMOS Science Archive (SSA)

• WFCAM Science Archive prototype

• Existing ad hoc flat file archive (inflexible, restricted access) re-implemented in an RDBMS

• Catalogue data only (no image pixel data)

• 1.3 Tbytes of catalogue data

• Implement a working service for users & developers to exercise prior to arrival of Tbytes of WFCAM data

VO as a Data Grid, NeSC ‘03

SSA has several similarities to WSA:

• spatial indexing is required over celestial sphere

• many source attributes in common, eg. position, brightness, colour, shape, …

• multi-colour, multi-epoch detection information results from multiple measurements of the same source

VO as a Data Grid, NeSC ‘03

Development method: “20 queries approach”

• a set of real-world astronomical queries, expressed in SQL

• includes joint queries between the SSA and SDSS

Example:

/* Q14: Provide a list of stars with multiple epoch measurements, which have light variations >0.5 mag. */

select objid into results from Source where (classR1=1 and classR2=1 and qualR1<128 and qualR2<128) and abs (bestmagR1-bestmagR2) > 0.5

VO as a Data Grid, NeSC ‘03

SSA relational model:

• relatively simple

• catalogues have ~256 byte records with mainly 4-byte attributes, ie. 50 to 60 per record

• so 2 tables dominate the DB - Detection: 0.83 Tbyte - Source: 0.44 Tbyte

VO as a Data Grid, NeSC ‘03

SSA has been implemented & data are being ingested:

VO as a Data Grid, NeSC ‘03

WSA has significant differences, however:

• catalogue and pixel data;

• science – driven, nested survey programmes (as opposed to SSA “atlas” maps of whole sky) result in complex data structure;

• curation & update within DBMS (whereas SSA is a finished data product ingested once into the DBMS).

VO as a Data Grid, NeSC ‘03

WFCAM Science Archive : relational design

VO as a Data Grid, NeSC ‘03

WFCAM Science Archive

Schematic picture of the WSA:

• Pixels: - one flat – file image store; access layer restricts public access - filenames and all metadata are tracked in DBMS tables with unrestricted access

• Catalogues: - WFAU incremental (no public access) - Public, released DBs - external survey datasets also held

VO as a Data Grid, NeSC ‘03

Image metadata relational model

• Programme & Field => vital

• library calibration frames stored & related

• primary/extension HDU keys logically stored & related

• this will work for VISTA

VO as a Data Grid, NeSC ‘03

Astrometric and photometric calibration data:

• require to store calibration information

• recalibration is required – esp. photometric

• old calibration coefficients must be stored

• time-dependence (versioning) complicates the relational model

Calibration data are related to images; source detections arerelated to images and hence their relevant calibration data

VO as a Data Grid, NeSC ‘03

Image calibration data:

• “set-ups” define nightly detector & filter combinations: - extinctions have nightly values - zps have detector & nightly values

• coefficients split into current & previous entities• Versioning & timing recorded• highly non-linear systematics are allowed for via 2D maps

Catalogue data: general model

• related back through progenitor image to calibration data

• detection list for each programme (or set of sub-surveys)

• merged source entity is maintained

• merge events recorded

• list re-measurements derived

VO as a Data Grid, NeSC ‘03

Non-WFCAM data: general model

• each non-WFCAM survey has a stored catalogue (currently locally stored).• cross-neighbour table: - records nearby sources between any two surveys - yields associated (“nearest”) source

VO as a Data Grid, NeSC ‘03

Example: UKIDSS LAS & relationship to SDSS

• UKIDSS LAS overlaps with SDSS

• list measurements: - at positions defined by IR source, but in optical image data; - do not currently envisage implementing this the other way (ie. optical source positions placed in IR image data)

VO as a Data Grid, NeSC ‘03

– set of entities to track in-DBMS processing:

• archived programmes have: - required filter set - required join(s) - required list – driven measurement product(s) - release date(s) - final curation task - one or more curation timestamps

• a set of curation procedures is defined for the archive

VO as a Data Grid, NeSC ‘03

Curation:

WFCAM Science Archive: V1.0 schemaimplementation

VO as a Data Grid, NeSC ‘03

Implementation: unique identifiers (UIDs)• meaningful UIDs, not arbitrary DBMS-assigned sequence no.

• following relational model, compound UIDs from appropriate attributes, eg. - detection UID is a combination of sequence no. on detector and detector UID - detector UID is a combination of extension no. of detector and multiframe UID

• but: top-level UIDs compounded into new attribute to avoid copying many columns down the relational hierarchy, eg. - meaningful multiframe UID is made up from UKIRT run no., and observation and ingest dates.

VO as a Data Grid, NeSC ‘03

Critical Design Review, April 2003

Implementation: SQL Serverdatabase picture (I)

• Multiframe & nearest neighbour tables

VO as a Data Grid, NeSC ‘03

Implementation: SQL Server database picture (II)

• UKIDSS LAS & nearest neighbour tables

VO as a Data Grid, NeSC ‘03

Implementation: spatial indexattributes

• Hierarchical Triangular Mesh algorithm (courtesy of P. Kunszt, A. Szalay & colleagues)

• HTM attribute HTMID for each occurrence of RA & Dec

• SkyServer functions & stored procedures: - spHTM_Lookup, spHTM_Cover, spHTM_To_String, fHTM_Cover etc.

VO as a Data Grid, NeSC ‘03

Implementation: table indexing

• standard RDBMS practice: index tables on commonly used fields

• one “clustered” index per table based on primary key (default) - results in re-ordering of data on disk

• further non-clustered indices: - when indexing on more than one field, put in order of decreasing selectivity - HTM index attribute is included as most selective in at least one non-clustered index on appropriate tables - index files stored on different disk volumes to tables to help minimise disk “thrashing”

= > experimentation required with real astronomical data and queries: SSA prototype

VO as a Data Grid, NeSC ‘03

VO as a Data Grid, NeSC ‘03

User interface & Grid context (I)• “traditional” interfaces (ftp/http), eg. existing implementations:

WWW from interface

Access via CDS Aladin tool

VO as a Data Grid, NeSC ‘03

User interface & Grid context (II)• SQL form interfaces:

VO as a Data Grid, NeSC ‘03

User interface & Grid context (III)

• web services under development (XML/SOAP/VOtable)

• other data (eg. SDSS, 2MASS, …) mirrored locally initially

• but aspiration is to enable usages employing distributed resources (both data and CPU) ultimately

recast web services as Grid services to integrate WSA into the VO Data Grid

Recommended