Upload
hugo-newton
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive
Nigel HamblyWide Field Astronomy Unit
Institute for Astronomy, University of Edinburgh
VO as a Data Grid, NeSC ‘03
Background & context
• Wide Field Astronomy: - large-scale public surveys - multi-colour, multi-epoch imaging data sets
• Developments over recent decades: - whole-sky Schmidt telescope surveys (eg. SuperCOSMOS) - current generation optical/IR, eg. SDSS, WFCAM - next generation, eg. VISTA
Prime examples of key datasets that will be the cornerstone of the VO datagrid
VO as a Data Grid, NeSC ‘03
SuperCOSMOS scans photographicmedia:
• 10 Gbyte/day• 3 colours: B, R & I• 1 colour (R) at 2 epochs• 0.7 “/pixel• 2 byte/pixel• whole sky• total data volume (pix): ~15 Tbyte• S hemisphere completed 2002 (N hemisphere by end 2005)
VO as a Data Grid, NeSC ‘03
WFCAM will image the sky directly using IR sensitive detectors; deployment on a 4m telescope (UKIRT):
• 100 Gbyte/night• 5 colours: ZYJHK; some multi-epoch imaging • 0.4 “/pixel• 4 byte/pixel• ~10% sky coverage in selected areas (various depths)• total data volume (pix): ~100 Tbyte• observations start in 2004; 7 yr programme planned
VO as a Data Grid, NeSC ‘03
• 500 Gbyte/night• 4 colours: zJHK• targeted surveys (various depths & areas)• 0.34 “/pixel• total data volume (pix): ~0.5 Pbyte• observations start at the end of 2006
VISTA (also 4m) will have 4x as many IR detectors as WFCAM:
Characteristics of astronomy DBs (I)• pixel images processed into lists of parameterised detections known as “catalogues” (parameterised data typically <10% of pixel data volume)
• detection association within survey data yielding multi-colour, multi-epoch source record
VO as a Data Grid, NeSC ‘03
Characteristics of astronomy DBs (II)• detailed (but relatively small) amount of descriptive data with images and catalogues
• required to track descriptive data and images along with catalogue data
• for current/future generation surveys processing and ingest dictated by observing patterns • but users require well defined, stable catalogue products on which to do their science
hence require periodic release of stable, well-defined, read-only catalogues
VO as a Data Grid, NeSC ‘03
Typical usages (I)• increasingly involve jointly querying different survey datasets in different databases
-example shows stellar population discrimination using SDSS colours and SSA proper motions
(Digby et al., astro-ph/0304056, MNRAS in print)
VO as a Data Grid, NeSC ‘03
Typical usages (II)
• position & proximity searches v. common - spatial indexing (2d, spherical geom.) required
• statistical studies: ensemble characteristics of different species of source
• one-in-a-million searches for peculiar sources with highly detailed, specific properties - whole table scans
• …?
=> enable flexible interrogation to inspire new, innovative usage and promote new science
VO as a Data Grid, NeSC ‘03
Science archive development at WFAU:
• SSA: a few Tbytes
• WSA = 10x SSA
• VSA = 5x WSA
approach is to set up a prototype archive system now (SSA), expand and implement WSA to coincide with WFCAM ops, then scale to VSA.
Database design: key requirements (I)
Flexibility:
• ingested data are rich in structure
• daily ingest; daily/weekly/monthly curation
• many varied usage modes
• protect proprietorial rights
• allow for changes/enhancements in design
VO as a Data Grid, NeSC ‘03
Database design: key requirements (II)
Scalability:
• ~2 Tbytes of new data per year
• operating lifetime > 5 years
• maintain performance for increasing data volumes
Portability:
• V1.0/V2.0 phased approach to hardware/OS/DBMS
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (I)
• RDBMS, not OODBMS
• WSA V1.0: Windows/SQL Server (“SkyServer”) - V2.0 may be the same, DB2, or Oracle
• Image data stored as external flat files, not BLOBs - but image metadata stored in DBMS
• All attributes “not null”, ie. mandatory values
• Archive curation information stored in DBMS
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (II)
• Calibration coefficients stored for astrometry & photometry
- instrumental quantities stored (XY in pix; flux in ADU)
- calibrated quantities stored based on current calibration
- all previous coefficients and versioning stored
VO as a Data Grid, NeSC ‘03
Database design: fundamentals (III)
• Reruns: reprocessed image data - same observations yield new source attribute values - re-ingest, but retain old parameterisation
• Repeats: better measurements of the same source - eg. stacked image detections - again, retain old parameterisation
• Duplicates: same source & filter but different observation - eg. overlap regions - store all data, and flag “best”
VO as a Data Grid, NeSC ‘03
VO as a Data Grid, NeSC ‘03
Hardware design (I)
• separate servers for - pixels - catalogue curation - catalogue public access - web services
• different hardware solutions - mass storage on IDE with HW RAID5 - high bandwidth catalogue servers using SCSI and SW RAID
VO as a Data Grid, NeSC ‘03
Hardware design (II)
• mass storage of pixels using low-cost IDE
VO as a Data Grid, NeSC ‘03
Hardware design (III)
• dual P4 Xeon server
• independent PCI-X buses for maximum b/w
• dual channel Ultra320 SCSI adapters
High bandwidth catalogue server
VO as a Data Grid, NeSC ‘03
Hardware design (IV)
• individual Seagate 146 Gbyte disks sustain > 50 Mbyte/s sequential read
• Ultra320 saturates at 200 Mbyte/s in one channel
• 4 disks per channel
• SW RAID striping across disks
(following SkyServer design of Gray, Szalay & colleagues)
The SuperCOSMOS Science Archive (SSA)
• WFCAM Science Archive prototype
• Existing ad hoc flat file archive (inflexible, restricted access) re-implemented in an RDBMS
• Catalogue data only (no image pixel data)
• 1.3 Tbytes of catalogue data
• Implement a working service for users & developers to exercise prior to arrival of Tbytes of WFCAM data
VO as a Data Grid, NeSC ‘03
SSA has several similarities to WSA:
• spatial indexing is required over celestial sphere
• many source attributes in common, eg. position, brightness, colour, shape, …
• multi-colour, multi-epoch detection information results from multiple measurements of the same source
VO as a Data Grid, NeSC ‘03
Development method: “20 queries approach”
• a set of real-world astronomical queries, expressed in SQL
• includes joint queries between the SSA and SDSS
Example:
/* Q14: Provide a list of stars with multiple epoch measurements, which have light variations >0.5 mag. */
select objid into results from Source where (classR1=1 and classR2=1 and qualR1<128 and qualR2<128) and abs (bestmagR1-bestmagR2) > 0.5
VO as a Data Grid, NeSC ‘03
SSA relational model:
• relatively simple
• catalogues have ~256 byte records with mainly 4-byte attributes, ie. 50 to 60 per record
• so 2 tables dominate the DB - Detection: 0.83 Tbyte - Source: 0.44 Tbyte
VO as a Data Grid, NeSC ‘03
SSA has been implemented & data are being ingested:
VO as a Data Grid, NeSC ‘03
WSA has significant differences, however:
• catalogue and pixel data;
• science – driven, nested survey programmes (as opposed to SSA “atlas” maps of whole sky) result in complex data structure;
• curation & update within DBMS (whereas SSA is a finished data product ingested once into the DBMS).
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive : relational design
VO as a Data Grid, NeSC ‘03
WFCAM Science Archive
Schematic picture of the WSA:
• Pixels: - one flat – file image store; access layer restricts public access - filenames and all metadata are tracked in DBMS tables with unrestricted access
• Catalogues: - WFAU incremental (no public access) - Public, released DBs - external survey datasets also held
VO as a Data Grid, NeSC ‘03
Image metadata relational model
• Programme & Field => vital
• library calibration frames stored & related
• primary/extension HDU keys logically stored & related
• this will work for VISTA
VO as a Data Grid, NeSC ‘03
Astrometric and photometric calibration data:
• require to store calibration information
• recalibration is required – esp. photometric
• old calibration coefficients must be stored
• time-dependence (versioning) complicates the relational model
Calibration data are related to images; source detections arerelated to images and hence their relevant calibration data
VO as a Data Grid, NeSC ‘03
Image calibration data:
• “set-ups” define nightly detector & filter combinations: - extinctions have nightly values - zps have detector & nightly values
• coefficients split into current & previous entities• Versioning & timing recorded• highly non-linear systematics are allowed for via 2D maps
Catalogue data: general model
• related back through progenitor image to calibration data
• detection list for each programme (or set of sub-surveys)
• merged source entity is maintained
• merge events recorded
• list re-measurements derived
VO as a Data Grid, NeSC ‘03
Non-WFCAM data: general model
• each non-WFCAM survey has a stored catalogue (currently locally stored).• cross-neighbour table: - records nearby sources between any two surveys - yields associated (“nearest”) source
VO as a Data Grid, NeSC ‘03
Example: UKIDSS LAS & relationship to SDSS
• UKIDSS LAS overlaps with SDSS
• list measurements: - at positions defined by IR source, but in optical image data; - do not currently envisage implementing this the other way (ie. optical source positions placed in IR image data)
VO as a Data Grid, NeSC ‘03
– set of entities to track in-DBMS processing:
• archived programmes have: - required filter set - required join(s) - required list – driven measurement product(s) - release date(s) - final curation task - one or more curation timestamps
• a set of curation procedures is defined for the archive
VO as a Data Grid, NeSC ‘03
Curation:
WFCAM Science Archive: V1.0 schemaimplementation
VO as a Data Grid, NeSC ‘03
Implementation: unique identifiers (UIDs)• meaningful UIDs, not arbitrary DBMS-assigned sequence no.
• following relational model, compound UIDs from appropriate attributes, eg. - detection UID is a combination of sequence no. on detector and detector UID - detector UID is a combination of extension no. of detector and multiframe UID
• but: top-level UIDs compounded into new attribute to avoid copying many columns down the relational hierarchy, eg. - meaningful multiframe UID is made up from UKIRT run no., and observation and ingest dates.
VO as a Data Grid, NeSC ‘03
Critical Design Review, April 2003
Implementation: SQL Serverdatabase picture (I)
• Multiframe & nearest neighbour tables
VO as a Data Grid, NeSC ‘03
Implementation: SQL Server database picture (II)
• UKIDSS LAS & nearest neighbour tables
VO as a Data Grid, NeSC ‘03
Implementation: spatial indexattributes
• Hierarchical Triangular Mesh algorithm (courtesy of P. Kunszt, A. Szalay & colleagues)
• HTM attribute HTMID for each occurrence of RA & Dec
• SkyServer functions & stored procedures: - spHTM_Lookup, spHTM_Cover, spHTM_To_String, fHTM_Cover etc.
VO as a Data Grid, NeSC ‘03
Implementation: table indexing
• standard RDBMS practice: index tables on commonly used fields
• one “clustered” index per table based on primary key (default) - results in re-ordering of data on disk
• further non-clustered indices: - when indexing on more than one field, put in order of decreasing selectivity - HTM index attribute is included as most selective in at least one non-clustered index on appropriate tables - index files stored on different disk volumes to tables to help minimise disk “thrashing”
= > experimentation required with real astronomical data and queries: SSA prototype
VO as a Data Grid, NeSC ‘03
VO as a Data Grid, NeSC ‘03
User interface & Grid context (I)• “traditional” interfaces (ftp/http), eg. existing implementations:
WWW from interface
Access via CDS Aladin tool
VO as a Data Grid, NeSC ‘03
User interface & Grid context (II)• SQL form interfaces:
VO as a Data Grid, NeSC ‘03
User interface & Grid context (III)
• web services under development (XML/SOAP/VOtable)
• other data (eg. SDSS, 2MASS, …) mirrored locally initially
• but aspiration is to enable usages employing distributed resources (both data and CPU) ultimately
recast web services as Grid services to integrate WSA into the VO Data Grid