20
Introduction to Sky Introduction to Sky Survey Problems Survey Problems Bob Mann Bob Mann

Introduction to Sky Survey Problems Bob Mann. Introduction to sky survey database problems Astronomical data Astronomical databases –The Virtual Observatory

Embed Size (px)

Citation preview

Introduction to Sky Survey Introduction to Sky Survey ProblemsProblems

Bob MannBob Mann

Introduction to sky survey Introduction to sky survey database problemsdatabase problems

Astronomical data Astronomical data

Astronomical databasesAstronomical databases– The The Virtual ObservatoryVirtual Observatory – concept & status – concept & status– Large sky survey databasesLarge sky survey databases

Spatial indexing in astronomical databasesSpatial indexing in astronomical databases

Case Study: SDSS & Case Study: SDSS & SkyServerSkyServer

Observational AstronomyObservational Astronomy

Electromagnetic spectrumElectromagnetic spectrum

IRAS 252MASS 2DSS Optical IRAS 100 NVSS 20cmGB 6cmROSAT ~keV WENSS 92cm

Astronomical data – in original formAstronomical data – in original form

OpticalOptical– Image: array of pixel valuesImage: array of pixel values

X-ray X-ray – Event list: positions, arrival times, energies Event list: positions, arrival times, energies

of all detected photonsof all detected photons

Radio Radio – Interferometric visibilities: sparse Fourier Interferometric visibilities: sparse Fourier

transform of a region of the skytransform of a region of the sky

VeryVery different types of data different types of data

Astronomical data – in final formAstronomical data – in final form

Most research done using catalogue dataMost research done using catalogue data– i.e. tables of attributes of detected sources – i.e. tables of attributes of detected sources –

mainly discrete sources (stars, galaxies, etc)mainly discrete sources (stars, galaxies, etc)– Data compressionData compression

Catalogue - few% of image data volumeCatalogue - few% of image data volume

– Amenable to representation in relational DBAmenable to representation in relational DBNatural indexing by location in skyNatural indexing by location in sky

Astronomical DatabasesAstronomical Databases

Sky survey archivesSky survey archives– Homogeneous data, standard reduction pipelineHomogeneous data, standard reduction pipeline– ““Science Archive” – do science on DBScience Archive” – do science on DB

Telescope archivesTelescope archives– Semi-indexed collections of raw data files from all Semi-indexed collections of raw data files from all

observations taken – heterogeneousobservations taken – heterogeneous– Download data for reduction and analysisDownload data for reduction and analysis

Specialist data centres – collections of cataloguesSpecialist data centres – collections of catalogues

Bibliographic databases– scans of major journalsBibliographic databases– scans of major journals

The The Virtual ObservatoryVirtual Observatory

Concept: Concept: – Interoperable federation of all the world’s Interoperable federation of all the world’s

significant astronomical databasessignificant astronomical databases– Facilitate multi-wavelength astronomyFacilitate multi-wavelength astronomy

Status:Status:– Several projects underway – AstroGrid in UKSeveral projects underway – AstroGrid in UK– 5+ years’ work to create a fully working VO5+ years’ work to create a fully working VO

The VO sets the context for the design of The VO sets the context for the design of new sky survey databasesnew sky survey databases

AstroGrid: AstroGrid: www.astrogrid.orgwww.astrogrid.orgConsortium:Consortium:– Edinburgh, Leicester, Cambridge, RAL, MSSL, Edinburgh, Leicester, Cambridge, RAL, MSSL,

Jodrell Bank, Queens BelfastJodrell Bank, Queens Belfast

3 year (~£4M) project:3 year (~£4M) project:– 1 yr Phase A Study – finished end of 20021 yr Phase A Study – finished end of 2002– 2 yr Phase B Implementation – to end 20042 yr Phase B Implementation – to end 2004

Web (later Grid) service framework; in JavaWeb (later Grid) service framework; in Java

Currently building web services, portals, etc Currently building web services, portals, etc - researching OGSA and OGSA-DAI- researching OGSA and OGSA-DAI

Large sky survey databasesLarge sky survey databases

Major science driver for AstroGrid – and VOMajor science driver for AstroGrid – and VO– New science – mining multi-wavelength dataNew science – mining multi-wavelength data

Largest are optical/near-infrared sky surveysLargest are optical/near-infrared sky surveys

Largest of these hosted in Edinburgh: Largest of these hosted in Edinburgh: – currentcurrent - SuperCOSMOS, SDSS (mirror) - SuperCOSMOS, SDSS (mirror) – futurefuture - WFCAM, VISTA - WFCAM, VISTA– Each yield 1-10TB of catalogue data in RDBMS Each yield 1-10TB of catalogue data in RDBMS

Spatial queries in astronomySpatial queries in astronomy

Two important types:Two important types:– Select entries (with predicate) in area of skySelect entries (with predicate) in area of sky– Match entries (esp. between two tables)Match entries (esp. between two tables)

Second is special case of first Second is special case of first – i.e. both boil down to “point-within-distance-of-i.e. both boil down to “point-within-distance-of-

point” point” – but distances in two cases can be very differentbut distances in two cases can be very different

Advantage in using a hierarchical spatial Advantage in using a hierarchical spatial indexing scheme indexing scheme – Perform spatial query at appropriate granularity Perform spatial query at appropriate granularity

Spatial IndexingSpatial Indexingin Astronomyin Astronomy

The Celestial Sphere The Celestial Sphere

Many coordinate systemsMany coordinate systems

Most common is the Most common is the

equatorial systemequatorial system, with , with

Right AscensionRight Ascension and and

DeclinationDeclination as analogues as analogues

of Longitude & Latitudeof Longitude & Latitude

Spatial indexing in Spatial indexing in astronomical databasesastronomical databases

Basic DBMS indexes are 1-D – e.g. B-treesBasic DBMS indexes are 1-D – e.g. B-treesSome DBMSs support general 2-D indexing Some DBMSs support general 2-D indexing – Usually using R-trees (or variants) – rectangles: astronomical Usually using R-trees (or variants) – rectangles: astronomical

experiments not too successful: [experiments not too successful: [CliveClive]]

Some DBMSs have native spatial indexingSome DBMSs have native spatial indexing– Little knowledge of this in astronomy - Little knowledge of this in astronomy - want to know morewant to know more

ButButThe Celestial Sphere is a sphere(!)The Celestial Sphere is a sphere(!)– Many geographical spatial DBs use planar projectionsMany geographical spatial DBs use planar projections

So, astronomers have felt the need to develop spatial So, astronomers have felt the need to develop spatial indexing prescriptions of their ownindexing prescriptions of their own

Hierarchical Triangular Mesh - Hierarchical Triangular Mesh - HTMHTM

Developed by Sloan survey archive team at JHUDeveloped by Sloan survey archive team at JHUStart with projection of octahedron on sphere and Start with projection of octahedron on sphere and subdivide triangles at their midpointssubdivide triangles at their midpoints

Generate unique pixel ID code based on position Generate unique pixel ID code based on position in the sky and level in hierarchy – can in the sky and level in hierarchy – can index that with B-treeindex that with B-tree

Hierarchical Equal Area Iso-Hierarchical Equal Area Iso-Latitude Pixelisation (HEALPix)Latitude Pixelisation (HEALPix)

Developed by Kris Gorski (now JPL/Caltech)Developed by Kris Gorski (now JPL/Caltech)

Start with division of sphere into twelve equal area Start with division of sphere into twelve equal area curvilinear quadrilaterals,curvilinear quadrilaterals,

then divide each into fourthen divide each into four

Like HTM, produces aLike HTM, produces a

pixel code on which apixel code on which a

B-tree index can be madeB-tree index can be made

(Ian – HEALPix in Oracle?)(Ian – HEALPix in Oracle?)

Sky survey DB case study:Sky survey DB case study:SkyServer for SDSSSkyServer for SDSS

Sloan Digital Sky Survey (SDSS): Sloan Digital Sky Survey (SDSS): – first of new generation of sky surveysfirst of new generation of sky surveys

US-led team, dedicated telescope & cameraUS-led team, dedicated telescope & cameraImage half of northern sky in 5 optical bandsImage half of northern sky in 5 optical bandsThen obtain opticalThen obtain optical

spectra for 1,000,000spectra for 1,000,000galaxiesgalaxies

Estimated ~1TB ofEstimated ~1TB ofcatalogue datacatalogue data

SDSS ArchiveSDSS Archive

First of new generation of sky survey archivesFirst of new generation of sky survey archives– Represents the state-of-the-art in sky survey databasesRepresents the state-of-the-art in sky survey databases

Developed by Alex Szalay’s team at Johns Hopkins Developed by Alex Szalay’s team at Johns Hopkins Project started in earnest in about 1996Project started in earnest in about 1996– OODBMSs seen as the coming thingOODBMSs seen as the coming thing– SDSS chose SDSS chose Objectivity/DBObjectivity/DB for their archive: for their archive:

~15 staff-years of effort later, they’d rewritten ~15 staff-years of effort later, they’d rewritten much of the DBMS themselves…and then jumped much of the DBMS themselves…and then jumped ship and started using MS SQL Server! - ship and started using MS SQL Server! - SkyServer SkyServer (in collaboration with Jim Gray, MS (in collaboration with Jim Gray, MS Research) Research)

SkyServer design considerationsSkyServer design considerations

Power & flexibility to pose arbitrary queriesPower & flexibility to pose arbitrary queriesSimple – astronomers ignorant of SQL!Simple – astronomers ignorant of SQL!Hide messy spherical trigonometry Hide messy spherical trigonometry – Distance on sphere between (a1,d1) and (a2,d2) is Distance on sphere between (a1,d1) and (a2,d2) is

given in SQL bygiven in SQL by 2.0*asin(sqrt(square(sin(0.5*(radians(d1-d2)))) + 2.0*asin(sqrt(square(sin(0.5*(radians(d1-d2)))) +

cos(radians(d1))*cos(radians(d2))* cos(radians(d1))*cos(radians(d2))* square(sin(0.5*(radians(a1-a2)))))square(sin(0.5*(radians(a1-a2)))))

– Don’t want users typing thisDon’t want users typing this– Don’t really want DBMS to evaluate expressions like Don’t really want DBMS to evaluate expressions like

this oftenthis often

SkyServer spatial queriesSkyServer spatial queries

Simple table-valued functions exposed to user:Simple table-valued functions exposed to user:– E.g. selectE.g. select count(*) count(*)

fromfrom fGetNearbyObjEqfGetNearbyObjEq(a,d,radius)(a,d,radius)

(a,d)=(Right Ascension, Declination)(a,d)=(Right Ascension, Declination)

Functions call SQL Server Functions call SQL Server Extended Stored Extended Stored ProcedureProcedure– HTM index manipulation routines, implemented in a HTM index manipulation routines, implemented in a

Dynamically Linked Library (DLL)Dynamically Linked Library (DLL)– DLL generated from HTM package in C++ DLL generated from HTM package in C++

Lessons from HTM Lessons from HTM implementation in SkyServerimplementation in SkyServer

SQL is not great for spherical trigonometrySQL is not great for spherical trigonometry– Messy to write, slow to computeMessy to write, slow to compute

Have to define stored procedures/functionsHave to define stored procedures/functions– Expose a clean interface to usersExpose a clean interface to users– Let them pose queries the way they want toLet them pose queries the way they want to

Replace trig operations by integer arithmetic Replace trig operations by integer arithmetic – Library of HTM index operations underneathLibrary of HTM index operations underneath

Precompute tables of neighbouring objectsPrecompute tables of neighbouring objects– Far fewer spatial match operations at query timeFar fewer spatial match operations at query time

Problems with this approachProblems with this approach

How easy to develop stored procedures, etc?How easy to develop stored procedures, etc?– Needs detailed knowledge of DBMSNeeds detailed knowledge of DBMS– Extended Stored ProcedureExtended Stored Procedure calls slow calls slow

How well will query optimiser use HTM?How well will query optimiser use HTM?– ……less well than built-in spatial index?…less well than built-in spatial index?…

……but that might be poorly suited to astronomical but that might be poorly suited to astronomical applications…applications…

How easy to implement all this in DBMSs How easy to implement all this in DBMSs other than SQL Server?other than SQL Server?

But this works reasonably well in practice!But this works reasonably well in practice!