34
2003 June 30 1 Spatial Indexing Clive Page

Spatial Indexing

  • Upload
    huela

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Spatial Indexing . Clive Page. Formats of Raw Data. Radio: Complex visibility for each polarisation at set of points sampling the complex ( u,v ) plane. Infra-red, Optical, Ultra-violet: Images from 1k ×1k to 18k×20k, collected at intervals of a few seconds to several minutes. - PowerPoint PPT Presentation

Citation preview

Page 1: Spatial Indexing

2003 June 30 1

Spatial Indexing

Clive Page

Page 2: Spatial Indexing

2003 June 30 2

Page 3: Spatial Indexing

2003 June 30 3

Formats of Raw Data

• Radio:– Complex visibility for each polarisation at set of points

sampling the complex (u,v) plane.• Infra-red, Optical, Ultra-violet:

– Images from 1k×1k to 18k×20k, collected at intervals of a few seconds to several minutes.

• X-ray, Gamma-ray: – Photon-even lists: get properties (x, y, time, energy) for

each detected photon. May get a list of millions of these over an integration period of a few hours.

Page 4: Spatial Indexing

2003 June 30 4

Formats of Reduced Data

• Images• Time-series• Spectra• Catalogues of astronomical objects:

– Vital to cross-identify objects from different wavebands, basis for many subsequent data mining investigations.

– Problem: tables can be long or wide:

Catalogue Rows ColumnsOptical: USNO-B 1,045,913,669 30 Infra-red: 2MASS 470,992,970 60 X-ray: 1XMM 56,711 379

Page 5: Spatial Indexing

2003 June 30 5

Main Functionality

• SELECT objects in a given small patch of sky– In a rectangle – to cover same region as an image– In a circle – to cover a radius around point of interest

(also known as search in a cone).– In a polygon – e.g. around extended object.

• Spatial JOIN: Cross-match objects from, for example, two wavebands or two epochs. – Principal matching criterion is overlap of error-circles.– Often important to find objects in one table which are

NOT matched in the other: need a left outer join.

Page 6: Spatial Indexing

2003 June 30 6

Functionality (continued)

• Self JOIN useful in some applications, for example:– Search for clusters of galaxies– Search for double stars– Find spatial distribution or spatial correlation functions

for particular classes of objects– Those observing with adaptive optics need a suitable

reference star in or near each field, e.g. might want to find all AGNs near to bright stars to plan an observing campaign.

Page 7: Spatial Indexing

2003 June 30 7

Current on-line services

• Search in a cone – many on-line services• Cross-matching (spatial JOIN) – few current examples:

– Astrobrowse (at GSFC) has pre-computed joins for a limited number of important catalogues.

– Vizier service (CDS) allows cross-match of user’s own table of positions with list of catalogues.

• But results of list of N sources and M catalogues are presented as unmerged list of M*N separate tables.

– Skyserver – joins between SDSS and other tables. • Self-join: do not know of any on-line services at present.

Page 8: Spatial Indexing

2003 June 30 8

Problems handling object catalogues• Positions use spherical-polar coordinates (RA, Dec)

– Right Ascension corresponds to geographic longitude– Declination corresponds to geographic latitude

• There are singularities at the poles and distortions in the scales everywhere except at the equator.

• RA wraps from 24 hours (360 degrees) to zero.• All object positions are imprecise positions have an

error radius.• Distances between points must use a great-circle distance

function not cartesian distance.• Two-dimensional indexing is really needed

Page 9: Spatial Indexing

2003 June 30 9

Indexing Possibilities

1. Home-brew slicing of the sky

2. Use B-tree on one spatial axis only

3. Use 1-d to 2-d mapping function then simple B-tree

4. Use true spatial index such as R-tree.

Page 10: Spatial Indexing

2003 June 30 10

Home-brew slicing of sky

• Both USNO-B and 2MASS issued as separate files, each covering 0.1 degree in declination, sorted in RA within each strip.

• Software from Harvard-Smithsonian Center for Astrophysics (WCSTOOLS) allows fairly efficient access to data stored in these strips.

• WFCSTOOLS is used by many astronomical archives around the world, (including LEDAS www.ledas.ac.uk).

• Ok for cone-search services, not suitable for spatial joins.

Page 11: Spatial Indexing

2003 June 30 11

Index one spatial axis only• Widely used, including Vizier collection of ~3000

astronomical tables. • Index on declination (avoids RA wrap-around problem)• Poor (but acceptable) performance on cone-search:

– E.g. consider USNO-B: a table of a billion rows– Typical search/join uses a radius of say 3 arc-seconds.– 17% chance of finding a match at a random position.– Index on declination effectively searches a strip 360° x

6 arc-seconds: get around 10,000 rows matching. Need to check all these to find the (0 or 1) true matches.

• Conclusion: might gain five orders of magnitude in efficiency by using a true 2-d index.

Page 12: Spatial Indexing

2003 June 30 12

One-dimensional index: spatial join problem

• Very inefficient, and efficiency is needed when joining one large table with another

• Join criterion is, typically:– SELECT * FROM cat1,cat2 WHERE

ABS(dec1 - dec2) < combinedError AND …

– I have not yet found any DBMS with an optimiser which uses an index when confronted with a join criterion of this form (or indeed any other expression for joining on an overlap of error ranges).

Page 13: Spatial Indexing

2003 June 30 13

Mapping functions (2-d to 1-d)

• Cover the space with cells (pixels) and number them.• Create conventional B-tree on resulting set of integers.• Each point in the sky maps to a single integer. • An area maps to a set of integers.• A seductive idea:

– If a small spatial area maps to a smallish range of integers, then a search over a spatial area might be done by a B-tree search over a small integer range.

– Various space-filling curves have been used in the hope that this works in practice, e.g. Z-order index, Hilbert, Peano curves, and many others.

Page 14: Spatial Indexing

2003 June 30 14

Hilbert Curve

Page 15: Spatial Indexing

2003 June 30 15

Z-order (bit-interleaved) Mapping Function

Page 16: Spatial Indexing

2003 June 30 16

Space-filling Curves as Mapping Functions

• Excellent performance when searching for single points.• For area searches, the median performance is adequate, but

all curves suffer the same drawback:– There exist cases in which nearby points in space are

very far apart on the curve (in the Z-order index this corresponds to a high-order bit flipping).

– Performance tests confirm this defect: the worst-case performance is so abysmal that the average performance is very poor.

• Another problem: simple cartesian grids also unsuited to spherical-polar coordinate searches as there are too many tiny distorted pixels near the poles.

Page 17: Spatial Indexing

2003 June 30 17

Better Mapping Functions

Aim: cover sky uniformly with pixels - integer pixel-codes. • HEALPix - Hierarchical Equal Area iso-Latitude

Pixelisation – invented at ESO for COBE. – Pseudo-square pixels, 2-d arrays of pixels suitable for

analysis of large-scale spatial structures.• HTM - Hierarchical Triangular Mesh – invented at Johns

Hopkins University – now used by several projects.– Triangular pixels. – Hierarchical numbering: high-order bits are a valid

HTM index for coarser grid.• Both algorithms perfectly good when searching for points.

Page 18: Spatial Indexing

2003 June 30 18

Hierarchical Equal Area iso-Latitude Pixelisation (HEALPix)

Page 19: Spatial Indexing

2003 June 30 19

Hierarchical Triangular Mesh (HTM)

Page 20: Spatial Indexing

2003 June 30 20

Spatial Join using Pixel-code MethodGiven table CAT1 with columns:• ID1 – primary key• RA• DEC• POSERR• MAGNITUDE• etc

Given table CAT2 with columns:• ID2 – primary key• RA• DEC• POSERR• FLUX• etc

Page 21: Spatial Indexing

2003 June 30 21

Create tables P1, P2 with pixel-code columnGiven table CAT1 with columns:• ID1 – primary key• RA• DEC• POSERR• MAGNITUDE• etcCreate table P1:• ID1 – foreign key• PCODE1 – primary key

Note: P1 and P2 have extra rows where error-circle overlaps two or more pixels.

Given table CAT2 with columns:• ID2 – primary key• RA• DEC• POSERR• FLUX• etcCreate table P2:• ID2 – foreign key• PCODE2 – primary key

Page 22: Spatial Indexing

2003 June 30 22

Join P1 and P2 on pixel-codes creating PJOINGiven table CAT1 with columns:• ID1 – primary key• RA• DEC• POSERR• MAGNITUDE• etcCreate table P1:• ID1 – foreign key• PCODE1 – primary keyCreate table PJOIN by joining P1

and P2 on PCODE1=PCODE2 • ID1• ID2

Given table CAT2 with columns:• ID2 – primary key• RA• DEC• POSERR• FLUX• etcCreate table P2:• ID2 – foreign key• PCODE2 – primary key

Page 23: Spatial Indexing

2003 June 30 23

PJOIN table

• Join using SELECT DISTINCT to remove any duplicates (two error-circles may each overlap two or more pixels).

• Table PJOIN identifies all pixels where error-circles potentially overlap– Circles may or may not actually overlap, may just be

nearby in the same pixels. • Next step: create B-tree index on PJOIN(ID1)

Page 24: Spatial Indexing

2003 June 30 24

Use PJOIN table to match catalogue rows

• Three-way join then produces required results, e.g.SELECT cols FROM CAT1, PJOIN, CAT2 WHERE CAT1.ID1=PJOIN.ID1 AND PJOIN.ID2=CAT2.ID2AND (2 * asin(sqrt(pow(sin((cat1.dec- cat2.dec)/2),2) + cos(cat1.dec) * cos(cat2.dec) * pow(sin((cat1.ra- cat2.ra)/2),2)))

<= cat1.poserr+cat2.poserr) ;• This works, and speed appears good, but more testing is

needed on a wider range of datasets.

Page 25: Spatial Indexing

2003 June 30 25

True Spatial Indexing

• Hot topic of research in computer science departments for more than 20 years

• Very many algorithms have been proposed:– BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, hB-

tree, kd-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q0-tree, Quadtree, R-tree, SKD-tree, SR-tree, SS-tree, TV-tree, UB-tree.

– So many alternatives, but none of them has properties as good as the B-tree in one dimension (e.g. compact and efficient, with fairly good worst-case performance).

• R-tree one of the earliest structures, one of the best.– R-trees are built into several modern DBMS.

Page 26: Spatial Indexing

2003 June 30 26

Spatial Options in current DBMSDBMS Spatial Index

DB2 Spatial Extender: multi-level grid file

Informix R-tree indexing included

Ingres

Interbase/Firebird

Microsoft SQL Server

MySQL R-trees on arbitrary polygons

Microsoft SQL Server

Oracle Spatial option: R-trees

Postgres R-trees on rectangular boxes

Sybase Spatial Option (Boeing SQS): R-trees

Page 27: Spatial Indexing

2003 June 30 27

Using R-trees

• Must draw a rectangular box outside each error circle• Boxes get rather extended (along RA axis) near poles• Need a subsequent filter to remove spurious matches where

rectangles overlap but error-circles do not.• If the error criterion alters (e.g. user wants 99% probability

circle rather than 90%) need to recreate column of boxes, and then recreate the R-tree index.– Solution: always use a box as large as anyone could

possible want, subsequent filtering on error-circles is still quite cheap.

Page 28: Spatial Indexing

2003 June 30 28

R-tree Performance

• Postgres: R-tree indexing works as advertised– R-trees are large, creation is slow, e.g. 2 hours for table

of 3.5 million rows. • MySQL: latest version allows R-trees on any polygon (but

anything above 4 sides is wasteful).– Very verbose external data format– Works as advertised, not yet measured performance.

• Informix: Kalpakis et al. (ADASS conference report) loaded part of USNO-A2 and found data load and R-tree creation would have taken 39 days for the entire table of 500 million rows.

Page 29: Spatial Indexing

2003 June 30 29

Comparison of Pixel-code and R-tree Methods

• Advantages– Pcode join seems to be faster (but not yet benchmarked

with identical systems).– Takes up less disc space in total.– Can use any DBMS, not just those with an R-tree or

other spatial data option.• Disadvantages

– Additional tables and indices have to be created.– More complex set of joins. – Needs external code as neither HTM or HEALPix can

be expressed as an SQL-callable function (because they return a variable-length array of integers).

Page 30: Spatial Indexing

2003 June 30 30

Indexing: summary

• Indexing on just one spatial axis is simply too inefficient for large tables and cannot support joins.

• R-trees are powerful and easy to use, but index creation times are a serious cause for concern.

• 2d1d mapping functions such as HTM or HEALPix are more complicated to use but may be faster in some cases.

• AstroGrid will continue work in this area.

Page 31: Spatial Indexing

2003 June 30 31

Idea: HTM as Universal Position Locator?

Can choose HTM (or HEALPix) resolution as high as needed• 32-bit words: pixels around 24 arcsec on a side.

– More pixels than objects in largest current catalogue• 64-bit words: pixels around 0.4 milliarcsec on a side.

– Ample precision for some time to comeMaybe every object catalogue should list the UPL of each

celestial object, allowing fast and easy searches and joins?• Fine for points in sky, but real objects have finite extent,

and some will correspond to a more than one UPL • Some objects move – should pixel-code change with time?• Reference frame also moves

Page 32: Spatial Indexing

2003 June 30 32

UPLs for extended objects: possible work-arounds

• Choose a pixel size large enough to encompass the error region in each case– In some cases this would be very large indeed– Different objects would have different pixel size

• Have extra rows in tables for objects crossing pixel boundaries.– Need to alter original tables, not always acceptable.

• Add variable-length vector to table to contain list of UPLs– Non-normalised table; not allowed by most DBMS

• Introduce separate table to map object IDs to UPLs.– Adds significant complexity to search/join operations

Page 33: Spatial Indexing

2003 June 30 33

Speeding up Spatial Data Access

Use Parallel Hardware such as Beowulf Clusters• We may be able to spread a large table over the disc drives

of several nodes in a cluster for faster searches.• Some support for this is built into DBMS such as Oracle,

DB2 and SQL Server. There is currently nothing to support it in Open Source products such as MySQL and Postgres.

• Possible do-it-yourself approach: split table into separate DBMS instances on separate nodes. Farm out query to all instances, then combine the results.

Column-oriented storage• Have tested Sybase-IQ: performance good, but expensive.

No multi-dimensional indexing.

Page 34: Spatial Indexing

2003 June 30 34

Can we do Distributed Searches and Joins?

• Object catalogues of interest are located on data archive servers all over the world. Is it feasible to search or join without copying entire tables?– Cone search services are becoming widespread. A few

servers allow a single search to be farmed out to many servers, no attempt is made to merge the results, they are just concatenated. Room for improvement here?

– An outer join requires every row in table#1 to be present in the output, so there is little point in attempting a join over the network, when it is simpler to copy table#1 to the host holding table#2.

• Perhaps need to set up astronomical data warehouse with services to import tables and join them.