Information Discovery in Spatio-Temporal Environmental Datajoseph/spatial-olap.pdfover the specified time period, such that each map corresponds to a temporal snapshot as specified

Information Discovery in Spatio-Temporal Environmental Data (Extended Abstract)

Joseph JaJa, Steve Kelley, and Dave Rafkind

Institute for Advanced Computer Studies University of Maryland, College Park

Abstract

Substantial amounts of spatio-temporal data sets have been collected over the past few decades and are being generated at an ever-increasing rate. It is currently extremely difficult to extract and make use of the information embedded in these data collections. In this paper, we describe an information discovery system, called GIDE (Grid Information Discovery Engine) based on new spatial OLAP techniques and an XML infrastructure for periodically harvesting information from distributed sites having large collections of environmental data. In particular, our spatial OLAP uses an efficient indexing structure based on an R-tree, coupled with a spatial measure represented by a linear quadtree, which allows a near optimal covering of the spatial objects and data granule counts for any given geographic region. Users can query and correlate results by spatial and temporal extents, site and project, and scientific category.

1. Introduction Substantial amounts of environmental data are already available at many sites around the world, and a much larger variety is currently being generated at a very high rate. The remotely sensed data from the NASA sensors alone is expected to exceed several terabytes a day within the next few years. These large volumes of spatio-temporal environmental data offer unprecedented opportunities for characterizing spatial phenomena, discovering interesting spatial and temporal trends, and studying new environmental and ecological models. However, the task of exploring the data available and performing analysis to discover associations between spatial patterns and environmental phenomena is currently quite difficult due to a number of factors. First is the typical problem of having large amounts of data stored at different locations, in heterogeneous formats and systems. The heterogeneity here includes widely different resolution data registered using one of a number of projections of the Earth into two dimensions, raster versus vector data representations, and different types of multispectral raw data and derived higher levels products. Hence the first major challenge is to set up the infrastructure that allows information extraction from the different sources to create an up to date summary information with sufficient details, upon which data exploration and information discovery can be carried out. The infrastructure should be flexible enough to allow new types of spatial data at any repository to be accounted for in a timely fashion, as well as allow the incorporation of additional data sources. Addressing this issue is not the main focus of this paper. However, we had to deal with this problem to build our prototype system since in particular we want to experiment with real and heterogeneous data collections, and study the cost of regular bulk incremental updates. A very brief outline of the related infrastructure is included in this paper. Second, the design

1

of a spatial OLAP that provides up to date summary information and that is scalable to large amounts of spatio-temporal data (such as remotely sensed data and in-situ data collected by ground stations) is significantly more challenging than the design of a non-spatial OLAP. This is in part due to the difficulty in defining and handling spatial dimensions and measures. For example, the user typically specifies an arbitrary region on a global map, types of data at a high level of abstraction (say, Landsat Thematic Mapper TM imagery for land cover studies, and a particular species of birds), an arbitrary time interval (say between January 1998 and February 2001), and desired spatio-temporal resolutions. The OLAP should quickly generate a set of maps over the specified region indicating the extent of the available TM scenes with an overlay of the birds information over the specified time period, such that each map corresponds to a temporal snapshot as specified by the query’s temporal specification. The task of data exploration requires that the maps be generated very quickly, and hence some aggregate maps over predefined regions and predefined time intervals have to be precomputed. Third, we will be dealing with different levels of abstraction, including multi-level semantic hierarchies such that the finest scale data (granule type) is represented at the leaves of each hierarchy. Some subsets of the nodes of a hierarchy may be accessed much more frequently than other nodes, and hence serious considerations should be given to what types of aggregate information should be materialized. This is especially difficult since we don’t have any a priori information about the query regions, which could be of any size and anywhere on the globe. In this paper, we focus on the design and implementation of a spatial OLAP paying a particular attention to indexing and map representation details and focusing on I/O efficient schemes. We also outline a materialization scheme that is tightly coupled with our indexing structure. This OLAP is the core component of a new prototype system developed by our group and called GIDE (Grid Information Discovery Engine). The main components of GIDE are: (1) an XML infrastructure for accessing on a regular basis collections of heterogeneous spatio-temporal environmental data that currently reside in four different sites; (2) an OLAP engine based on novel design ideas for spatial dimensions and measures, which includes a Hilbert-packed R-tree at its foundation, coupled with local region quadtrees representing the summary maps over predefined cells over the globe; and (3) a visualization interface that enables easy interaction, exploration, correlation, and discovery of various types of information using different levels of abstraction. Before proceeding further, we give a couple of examples of the types of interactive exploration possible with GIDE. Query 1. A graduate student is asked to create time series maps for assessing forest cover loss near an area where logging roads are being built. The query is for the study area in western Canada (region is selected as a polygon by user) and for two time periods specified using years 1986 and 1996. The maps should use the same spatial resolution and should have an additional overlay of other land cover types with pertinent transportation features.

2

Query 2. A land cover scientist wishes to examine the interaction of El Nino events on vegetation changes for a specific study area to determine how much area in vegetation cover is lost or gained. The query is to produce and display two NDVI-change maps for the Southwestern United States (region specified as 5 SW states) that correspond to the beginning and end of an El Nino event. A substantial amount of work has been done regarding data warehouses – see for example [1,2,5]. However much less research has been devoted to spatio-temporal OLAPs, and to the authors’ best knowledge, none of which has addressed the problem in its full generality as in this paper. For example, Stefanovic et. al. [16] focuses on the development of materialization methods for spatial data cubes by extending the work of [9]. The set up is much more restricted as they have dealt with a fixed partition of a given contiguous region. Other somewhat related efforts in spatial data mining appear in [3,8].

2. Overall Design GIDE is designed from the ground up to handle heterogeneous spatial data collections using semantic hierarchies to group subcollections into browsable entities. Semantic classification of data sources and related metadata were incorporated into our system using domain expertise. This classification can be easily extended to further domains, and automatic semantic classification requires that the context of information be extracted from data sources. While the current prototype includes several hierarchies, we focus here on the category hierarchy based on a preliminary partial classification of environmental data. An example of a small subtree of this hierarchy is shown below: Land Cover

Land Use Carbon Cycle Urbanization Forest Cover

Global Regional AVHRR Land Cover MODIS ETM Maps

The prototype system takes an object-oriented view of the data sources, using XML metadata import and update. We have designed DTDs for local and remote metadata import and export primitives. We assume that data sources have well-defined access protocols such as FTP, HTTP, or SQL gateways. The prototype is intended to meet the minimum possible capabilities necessary to access stored data and extract spatio-temporal attributes. Each source has a metadata description stored locally, specifying various

3

access parameters such as the URI of the source and the protocol for accessing the related data. The data sources included in our study are the following:

• The University of Maryland Global Land Cover Facility (GLCF) that offers access to around 10TB of primarily remotely sensed data for land cover studies. The major data types include the following:

o Landsat Thematic Mapper TM scenes, in raster form, each of size approximately 400MB, and covers a region of approximately 1Kmx1Km. The GLCF has around 5,000 such scenes covering North America, South America, Africa, and a few regions in Asia, collected at different times.

o Derived GIS and raster Central African Regional Program for the Environment (CARPE), augmented by human or animal population data.

o US Coastal Marsh data, in GIS format, and derived from TM data and the USDA National Wetlands Inventory data. Each such data is of size between 20MB to 40 MB.

o An up to date collection of 250m resolution MODIS red and near infra-red surface reflectance data, and derived NDVI (Normalized Difference Vegetation Index) products.

• Long Term Ecological Research data, available through a site at the University of New Mexico, which includes the Hyperspectral AVIRIS Flyover data from Sevilleta. This latter data is of very high resolution generated by airborne Infrared Spectrometer along long, thin strips over the Sevilleta. Each raster image is about 500MB.

• The University of Kansas Museum data, which includes biological data collections stored at various universities and accessed using the Z39.50 protocol. Each data item is very small (less than 1KB), but with a wide variety of data types.

3. Spatial OLAP Design

The core of our system is a spatial OLAP that includes new methods for defining spatial dimensions and measures, and a novel implementation based on a combination of R-trees and region-based quadtrees. We first review some general concepts and then focus on our design. In general, an OLAP provides selected summary information, collected from multiple sources, and modeled as a multidimensional structure. Each dimension corresponds to a “business prespective” such that each cell contains the value(s) of some aggregate measure. Typically, there is a dimension table associated with each dimension. The data cube refers to the computation of the aggregate functions over all combinations of dimensions (all group-bys), and hence its size is exponential in terms of the number of dimensions. The data cube can be used to answer OLAP queries, which require aggregation on different combinations of attributes.

4

The most commonly used modeling approach is based on the star schema, which consists of a large central table, called the fact table, coupled with several dimension tables. Each dimension table contains information specific to the dimension, while the fact table correlates the various dimensions using a number of measures. In our case, a dimension or a measure can be spatial, that is, it can represent information pertaining to a spatial region in either raster or GIS format. An OLAP server can be implemented either as relational OLAP (ROLAP), or multidimensional OLAP (MOLAP), or a hybrid of these approaches. The notion of concept hierarchies plays a very important role in OLAP design and is essential in our design. Such a hierarchy allows different levels of summarization and abstractions, and can be represented in general as a directed acyclic graph. For example, a temporal hierarchy can be defined as: . An example of a spatial hierarchy consists of the following sequence:

. Among the typical OLAP operations are: drill-down, roll-up, dicing, slicing, and pivoting. For example, a roll-up operation along a dimension involves the generation of summary information bottom-up along the corresponding hierarchy, and hence allows users to view data at different degrees of abstraction. For more details about OLAP operations, see [1,2,5].

yearseasonmonthweekday →→→→

statescountiesregionszipcode →→−

It is clear that it is impractical to precompute and materialize all the group-bys of the data cube. An essential component of designing an OLAP is to determine the views that will be materialized in the OLAP, for which several algorithms exist. Overview of Spatial OLAP Model In order to focus our attention to the core issues, we abstract the static version of our problem as follows. We are given a large number of spatial objects, each characterized by spatial coordinates (say, longitude and latitude) and spatial extents, a time stamp, and an elementary data type. Note that, while the temporal and data type attributes can be enumerated using integers, the spatial coordinates and extents are real numbers that can take on an unbounded number of values (within fixed ranges). An elementary data type is a data type that appears at the finest scale and the highest resolution in the category hierarchy. Before defining the measures, let’s look at the types of queries we want to handle. A typical query is specified by: (i) an arbitrary polygonal region on a map; (ii) some time interval; (iii) an environmental data class specified by a node of the category hierarchy; and (iv) spatio-temporal resolutions. The OLAP is expected to return the maps, each map clearly delineating the boundaries of the polygons where each data type of the specified class exists, together with a numerical count of each elementary data type (other numerical measures are useful but we focus here on just the count measure). An output to a query requesting a map of the regions covered by the Landsat TM scenes available under a node in the category hierarchy is shown in Figure 1. We use different colors to reflect the ranges of the numerical counts of the TM scenes at different locations but these were combined for the purpose of this example. The small global map shown in the lower right corner gives the relative location of the selected region on the globe.

5

Figure 1: Result of a query specified over the region indicated

In our design, we model the fact table with only four dimensions: two spatial dimensions, temporal, and elementary data types. The temporal dimension is defined in a straightforward way, with an enumeration of time intervals (in our current prototype, monthly starting January 1970 until present). The elementary data types are enumerated in a left to right traversal of the leaves of the category hierarchy. This ensures that the data types corresponding to any node in the hierarchy tree appear contiguously in this ordering, a fact that we will exploit later. We now turn our attention to the spatial dimensions. Given the almost random coverage of possibly very large number of patches of the globe for each data type, a coverage that is dynamically changing, and the fact that the user can select an arbitrary region over the globe, it is not clear over which spatial boundaries our information should be aggregated and precomputed, even if we fix the data type and time period. We organize our basic spatial granularity as follows. We impose a spherical grid of latitude and longitude lines over the whole globe as shown in the following figure.

6

Figure 2: Spherical Gridding of the Globe In our implementation, the latitude and longitude lines are each separated by degree one. This seems to work very well based on our preliminary experimental results. Hence we obtain a grid system of size 360x180, of irregular cells. Note that each degree of latitude is approximately 111Km (69 miles) and each degree of longitude at the equator is also approximately 111Km. However each degree of longitude at 60 degrees from the equator is approximately 56Km, and decreases steadily as we approach either one of the poles. Each grid cell can be specified by a pair of values (longitude, latitude) defining the lower left corner of the cell. Hence we incorporate these two spatial dimensions in our fact table. Assuming a few thousands elementary data types and few hundreds data classes, the fact table can clearly be quite large, depending on data coverage. For land studies, all grid cells over water can be ignored, which substantially reduces the size of the problem. Before discussing the implementation and related indexing issues and higher-level materialization schemes, we describe the measures we are after. Consider a fixed grid cell C, say near the equator, which will roughly be of size 111Kmx111Km in our implementation. For example, a query may require a map over C, which defines the subregions determined by the available Landsat TM scenes of resolution 30m. Note that each TM scene covers roughly a slightly distorted rectangular region of size 1Kmx1Km, whose sides are not necessarily aligned with those of C. Hence the possibly overlapping TM scenes that are within C can form an arbitrary polygonal region. How should these regions in C be represented? Note that numerical measures can give rise to the same issue. Suppose for example we want to compute the average temperature and precipitation in C over some period of time at a certain spatial resolution. The temperature and precipitation can be mapped into categorical data, and hence the average temperatures and precipitations decompose C into different types of regions (illustrated by different colors in our prototype), just as in the previous case.

7

To solve this problem, we do the following. We rasterize all our elementary data types to the finest desired resolution (100 m resolution in our implementation), and then partition each cell C into 2 mesh, where k is chosen to correspond to the finest desired resolution (and hence depends on the size of C). For our hypothetical cell C, . We now use the region quadtree [14] to represent any combination of points and regions within C. The region quadtree is based on the successive division of C into four equal-sized quadrants such that each quadrant will be subdivided further if it is not completed included in one of the regions defined in C. The next figure illustrates the region quadtree generated to answer the query whose result is shown in Figure 1.

kk 2×10=k

Figure 3.

The region quadtree is characterized as a variable resolution data structure in that it can be used to represent the regions in C at any resolution up to the highest resolution defined by the mesh on C. In particular, the nodes at level correspond to a partition of C into

mesh. It is well known that the quadtree representation can efficiently handle range search type queries within a cell. Moreover, the merging of the quadtrees corresponding to different temporal snapshots within the same cell or the quadtrees corresponding to adjacent cells can be performed very quickly in time proportional to the perimeters of the boundaries of the regions. As we will see later, these two types of operations are critical for the performance of our query answering algorithm. Details of the merging operations will be given in the full paper. See also [13,14].

iii 22 ×

Indexing Structure and Materialization Scheme Thus far, each row of the fact table corresponds to a grid cell, time intervals, and elementary data type, such that the grid cell is non-empty; otherwise, the cell does not appear in the table. We will soon address which additional spatio-temporal aggregrates need to be materialized. Given the expected size of such a table, it has to reside in external memory, and hence it is critical to choose an implementation scheme that will minimize the I/O time. We use an indexing scheme based on a four-dimensional R-tree such that any query can be evaluated through a single range query on the R-tree.

8

Consider a four-dimensional space, say ( , where x and y correspond respectively to the longitude and latitude coordinates (x varying between 1 and 360 and y varying between 1 and 180 in integer values), z represents an enumeration of the time intervals in increasing order (in our implementation beginning with January 1970 up to the present), and u corresponds to an enumeration of the elementary data types ordered according to a left to right traversal of the leaves of the category hierarchy, as stated earlier. Hence our four-dimensional coordinate space has been restricted to an integer lattice, a fact that will make our implementation quite efficient. Each row of the fact table corresponds to a point in our (x,y,z,u)-space, for which we have a numerical measure and a region quadtree map. The numerical measure corresponds, within the specified time period, to the number of objects belonging to the grid cell of the type specified, and the region quadtree map defines the regions covered by the selected data type. With each such point, we associate a pointer to a record that contains the numerical measure and the linear quadtree representation.

),,, uzyx

We now use a packed Hilbert R-tree [11] to represent the resulting collection of these points. The Hilbert R-tree packs as many children into a parent node as possible while trying to make sure that children of the same parent are spatially close by using the Hilbert space-filling curve. Specifically, the tree is constructed bottom-up as follows. The points are first sorted in increasing order of their Hilbert values. The first B points are removed and grouped under the same leaf node. The quadtrees corresponding to these leaves are stored contiguously on the disk. The value B is related to the size of a disk page. The next B points are again chosen from the remaining sorted listed, and grouped under the next leaf node, and so on. After all the leaf nodes are created, they are grouped similarly into internal nodes, using the order in which they were created. Tree nodes are created level by level until there is only one node that becomes the root of the R-tree. Packed Hilbert R-trees have been shown to be the most efficient R-trees in general [4,6,10], which can be dynamically updated quite efficiently as well. In our implementation, we chose 62=B in which case a tree of height 4 corresponds to rows in the fact table.

242

Consider now the handling of a query specified by a certain rectangle S, time interval

, and a data type. More general queries defined over nodes of the category hierarchy can be handled similarly. Let T be the union of the global grid cells that have a non-empty intersection with S. Note that T could consist of only a single grid cell that properly contains S. Answering the query with T instead of S reduces the problem to the handling of a single range query on the R-tree. Given the very small depth of the R-tree and the spatial characteristics of the Hilbert space-filling curve, this operation can be performed very quickly. The outcome is a set of leaves. Three cases arise. The first is when S is properly contained in one of the global grid cell, in which case we have to perform a quadtree merge operation over the leaves corresponding to different temporal snapshots, followed by a range search on the resulting quadtree to extract the subregions corresponding to S. The second case is when S=T, which can be handled by merging the quadtrees and combining the counts of the various leaves identified during the range

],[ 21 tt

9

search on the R-tree. As stated earlier, we can merge quadtrees quite efficiently while updating the counts (by the way, the counts of two adjacent cells cannot be simply added together even for a single data type as a spatial object may extend over the two cells). The third case is when the boundary cells of T partially overlap with S, which is reduced to a combination of the first two cases. In practice, since we are only interested in summary information, it may be sufficient to either slightly enlarge or reduce S (say by including a boundary grid cell if a significant portion overlaps with S). However, in either of case 2 or 3, the number of the leaves involved in answering a query can be quite high resulting in a significant execution time. To improve performance, we materialize higher level aggregate information as follows. We use a scheme similar to the one reported in [15]. For each internal node in the R-tree, we precompute the quadtree map obtained by merging the maps if its children and the combining of the counts of each data type stored in its children. This information can be computed during the process of creating the R-tree, in a bottom up fashion. Hence the number of aggregations doubles. Given range query window w, an allocation node of the R-tree is a node whose MBR (Minimum Bounding Rectangle) is covered entirely by w and whose parent is not an allocation node. Answering a range query consists of identifying the set of allocation nodes, followed by merging of the information of these nodes. As shown in [15], the number of allocation nodes is in general substantially smaller than the corresponding number of leaves, which implies a much faster query algorithm. In fact, one can show that the maximum number of allocation nodes is

, where n is the total number of leaves. )log( nBO B

In addition to this scheme, we can significantly speed up frequent queries to any particular node F in the category hierarchy as follows. We build a 3-D Hilbert packed R-tree corresponding to the spatial and temporal dimensions, where each quadtree map includes all the data types pertaining to node F. We also materialize internal nodes as described above. This leads to very efficient spatio-temporal queries over the data category specified by the node F. We are currently exploring a dynamic scheme in which reduced dimension R-trees are built for those highly accessed nodes in the category tree.

4. Interface Design and Visualization Since our main goal is to enable efficient exploration and scientific knowledge discovery of environmental data, the interface should be carefully designed to enable the experimentation with the rich set of possible queries and an intuitive visualization of the query output. Other important aspects of the design include ease of use, speed, and small client footprint. Our interface interface has three main parts (see Figure 4): map browser and time sliders in the upper right section, detailed query selection according to one of the concept hierarchies in the upper left as selected by the user (part of the opened up category hierarchy is shown in the figure), and detailed query results as desired in the lower section.

10

Figure 4 Screen shot of the interface for the TM query

We have used the University of Minnesota MapServ to provide map browsing capability. Given latitude and longitude extents and desired image size, the map server returns a GIF image of the requested map data. Our map browser applet sends latitude and longitude coordinates to the map server and retrieves map images for display to the user. The large main map is for direct browsing, while the smaller map in the lower right corner displays the portion of the world seen in the large map. Intuitive ways to pan and zoom are used. For example, zooming involves drawing a box on the map with the left mouse button over the area be zoomed in. If the area of the zoom is too small, a secondary small map appears to show a finer degree of magnification. The time selection applet lets the user choose the start and end time by which to specify the temporal extents of the query. The query selection applet presents the user with the possibility of specifying a node in one of the concept hierarchies. The user chooses one of the hierarchies with a radio button on top of the applet, and can open up and close branches of this hierarchy as desired. Opening a branch will automatically update the query results, using the spatial and temporal extents already specified on the right side of the interface. The query results section enables the user to further explore in more details the results of the query. Results of a query can be visualized from different perspectives and in different resolutions as desired. In particular, a 3-D visualization of the query results is generated over the map region specified by the user.

11

5. References 1. S. Agarwal, R. Agarwal, P.M. Deshpande, G. Gupta, J.F. Naughton, R.

Ramakrishnan, and S. Sarawagi, “On the computation of multidimensional aggregates,” VLDB, pp. 506-521, 1996.

2. S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” ACM SIGMOND Record, 26, pp. 65-74, 1997.

3. M. Ester, H.P. Kriegel, and J. Sander, “Spatial data mining: a database approach,” SSD’97, pp. 47-66, July 1997.

4. C. Faloutsos, Searching Multimedia Databases by Content, Kluwer Academic, 1996. 5. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann,

2001. 6. V. Gaede and O. Gunther, “Multidimensional access methods,” ACM Computing

Surveys, 30(2), pp. 170-231, 1998. 7. I. Gargantini, “An effective way to represent quadtrees,” CACM, 25(12), pp. 905-910,

1982. 8. J. Han, K. Koperski, and N. Stefanovic, “GeoMiner: a system prototype for spatial

data mining,” ACM-SIGMOND, pp. 553-556, 1997. 9. V. Harinarayan, R. Rajaraman, and J.D. Ullman, “Implementing data cubes

efficiently,” ACM-SIGMOND, pp. 205-216, 1996. 10. I. Kamel and C. Faloutsos, “On packing R-trees,” Int’l Conf. on Information and

Knowledge Management, 1993. 11. I. Kamel and C. Faloutsos, “Hilbert R-tree: an improved R-tree using fractals,”

VLDB, pp. 500-509, Santiago, Chile, 1994. 12. Y. Kotidis and N. Roussopoulos, “An alternative storage organization for ROLAP

aggregate views based on cubetrees,” ACM SIGMOD, pp. 249-258, 1998. 13. J. Orenstein, “A comparison of spatial query processing techniques for native and

parameter spaces,” ACM SIGMOD, pp. 343-352, 1990. 14. H. Samet, The Design and Analysis of Spatial Data Structures, Addison-Wesley,

1990. 15. Q. Shi and J. JaJa, “Techniques for handling spatio-temporal range value search

queries on large scale raster data,” to appear in SSDB’02. 16. N. Stefanovic, J. Han, and K. Koperski, “Object-Based selective materialization for

efficient implementation of spatial data cubes,” IEEE Transactions on Knowledge and Data Engineering, 12(6), pp. 1-21, 2000.

12

13

Documents

Information Discovery in Spatio-Temporal Environmental Datajoseph/spatial-olap.pdfover the specified time period, such that each map corresponds to a temporal snapshot as specified