Walid Aref
Panasonic Information
and Networking Technology Laboratory
Panasonic Technologies,
Inc.
Two Research Way
Princeton, NJ 08540
aref@research.panasonic.com
1. Introduction
Spatial databases are naturally distributed due to the data collection process, the distances between the geographical entities of interest, and the physical geographical boundaries (e.g., between countries, etc.). This calls for ways of interoperation among the distributed spatial databases.
In order for the interoperability among distributed spatial databases to be possible, several issues have to be addressed. Some of these are: resolving the heterogeneity among the spatial databases, e.g., in their data models and metadata, providing open standards and architectures to facilitate the interoperation and the scalability of the interoperability solutions, efficient distributed query processing and optimization for user queries that involve more than one spatial database.
In this paper, I focus on the latter issue; namely providing efficient algorithms for spatial query processing in distributed spatial databases.
In a distributed environment, the execution cost of a query involves the message (or communication) cost, in addition to the CPU and I/O costs.
Many research papers and prototype distributed database systems address the issue of distributed query processing and optimization. Although most of the techniques prescribed in the literature may apply to distributed spatial databases, there are optimizations that are of spatial nature that can help further reduce the overall cost of a distributed spatial query. I illustrate some of these optimizations in the paper for several spatial queries that are of common use in spatial databases.
The queries of interest here are the distributed spatial proximity and spatial join queries. In the following sections, I address some of the issues that are unique to distributed spatial query processing, explain briefly the distributed version of the spatial queries of interest and address issues related to answering each of them efficiently.
I assume that there are m distributed spatial databases g1, g2, ... gm). Each database has multiple collections of objects (e.g., tables), where each collection represents objects of a certain spatial data type. For example, in g1 , one might have a table containing point objects representing the fire stations in the geographical area covered by the spatial database g1 , a second table containing line segments representing the road network of the same area, and a third table containing polygons that represent the land use map of the area covered by g1 , etc.
2. Distributed Spatial Query Processing
A distributed query is any data manipulation statement that references databases at sites other than the query site; the site that initiated the query request. The processing sites are the sites where the actual processing of the query takes place, e.g., a join site is where one of the joins in the query takes place.
Several strategies for evaluating distributed queries exist in the literature, e.g., the ship whole and fetch matches strategies, joins using dynamically creating indexes, semijoins, and joins using hashing filters.
To illustrate the differences in distributed spatial query processing, consider the ship whole and the fetch matches strategy. In the ship whole strategy, the distributed query processor ships the entire tables to the processing site and saves it in a temporary table, performs the required operation (e.g., a join), and ships the result of the operation back to the query site. On the other hand, in the fetch matches strategy, when performing a join, the outer table of the join is scanned sequentially, and the distinct key values of the join column are shipped to the site where the inner table exists to perform the join.
We observe the following differences in distributed spatial databases that affect significantly the strategies for distributed spatial query processing.
One example of a spatial proximity query is the following:
Given the m databases of objects (g1, g2, ... gm) , find the m-tuples of objects (o1, o2, ..., om) such that each object oi E gi is within r units from each of the other objects of the same m-tuple.This variant of the spatial proximity query has many practical applications. One example of a query that has this form is the following query: find the shopping centers, libraries, and schools that are within "r" miles from each other. This reflects a proximity query that is of a two-dimensional nature, assuming that the geographical locations of the shopping centers, libraries, and schools are described using two-dimensional coordinates. Other examples exist in the one-, three-, and multi-dimensional space (e.g., in spatiotemporal queries).
A straightforward query plan for evaluating this query is by a query processing pipeline. There are drawbacks, that will be explained in the full paper, for using this technique to answer proximity queries both in the centralized and distributed cases. In the centralized case, one can find more efficient ways of answering this query.
In the full paper, I present a new technique for answering proximity queries in a distributed spatial database environment that uses the spatial characteristics of the objects to reduce the overall message cost of the query. This is based on techniques borrowed from computational geometry.
4. Distributed Spatial Join
In addition to the distribution of the filter and refinement steps over different sites, the full paper presents special issues related to the distributed spatial join operation. To illustrate a more complete example, the full paper focuses on spatial join over a distributed collection of polygonal databases.
5. Concluding Remarks
New issues and cost
parameters arise when processing distributed queries over distributed spatial
databases. The paper illustrates some of these issues and gives two concrete
examples of distributed algorithms that handle the distributed spatial
proximity and join queries.