Characterization
of Data, Queries, and Index Performance of Geographic Information Systems
with Applications to Informix Geodetic DataBlade Module
Kumar
Ramaiyer
Informix
Software, Inc.
Abstract
This paper discusses
various types of geometry data on which GIS applications build index, and
also presents a collection of queries that tests for various characteristics
of the index. The proposed merits include the size of the index and the
number of I/O's performed during the query. The goal is to come up with
a standard collection of data and queries that will help us to characterize
the performance of various systems. The paper proposes to show that it
is important to implement various indexing schemes and also allow for various
tunable parameters to optimize the performance for given data.
As a case study, we
apply these standard data and queries to evaluate and characterize the
performance of the Informix Geodetic DataBlade, a module that allows for
storage and retrieval of geographic data collected on or near the surface
of our Earth.
Introduction
Geographic Information
Systems (GIS) are useful tools for storing, retrieving, and visualizing
geographic data. They are widely used in various applications areas like,
utilities management, real-estate management, weather modeling, monitoring
of natural habitats like, wet lands, bird migration, rain forest, etc.,
navigation, map generation, and scheduling routing for various services
like fire personnel, security, postal, etc. The requirements of GIS include
support for:
Data Modeling: This
is the process of representing the information content of the applications,
like roads, pipelines, housing development, lakes, forest, etc., in some
internal form, which store the connectivity as well as the geometry of
the geographic data or objects.
Data Storage and Spatial
Index: The geographic data is typically stored in some database. Database
provides an integrated environment for storage and retrieval of information.
Database also allows for indexing of the underlying geographic data. Several
strategies for indexing geographic or spatial data have been developed
over the years.
Data Retrieval: Applications
retrieve the information in several ways, but they can be abstracted into
a finite collection of queries. These queries make use of the underlying
spatial index built on the data to retrieve the data in an efficient manner.
To compare the performance
of the various GIS, we need a common set of data and queries that represents
the needs of various applications that use GIS.
Motivation for Collection
of Data and Queries
The performance of
a GIS depends on the efficiency of the underlying database to store, index,
and retrieve the geographic data. But the indexing strategy is the key
factor in deciding the performance. There are several indexing strategies
for the spatial or geographic data, but there are two fundamental approaches:
-
Space Partition: This
approach involves partitioning the underlying domain space occupied by
the spatial data into different cells. Spatial indexing, then involves
associating the data with different cells and vice versa. During queries,
one first finds the cells occupied by the query region, and then finds
all the data associated with these resulting cells. The efficiency is then
proportional to the space required to store the data in different cells
and the number of data-cells searched during the query.
-
Quad-tree indexing
method and its several variations use this approach. This approach uses
a partition technique which is independent of the data and also subdivides
the underlying space at fixed positions. Therefore this allows for faster-direct
indexing, but the space and query performance varies a lot with data distributions.
-
Data Partition: This
approach involves partitioning the underlying data into different groups
based on their geometric properties. Typically this approach recursively
partitions the data into groups till the size of the group reduces to a
small constant. The groups are then organized in a hierarchical fashion
in multiple levels, and in each level the information of union of space
occupied by all the data below that level is stored. The efficiency is
then proportional to the number of levels in the hierarchy, the partitioning
strategy, and the number of levels and groups in each level to be search
during the query.
k-d-tree, R-Tree,
and their variations use this approach. Since this partition technique
is data dependent, the space requirement is proportional to size of the
data and is better than other approach. But the query performance again
varies with the data characteristics, and direct indexing is not possible
here.
The different characteristics
of these two approaches, their several variations, as well as the subtleties
and anomalies of the various implementations stress the need for a common
data set and queries, which will enable us to characterize the behavior
of a GIS. The characterization will then enable us to choose appropriate
indexing scheme for a give type of data.
The goal of this study
is therefore to first come with a data collection that reflect different
data distribution in the underlying domain, different types of connectivity
and geometry of the data, and different sizes of the individual data--which
are either artificially generated or reflect some natural data in our universe.
Once the data is collected, which is a separate process common to all the
GIS applications, various indexes can be built and evaluated for their
size. To test the spatial selectivity of the index, one needs to generate
a collection of queries that reflect the needs of the various applications
of GIS.
In the following sections,
we address these issues in data collection and queries, which we propose
to use as benchmark for comparing the performance of the GIS, and for characterization
of the efficiency of the different indexing schemes.
Data Collection
In this section we
describe the goals of the data collection. Each application deals with
a unique collection of objects that vary in their connectivity, geometry,
distribution, size, etc. The benchmark that tests the efficacy of a GIS
should include data to represent all these parameters.
Distribution Properties
The distribution properties
of the objects we would like to consider for the benchmark are:
-
Uniform distribution:
The objects are spread uniformly over the domain (see Fig.~\ref{fig:uniform}).
The uniform distribution should enable the indexing strategies to achieve
good space utilization as well a balanced indexing structure, thereby achieving
good performance. This type of data is therefore fundamental for any type
of benchmark data and it tests the quality of the implementation.
-
Non-Uniform distribution:
The objects are spread non-uniformly over the domain, and hence there are
regions where the concentration is more than that of others (see Fig.~\ref{fig:nonuniform}).
This type of data is very common in GIS applications. It significantly
affects the space utilization as well as query performance of most indexing
strategies.
Coupling Property
We define the coupling
as a property that determines the extent to which one object in the domain
interacts with the other objects i.e., whether it can intersect with any
of the other objects or whether it can be inside any of the other objects
and so on. The data collection should support the following coupling properties
between the objects:
-
Disjoint: All the objects
in the domain have no intersection and hence are disjoint (see Fig.~\ref{fig:disjoint}).
The disjointness property of the data should allow for good spatial partitioning
when building index structures. As a result queries should have better
selectivity and as a result perform better on data with this property.
-
Overlap: There are
objects in the domain that overlap or intersect with each other (see Fig.~\ref{fig:overlap}).
The inherent errors associated with the data collection process sometimes
leads to overlapping geometric data. Also during aerial photographic scanning
overlapping is allowed so as not to miss the details. The overlap causes
problems in terms of reduced spatial selectivity because it requires larger
bounding boxes, and also complicates the index selectivity by either forcing
the objects to be referenced multiple times or causes splitting of the
objects into smaller disjoint parts as in quad-tree partitioning. A good
performance of an index on data with this property is very critical for
GIS applications.
-
Nested: There are objects
in the domain that are nested within each other (see Fig.~\ref{fig:nested}).
The nested objects arise naturally in GIS applications as objects with
forbidden regions. For example, lakhs and oceans with islands, road ways,
plan of a campus without details of buildings, etc. The forbidden region
poses challenges in terms of selectivity during index scan, as they are
also included in the bounding-box representation of the object typically
used by the index. This causes serious problem when the area/volume of
the forbidden region is very large compared to that of the object itself.
A good performance of an index on data with this property is thus very
important and desirable for GIS applications.
-
Random: There are no
defined coupling properties among the objects. They may or may not have
any of the above three properties (see Fig.~\ref{fig:random}). A good performance
of the index on data with this property shows the robustness of the selectivity
of the index. The randomness in coupling as well as distribution will test
the weakness in the spatial selectivity of the index, if any assumptions
were made about the properties of the data in terms of coupling or distribution.
Size Property
We define the size
of a object as the storage requirement imposed on the GIS. Some of the
objects require compact representation and hence require lesser storage
than others. The size properties of the objects we would like to include
in the benchmark are:
-
Uniform or Variation
is bounded by a constant: All the objects are either of equal size or their
sizes do not vary much (see Fig.~\ref{fig:uniformsize}). Some applications
deal with the objects that are either simplicial like square, triangle,
quadrilateral, etc., or complex and still require only as many as vertices
as other objects in the domain to describe. The size variation generally
causes problems with data storage as not all pages of the secondary storage
of the index will contain equal number of objects and also may increase
the overall size of the index as some of the large objects may require
new pages. Algorithmically the size variation adds additional constraint
as the index-building algorithm needs to not only achieve good spatial
selectivity, but also need to fit the objects in as few pages as possible
so as not to increase the overall size of the index. Therefore, the almost-constant-size
property of the data should help GIS build good index, and hence a good
data to be part of the benchmark.
-
Variable: There are
no constraints on the individual sizes of the objects (see Fig.~\ref{fig:varsize}).
The Fig.~\ref{fig:varsize} shows two objects, one requiring four vertices
and other requiring many vertices (about 20) for description. The data
set consists of large objects that may require special handling since the
data base server generally puts a limit on the individual size of objects.
The special handling may affect the performance of the index, and hence
the data set with this property is very important to be part of the benchmark.
Area/Volume Property
The objects stored
in a GIS come in various aspect ratios and various shapes. Accordingly
they occupy different area/volume in the underlying Universe. Therefore
the volume or area properties of the objects we would like to include in
the benchmark are:
-
Zero Area/Volume Objects:
The point data have no area or volume, and there are number of applications
that deal with point data to abstract information like position of landmarks,
data collection sensors, etc.
-
Variable: There are
no constraints on the area or volume occupied by the individual objects
(see Fig.~\ref{fig:varshape}). The relatively larger objects or ``whole
Earth objects'' cause serious problems during index generation and also
during scan, as their bounding boxes occupy large portion of the underlying
domain, thereby reducing the spatial selectivity. The smaller objects will
fall inside the region covered by the bounding-box of the larger objects,
thereby rendering them ineffective during the index-scan. Therefore approaches
other than traditional bounding-box generation or some variation of it
is called when the domain contains larger and variable shaped objects.
-
Uniform or variation
is bounded by a constant: The area/volume occupied by the different objects
are all either same or the difference is bounded by a constant (see Fig.~\ref{fig:uniformshape}).
This similar-shaped data property will help index to achieve better spatial
selectivity as the bounding box of the individual objects tend not to interfere
with each other.
These various parameters
and their combinations reflect the properties of the data used in most
of the GIS applications, and hence are good candidates to be part of a
benchmark.
Data Formats and
Different Standards
The major hurdle in
coming up with a standard benchmark is the various data formats used by
the different vendors of GIS. There are efforts to unify the data representation
and to come up with a common data model. But there are multiple unifying
efforts and thus we have several standards available like OGIS, SDTS, etc.
As a result, there is a proliferation of software that converts the data
from one format to another.
Queries
Each application that
uses a GIS to retrieve geographic data has a unique set of queries. Also
the application differ in the manner in which they perform the accesses.
Some of the queries access same or adjacent regions in the underlying domain
repeatedly. Other queries access different regions that are not proximal,
but the number of different accesses performed over a given interval of
time is fixed.
Though the queries
and their patterns look vastly different they can be abstracted into a
finite collection of queries as follows:
-
Range Searching: This
is the most general form of queries used in the application, and it involves
searching the underlying domain to retrieve the objects that fit the search
criteria specified by the query range. The Fig.~\ref{fig:intersect} illustrates
the information retrieval through range searching. This query can be further
classified as:
-
Range Intersection:
Retrieve all the objects in the underlying domain that intersect or overlap
the region specified by the query range. In Fig.~\ref{fig:intersect}, range
intersection query retrieves the objects A through F.
-
Range Containment:
Retrieve all the objects in the underlying domain that are completely inside
or contained within the region specified by the query range. In Fig.~\ref{fig:intersect},
the range containment query retrieves the object D, which is the only object
inside the query range.
-
Range Outside: Retrieve
all the objects in the underlying domain that are completely outside the
region specified by the query range. In Fig.~\ref{fig:intersect}, this
query retrieves all the objects in the domain, except the objects A through
F.
-
Range Searching with
Tolerance: Retrieve all the objects in the underlying domain that either
intersect the query region or lies within certain distance from the boundary
of the query region. The tolerance is specified in terms of distance from
the query range. In Fig.~\ref{fig:tolerance}, this query retrieves all
the objects in the domain that either intersect the query region or lies
within the distance specified (shown as thick lines surrounding the query
region).
The shape of the
ranges also vary from the typical recti-linear ranges to other simplicial
objects like triangle, quadrangle, etc., to other objects like circle,
sphere, convex objects, simple polygons, complex polygons like polygons
with holes, etc.
-
Proximity Searching:
This form of query involves computing distances between objects, and hence
is applicable only to those domains, which contain objects that allow for
distance metric to be applied on them. This query can be further classified
as:
-
Within a Given Distance:
Retrieve all the objects that are within certain distance, specified as
parameter, from the query object (also specified as parameter). In Fig.~\ref{fig:within},
the query retrieves all the points within the circle, with radius equals
to the distance parameter.
-
Beyond a Given Distance:
Retrieve all the objects that are beyond the distance, specified as parameter,
from the query object (also specified as parameter). In Fig.~\ref{fig:within},
the query retrieves all the points that are outside the circle, with radius
equals to the distance parameter.
-
Post-Office Problem
or Nearest-Neighbor Queries: This classical query involves retrieving the
object from the domain which is at the closest distance to the query object
specified as a parameter (see Fig.~\ref{fig:postoffice}).
Other variations of
proximity searching include ray shooting i.e., finding closest object to
a query object along a given direction, closest pair i.e., finding the
pair of objects in the domain that are closest to each other, etc.
Apart from the above
abstract queries, the query patterns are unique for GIS applications. The
queries used in GIS applications do not uniformly sample the entire domain,
but are rather non-uniform and adaptive in nature. Typical GIS queries
have spatial coherence and temporal coherence properties:
-
Spatial Coherence:
This is the property that same regions are accessed more frequently than
other regions. For example, service-dispatch GIS query high-population
regions more often than low-population regions.
-
Temporal Coherence:
This is the property that between any two consecutive queries to the same
region, the set of queries, which we refer to as the working set, is fairly
small. For example, in a data collection GIS, sensors are finite in number,
widely distributed, and are tracked periodically to collect the data.
It is easy to assume
that virtual-memory properties of the underlying system would automatically
achieve good performance for queries with above properties. But there is
no performance guarantee, and it is important that the indexing structures
themselves adapt their organization to guarantee performance.
Informix Geodetic
DataBlade Module
In this section, we
discuss the application of ideas discussed in the previous to characterize
the performance of the Informix Geodetic DataBlade module. The Geodetic
DataBlade module extends the data types supported by the Object-Relational
DBMS Informix Universal Server to represent geometric objects such as points,
polygon, circle, ellipse, lines, line-segments, etc., on the earth's ellipsoidal
surface. The module supports true geodetic coordinate reference system
i.e., geodetic latitude and geodetic longitude to describe the coordinates
on the surface of the Earth. For each object, the DataBlade supports storing
an altitude, which is measured in meters above or below the surface of
the Earth, and also supports time-range, which helps to associate time
with the data. The Geodetic DataBlade module measures the distance between
two points along a geodesic, which is the shortest path between two points
on the surface of an ellipsoid.
The Geodetic DataBlade
module also provides:
-
SQL support for defining
columns in a tablecorresponding to the geodetic types.
-
SQL support for inserting
data into these columns.
-
Number of useful SQL-level
data manipulation functions like find the coordinates of a spatial-object
column, find the center of a circle-column, find the major and minor axis
of an ellipse-column, and setting and updating values of a spatial-object
column.
-
Number of SQL-level
boolean operators for performing queries:
-
Intersect: Test whether
two geodetic objects intersect.
-
Outside: Test whether
two geodetic objects do not intersect.
-
Inside: Test whether
one geodetic object is inside another.
-
Within: Test whether
two geodetic objects are within certain distance.
-
Beyond: Test whether
two geodetic objects are not within certain distance.
-
Support for building
spatial R-Tree index on a column of any geodetic type. It also supports
different heuristics for organizing the information in the internal pages
of the R-Tree.
The GIS vendors use
Geodetic DataBlade to build applications on top of SQL support layer, to
store and retrieve global geographic data.
Performance of the
Geodetic Blade
The performance of
the Geodetic DataBlade on the data collection and queries discussed in
previous sections will be presented in the final version of the paper.
Conclusions
The indexing methods
available in the literature have different characteristics, have different
variations, allow for several tunable parameters, and their implementations
invariably have certain limitations. The geographic data have various properties
in terms of their distribution, size, coupling, and volume. The queries
used by GIS applications are of various types.
Therefore to optimize
the performance of a Geographic Information System for a particular application,
one needs to first understand the behavior of the system for the data used
by the application. For example, for an application dealing with data with
several overlap, R-Tree indexing will be better, and for an application
dealing with data containing ``whole Earth objects'', bounding-box based
methods will not work properly, and so on.
Therefore it is
very important for a GIS to have support for multiple indexing schemes,
and allow for tunable parameters.
Acknowledgements
I would like to sincerely
thank all the members of Geodetic DataBlade team, Robert Uleman, Toni Guttman,
David Ashkenas, Jim Vaudagna, and Ed Katibah, for all the interesting discussions
about the topics in this paper.