Characterization of Data, Queries, and Index Performance of Geographic Information Systems with Applications to Informix Geodetic DataBlade Module

Kumar Ramaiyer
Informix Software, Inc.

Abstract

This paper discusses various types of geometry data on which GIS applications build index, and also presents a collection of queries that tests for various characteristics of the index. The proposed merits include the size of the index and the number of I/O's performed during the query. The goal is to come up with a standard collection of data and queries that will help us to characterize the performance of various systems. The paper proposes to show that it is important to implement various indexing schemes and also allow for various tunable parameters to optimize the performance for given data.
As a case study, we apply these standard data and queries to evaluate and characterize the performance of the Informix Geodetic DataBlade, a module that allows for storage and retrieval of geographic data collected on or near the surface of our Earth.

Introduction

Geographic Information Systems (GIS) are useful tools for storing, retrieving, and visualizing geographic data. They are widely used in various applications areas like, utilities management, real-estate management, weather modeling, monitoring of natural habitats like, wet lands, bird migration, rain forest, etc., navigation, map generation, and scheduling routing for various services like fire personnel, security, postal, etc. The requirements of GIS include support for:
Data Modeling: This is the process of representing the information content of the applications, like roads, pipelines, housing development, lakes, forest, etc., in some internal form, which store the connectivity as well as the geometry of the geographic data or objects.
Data Storage and Spatial Index: The geographic data is typically stored in some database. Database provides an integrated environment for storage and retrieval of information. Database also allows for indexing of the underlying geographic data. Several strategies for indexing geographic or spatial data have been developed over the years.
Data Retrieval: Applications retrieve the information in several ways, but they can be abstracted into a finite collection of queries. These queries make use of the underlying spatial index built on the data to retrieve the data in an efficient manner.
To compare the performance of the various GIS, we need a common set of data and queries that represents the needs of various applications that use GIS.

Motivation for Collection of Data and Queries

The performance of a GIS depends on the efficiency of the underlying database to store, index, and retrieve the geographic data. But the indexing strategy is the key factor in deciding the performance. There are several indexing strategies for the spatial or geographic data, but there are two fundamental approaches:
The different characteristics of these two approaches, their several variations, as well as the subtleties and anomalies of the various implementations stress the need for a common data set and queries, which will enable us to characterize the behavior of a GIS. The characterization will then enable us to choose appropriate indexing scheme for a give type of data.
The goal of this study is therefore to first come with a data collection that reflect different data distribution in the underlying domain, different types of connectivity and geometry of the data, and different sizes of the individual data--which are either artificially generated or reflect some natural data in our universe. Once the data is collected, which is a separate process common to all the GIS applications, various indexes can be built and evaluated for their size. To test the spatial selectivity of the index, one needs to generate a collection of queries that reflect the needs of the various applications of GIS.
In the following sections, we address these issues in data collection and queries, which we propose to use as benchmark for comparing the performance of the GIS, and for characterization of the efficiency of the different indexing schemes.

Data Collection

In this section we describe the goals of the data collection. Each application deals with a unique collection of objects that vary in their connectivity, geometry, distribution, size, etc. The benchmark that tests the efficacy of a GIS should include data to represent all these parameters.

Distribution Properties

The distribution properties of the objects we would like to consider for the benchmark are:
Coupling Property
We define the coupling as a property that determines the extent to which one object in the domain interacts with the other objects i.e., whether it can intersect with any of the other objects or whether it can be inside any of the other objects and so on. The data collection should support the following coupling properties between the objects:
Size Property
We define the size of a object as the storage requirement imposed on the GIS. Some of the objects require compact representation and hence require lesser storage than others. The size properties of the objects we would like to include in the benchmark are:
Area/Volume Property
The objects stored in a GIS come in various aspect ratios and various shapes. Accordingly they occupy different area/volume in the underlying Universe. Therefore the volume or area properties of the objects we would like to include in the benchmark are:
These various parameters and their combinations reflect the properties of the data used in most of the GIS applications, and hence are good candidates to be part of a benchmark.

Data Formats and Different Standards

The major hurdle in coming up with a standard benchmark is the various data formats used by the different vendors of GIS. There are efforts to unify the data representation and to come up with a common data model. But there are multiple unifying efforts and thus we have several standards available like OGIS, SDTS, etc. As a result, there is a proliferation of software that converts the data from one format to another.

Queries

Each application that uses a GIS to retrieve geographic data has a unique set of queries. Also the application differ in the manner in which they perform the accesses. Some of the queries access same or adjacent regions in the underlying domain repeatedly. Other queries access different regions that are not proximal, but the number of different accesses performed over a given interval of time is fixed.
Though the queries and their patterns look vastly different they can be abstracted into a finite collection of queries as follows:
Other variations of proximity searching include ray shooting i.e., finding closest object to a query object along a given direction, closest pair i.e., finding the pair of objects in the domain that are closest to each other, etc.
Apart from the above abstract queries, the query patterns are unique for GIS applications. The queries used in GIS applications do not uniformly sample the entire domain, but are rather non-uniform and adaptive in nature. Typical GIS queries have spatial coherence and temporal coherence properties:
It is easy to assume that virtual-memory properties of the underlying system would automatically achieve good performance for queries with above properties. But there is no performance guarantee, and it is important that the indexing structures themselves adapt their organization to guarantee performance.

Informix Geodetic DataBlade Module

In this section, we discuss the application of ideas discussed in the previous to characterize the performance of the Informix Geodetic DataBlade module. The Geodetic DataBlade module extends the data types supported by the Object-Relational DBMS Informix Universal Server to represent geometric objects such as points, polygon, circle, ellipse, lines, line-segments, etc., on the earth's ellipsoidal surface. The module supports true geodetic coordinate reference system i.e., geodetic latitude and geodetic longitude to describe the coordinates on the surface of the Earth. For each object, the DataBlade supports storing an altitude, which is measured in meters above or below the surface of the Earth, and also supports time-range, which helps to associate time with the data. The Geodetic DataBlade module measures the distance between two points along a geodesic, which is the shortest path between two points on the surface of an ellipsoid.

The Geodetic DataBlade module also provides:

The GIS vendors use Geodetic DataBlade to build applications on top of SQL support layer, to store and retrieve global geographic data.

Performance of the Geodetic Blade

The performance of the Geodetic DataBlade on the data collection and queries discussed in previous sections will be presented in the final version of the paper.

Conclusions

The indexing methods available in the literature have different characteristics, have different variations, allow for several tunable parameters, and their implementations invariably have certain limitations. The geographic data have various properties in terms of their distribution, size, coupling, and volume. The queries used by GIS applications are of various types.
Therefore to optimize the performance of a Geographic Information System for a particular application, one needs to first understand the behavior of the system for the data used by the application. For example, for an application dealing with data with several overlap, R-Tree indexing will be better, and for an application dealing with data containing ``whole Earth objects'', bounding-box based methods will not work properly, and so on.

Therefore it is very important for a GIS to have support for multiple indexing schemes, and allow for tunable parameters.

Acknowledgements

I would like to sincerely thank all the members of Geodetic DataBlade team, Robert Uleman, Toni Guttman, David Ashkenas, Jim Vaudagna, and Ed Katibah, for all the interesting discussions about the topics in this paper.