Virtual Data Sets - Smart Data for Environmental Applications

Andrej Vckovski and Felix Bucher

Spatial Data Handling
Department of Geography
University of Zurich
Winterthurerstr. 190
CH-8057 Zürich
Switzerland
vckovski@geo.unizh.ch / bucher@geo.unizh.ch

Abstract

Continuous fields on a spatio-temporal support are one of the basic types of data used in environmental modeling, e.g., air temperature and pressure, wind-fields, precipitation, soil types etc. The sampling of a continuous field necessarily involves discretization both in spatial and temporal domain. Data describing (random) continuous fields therefore consist of a series of samples at fixed spatial and temporal locations, e.g. as point values or as aggregations over certain areas and time intervals. For many applications the representation available with such data sets does not meet the requirements of the application. It is therefore often necessary to model the field under consideration based on the available data values and then to use the model to predict values at unsampled locations, i.e., to generate a new representation of the field.

The process of creating new representations, e.g., resampling or interpolation, is often time consuming and has strong impacts on the quality of the generated representation. Virtual Data Sets (VDS) are an approach to address these problems and improve reliability and re-usability of field representations. The concept is based on an extension of a data set with methods that implement a model of the field under consideration. That is, a VDS contains itself methods to generate new representations or to present itself as a new representation, respectively.

The structure of VDS leads to an implementation design based on a distributed computing environment and object-oriented technology. Therefore, VDS are implemented as distributed objects or services which can be queried by applications needing data. This design is discussed in the context of the sapproach taken by the Open Geodata Interoperability Specification (OGIS). The last part of this contribution discusses a possible implementation strategy based on Java. It is shown that Java can be used to realize a distributed geoprocessing environment and Virtual Data Sets.

Introduction

Environmental information is increasingly required for policy making in diverse fields such as natural risk assessment, crop yield estimation and ressource management. Such decision support information is provided by reliably analyzed environmental data, i.e., value added information. Over the past few years, Geographical Information Systems (GIS) have been evaluated as (and extended towards) an appropriate technology for environmental data management and analysis. This is mainly due to the fact that GIS has the ability to integrate diverse information from various sources which is a major requirement in environmental data analysis. Environmental data analysts, however, are still facing a whole bunch of problems when using GIS: As a consequence and in order to overcome these problems, the GIS industry is now adopting new trends and technologies from fields such as computer sciences and data engineering. More specifically, GIS are moving towards true distributed, component-based geoproccessing across large networks as well as geodata interoperability. Standardization efforts based on object-oriented geodata models find increasingly acceptance in the design of modern GIS.

This paper presents the use of Virtual Data Sets (VDS) for environmental data management and analysis. More specifically, VDS provides a promising means for the storage and usage of data describing continuous fields as is often required in environmental applications. The VDS approach is based on "smart" data sets and views data sets as active objects in a distributed software system.

Interoperability Issues in Environmental Applications

The following discussion will be focused on (continuous) fields. Fields are the major information type used in most environmental applications. A field is basically a function relating the elements of a definition domain (support) to a value domain. The support is in most cases a subset of (physical) space and time, e.g., a time interval and/or a region on the earth's surface. The values of a field are either discrete or continuous, scalar- or vector- entities (Vckovski, 1995, Bucher & Vckovski, 1995). If the support is dense and compact then a field is called continuous.

The sampling of a field necessarily implies a discretization of the support, i.e., the field values are measured at a specific set of sites (sample sites). These sites are in most cases points (e.g., meteorological measurement devices) or simple polygons (e.g., squares on a satellite image). A representation of a field therefore consists of a finite set of sites and corresponding field values. Data sets typically are collected for a specific application in mind, with a corresponding sampling strategy. Due to the high costs of sampling, the acquired data needs to be re-used as much as possible. The original field representation, however, often does not meet the data requirements of future applications, i.e., the representations are semantically incompatible. Therefore, new representations of the original field representation have to be derived. Consider a digital elevation model (DEM) that was originally sampled at irregularily distributed points. In order to meet the requirements of some applications, the DEM - describing the terrain height field - might be transformed into other representations, such as:

These transformations necessarily need a model of the field based on the sampled values. The model is a simplified view of the real-world behaviour of the underlying phenomenon. The transformation generates a new representation through the model. This is shown in the following figure:

The sampling (A) generates the original field representation. This representation then is analyzed to form a model which serves for the generation of another representation.

The process of a transformation is a critical task and bears many difficulties in it:

The concepts of Virtual Data Sets (VDS) and Open Geodata Interoperability Specification (OGIS) discussed in the following section try to minimize the necessitity and drawbacks of subsequent transformations.

Virtual Data Sets and OGIS

The basic idea of VDS (Stephan et. al., 1993, Vckovski, 1995) is to enhance a field representation with a set of methods which implement a model of the field. With these methods, a VDS is able to produce various views of the field depending on the data requirements of potential applications. VDS consists of multiple, dynamically created representations of a field. The methods are a persistent component of VDS whereas the values of a particular representation are virtual in the sense that they are derived upon an application's request. VDS encapsulate the field's behaviour and the (original) field representation and therefore represent an object in the sense of object-oriented design (Booch, 1991). With its ability to adapt to various data requirements (multiple virtual representations) a VDS can serve for many applications without need for conversions and transformations. Although these applications are required to access their views or representations through the interface offered by the VDS.

The OGIS Data Model (OGM) (Buehler, 1994) is an approach to define and specify a set of basic data types and their aggregates as building blocks for interoperable geoprocessing. The basic approach taken is to define interfaces to these data types (or classes). The OGIS approach is based on similar ideas but goes one step beyond VDS by providing a comprehensive specification. The specification consists basically of interface definitions written in CORBA's IDL (Interface Definition Language) and a set of assertions to restrict and define the semantics of implementations. Other approaches for the specification of OGM based on functional languages are discussed in (Frank & Kuhn, 1995).

The OGIS and VDS approaches envisage GIS technology based on (possibly distributed) interoperable objects where the classical gap between data structures and algorithms is bridged. Data sets are exchangeable components in the same way as other sets of services in a modular system. In combination with software technologies such as CORBA (CORBA, 1992), OLE, OSF/DCE or Java this leads to a true distributed geoprocessing environment.

Distributed processing has in the domain of environmental data management and analysis particular benefits as both data and specific processing methods are frequently exchanged and used within various organisations. More specifically, data users are often not in the same organisation than the data producer. The concepts of VDS and OGM allow data producers to create 'smart' data sets and to distribute them including the encapsulated methods. It can be expected that the quality of the dynamically created field representations is in general improved if data producers provide the corresponding methods. Data producers usually have most knowledge needed for the selection and parameterization of the methods.

The following section shows how Java can be used as a base for distributed geoprocessing and for the definition and implementation of VDS.

Java as an Environment for Interoperable Geoprocessing

VDS and OGM provide a conceptual framework for a component-based and true distributed geoprocessing. However, the implementation of distributed object systems is not a simple task. CORBA, OSF/DEC and OLE have become the major carriers in the software industry for such projects. Recently, Sun Microsystems presented the new programming language Java to the public. Java has a wide variety of promising features, some of them giving a powerful support for the implementation of distributed object systems.

Java is an object-oriented programming language and is syntactically based on C++. Many of the redundant and annoying features of C++ were removed based on the experience gained within the software industry in the past years (e.g., Java has no pointers). Java's popularity is particularly due to its usage within the World-Wide Web, as programming language for the development of so-called applets. The features of Java enabling its application within the World-Wide Web are the same which make Java a promising environment of distributed object systems. These features are:

A Java-based open and interoperable GIS-environment would consist of a set of Java classes containing both the (virtual) data sets and other geoprocessing services such as display, analysis and database management. For continuous fields, the corresponding Java classes contain both the (original) field representation as static data member as well as the transformation methods (e.g., for local interpolation, or calculation of derivatives) as class methods. Each VDS describing a field implements a fixed Java interface. Any application using the data needs only to have the field interface declaration to be able to use the data set. All low level details, e.g., on how the physical storage is organized, is hidden from the application.

A Java Interface for Continuous Fields

This example shall illustrate some of the concepts discussed above. It consists of three parts: The code presented here is by no means complete. Implementation bodies of most methods were left out for the sake of brevity. The class hierarchy was also reduced to a minimum to clarify the basic ideas.

The first part is a generic delcaration of the values of a field. Since all values are measurements, uncertainty and error information needs to be included. This is done by an interface called ValueType. It declares methods to retrieve some information on a "uncertain" value such as its mean, variance, etc. A more realistic example would be based on a set of different interfaces for various types of uncertain information, e.g., random variables, fuzzy sets, intervals etc. (see also Vckovski, 1995). NormalValue implements ValueType and declares a gaussian-distributed (random) variable.

public interface ValueType {

  float Mean();                 // Mean value of a measurement

  float Variance();             // Variance of a measurement

  float Min();                  // Minimum value

  float Max();                  // Maximum value

  float Median();               // Median value 

  float Realize();              // Realization for stochastic simulations 
                                // (if applicable)

}

// A normal (Gaussian-distributed) value
public  class NormalValue implements ValueType {

  float m,sigma;                // state: the 2 parameters of a gaussian
                                // distribution

  // constructor
  NormalValue(float amu, float asigma) {
    mu = amu;
    sigma = asigma;
  }

  float Mean() {
    return(mu);
  }

  float Variance() {
    return(sigma);
  }

  float Min() {
    return(-Inf);
  }

  float Max() {
    return(+Inf);
  }

  float Median() {
    return(mu);
  }

  float Realize() {    
    return(Util.NormalRandomValue(mu,sigma));
  }
}
The follwing declaration defines a generic reference to a location on the earth's surface. It is in so far simplified as the ellipsoid used and other details are not referenced. A more comprehensive apporach would certainly also use additional represenation schemes.
// General geo-reference. Here simplified (e.g., 2-dimensional,
// no ellipsoid etc)
interface Georeference {

  float Longitude();

  float Latitude();

  void Set(float along, float alat);

  void Set(Georeference point);

}
These interfaces declare a generic scalar- and vector-valued field. The difference between scalar- and vector-field is made for the simplicity of using them. A scalar field is identical to a vector field with a value-domain dimension of 1, but it is simpler to use if it is declared as scalar value since no subscripts (array indices) need to be given.

// scalar field (= vector field with dim(Value-domain)=1)
interface ScalarField {

  ValueType Value(Georeference point);

  // in addition to this: functions to retrieve a description of the
  // support of the field
}

// vector field
interface VectorField {

  ValueType[] Value(Georeference point);

  // in addition to this: functions to retrieve a description of the
  // support of the field
}
The class ScalarFieldOnGrid is an example of an implementation of ScalarField for the case when the underlying representation is a regular grid. This class is abstract, i.e., it needs to be subclassed by an actual implementation.
public class ScalarFieldOnGrid implements ScalarField {
  
  Georeference origin;          // "lower-left" corner of grid (pt at 0,0)
  Georeference row1st;          //  pt at (0,1)
  Georeference column1st;       //  pt at (1,0)
  int rows,cols;

  abstract ValueType Value(Georeference point);

  // and here there would be a series of helper functions for the
  // management of data on regular grids, e.g., interpolation functions
  // etc.
}

The last part is a sample implementation of a Virtual Data Set describing the surface air temperature over a specifc area. It contains first a class defining the spatial references of this data set in a custom coordinate system (class MyPoint). The second class is the actual wrapper for the temperature data (class Temperature). Note that the data source can be an external data file, a database query (or view) or initialized static variables within the class. In that sense a Virtual Data Set can also be seen as middleware providing standardized access to environmental data.
// A custom georeference for this data set
public class MyPoint implements Georeference {

  float x,y;

  // construct via 'my' coordinates
  MyPoint(float ax, float ay) {
    x=ax;
    y= ay;
  }

  // construct via long/lat
  MyPoint(Georeference point) {
    Set(point);
  }

  float Longitude() {
    // de-project x,y
  }

  float Latitude() {
    // de-project x,y
  }

  void Set(float along, float alat) {
    // project along,alat to x and y
  }

  void Set(Georeference point) {
    Set(point.Longitude(),point.Latitude());
  }
}

// My temperature data set
public class Temperature extends ScalarFieldOnGrid {
  String datasourde = "http://myhost/mypath/tempdata";
  // or for example String datasource = "sql://mydbmshost/select tm,tstd,x,y from temp;";
  // or static NormalValue[20][20] values  = { .... };
  NormalValue[][] values;
  Temperature() {
    origin = new MyPoint(42,42);
    row1st = new MyPoint(43,42);
    column1st = new MyPoint(42,44);
    // get data from 'datasource', fill up rows, cols, values[][];
  }

  NormalValue Value(Georeference point) {
    MyPoint pt = new MyPoint(point);
    // and now find out where pt is in the grid is and
    // perform interpolation, or fetch additional data if necessary
    // here is where the real work is done
    return (new NormalValue( /* */ ));
    
  }
}
The usage of such a VDS is sketched below. This can be either from within a Java-enabled generic GIS or any other tool for scientific computing.

  import ANETDataset.temperature;
// ...
  Temperature T = new Temperature();
// ...  
  float avalue  = T.Value(8.5,47.4).Mean(); // this is Zurich ..

Conclusion

The objective of this paper is twofold. On the one hand, the discussion of some properties of environmental data has shown that modern technologies for object-oriented and component-based systems as they are adopted now in the GIS industry can be of great use in the management and analysis of environmental data. Particularly the representation of continuous fields can be significantly enhanced by embedding methods that model the behaviour of the natural phenomenon described by the sample data. On the other hand, this paper promotes Java as a carrier for distributed object systems in general and Virtual Data Sets in particular. Java's strength allows the implementation of interoperable and distributed geoprocessing in real and production-quality environments. However, the current lack of experience with Java in the GIS industry, and Java's future between market forces question its maturity for industry-strength software development.

Future work will focus on implementing the VDS concept as a subset of OGIS's OGM in Java. The experience gained hereby will help in the refinement of specifications of interoperable geoprocessing environments such as OGIS.

References

Albrecht, Jochen. 1995.
Semantic Net of Universal Elementary {GIS} Functions. Pages 235-244 of: Proceedings of the AUTOCARTO 12 Conference.
Booch, Gary. 1991.
Object Oriented Design with Applications. The Benjamin / Cummings Publishing Company, Inc.
Bucher, Felix, & Vckovski, Andrej. 1995.
Improving the Selection of Appropriate Spatial Interpolation Methods. Pages 351-364 of: Frank, Andrew U. & Kuhn, Werner (eds.) Spatial Information Theory: A Theoretical Basis for GIS. Lecture Notes in Computer Science 988. Berlin: Springer Verlag.
Bucher, Felix, Stephan, Eva-Maria, & Vckovski, Andrej. 1994.
Integrated Analysis and Standardization in GIS. In: Proceedings of the EGIS'94 Conference.
Buehler, K. A. 1994.
The Open Geodata Interoperability Specification: Draft Base Document. Tech. rept. OGIS Project Document 94-025. OGIS, Ltd.
CORBA. 1992.
The Common Object Request Broker: Architecture and Specification. Object Management Group. OMG Document Number 91.12.1.
Frank, Andrew. U., & Kuhn, Werner. 1995.
Specifying Open GIS with Functional Languages. Pages 184-195 of: Egenhofer, Max, & Herring, John R. (eds), Advances in Spatial Databases. Lecture Notes in Computer Science. Berlin: Springer Verlag.
Stephan, Eva-Maria, Vckovski, Andrej, & Bucher, Felix. 1993.
Virtual Data Set: An Approach for the Integration of Incompatible Data. Pages 93-102 of: Proceedings of the AUTOCARTO 11 Conference.
Vckovski, Andrej. 1995.
Representation of Continuous Fields. Pages 127-136 of: Proceedings of the AUTOCARTO 12 Conference.

WWW References

About Java
http://java.sun.com/about.html
OSF Distributed Computing Environment Home Page
http://www.osf.org/dce/index.html
Microsoft OLE Strategic overview
http://www.microsoft.com/devonly/strategy/ole/ole.htm