Brian Lees

IMPROVING THE SPATIAL EXTENSION OF POINT DATA

BY CHANGING THE DATA MODEL

Abstract

The most common form of delivery, and use, of vegetation and soils information remains the mapped thematic, choropleth, form. The pre-processing of data to suit this data structure perpetuates the use of an inappropriate data model and places an upper limit on the accuracy of spatial extension of point data by most predictive modelling techniques. In many cases a continuum of change is being represented as a series of overlapping gaussians. This leads inevitably to the generation of errors of omission and commission. These are artefacts resulting from the choice of an inappropriate data model. This is not a new insight, Sinton dealt with it in 1978, but it remains a perennial one. The extensive work on the Kioloa data set to develop and test new predictive modelling tools was constrained to use standard (forest) industry data as input, and forest types as output. This resulted in an upper limit to predictive accuracy, for all the models tested, of about 65%. Those themes which were more appropriately represented by this data structure (land/sea discrimination) could be predicted with up to 99% accuracy. Using the same techniques, data points and variables, but changing the data model, it is possible to achieve a considerable increase in predictive accuracy. Expressing species distribution as a fuzzy membership of the tallest stratum (in a forest), midstratum or understorey, and lower stratum or ground layer allows the prediction of a series of data layers which represent surfaces of spatially varying fuzzy memberships. Further, the use of a simple neural net configuration enables both fuzzy membership and the probability of this membership to be estimated. This makes it possible to track error through subsequent uses of the modelled estimates. Methods of comparing the two methods are still under investigation, but it is clear that the change in data model results in significant gains in accuracy.

Introduction

Current approaches to the analysis of spatial data tend to deal with the spatial characteristics of the data as another attribute in an n dimensional, analytical space. The most common analysis is some form of cluster analysis. It is often an illogical framework for analysis. This has been discussed elsewhere (Lees, 1994, 1995; Aspinall and Lees, 1994). Topological relationships which exist in geographic space are lost and illogical relationships are created. Similarly, topological relationships which exist in, say, environmental data space, and form the basis for environmental domain analysis, are disrupted. This suggests that the practice of creating a single, n-dimensional, space to facilitate the analysis of spatial data destroys some of the most important characteristics of the data prior to analysis. In the following section I wish to review the previous discussion of this view of data and then move on to consider the implications of this for predictive accuracy.

Operational Data Spaces and Appropriate Data Models

Spatial data exists in a number of discrete domains (Lees, 1994, 1995; Aspinall and Lees, 1994) . In each of these there exist topological relationships, but these relationships vary from domain to domain. We are most familiar with spatial data existing in a geographic space defined by latitude, longitude and elevation. Movement from point to point in this space is a vector. It is not possible to move from one point to another without transiting intermediate points. Each point is unique.

In the other, conceptual, domains or data spaces topological relationships are different. These data spaces can be spectral space, environmental data space, even socio-economic data space. The fundamental, and shared, characteristic of these spaces is that movement through the space has a logical meaning. Spectral space, for example, forms the basis for most analysis of remotely sensed data. Proximity suggests similar colour. Trajectories of reflectance values for developing crops on different soils form the basis for the common Kauth-Thomas, or Tasseled Cap, transformation (1976). Trajectories in spectral space form the basis for sub-pixel modelling of vegetation structure (Jupp et al, 1986). In these analyses vectors represent changes in the reflectance at a point, through time. No motion in geographic space is envisioned. A large number of points in geographic space can occupy a single location in spectral space. The converse is not true.

In environmental data space, the basis for environmental domain analysis (Mackey et al., 1988), topological relationships are linked directly to environmental gradients. Vectors in this space drive the continuum of change in vegetation composition observed in nature. The parameters typically used in environmental domain analyses tend to act at the species and not community level. The conflict in ecological literature between those who favour a community view of vegetation and those who view it as a continuum lies squarely on the fact that community is a spatial concept in geographic space, whilst the continuum is a spatial concept in environmental data space (Austin and Smith, 1989). These are fundamentally different in the way they can be analysed. In geographic space one can move from one point to another along a vector. This same motion in environmental data space may result in no motion, if the environments along this vector in geographic space are the same, or a jump from point to point if say, a soil boundary is crossed. As before, a large number of points in geographic space can occupy a single location in environmental data space and, once again, the converse is not true.

This particular dichotemy, between representation of vegetation distribution in geographic space and environmental data space, is a dichotomy between data models. The `mapping' school reduce observations of vegetation to a series of vegetation classes, even forest types. In many ecosystems the class boundaries are cultural (statistical) artefacts. Slight changes in contribution to the canopy can lead to a change in class. In some cases, there is often more variation within the class than between classes. Nevertheless, the fundamental structure of choropleth mapping requires this reduction of variance to permit the mapping of polygons. This mismatch between data and data structure, excusable in the days where choropleth mapping was the only means of representation, has been carried forward to the present. The often uncritical use of an incorrect model to represent a particular spatial distribution is a persistent source of error in geography (James, 1967).

If we take, as an example, vegetation observations from a typical, undisturbed, eucalypt forest in hilly terrain these can be plotted in geographic space, environmental data space and spectral space. Examining these field observations in environmental data space shows very clearly that much of the variance in the data forms a continuum. Mapping this continuum across to geographic space `shatters' this continuum into facets. These do look like polygons until one examines their internal behaviour. Each facet retains some of the continuous variation which formed the continuum in environmental data space.

`Facets' are not a recognised data structure, but the assumption that, because they appear similar to polygons in geographic space, the latter is an appropriate data model for vegetation, is a major source of error in vegetation and soil mapping. Because choropleth mapping requires the observed variability of data to be reduced, pre-classification is required. The mapping errors are compounded because this pre-classification of the vegetation into communities is carried out in taxonomic data space. The implied distribution is now a series of overlapping gaussians. Attempting to represent a continuum as a series of overlapping gausians means that the predictive accuracy of the analysis drops markedly because of decision errors of omission and commission (Sinton, 1977).

This reduction in accuracy is directly related to the use of an inappropriate data model. It would be more appropriate to model this phenomenon in environmental data space or without pre-classification of the observations (Payne, Stockwell and Davey, 1994).

Mapping Vegetation Attributes Without Pre-classification

We used the Kioloa data set (Lees and Ritman, 1991), this covers an area of complex terrain with elevations ranging from sea level to 285m. Land cover varies from rainforest, highly disturbed forest, heath to cleared grassland. Extensive analyses of this dataset using the ground data pre-classed into forest types typically give predictions of `Sea' at better than 95%, `Paddock' at better than 80% and the other, forest types at accuracies between 45%-65%. Treating the continuum of change within the forest as a set of overlapping gaussians, or classes, means that better results are impossible using classed data. However, if one deals with the point observations of vegetation without pre-classification, then another form of mapping must be used. Because we are now dealing with digital information, we are no longer constrained to produce a single `map'. At any point We can now store and retrieve a considerable amount of information about any point, either in geographic, environmental or spectral space. If we retain geographic space as the most convenient operational data space for users of predictive modelling, we are also selecting the domain with least database complexity where each point exists in only one location in each of the other domains. We can then model the spatial extension of each relevant attribute of each entity observed in the ground truth plots. It is possible to add an estimate of the error at each location resulting from this.

Using the Lees and Ritman (1991) dataset the original observations can be recoded as DBH, stem densities, biomass, or canopy contribution. For this example, the last was chosen. As this is a genuinely fuzzy phenomenon, it was coded as fuzzy membership of the canopy, species by species. A simple Back Propagation neural net was set up for each species. That developed for Eucalyptus maculata, spotted gum, is described. In order to provide an ongoing comparison of methods, the input layer was the same as that described in Fitzgerald and Lees (1993, 1994), and used the same datasets as Lees and Ritman (1991). A 9/10/10 structure was used with one hidden layer of ten nodes and an output layer of ten nodes. No spatial or temporal context was used.

The network used the Delta rule with a sigmoid transfer function. Learning rates were set at 0.9 initially (0-5000), then reduced steadily (0.3 for 5000-10,000; 0.2 for 10,000 - 150,000). Initial runs suggested that the momentum term needed to be set rather higher than usual (0.6 for 0-20,000; 0.05 for 20,000-150,000).

Each output node represented a range of fuzzy memberships (0-0.1; 0.1-0.2 and so on) (fig 1). The highest number in the output range was taken to indicate the membership of that cell and the whole output range for each cell was treated as a distribution and the probability of the membership was calculated. So, for each species, it is possible to estimate its fuzzy membership of the canopy (figs 2 & 3) and the probability of that estimate (fig 4).

fuzzy

membership cell 0/row 0 cell 1/row 0

0-0.15     1.086      0.908      0.749      0.429      0.414      0.472      0.495      
0.2        0.086      0.152      0.2        0.268      0.287      0.274      0.271      
0.3        0.013      0.008      0.01       0.009      0.008      0.01       0.01       
0.4        -0.072     -0.004     0.034      0.087      0.078      0.115      0.108      
0.5        -0.001     0.011      0.029      0.058      0.058      0.051      0.05       
0.6        -0.013     -0.013     -0.01      -0.009     -0.015     -0.014     -0.014     
0.7        0.006      0.035      0.073      0.151      0.162      0.106      0.104      
0.8        -0.052     -0.044     -0.025     0.009      0          -0.011     -0.014     

Figure 1: Cell by cell output from the NN for row 0, cells 0-6.

0.9 0 0 0 0 0 0 0

Figure 2: Fuzzy membership of the canopy for Eucalyptus maculata. Range is 0-0.83.

Figure 3: Distribution of predicted fuzzy memberships plotted against actual memberships. The broad pattern and structure are still being investigated.

Figure 4: The probability of the predicted fuzzy memberships shown graphically. Absence has the highest probability and is the easiest to predict. The distribution of low values (blue) tend to be related to those areas of forest where gradients of change are flattest, and the high values (redder) where gradients are steepest.

The structure of the output is so different to our previous modelling exercises that comparative statistics are almost meaningless. The form of the model output is similar to that produced by Payne (Payne, Stockwell and Davey, 1994) using genetic algorithms on the Lees and Ritman (1991) dataset, but as that study only attempted to predict the probability of presence/absence for a species, it is not directly comparable. The information provided by this approach is much richer in content than a traditional choropleth map and would probably reside in a database rather than be made explicit as a map. For the species chosen as an example, and using only the fuzzy memberships, the RMS error is calculated to be 0.2324. Adding spatial and temporal context should reduce this significantly.

Conclusions

It is almost too simplistic to state that many of the compromises used by cartographers and geographers over the centuries need to be rethought in the light of modern technologies. The use of choropleth maps as the main data storage form of natural systems information has led to the perpetuation of an inappropriate data model for things such as vegetation, soil and geochemical mapping. It is easy to demonstrate that this is a major source of error but there is, as yet, no agreement on the most appropriate replacement. In this exercise we have examined one possible model which shows promise. Moving away from a `mapping' to a geographic information system mindset makes it clear how important it is to have an appropriate data structure, and how important it is to examine the data model to select the most appropriate data space for analysis.

References

Aspinall, R. & Lees, B.G. 1994. 'Sampling and analysis of spatial environmental data.' in Waugh, T. C. & Healey, R.G. (eds) Advances In GIS Research, Taylor and Francis, Southhampton, 1086-1099. ISBN 0-7484-0315-9 (B)

Austin, M.P. & Smith, T.M. 1989. `A new model for the continuum concept.' Vegetatio, 83: 35-47.

Fitzgerald, R.W. & Lees, B.G., 1993. `Assessing the classification accuracy of multisource remote sensing data.' REMOTE SENSING OF THE ENVIRONMENT, 47: 1-25.

James, P.E. 1967. `On the origin and persistence of error in Geography'. Annals of the Assoc. of American Geographers, 57:1-25.

Jupp, D.L.B., Walker, J. and Penridge, L.K., 1986. Interpretation of vegetation structure in Landsat MSS imagery: A case study in disturbed semi-arid Eucalypt woodlands. Part 2, Model-based analysis. J. Environmental Management, 23: 35-57.

Kauth, R.J. and Thomas, G.S., 1976. The Tasseled Cap. Proc. LARS 1976 Symp. on Machine Process. Remotely Sensed Data, Purdue University.

Lees, B.G. and Ritman, K. 1991. Decision tree and rule induction approach to integration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly environments. Environmental Management, 15: 823-831.

Lees, B.G. 1994. `Decision trees, artificial Neural Networks and Genetic Algorithms for classification of remotely sensed and ancillary data.' Proceedings 7th Australasian Remote Sensing Conference, Remote Sensing and Photogrammetry Association, Australia Ltd, Floreat, W.A. v1: 51-60.

Lees, B.G., 1995. `Sampling strategies for machine learning using GIS' in GIS and Environmental Modelling: Progress and Research Issues, Goodchild, M.F., Steyart, L., Parks, B., Crane, M., Johnston, C., Maidment, D., and Glendinning, S. (eds), GIS World Inc, Fort Collins, Co. ISBN 1-882610-17-2. (E1).

Payne, K., Stockwell, D., and Davey, S. 1994. A methodology for improving the accuracy of vegetation mapping using GIS, remote sensing and genetic algorithms. Proc. of the regional conference of the International Union of Geographers: `Environment and the quality of life in Central Europe: problems of transition.' 22-26 August, Prague, Czech Republic.

Sinton, D.F. 1977. `The inherent structure of information as a constraint to analysis: mapped thematic data as a case study' International Advanced Study Symposium on Topological Data Structures for Geographic Information Systems, Dedham, Mass. in Dutton, G. (ed), Harvard Papers on geographic information systems, Harvard University, Camb. Mass.

Brian G. Lees

Department of Geography

Australian National University

ACT 0200, Australia

e-mail Brian.Lees@anu.edu.au

Fax 06 249 3770

Telephone 06 249 3795