Towards a Methodology for Selecting a "Characteristic" Sample from an Existing Database: An Evolutionary Approach

Jarvis, C.H.*, Stuart, N.*, Kelsey, J., Baker, R.H.A.


ABSTRACT

The performance of an environmental model depends highly on the 'representativeness' of the sample data with which it has been developed in relation to the overall data set, together with the adequacy of the sample with regard to the modelling technique itself. While advances have been made towards achieving both goals individually, in few cases are these combined. Additionally, the situation with regard to pre-existing data at fixed points remains 'ad hoc'. Given that increasing volumes of pre-sampled point data now exist in electronic form, these omissions are an issue of currency within the wide range of topics embraced by the term 'environmental modelling'. That this increase in data richness has frequently been matched by the rising cost of data means the ability to select appropriate data is paramount not only in terms model quality but also in the economic feasibility of the overall project itself. A new sampling methodology is presented to deal with these issues, framed within the context of the choice of meteorological station data for the development of interpolated weather surfaces but of wider applicability.

Fundemental to the new strategy is the representation of sampling requirements by means of a multi-criteria function. Criteria are developed to deal with each individual variable, for example a desired height function/range or the maximisation of record length. Data selection may also be tailored to the model types for which the data is to be used, for example by ensuring that an adequate number of nearby points are selected for interpolation by kriging. This combination of possibly contradictory criteria are then optimised together using a genetic algorithm in conjunction with the geographical information system in which the data are stored. While the solution of multi-criteria problems can be achieved using a variety of more traditional techniques, the GA approach was chosen both for its ability to manage large volumes of data and for its management of compromise. As a working progress report, discussion focuses on the reasons for the choice of an evolutionary approach and the representation of the sampling problem within this framework.

INTRODUCTION

However sophisticated a model or procedure, the results will only be as good as the data/knowledge underpinning its conception. This statement in itself is not new or radical and a number of sampling methodologies are very familiar in a wide range of disciplines. Such techniques include for example the random, stratified or systematic families of sampling. More recently these familiar methods have been augmented by those such as gradsect sampling (Austin & Heyligers, 1989) which deliberately sample across steep gradients of environmental change. Such moves highlight an increased awareness of the need to maximise the possible environmental range within a sample. Additionally, work by biologists Margules & Stein (1989) stressing the limitation of sampling within one dimension only has been applied within a geographical framework (Lees 1994, Aspinall & Lees 1994). The latter argue that not only should sampling be carried out with respect to full range of environmental criteria, but that each environmental space should be sampled individually. Further dimensions to the sampling problem that are difficult to incorporate within the context of traditional sampling methodologies include the purpose for which the data is to be used (e.g. Pettitt & McBratney, 1993), together with strategies which formally take into account the cost of sampling within the overall sampling objective (e.g. Bras and Rodriguez-Iturbe, 1976). What ma y be summarised from the current literature therefore are the concepts of representative data, the need to look at multidimensionality and the possibility of tailoring our sample according to future analytical requirements.

In the case of choosing the most characteristic pre-located site data from a wider set, as distinct from a free choice of sample sites over a particular landscape, the means by which all of the principles discussed above may be applied within the sampling process is somewhat ad hoc. Such a situation is however frequently to be found with the increasing volumes of pre-sampled data available in electronic form, the task of choosing relevant data becoming commensurably awesome. Even when a restricted and relevant data exists, its full cost may deem a project economically not viable and therefore a means of partial selection critical. Applying the definition of Eastman (1995) which classifies problems with a single objective (making an 'appropriate' sample choice) subject to a number of possibly conflicting criteria (representative data, multidimensionality, analytical requirements, cost) as multi-criteria evaluations, techniques for solving such problems within the broader literature are used as a base for further methodological discussion. It should be noted that some ambiguity in the use of such terminology arises even within a GIS environment, Jankowski (1995) referring to both multi-criteria and multi-objective techniques as 'multiple criteria decision making methods'.

A wide variety of traditional optimisation and search techniques exist which have been drawn upon in the development of 'multiple criteria decision making methods' (Table 1). As Schwefel (op cit, p165) notes, 'the question of which is the best strategy is itself a kind of optimisation problem' ! Just as multi-objective analysis has tended to be viewed as a 'natural extension of mathematical programming' (Jankowski 1995), so have the overlay and hierarchical methods of multi-criteria analysis found favour within the context of a geographical information systems (e.g. Carver 1991, Eastman et al 1995).

Gradient analysis Enumerative Random
Table 1, CONVENTIONAL SEARCH AND OPTIMISATION METHODS (After Goldberg, 1989, p3)

While a straightforward solution to the multi-criteria sampling problem was sought, a number of disadvantages in using these conventional methods were identified. These are summarised in Table 2 below.

Gradient analysis:

Enumerative methods: Random search:

Table 2, RESTRICTIONS OF CONVENTIONAL ALGORITHMS

In addition to conventional search and optimisation algorithms, a newer class of methods known as evolutionary algorithms (incl. genetic algorithms) has also been used in multi-criteria optimisation (Goldberg, op cit, p197). Because of the major obstacles emerging should traditional methods be used in the case of the sampling problem (search space size, the objective as a collective of suitable sites, and conflicting criteria selection) these more recent techniques are evaluated for their potential use in the sampling application. The results of this analysis are shown in Table 3, from which the decision to develop a genetic algorithm approach to the extraction of a characteristic sub-sample of fixed data points from an established database was taken.

Table 3, ADVANTAGES OF EVOLUTIONARY COMPUTING APPROACH

METHODOLOGY

In developing a new sampling methodology, the first question to be asked is 'what is required of the sample data?'. As established within the literature review, the two main goals should be the 'representativeness' of the data with regard to the overall problem space, and the tailoring of data to suit the requirements of further analytical techniques. Specific criteria relevant to the search may then be introduced within the framework of more general evolutionary code. The local context of the sampling problem therefore requires close analysis. In this case the use of the sample is in the interpolation of point type meteorological data for England & Wales. The problem task is to choose a total of 200 sites from a possible 985, the number of sites restricted for economic reasons. The available meta-data for each site, derived data and its source are detailed below (Table 4) and in the first instance exclude partial or subjective data such as quality statements.

MET. OFFICE DERIVED

GIS

Location Location
Start of recording (year) Record length
End of recording (year) Age of information
Currency of data
Table 4, SOURCES OF META DATA REGARDING EACH SITE

Derivations have been developed within the GIS GRASS or by spreadsheet, and move the initial data closer to a form useful in meeting the sampling criteria which are outlined below

A. Goal 1: 'Representativeness' of sample data relative to full data set B. Goal 2: Sampling requirements related to interpolation tasks
Table 5, DEVELOPMENT OF SAMPLING CRITERIA

Straightforward formulation of the criteria into 'fitness' functions does not however necessarily imply ease of optimisation, as hinted at by the ease in which more traditional approaches may be discounted in this problem case. A number of methods o f multi-criterion optimisation using evolutionary techniques exist, and this subject in itself forms an area of active research. Indeed, Fonseca & Fleming (1995) suggest that for the related multi-optimisation problem that it is time that an experimental approach be taken to real-world problems. Choice of method incorporates the two main issues involved within the evolutionary approach, those of fitness assignment (or suitability of the site) and search strategies (the probabilistic search through disc rete space). In the main, attention has been given to the former (e.g. Jankowski 1995) and thus this is where the main alternatives lie. Given the apparent lack of comparative material and a desire to maintain simplicity where possible, the most straightforward of approaches is chosen firstly: the use of the popular aggregation (weighted sum) approach to fitness familiar from other techniques together with standard (Michalewicz, 1992) search operations.

The basic implementation of the evolutionary approach is summarised below (After Davis, 1991):

RESULTS AND DISCUSSION

In order for the evolutionary approach to sampling to be evaluated , its advantages and improvements compared to older strategies must be displayed. A major problem facing the study in this regard is that the performance of the evolutionary algorithm is a vector. Because of conflicting criteria, such as the requirements for sites to be spread throughout England and Wales together with the reward of nearby sites in order to be able produce an adequate variogram, a number of non dominated solutions are likely to exist owing to compromise. How therefore can a set of runs be evaluated? Perhaps only by the relative performance of the sample data in producing an interpolated data surface. However, because this experiment is set within the context of a real requirement for meteorological data, only meta-data is as yet available.

Instead, average initial results of the overall fitness of the solution were compared to that from a simple deterministic methodology in which sites were 'weeded' in a sequential fashion according to each criterion in turn, ranked according to perceived importance. While using this comparative technique selections could be made relatively straightforwardly according to second derivative of slope, age and continuance however, problems arose in incorporating the full range of environmental parameters such as height and in balancing a good spread of sites with adequate nearest neighbours in geographically varied areas. The overall fitness value achieved by the ad hoc technique compared well with the 'best' value of the initial random populations of the evolutionary algorithm but unsurprisingly was not well balanced within the scores for individual criteria.

One of the obstacles to the wider application of evolutionary algorithms is the need to experiment using a number of parameters such as the mutation rate, the probability of crossover between populations, population size used and the form of the operators themselves. The success of the evolutionary procedure may be shown by reference to the average convergence curve of fitness per generation over several runs using identical parameters from different random start points and additionally the use of different strategies from similar start points. Work in progress currently shows a steady increase in the average fitness of the total population per generation, with slower gains in 'best' fitness. This steady increase in average population value indicates a performance better than purely random and that the methodology holds promise. Nevertheless, it is the case that further tuning of the crossover and mutation rates is required, and alternative evolutionary operators such as expected crossover and mutation and ranked crossover are currently being explored. Also, the weighted sum method which aggregates all individual fitnesses within one overall, compensatory function is acknowledged as being sub-optimal in comparison to more sophisticated algorithms such as Shaffer's (1985, in Michalawicz 1992) VEGA system which rank populations according to pareto dominance (the ranked position according to each individual criterion, not accumulated fitness). the current formulation does not allow for the generation of two sets of sites for both testing and training. but this could be achieved in future work by the use of a double objective function.

CONCLUSIONS AND FURTHER DIRECTIONS

While simplistic in evolutionary computing terms, in comparison with traditional sampling methodologies the technique outlined appears rather complex. However, the approach outlined does allow for the sampling of fixed point sites of data both characteristic of the underlying landscape spaces and tailored for analysis. To date, only ad hoc alternatives exist. Additionally, with the tailoring of the criteria and therefore the fitness function, the methodology developed is equally applicable to the selection of other geographically referenced point data, such as for example soil survey results and indeed to very different geographical problems.

In order to facilitate the exploration of a wider variety of genetic operators and parameters, work is undergoing to visualise the problem search space according to the defined function as a means of guidance. The generation of equally representative training and testing site combinations in order to avoid over reliance on cross-validation techniques is also underway.

REFERENCES

Aspinall, R.J., Lees B.G. (1994) Sampling and analysis of spatial environmental data, In Waugh, T.C., Healey (Eds) (1994) Advances in GIS Research, Proceedings of Sixth International Symposium on Spatial Data Handling, Waugh: Edinburgh, p1086-1098.

Austin, M.P. , Heyligers, P.C.(1989) Vegetation survey design for conservation: Gradsect sampling of forests in north-eastern New South Wales, Biological Conservation, 50, 13-32.

Bras, R.L., Rodriguez-Iturbe, I. (1976) Network design for the estimation of areal mean of rainfall events, Water Resources Research, (12)6, 1185-1195.

Carver, S. J. (1991) Integrating multicriteria evaluation with GIS, International Journal of Geographical Information Systems, 5, 321-339.

Davis, L. (1991) Handbook of Genetic Algorithms, Van Nostrand Reinhold: New York.

Eastman, J., Jin, W., Kyem, P.A.K., Toledano, J. (1995) Raster procedures for multi-criteria / multi-objective decisions, Photogrammetric Engineering & Remote Sensing, 61(5), 539-547.

Fonseca, C.M., Fleming, P.J. (1995) An overview of evolutionary algorithms in multiobjective optimization, Evolutionary Computing, 3(1), 1-16.

Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley: Reading, Mass.

Jankowski, P. (1995) Integrating geographical information systems and multiple criteria decision-making methods, International Journal of Geographical Information Systems, 9(3), 251-273.

Janssen, R., Rietveld, P. (1990) Multicriteria analysis and GIS: an appliction to agricultural land use in the Netherlands, In , Geographical Information Systems for Urban and Regional Planning, Eds. Scholten, H.J., Stillwell, J.C.H., Kluwer : Dordrecht.

Lees B. (1993) Sampling strategies for machine learning using GIS, In Proceedings of the Second International Conference on Environmental Modelling, Breckenridge, NCGIA: CA.

Margules, C.R., Stein, J.L. (1989) Pattern in the distributions of species and the selection of nature reserves: an example from Eucalyptus forests in south-eastern New South Wales, Biological Conservation, 50(4), 219-38.

Michalewicz, M. (1992) Genetic Algorithms + Data Structures = Evolution Programs, Springer- Verlag: Berlin.

Pereira, A.G., Peckham, R.J., Antunus, M.P. (1993) GENET: A method to generate alternatives for facilities siting using genetic algorithms, In European Conference on Geographical Information Systems, 1993, 973-982.

Pettitt, A.N., McBratney, A.B. (1993) Sampling designs for estimating spatial variance components, Applied Statistics, 42(1), 185-209.

Ritzel, B.J., Eheart, J.W., Ranjithan, S. (1994) Using genetic algorithms to solve a multiple objective groundwater pollution containment problem, Water Resources Research, 30(5), 1589-1603.

Schwefel, H.-P., (1995) Evolution and optimum seeking, Wiley: New York.

Shaffer, J.D., Grefenstette, J.J. (1985) Multi-objective learning via genetic algorithms, In Joshi, A.(Ed), Proceedings of the Ninth International Joint Conference on Artificial Intelligence, Morgan Kaufmann: San Mateo, CA, p593-595.

AFFILIATIONS

Ordnance Survey height data Height
(1: 50,000 raster) Second derivative of slope
Aspect
* University of Edinburgh Central Science Laboratory
Department of GeographyMinistry for Agriculture, Food & Fisheries
Drummond Street Hatching Green
EDINBURGH Harpenden
Herts.