The performance of an environmental model depends highly on the 'representativeness' of the sample data with which it has been developed in relation to the overall data set, together with the adequacy of the sample with regard to the modelling technique itself. While advances have been made towards achieving both goals individually, in few cases are these combined. Additionally, the situation with regard to pre-existing data at fixed points remains 'ad hoc'. Given that increasing volumes of pre-sampled point data now exist in electronic form, these omissions are an issue of currency within the wide range of topics embraced by the term 'environmental modelling'. That this increase in data richness has frequently been matched by the rising cost of data means the ability to select appropriate data is paramount not only in terms model quality but also in the economic feasibility of the overall project itself. A new sampling methodology is presented to deal with these issues, framed within the context of the choice of meteorological station data for the development of interpolated weather surfaces but of wider applicability.
Fundemental to the new strategy is the representation of sampling requirements by means of a multi-criteria function. Criteria are developed to deal with each individual variable, for example a desired height function/range or the maximisation of record length. Data selection may also be tailored to the model types for which the data is to be used, for example by ensuring that an adequate number of nearby points are selected for interpolation by kriging. This combination of possibly contradictory criteria are then optimised together using a genetic algorithm in conjunction with the geographical information system in which the data are stored. While the solution of multi-criteria problems can be achieved using a variety of more traditional techniques, the GA approach was chosen both for its ability to manage large volumes of data and for its management of compromise. As a working progress report, discussion focuses on the reasons for the choice of an evolutionary approach and the representation of the sampling problem within this framework.
However sophisticated a model or procedure, the results will only be as good as the data/knowledge underpinning its conception. This statement in itself is not new or radical and a number of sampling methodologies are very familiar in a wide range of disciplines. Such techniques include for example the random, stratified or systematic families of sampling. More recently these familiar methods have been augmented by those such as gradsect sampling (Austin & Heyligers, 1989) which deliberately sample across steep gradients of environmental change. Such moves highlight an increased awareness of the need to maximise the possible environmental range within a sample. Additionally, work by biologists Margules & Stein (1989) stressing the limitation of sampling within one dimension only has been applied within a geographical framework (Lees 1994, Aspinall & Lees 1994). The latter argue that not only should sampling be carried out with respect to full range of environmental criteria, but that each environmental space should be sampled individually. Further dimensions to the sampling problem that are difficult to incorporate within the context of traditional sampling methodologies include the purpose for which the data is to be used (e.g. Pettitt & McBratney, 1993), together with strategies which formally take into account the cost of sampling within the overall sampling objective (e.g. Bras and Rodriguez-Iturbe, 1976). What ma y be summarised from the current literature therefore are the concepts of representative data, the need to look at multidimensionality and the possibility of tailoring our sample according to future analytical requirements.
In the case of choosing the most characteristic pre-located site data from a wider set, as distinct from a free choice of sample sites over a particular landscape, the means by which all of the principles discussed above may be applied within the sampling process is somewhat ad hoc. Such a situation is however frequently to be found with the increasing volumes of pre-sampled data available in electronic form, the task of choosing relevant data becoming commensurably awesome. Even when a restricted and relevant data exists, its full cost may deem a project economically not viable and therefore a means of partial selection critical. Applying the definition of Eastman (1995) which classifies problems with a single objective (making an 'appropriate' sample choice) subject to a number of possibly conflicting criteria (representative data, multidimensionality, analytical requirements, cost) as multi-criteria evaluations, techniques for solving such problems within the broader literature are used as a base for further methodological discussion. It should be noted that some ambiguity in the use of such terminology arises even within a GIS environment, Jankowski (1995) referring to both multi-criteria and multi-objective techniques as 'multiple criteria decision making methods'.
A wide variety of traditional optimisation and search techniques exist which have been drawn upon in the development of 'multiple criteria decision making methods' (Table 1). As Schwefel (op cit, p165) notes, 'the question of which is the best strategy is itself a kind of optimisation problem' ! Just as multi-objective analysis has tended to be viewed as a 'natural extension of mathematical programming' (Jankowski 1995), so have the overlay and hierarchical methods of multi-criteria analysis found favour within the context of a geographical information systems (e.g. Carver 1991, Eastman et al 1995).
While a straightforward solution to the multi-criteria sampling problem was sought, a number of disadvantages in using these conventional methods were identified. These are summarised in Table 2 below.
In addition to conventional search and optimisation algorithms, a newer class of methods known as evolutionary algorithms (incl. genetic algorithms) has also been used in multi-criteria optimisation (Goldberg, op cit, p197). Because of the major obstacles emerging should traditional methods be used in the case of the sampling problem (search space size, the objective as a collective of suitable sites, and conflicting criteria selection) these more recent techniques are evaluated for their potential use in the sampling application. The results of this analysis are shown in Table 3, from which the decision to develop a genetic algorithm approach to the extraction of a characteristic sub-sample of fixed data points from an established database was taken.
In developing a new sampling methodology, the first question to be asked is 'what is required of the sample data?'. As established within the literature review, the two main goals should be the 'representativeness' of the data with regard to the overall problem space, and the tailoring of data to suit the requirements of further analytical techniques. Specific criteria relevant to the search may then be introduced within the framework of more general evolutionary code. The local context of the sampling problem therefore requires close analysis. In this case the use of the sample is in the interpolation of point type meteorological data for England & Wales. The problem task is to choose a total of 200 sites from a possible 985, the number of sites restricted for economic reasons. The available meta-data for each site, derived data and its source are detailed below (Table 4) and in the first instance exclude partial or subjective data such as quality statements.
MET. OFFICE DERIVED
| Location | Location |
| Start of recording (year) | Record length |
| End of recording (year) | Age of information |
| Currency of data |
| Ordnance Survey height data | Height |
| (1: 50,000 raster) | Second derivative of slope |
| Aspect | |
| * University of Edinburgh | Central Science Laboratory |
| Department of Geography | Ministry for Agriculture, Food & Fisheries |
| Drummond Street | Hatching Green |
| EDINBURGH | Harpenden |
| Herts. |