Interest in sharing data, providing network access, and in documenting data quality has fostered much of the recent interest in metadata and development of metadata standards. Metadata should provide a full description of data such that it can be discover ed by potential users, assessed for its usefulness, transferred, and used or analyzed in an appropriate context. In the environment of electronic data sharing metadata is essential and it is important that it be as accurate as possible. The compilation of metadata requires as much if not more care as compilation of the original data.
Metadata as it is currently being compiled has two limitations: 1) it is largely being compiled manually after the data has been collected, and 2) it has been treated as separate from the data The former problem is expected since the concept of metadata has only recently emerged. The consequence is that retroactive compilation of metadata is an error prone and sometimes even impossible task. With respect to the latter problem, typical formats for metadata are separate text files, or separate relational tables. This paper proposes a structure for organizing metadata collection that moves it toward greater automation and integration with the data it describes. This work is described in relation to metadata for marine data and associated models.
Metadata fundamentally describes data. In this paper it is defined as a formal description of data that allows them to be exchanged and used by people other than those who originally collected them. This paper also addresses metadata for models. In the co ntext of digital libraries metadata for both data and models becomes essential. The questions which arise are where does this all important metadata come from and how is it most effectively organized. To develop a formal structure for metadata, this paper considers the following questions: 1) what constitutes metadata, 2) who is responsible for generating metadata, 3) when is metadata most optimally generated and 4) how can it be generated most efficiently?
The question of what constitutes metadata has been the subject of much debate and effort. Some content lists are long and others short. The Content Standard for Geospatial Metadata (FGDC 1994) represents one outcome of efforts focused specifically on spa tial data. The Dublin Core (Weibel, et al 1995) developed by the library community represents another. Bretherton (1994) as part of an IEEE workshop provides an interesting white paper on a metadata reference model. Smith (1995) describes metadata as cons isting of context, content, and structural information. All of these metadata efforts show substantial agreement in terms of basic metadata content.
The question of what constitutes metadata is re-structured here on the basis of four functional roles: 1) search, 2) retrieval, 3)transfer, and 4) evaluation. These functions provide a useful structure to organize metadata content. They are not intended as exhaustive or mutually exclusive classes. In particular this structure can help to isolate the metadata which forms an independent catalogue entry and what metadata must be integrally linked with the data and travel with in when it is retrieved.
With respect to the search function, metadata should provide sufficient information to discover if the data of interest exists within the collection of available data or exists at all. With respect to the retrieval function, metadata should provide the information for users to acquire the information of interest. The library analogy for this is the procedure for checking out a book. The retrieval component of metadata may be as simple as providing a URL identifying the location of an electronic data se t, to as complex as covering security issues or arranging a financial transaction for access to the information. It may include such information as the off-line location of the data, the contact person, media formats for data distribution, any restriction s on access to the data such as licensing agreements and information on costs. Metadata to support transfer should provide the necessary information for users to make use of retrieved files on their machines. This component would include information on the size of the data set (and its metadata), and the logical and physical structure of the data and metadata. The evaluation function of metadata is perhaps the most complex. Metadata to support evaluation can consist of any information which assists u sers in determining if data will be useful for an application. In many ways it is a refinement or expansion of the other functions. Some evaluation should be possible prior to retrieval particularly if there are fees for the data. In addition sufficient metadata should travel with retrieved data such that evaluation can occur or continue subsequent to retrieval. The next sections focus on the search and evaluation functions as these are the most pertinent to scientific data analysis and modeling.
Metadata to support the search function includes any information which would help a user discover an information resource. We think of author, title, and subject indices as the traditional metadata elements for a library search, but for spatial data thes e are typically not the optimal search indices. To set the scene we should stipulate the type of archive we plan to search. For the purposes of this paper a digital library of scientific data collections and models is assumed. These collections could inc lude imagery, maps, in situ collections, models, and related bibliographic materials in a heterogeneous distributed archive maintained by several investigators. Heterogeneous is understood to describe data sets which vary in their quality, area of spatial coverage, range of temporal coverage, their spatial and temporal resolution, their generating source, database schema, and other characteristics. Given this scenario we want to determine the metadata required to effectively search such an archive.
The search indices to such a collection should be multidimensional and include various spatial, temporal, and thematic indices. A spatial index can take several different forms which might include an index for two dimensional space, three dimensional sp ace, topological relations and metric relations. The temporal index might include calendar and clock times as well as process time (tides) and temporal relationships. Thematic indexes should allow searches on any type of data collection, data collector, t hematic variable as well as measures of variable similarity. Table 1 provides a potential list of metadata elements for search.
Search should theoretically be possible on any metadata element so all metadata may be considered relevant to search. Strictly search metadata includes the information required to find a data set which meets a set of criteria, but it can overlap with eval uation. In some senses a data set is evaluated by its ability to satisfy certain criteria so in meeting a set of search criteria it may have met a minimum set of evaluation criteria as well.
Given the question: Is the data likely to be of use to me? Bretherton (1994)
suggests metadata should include the following pieces of information.
(2) A summary description of scientific context, including discussion of
primary scientific objectives for the data, the variables, instrument
systems, processing algorithms and quality control procedures
(3) The spatial, temporal coverage and sampling design
(4) The scientific credentials of this data, including evidence for its
credibility, references to scientific publications which used it or
commented on its quality or deficiencies.
For the question: Is it really what I want? Bretherton (1994) suggests metadata in the form of the following browse products: (1) a typical sample, (2) diagrams, (3) graphs, and (4) derived products. Bretherton (1994) envisaged summaries of this informat ion, prepared as electronic documents with linkages to various layers of information containing more detail if desired.
Other examples of metadata for evaluation are provides by the US Global Change Research Program (USGCRP), US GLOBEC and related programs. The guiding principle in their data management policy is, "as soon as data might be useful to other researchers the d ata should be released along with documentation which can be used by other researchers to judge data quality and potential usefulness."
The USGCRP data policy requires that all principal investigators submit 1) pre-data collection plans that describe in detail collection and analysis methodologies, 2) a detailed inventory of all measurements actually made along with documentation of the m easurement techniques used to produce the data, 3) an estimate of the accuracy and precision of each measurement along with procedures used to correct errors, remove noise, or otherwise modify the collected data and any analyses of the data 4) documentati on of the physical setting of the ecosystem and 5) corrections and improvements in data made subsequent to submission of the data to the data management office. Table 2 lists a synthesis of metadata elements for evaluation.
Historically librarians played the dominant role in generating metadata. They performed the valuable function of abstracting, indexing, and cross-referencing information such that it could be efficiently discovered. In the case of scientific data exchang es, data collectors or principal investigators were more often the primary generators of metadata. Early (pre digital and even early digital) exchanges were generally infrequent. The two parties to an exchange could negotiate the transaction one on one a nd relay the appropriate information about the data given the context of the exchange. As data exchanges become more frequent and eventually routine they will become too burdensome for data collectors to individually negotiate. Procedures for exchange wi ll need to be formalized such that information can be transferred without a need in every case for person to person interaction. Such exchanges will require sufficient information to serve the four functions discussed above. The possibilities for metada ta generators or compilers include some of the same historical players; the librarian and the data collector or domain expert but may expand to include new players such as computer scientists, electrical engineers, or others involved in knowledge extracti on and data mining. Librarians have an important role as the traditional experts in abstracting and cataloguing information. The data collector and domain specialist have the familiarity with the process of data collection and expertise in the characteri stics of a particular type of data. Within this process there will likely be complementary or collaborative roles for each type of player. The non-human extensions to this include smart data loggers and linked instrumentation as well as smarter systems t hat are capable of logging metadata as data are processed.
Metadata need not be collected all at once and there are in fact distinct points in the history of a data set at which metadata is logically generated. Times at which metadata collection may occur can be identified as three broad stages: pre-data collect ion; collection concurrently with data collection; and post-data collection. Characteristics of the data and metadata will determine the optimal stage. This three stage process of data collection is now being recommended by the USGCRP. The individual stag es are discussed in more detail below.
Pre-data collection. Pre-data collection of metadata can occur in two forms: that which is not specific to the data to be collected and that which is. Non-specific metadata compiled prior to data collection could include such things as thesaurus, gazettee rs, and instrument descriptions. Metadata specific to a particular data set which could logically be compiled prior to data collection includes the data sampling design methodology, specification of purpose, planned geographic coverage, planned depth or elevation ranges, planned collection period, and instrumentation. As an example from marine data collections, the US GLOBEC pre-collection reports require inclusion of: study objectives, principle investigators, sampling plan, identification of data type s and instrumentation. Typical descriptions include navigation, timekeeping, sensor make and model, net opening, mesh size, rate of retrieval, mooring configuration and other information particular to a data collection device.
Concurrent data collection. Concurrent data collection occurs simultaneously with data collection. Concurrent metadata would be supported by linked instrumentation in which at the same time a sample is recorded other variables are recorded such as horiz ontal position, vertical position or depth, time, salinity, temperature, wind speed, and instrument variable settings The U.S GLOBEC Steering Committee strongly recommends the use of logging systems which record the underway data including navigation, met eorology, near surface temperature and salinity, and any other data collected automatically.
Post data collection: Post data collection of metadata includes compilation which could only logically take place after data has been collected. These elements would include the actual processing history, history on use of the data, quality assessment, g eneration of browse files, and computation of additional indexing, such as computation of topological relationships, and identification and indexing of image content.
It is not efficient to describe each element in the table but a few examples will illustrate the intent of the structure. The back slash which separates values in the table is used to indicate that the terms following the back slash are related. For examp le in the case of the first element the back slash indicates that the compilation methods, key in and look up, apply to the pre compilation stage and the compilation methods, computed and inferred, apply to the post compilation stage. Referring to the fir st metadata element "Geographic region", the possible compilation stages are pre or post collection. Most data collection campaigns will have defined the general geographic region prior to data collection so this element could be keyed in by the data coll ector. To expedite metadata collection of similar pre-data collection elements, interactive pre-collection forms could be designed for easy key in by the data collector. Geographic region could take two forms, a named geographic region or specification b y coordinates. If either one has been keyed in the other can be generated through a look up table or gazetteer. Librarians are included in the list of compilers for their potential role in generating a look up gazetteer. In the case of post data compila tion a coordinate defined geographic region may be computed from the set of measured coordinates. For example if a cruise visits fifty stations and a position is measured for each station, the bounding rectangle or convex hull can be computed for this poi nt set to generate a value for the geographic region. In this case the compiler is the system. If this metadata element is to be compiled retroactively and no geographic measurements exist for a data set it may still be possible to infer a geographic re gion from characteristics of the data set. MacGranaghan) The compiler could be a system, but a highly specialized knowledge extraction mechanism would be required in this case. The depth range and the temporal collection period metadata elements have sim ilar structures.
As additional examples, the horizontal and vertical position elements are indicated as being collected concurrently with the data. The logical approach would be through linked instruments in which for example horizontal and vertical GPS coordinates are g enerated at the same time a variable is observed. The compiler is thus an instrument and the metadata element is directly measured. The calendar/clock time element has similar behavior. Given these examples the structure of the other elements in the tab le should be self explanatory.
Table 4 illustrates the same structure for metadata elements for the evaluation function. The sampling plan element is one which should be documented prior to data collection. It is most logically compiled by the data collector and by key in on some type of interactive pre-compilation form. If the sampling plan has a spatial configuration a sample plan diagram can be included as a browse file. This element would also be compiled prior to data collection by the data collector or domain specialist. Key i n may be the only current method for compiling this element but the possibility exists for computing a spatial sampling plan as well. Most of the other browse files would logically be compiled after data collection. However there is the possibility tha t some could be compiled concurrently with data collection. Examples would be underwater video images or ship tracks which could be plotted as the ship steams from station to station in the process of collecting data.
Tables 5 and 6 present the metadata structure for models and models outputs. Elements which are independent of the model run would include such items as the name, type of model and type of process being modeled, parameters, boundary conditions, authors, citations of the model, and model software. Information which could be recorde d concurrently with a model run could include actual parameter values, boundary condition values, spatial units and temporal units, and pointers to the observational data used to force the model. If the intent is to evaluate a model outcome, it is likely that an investigator may wish to evaluate the data used to force the model. Thus a useful component of model metadata will be a link to the metadata of the observational data used to force the model. Post model run metadata could include evaluation metho ds and results, references, and browse graphics of the model output.
Another interesting side benefit of this structure is that it points to a need for metadata about the metadata. For example in an evaluation context it may be important for a user to know the compilation method of a metadata element. For example if a geog raphic region or time period was inferred for a data set this provides important information about the reliability of that element. This complexity quickly indicates that traditional relational database models will not be adequate for metadata representa tion.
Bretherton, F. 1994. Reference Model for Metadata : a Strawman. http://www.llnl.gov/liv_comp/metadata/papers/whitepaper/bretherton.ps
FGDC 1994. Content Standards for Digital Geospatial Metadata. June 8. Federal Geographic Data Committee. Washington DC.
Frawley, W. J., Piatetsky-Shapiro, G., Matheus, J. 1991. Knowledge Discovery in Databses: An Overview. in Knowledge Discovery in Databases, AAAI Press Menlo Park. 1-27.
Futch, S., Chin,D., McGranaghan, M., and J-G. Lay. 1992. Spatio-Linguistic Reasoning in LEI (Locality and Elevation Interpreter). in A. Frank, I. Campari, and U. Formentini (Ed). Theories and Models of Spatio-Temporal Reasoning in Geographic Space. Pisa , Italy 318-327.
Smith, T. R. 1995. Paper presented at DL Metadata Workshop. Santa Barbara.
U.S. GLOBEC. 1994. Ocean Ecosystems Dynamics. U.S GLOBEC Data Policy Report Number 10. February 1994. http:www.ccpo.odu.edu/globec/globec_rn_10_feb_1994.html
Weibel, S., Godby, J., Miller, E. and R. Daniel. 1995. OCLC/NCSA Metadata Workshop Report. Office of Research Online Computer Library Center, Inc. http//www