Natalia V. Andrienko
German National Research Center for Information Technology, Sankt-Augustin, Germany

Position Statement
Curriculum Vitae
Address

Position Statement

Knowledge Extraction from Spatially Referenced Databases: a Project of an Integrated Environment

Current state

The notion of Knowledge Discovery in Databases (KDD) denotes the work on revealing significant relationships and regularities in data based on the use of algorithms collectively entitled “data mining”. The KDD process consists in an iterative fulfillment of the following steps [5]:

1.Data selection and preprocessing, such as checking for errors, removing outliers, handling missing values, and transformation of formats.
2.Data transformations, for example, discretization of variables or production of derived variables.
3.Selection of a data mining method and adjustment of its parameters.
4.Data mining, i.e. application of the selected method.
5.Interpretation and evaluation of the results.

In this process the phase of data mining takes no more than 20% of the total workload. However, this phase is much better supported methodologically and by software than all others [6]. This is not surprising because performing of these other steps is a matter of art rather than a routine allowing automation [7]. Lately some efforts in the KDD field have been directed towards intelligent support to the data mining process, in particular, assistance in the selection of an analysis method depending on data characteristics [2,3].

A particular case of KDD is knowledge extraction from spatially referenced data, i.e. data referring to geographic objects or locations or parts of a territory division. In analysis of such data it is very important to account for the spatial component (relative positions, adjacency, distances, directions etc.). However, information about spatial relationships is very difficult to represent in discrete, symbolic form required for the data mining methods. Known are works on spatial clustering [4] and use of spatial predicates [8], but a complexity of data description and large computational expenses are characteristic for them.

Our suggestion

For the case of analysis of spatially referenced data we propose to integrate traditional data mining instruments with automated cartographic visualization and tools for interactive manipulation of graphical displays. The essence of the idea is that an analyst can view both source data and results of data mining in the form of maps that convey spatial information to a human in a natural way. This offers at least a partial solution to the challenges caused by spatially referenced data: the analyst can easily see spatial relationships and patterns that are inaccessible for a computer, at least on the present stage of development. In addition, on the ground of such an integration various KDD steps can be significantly supported.

The most evident use of cartographic visualization is in evaluation and interpretation of data mining results. However, maps can be helpful also in other activities. For example, visual analysis of spatial distributions of different data components can help in selection of representative variables for data mining and, possibly, suggest which derived variables would be useful to produce. On the stage of data preprocessing a map presentation can expose “strange” values that may be errors in the data or outliers. Discretization, i.e. transformation of a continuous numeric variable into one with a limited number of values by means of classification, can be aptly supported by a dynamic map display showing spatial distribution of the classes. With such a support the analyst can adjust the number of classes and class boundaries so that interpretable spatial patterns arise.

More specifically, we propose to build an integrated KDD environment on the basis of two existing systems, Kepler [9] for data mining and Descartes [1] for interactive visual analysis of spatially referenced data. Kepler includes a number of data mining methods and, what is very important, provides a universal plug-in interface for adding new methods. Besides, the system contains some tools for data and formats transformation and is capable of graphical presentations of some kinds of data mining results (trees, rules, and groups). Descartes automates generation of maps presenting user-selected data and supports various interactive manipulations of map displays that can help to visually reveal important features of spatial distribution of the data. Descartes also supports some data transformations productive for visual analysis. It is essential that both systems are designed to serve the same goal: help to get knowledge about data. They propose different instruments that can complement each other and together produce a synergistic effect.

In its present state, Kepler contains the following data mining methods:

1.Methods fw and kNN estimate importance of different variables in relation to values of a selected variable.
2.Methods C4.5 and C5.0 derive classification trees.
3.Methods C4.5, FOIL, and BNGE generate classification or prediction rules.
4.Methods SIDOS and MIDOS find statistically interesting subgroups of objects with regard to distribution of values of a variable.
5.Method AutoClass performs clustering.

Most of the methods (groups 1-4) require selection of a target variable that typically should be discrete and are intended for revealing relationships between the target variable and other variables selected for the analysis. Descartes can be effectively used for producing “promising” discrete variables including, implicitly or explicitly, a spatial component. The following ways of doing this are available:

1.Classification by segmentation of a value range of a numeric variable into subintervals.
2.Cross-classification of a pair of numeric attribute. In both cases the process of classification is highly interactive and supported by a map presentation of the spatial distribution of the classes that reflects in real time all changes in the definition of classes.
3.Spatial aggregation of objects performed by the user through the map interface. Results of such an aggregation can be represented by a discrete variable. For example, the user can divide city districts into “center” and “periphery” or encircle several regions, and the system will generate a variable indicating to which aggregate each object belongs.

Results of most of the data mining methods are naturally presentable on maps. The most evident is the presentation of subgroups or clusters: belonging of a geographical object to a subgroup or a cluster can be designated by painting or an icon. The same technique can be applied for tree nodes and rules: visual features of an object indicate whether it is included in the class corresponding to a selected tree node, or whether a given rule applies to the object and, if so, whether it is correctly classified.

Since Kepler contains its own facilities for presentation (non-geographical) of data mining results, it would be productive to make a dynamic link between Kepler’s and Descartes’ displays. This means that, when a cursor is positioned on an icon symbolizing a subgroup, a tree node, or a rule in a Kepler’s display, the corresponding objects are highlighted in a Descartes’ map. And vice versa, selection of a geographical object or a group of objects in a map results in highlighting subgroup(s) or tree nodes it belongs (or they belong) to or rules applicable to it (them).

Besides their main capabilities (data mining in Kepler and data visualization plus analysis-supporting display manipulation in Descartes), the systems contain additional useful functions and components to be included in the integrated environment. Thus, Kepler contains a tool DataZoom [10] supporting analysis of tables with data by a highly interactive dynamic interface for sorting, focusing, and querying. Kepler can also perform a number of necessary routine operations over datasets: transformations of formats, access to databases, querying etc. Descartes has a convenient graphical interface for outlier
removal and an easy-to-use tool for generation of derived variables by means of arithmetic operations over existing variables.

The above-presented consideration can be summarized in the form of three kinds of links between data mining and cartographic visualization:

From geography to mathematics: using dynamic maps, the user arrives at some geographically interpretable results or hypotheses and then tries to find an explanation of the results or checks the hypotheses by means of data mining methods.

From mathematics to geography: data mining methods produce results that are then visually analyzed after being presented on maps.

Linked displays: graphics representing results of data mining in the usual (non-cartographic) form are viewed in parallel with maps, and dynamic highlighting visually connects corresponding elements in both types of displays.

Software implementation

The feasibility of software implementation of the project is supported by the circumstance that both systems have a client-server architecture and use the TCP/IP protocol for the client-server communication. The client components of both systems are realized in the Java language.

For coupling the two systems, it is necessary to organize their shared use of the same data and to create a mechanism to distribute and transfer control between the systems. For this purpose a communication protocol should be designed and implemented.

References

Andrienko, G. and Andrienko, N. (1998) Intelligent Visualization and Dynamic Manipulation: Two Complementary Instruments to Support Data Exploration with GIS. In Proceedings of AVI'98: Advanced Visual Interfaces Int. Working Conference (L'Aquila – Italy, May 24-27, 1998), ACM Press, pp.66-75.

Brodley, C. (1993) Addressing the Selective Superiority Problem: Automatic Algorithm / Model Class Selection. In Machine Learning: Proceedings of the 10th International Conference, University of Massachusetts, Amherst, June 27-29, 1993. San Mateo, Calif. : Morgan Kaufmann, pp.17-24.

Gama, J. and Brazdil., P. (1995) Characterization of Classification Algorithms. In Progress in Artificial Intelligence, LNAI 990, Berlin: Springer-Verlag, pp.189-200.

Gebhardt, F. (1997) Finding Spatial Clusters. In Principles of Data Mining and Knowledge Discovery PKDD’97, LNCS 1263, Berlin: Springer-Verlag, pp.277-287.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996) The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39 (11), 27-34.

John, G.H. (1997) Enhancements to the Data Mining Process. PhD dissertation, Stanford University. Available at the URL http://robotics.stanford.edu/~gjohn/

Kodratoff, Y.. (1997) From the art of KDD to the science of KDD. Research report 1096, Universite de Paris-sud.

Koperski, K., Han, J., and Stefanovic, N. (1998) An Efficient Two-Step Method for Classification of Spatial Data. In Proceedings SDH’98, Vancouver, Canada: International Geographical Union, pp.45-54.

Wrobel, S., Wettschereck, D., Sommer, E., and Emde, W. (1996) Extensibility in Data Mining Systems. In Proceedings of KDD’96 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.214-219.

Spenke, M., Beilken, Chr., and Berlage, Th. (1996). The Interactive Table for Product Comparison and Selection. In Proceedings of the UIST 96 9th Annual Symposium on User Interface Software and Technology, Seattle, November 6 - 8, 1996. ACM Press, pp.41-50.


Curriculum Vitae

Education:

M.A., Computer Sci, Kiew State University, 1985
Ph.D., Computer Sci, Moscow State University, 1993
 

Experience:

Post-graduate, Research scientist at the Institute of Mathematics, Kishinev, (1988-1993)
Senior research scientist, Assistant professor at the Pushchino State University, (1993-1997)
Researcher at the GMD, (1997- present)

Research interests:

Intelligent computer graphics and data visualization.
Geographic information systems.
Expert systems, knowledge engineering.
Knowledge Discovery in Databases (KDD)
Exploratory Data Analysis
Intelligent information retrieval, knowledge-based hypertext design.

Additional information:

A member of Russian Artificial Intelligence (AI) Association.
The author of about 60 published papers in AI, IR and GIS areas.
Participated in several international conferences, including International GIS conference (Luxemburg, December 1996), ACM CHI conference (March 1997, Atlanta USA), ERCIM Workshop “User Interfaces for ALL” (November 1997, Strasburg France), ACM Advanced Visual Interfaces conference (May 1998, Italy)
One of the authors of the WWW version of IRIS (later Descartes) system, accessible from http://allanon.gmd.de/and/java/iris/. The system is included to the 1% top Java applet in the Web list (September 1996) by Java Applet Rating Service
One of key persons in CommonGIS project accepted for funding by ESPRIT DGIII, 1998-2001

Selected publications:

Automated tools for building bases of procedural knowledge / PhD thesis, Moscow State University, 1993.

AFORIZM Approach:Creating Situations to Facilitate Expertise Transfer. In A future for knowledge acquisition (Lecture Notes in Artificial Intelligence, v. 867), pp.244-261. Springer-Verlag, 1994.

Multimedia information retrieval and presentation: knowledge based approach. Paper was published in Russian in Programmirovaniye 1996 N.2 pp.17-29 and in English in Programming and Computer Software 1996 22(1), pp.45-52
 
Research issues in intelligent data visualization for exploration and communication. In Proceedings of ACM CHI’97 Conference, ACM Press, 1997
 
Knowledge-based support for visual exploration of spatial data. In Proceedings of ACM CHI’97 Conference, ACM Press, 1997
 
Knowledge-based cartographical visualization to support data exploration in IRIS system. Paper was published in Russian in Programmirovaniye 1997 N.5 pp.49-68 and in English in Programming and Computer Software 1997 23(5), pp.268-282

IRIS: a Tool to Support Data Analysis with Maps. Proceedings of Interop’97: International Conference on Interoperating Geographical Information Systems (Santa-Barbara, CA, December 3-4, 1997), NCGIA, pp.215-226 (to be published in M.Goodchild, M.Egenhofer, R.Fegeas, and C.Kottman (eds.) Interoperating Geographic Information Systems, Kluwer, 1998)

Intelligent Visualization and Dynamic Manipulation: Two Complementary Instruments to Support Data Exploration with GIS. In Proceedings of  AVI'98: Advanced Visual Interfaces Int. Working Conference (L'Aquila – Italy, May 24-27, 1998), ACM Press, pp.66-75

Visual Data Exploration by Dynamic Manipulation of Maps. T.Poiker and N.Chrisman (eds.) 8th International Symposium on Spatial Data Handling, SDH'98, July 11-15, 1998, Vancouver, Canada, pp.533-542
 


Address

Natalia V. Andrienko
GMD - German National Research Center for Information Technology
SET.KI - Research Division Artificial Intelligence
Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany
Telephone +49-2241-142329
Email: gennady.andrienko@gmd.de
http://allanon.gmd.de/and/


Go back to list