1.Data selection and preprocessing, such as checking for errors, removing
outliers, handling missing values, and transformation of formats.
2.Data transformations, for example, discretization of variables or
production of derived variables.
3.Selection of a data mining method and adjustment of its parameters.
4.Data mining, i.e. application of the selected method.
5.Interpretation and evaluation of the results.
In this process the phase of data mining takes no more than 20% of the total workload. However, this phase is much better supported methodologically and by software than all others [6]. This is not surprising because performing of these other steps is a matter of art rather than a routine allowing automation [7]. Lately some efforts in the KDD field have been directed towards intelligent support to the data mining process, in particular, assistance in the selection of an analysis method depending on data characteristics [2,3].
A particular case of KDD is knowledge extraction from spatially referenced data, i.e. data referring to geographic objects or locations or parts of a territory division. In analysis of such data it is very important to account for the spatial component (relative positions, adjacency, distances, directions etc.). However, information about spatial relationships is very difficult to represent in discrete, symbolic form required for the data mining methods. Known are works on spatial clustering [4] and use of spatial predicates [8], but a complexity of data description and large computational expenses are characteristic for them.
The most evident use of cartographic visualization is in evaluation and interpretation of data mining results. However, maps can be helpful also in other activities. For example, visual analysis of spatial distributions of different data components can help in selection of representative variables for data mining and, possibly, suggest which derived variables would be useful to produce. On the stage of data preprocessing a map presentation can expose “strange” values that may be errors in the data or outliers. Discretization, i.e. transformation of a continuous numeric variable into one with a limited number of values by means of classification, can be aptly supported by a dynamic map display showing spatial distribution of the classes. With such a support the analyst can adjust the number of classes and class boundaries so that interpretable spatial patterns arise.
More specifically, we propose to build an integrated KDD environment on the basis of two existing systems, Kepler [9] for data mining and Descartes [1] for interactive visual analysis of spatially referenced data. Kepler includes a number of data mining methods and, what is very important, provides a universal plug-in interface for adding new methods. Besides, the system contains some tools for data and formats transformation and is capable of graphical presentations of some kinds of data mining results (trees, rules, and groups). Descartes automates generation of maps presenting user-selected data and supports various interactive manipulations of map displays that can help to visually reveal important features of spatial distribution of the data. Descartes also supports some data transformations productive for visual analysis. It is essential that both systems are designed to serve the same goal: help to get knowledge about data. They propose different instruments that can complement each other and together produce a synergistic effect.
In its present state, Kepler contains the following data mining methods:
1.Methods fw and kNN estimate importance of different
variables in relation to values of a selected variable.
2.Methods C4.5 and C5.0 derive classification trees.
3.Methods C4.5, FOIL, and BNGE generate classification
or prediction rules.
4.Methods SIDOS and MIDOS find statistically interesting
subgroups of objects with regard to distribution of values of a variable.
5.Method AutoClass performs clustering.
Most of the methods (groups 1-4) require selection of a target variable that typically should be discrete and are intended for revealing relationships between the target variable and other variables selected for the analysis. Descartes can be effectively used for producing “promising” discrete variables including, implicitly or explicitly, a spatial component. The following ways of doing this are available:
1.Classification by segmentation of a value range of a numeric variable
into subintervals.
2.Cross-classification of a pair of numeric attribute. In both cases
the process of classification is highly interactive and supported by a
map presentation of the spatial distribution of the classes that reflects
in real time all changes in the definition of classes.
3.Spatial aggregation of objects performed by the user through the
map interface. Results of such an aggregation can be represented by a discrete
variable. For example, the user can divide city districts into “center”
and “periphery” or encircle several regions, and the system will generate
a variable indicating to which aggregate each object belongs.
Results of most of the data mining methods are naturally presentable on maps. The most evident is the presentation of subgroups or clusters: belonging of a geographical object to a subgroup or a cluster can be designated by painting or an icon. The same technique can be applied for tree nodes and rules: visual features of an object indicate whether it is included in the class corresponding to a selected tree node, or whether a given rule applies to the object and, if so, whether it is correctly classified.
Since Kepler contains its own facilities for presentation (non-geographical) of data mining results, it would be productive to make a dynamic link between Kepler’s and Descartes’ displays. This means that, when a cursor is positioned on an icon symbolizing a subgroup, a tree node, or a rule in a Kepler’s display, the corresponding objects are highlighted in a Descartes’ map. And vice versa, selection of a geographical object or a group of objects in a map results in highlighting subgroup(s) or tree nodes it belongs (or they belong) to or rules applicable to it (them).
Besides their main capabilities (data mining in Kepler and data
visualization plus analysis-supporting display manipulation in Descartes),
the systems contain additional useful functions and components to be included
in the integrated environment. Thus, Kepler contains a tool DataZoom
[10] supporting analysis of tables with data by a highly interactive dynamic
interface for sorting, focusing, and querying. Kepler can also perform
a number of necessary routine operations over datasets: transformations
of formats, access to databases, querying etc. Descartes has a convenient
graphical interface for outlier
removal and an easy-to-use tool for generation of derived variables
by means of arithmetic operations over existing variables.
The above-presented consideration can be summarized in the form of three kinds of links between data mining and cartographic visualization:
From geography to mathematics: using dynamic maps, the user arrives at some geographically interpretable results or hypotheses and then tries to find an explanation of the results or checks the hypotheses by means of data mining methods.
From mathematics to geography: data mining methods produce results that are then visually analyzed after being presented on maps.
Linked displays: graphics representing results of data mining in the usual (non-cartographic) form are viewed in parallel with maps, and dynamic highlighting visually connects corresponding elements in both types of displays.
For coupling the two systems, it is necessary to organize their shared use of the same data and to create a mechanism to distribute and transfer control between the systems. For this purpose a communication protocol should be designed and implemented.
Brodley, C. (1993) Addressing the Selective Superiority Problem: Automatic Algorithm / Model Class Selection. In Machine Learning: Proceedings of the 10th International Conference, University of Massachusetts, Amherst, June 27-29, 1993. San Mateo, Calif. : Morgan Kaufmann, pp.17-24.
Gama, J. and Brazdil., P. (1995) Characterization of Classification Algorithms. In Progress in Artificial Intelligence, LNAI 990, Berlin: Springer-Verlag, pp.189-200.
Gebhardt, F. (1997) Finding Spatial Clusters. In Principles of Data Mining and Knowledge Discovery PKDD’97, LNCS 1263, Berlin: Springer-Verlag, pp.277-287.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996) The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39 (11), 27-34.
John, G.H. (1997) Enhancements to the Data Mining Process. PhD dissertation, Stanford University. Available at the URL http://robotics.stanford.edu/~gjohn/
Kodratoff, Y.. (1997) From the art of KDD to the science of KDD. Research report 1096, Universite de Paris-sud.
Koperski, K., Han, J., and Stefanovic, N. (1998) An Efficient Two-Step Method for Classification of Spatial Data. In Proceedings SDH’98, Vancouver, Canada: International Geographical Union, pp.45-54.
Wrobel, S., Wettschereck, D., Sommer, E., and Emde, W. (1996) Extensibility in Data Mining Systems. In Proceedings of KDD’96 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.214-219.
Spenke, M., Beilken, Chr., and Berlage, Th. (1996). The Interactive Table for Product Comparison and Selection. In Proceedings of the UIST 96 9th Annual Symposium on User Interface Software and Technology, Seattle, November 6 - 8, 1996. ACM Press, pp.41-50.
AFORIZM Approach:Creating Situations to Facilitate Expertise Transfer. In A future for knowledge acquisition (Lecture Notes in Artificial Intelligence, v. 867), pp.244-261. Springer-Verlag, 1994.
Multimedia information retrieval and presentation: knowledge based approach.
Paper was published in Russian in Programmirovaniye 1996 N.2 pp.17-29 and
in English in Programming and Computer Software 1996 22(1), pp.45-52
Research issues in intelligent data visualization for exploration and
communication. In Proceedings of ACM CHI’97 Conference, ACM Press, 1997
Knowledge-based support for visual exploration of spatial data. In
Proceedings of ACM CHI’97 Conference, ACM Press, 1997
Knowledge-based cartographical visualization to support data exploration
in IRIS system. Paper was published in Russian in Programmirovaniye 1997
N.5 pp.49-68 and in English in Programming and Computer Software 1997 23(5),
pp.268-282
IRIS: a Tool to Support Data Analysis with Maps. Proceedings of Interop’97: International Conference on Interoperating Geographical Information Systems (Santa-Barbara, CA, December 3-4, 1997), NCGIA, pp.215-226 (to be published in M.Goodchild, M.Egenhofer, R.Fegeas, and C.Kottman (eds.) Interoperating Geographic Information Systems, Kluwer, 1998)
Intelligent Visualization and Dynamic Manipulation: Two Complementary Instruments to Support Data Exploration with GIS. In Proceedings of AVI'98: Advanced Visual Interfaces Int. Working Conference (L'Aquila – Italy, May 24-27, 1998), ACM Press, pp.66-75
Visual Data Exploration by Dynamic Manipulation of Maps. T.Poiker and
N.Chrisman (eds.) 8th International Symposium on Spatial Data Handling,
SDH'98, July 11-15, 1998, Vancouver, Canada, pp.533-542