Data Classification

If you are developing a choropleth map of ordered data, one of the first decisions to be made deals with classification: which values should be associated with each color. In other words, which units should be in the lowest class, which units should be in the highest class, and how the rest of the units should be distributed among the remaining classes.

Among the many choices made by analysts and designers, data classification decisions might among the most important, but also the most difficult to understand. A GIS specialist must make choices about not only how many different classes that the data should be categorized into, but what the value ranges of those classes should be. A slight adjustment of the "breaks" in the value ranges of ordered data, for example, might alter the map significantly and reveal trends that were not detected previously (or are not in fact there).

In this section, two of the most common "default" methods of classifying data are presented. These are two of many choices available in recent versions of ArcView, and a designer should be aware of the differences among all of the methods. Each has advantages and disadvantages.

Quantiles. This method classifies data into a certain number of categories with an equal number of units in each category.

Equal Intervals. This method sets the value ranges in each category equal in size. The entire range of data values (max - min) is divided equally into however many categories have been chosen.

These methods are illustrated using the following data:


The data above are classified in Quanitle and Equal Interval schemes in the table below. The first letter of each county is indicated in the "counties" column, and the corresponding data values (the height of the bar in the bar graph) is listed in the "ranges" column, representing value ranges for each of the four classes.

 

Quantiles

Equal Intervals

 

Counties

Ranges

Counties

Ranges

Class 1

ABCD

1-3

ABCDEFGHI

1-13

Class 2

EFGH

3-10

JKLMN

14-26

Class 3

IJKL

10-25

-

27-39

Class 4

MNOP

25-52

OP

40-52

Notice right away the possible pitfalls of these methods. With a four-category quantile classification, there are an equal number of counties in each class, but note that Durst and Evans Counties, though they have identical attribute values, are placed in different classes. In addition, Manto and Niles Counties are placed with the extreme Orton and Percy Counties, rather than with Lewis County, to which they are much more similar. This method, in this case, leads to a misleading visualization.

Switching to a four-category equal interval method, the most obvious problem is that only three of the four classes actually contain data points. The ranges of each class (13%) are the same, but because this data is skewed (has a few data points that are very different from the rest), no county's attribute value actually falls into the third class. This reduces visualization effectiveness by effectively eliminating one fill color.

Look how different these maps -- of identical data -- appear, depending on the classification scheme used.

More effective methods for visualizing this data exist.

Why would anyone use them, if they are so limited?

The two classification schemes above are the most easily computed and one or the other is usually the default classification in most GIS. These methods are adequate to display data that varies linearly, that is, data with no outliers that tend to skew the mean of the data far from the median. A chart of such data like the one above would contain columns that increased in height equally from left to right; the "steps" between the columns would be the same.

All ordered data should be examined to determine whether or not altering the classification method has a significant impact on the display. If so, a method other than the two discussed in this section should be used for the most accurate and informative visual portrayal of the information.