People use GIS to answer questions about the world. Typically,
these answers involve different types of generalizations. These generalizations may relate either
to questions about the physical or human reality around us. For example:
GIS technicians are often called on to answer these sorts of questions. Answers to these types of
require an understanding of basic statistics. This module describes the fundamental concepts of this statistics,
in a light-hearted manner that attempts to express the key ideas of this critical intellectual building block
without being overly technical or boring.
You are a GIS technician for an environmental consulting firm, Worldwide Environmental Development Organization
(WeDo), and your firm has been assigned the task of evaluating the effectiveness of a wetlands restoration project.
Wetlands are vital ecological regimes. They serve as
breeding grounds for many fish and bird species, and help filter pollutants out of water. Wetlands also tend to be
highly sought after as potential locations for development, typically being on the coasts. A current debate in ecology centers around
whether or not valuable wetlands in highly urbanized areas can be developed, in return for 'wetland restoration'
projects in less urbanized areas. Your firm (WeDo) has been hired by the EPA to evaluate the effectiveness of one such restoration
project. Two one-hundred acre test sites (call them A and B), have been set up. Site A is in a wetland, and Site B
is in a nearby 'restored' wetland. Each site has been divided into 1 acre pieces, and a catalog of the existence of
key species of plant and animal species has been made. Your mission, if you should accept it, is to determine how
similar the two sites are.
Students should be familiar with the basic statistical measures of location,
as well as the idea of a sample, histogram and distribution. The meaning and method
of calculating the mean, median and mode are covered in this lesson.
Students will learn how to calculate the variance and standard deviation
of a sample. These idea is expressed in sigma notation, and linked to the
ideas of accuracy and precision.
Students will apply the ideas of the mean, median, mode, variance and standard
deviation to a set of water samples taking from a polluted river.
Recommended:
Complementary:
Courtier: "Ahem, yer kingship, please excuse this interuption, but
I have vital news from the provence of Leon."
Marie: "Off with his head."
Louis: (Who was saved from immanent defeat by the courtier's
interuption) "Now Dear, let him speak. Then we'll cut his head off."
Marie: "Off with your head!"
Louis: "Now dear, we've been over this before. I'm Louis
the XVI, the semidevine sun king of France, and we can't go around smiting
off my head. The craze might catch on, and we'd end up with some dingy form
of republic or even worse one of those ..."
Courtier: "Sire. The royal granaries at Leon have been
engulfed in flame. The people are starving - they have no more bread!"
Marie: "Let them eat cake!"
Courtier: "But they have no cake!"
Louis: (Picking up a new glass of Champagne) Hmmm, yes,
no cake. I wonder. Guards, behead the courtier. Then despatch yourself
with all haste to the nearest hamlet, and there enquire within nine houses
as to whether or not the peasants there ensconced possess of themselves
some certain number of cakes. By this means we shall ascertain the number
of cakes possessed by the peasantry.
The guards behead the courtier, go to the village, and
return with following counts:
Taken together, the number of cakes in the table above
represents a sample of the number of cakes possessed by peasants.
The sample provides us a basis for estimating the number
of peasant possessed pastries, based on incomplete information.
Consider the number line below:
Where on this line is the 'location' of peasant possessed
pastries? Like any good question, there are actually a few different
ways to answer this quandry. One tool that statisticians frequently
use to help them understand variables is the HISTOGRAM plot. The
histogram has a number line along the x axis and the number of values found
along the Y axis. Here's an example, based on the counts reported
by the king's guards.
The histogram above depicts what statisticians call
a 'distribution'. A distribution associates a given probability with
each possible value of a variable. There are three types of location statistics
associated with any distribution: the MODE, the MEAN, and the MEDIAN.
The MODE
If we decide that the location is best represented by the number of
cakes that appears most often in a distribution, then we could say, 'the
peasants have about 3 cakes'. The type of location measurement
is called the MODE. The mode of a distribution is similar to the
key of a piece of music. The key of a piece of music is generally
the note that appears most often, and the mode of a distribution is the
value that shows up most frequently. So, for example the mode of
the distribution pictured above is '3'. Here is how you can find
the mode of a given set of numbers:
The MEDIAN
The median of a set of numbers is the 'middlemost' value. We can
find the middle-most value by arranging the numbers so that they go from
low to high (ascending order), and then looking halfway down the list to
find the median value. For example, our sorted list of peasant pastry
products would look like:
The median may always be calculated in the following manner:
The MEAN
In statistics the term mean does not denote 'an ill-tempered person
who would rather staple you ears to the side of your head than buy you
an icecream cone'. Not at all. The statistical MEAN of a set
of numbers is just what we would typically refer to as a good old fashioned
average. It really is just that simple. The average of a set
of numbers is calculated by adding up all the numbers, and then dividing
them by the number of numbers. The mean of the distribution of cakes
would be calculated as:
or, for our set of numbers:
We can solve this equation in two steps: by first finding the
sum of the values and then dividing this sum by 9, the number of houses
visited. The sum of the values is 3 + 4 + 0 + 0 + 1 + 2 + 3 + 3 +
2 = 18. 18 divided by 9 is 2. Therefore the mean number of
cakes found was 2. Voila! Isn't that simple?
But wait. Let's muddy the water just a teensy weensy bit, by re-expressing
our formula more mathematically. We could rewrite this equation as:
Measures of Dispersion
Most of you will be familar with the difference between accuracy and
precision. Accuracy is how well some set of numbers center around
a certain target, and precision is how close a set of numbers are to each
other. We will use these simple concepts as a bridge to understanding
our next major topic: statistical measures of dispersion.
High Accuracy - Low Precision         Low Accuracy - High Precision
Have you ever had a friend who is perfectly consistent in their capacity
to really mess something up? I mean, they could be driving through
the desert, and they would run into a tree every time. These
individuals are precise in their behavior. We all also know people who
are dispersed - running hither or thither with no clear central tendency.
In statisticese, w say that a set of numbers (a distribution) is highly
dispersed, if the numbers vary greatly from their mean.
Pay attention to that word vary, we'll be revisiting it shortly.
Sets of numbers that are tightly clustered around the mean are not dispersed.
The table below has some examples:
As descriptions like this become more complicated, mathematical mumbo
jumbo becomes more complex. The verbal description from above would
look something like:
In english, we could express this equation as the variance equals the
sum of the squared deviations from the mean, divided by the number of points
minus one. So the variance is a sort of average squared deviation.
Astute readers may be wondering why the variance is divided by n-1, instead
of n - the number of points. This is what's known as a 'bias correction'.
A bias is a systematic under or over-prediction of a value. Basically,
since large values occur very seldomly, these values tend to be missed
when calculating the variance based on just a sample. So dividing
by N-1, as opposed to N, takes this into account and makes the estimate
of the variance unbiased.
Let's calculate the variance of our cake counts. Remember that
the mean number of the cakes was 2. We can begin by calculating the
squared deviations from the mean.
Now we can total the squared deviations:
         Total Squared Deviations = 4 + 4 + 1 + 0 + 0 + 1 + 1 + 1 + 4 = 16
Contrast this value with the sum of deviations:
         Total Deviation = -2 - 2 - 1 + 0 + 0 + 1 + 1 + 1 + 2 = 0
The total squared deviations is sometimes refered to as the 'variation'
of a data set. To get the average 'variance' we divide the variation
of a sample by n-1.
         VARIANCE = 16 / (n-1) = 16 / (9-1) = 16 / 8 = 2 cakes2
So the variance of our cake sample is 2 cakes squared. What!
Two cakes squared? Yes - that's how the units work out in the above
equations. Thinking in terms of cakes squared makes a lot of people
(including me) nervous. For this reason, statisticians very frequently
use a measure of dispersion called the STANDARD DEVIATION. The standard
deviation is just the square root of the variance. Or in mathematical
parlance:
Together, the variance and mean of a set of numbers tell us a great
deal: both the location and level of dispersion of a data set.
For example, examine this linked image: china.jpg.
This image shows a set of weather stations in Asia. The weather stations are stars, and the color of
each station represents the temperature of china. Can you see how
it gets colder as you go farther north? Below the map you'll find
a histogram, expressed in degrees celcius, so zero is freezing.
The mean of this data set is below freezing (-2.5 degrees C) - so we'd
want to bring our parkas - and the standard deviation is 10 degrees,
so it is a lot colder in the north than in the south.
Now, imagine that you are Dr. Doolittle, a renowned scientist, called
in to assess the dangers posed by a lead pipe factory, Romullus and Remus
Incorporated. R&R inc is a new factory, located in nothern California along a small
tributary of the Trinity river. An ecological group has has recently reported
increased levels of lead in the tributary, which would be a bad thing, since
lead is a poison. The Environmental Protection Agency sends you out as leader
of a crack team of researchers to study this matter. On site, you discover that
the ecologist group has been taking water samples for the last seven years.
Their data looks like this:
When you have your answers, click here to see if you're correct.
Surfstat
This is a great website full of java applets and great descriptions. Highly recommended.
Solveit
This page more example of mean, median and mode examples.
Sampling
This page has some good discussion and example of statistical samples.
SFU
This page from Simon Frazer University provides a good summary of the topics we've discussed.
UNIT 39: PERFORMING STATISTICAL ANALYSES
Written by Chris Funk, Geography Department, University of
California Santa Barbara
Context
Example Application
Learning Outcomes
The following list describes the expected skills which students should
master for each level of training, i.e. Awareness/Competency/Mastery.
Preparatory Units
Awareness
Think
about the word statistics for a moment. The root of the word is state,
and derives from the French word statistique. This is not surprising,
since the discipline of statistics arose along with the nation states of
the 19th century. This was an exciting time in Europe, if you happened
to be a king. In many countries, such as France and England, power was
coalescing into single unified goverments. Large state-run industries,
such as glass works and clothing mills, funded the armies which were making the
kings of European nations into ever-more dominant political forces. People began to be
concerned with the state of the state, and began building to develop the intelectual
tools to tackle this problem. Policy makers needed information
concerning the amount of resources in a country, things that we take for granted
knowing today, such as the number of people, or the amount of wheat. Clearly, too
much or too little of these resources could be a bad thing for a country in
the 18th century. But how can one really tell how much of something exists when
we only have incomplete knowledge? This is where statistics comes in. Statistics
is a set of mathematical tools we can use to reason effectively in situations
where we have incomplete knowledge. Most of the time we have incomplete
knowledge. Therefore, understanding a little statistics is vital to understanding
the world around us. That is the goal of this chapter.
Statistical Measures of Location
Statistics makes heavy use of spatial metaphors. We have
already hinted at this by mentioning the state of a state. Another idea
used in statistics is that of location. We need to understand, however,
that the location's we are discussing are along a number line - value
space, not real space. As an example, consider the following hypothetical
situation:
House Cakes
1 3
2 4
3 0
4 0
5 1
6 2
7 3
8 3
9 2
House Cakes
3 0
4 0
5 1
6 2
9 2 - This is the median value
1 3
7 3
8 3
2 4
Counting half way down this list, to the fifth number, gives us a house
with 2 cakes, therefore we would say that the median number of cakes is 2.
Right. You got it. If the number of points is even, then
there won't be a middlemost most value. In this case, it is traditional
to add the two middlemost values together, and divide them by two.
This new number is then used as the median.
mean = Sum of the Values
-----------------
The Number of Values
mean = 3 + 4 + 0 + 0 + 1 + 2 + 3 + 3 + 2 = 2
---------------------------------
9
Avg = Sum( X )
--------
N
Where Sum(X) is all the X's added up, and N is the number of X's
Right. The average of a set of numbers is the total of all the numbers added together,
and divided by number of numbers.
Competency
Dispersed
Not Dispersed
Temperature in Siberia
(Hot Summer -Cold Winter)Temperature in Borneo
(Along the Equator - Always the Same)
Big Gambler's Finances
(Up and Down)Student's Finances
(Always Low)
Price of Gold
Price of Milk
Popularity of DISCO music
Popularity of Classical Music
And so on. Clearly, things which are dispersed vary from greatly
one instance to the next. Things with low variance are always the
same. Whoa ... we just snuck a new technical term up on you:
VARIANCE. Variance can be defined as how much a distribution
varies about the mean. Any guesses as to how we could calculate it?
We could just try to add up all the amounts that each number differs from
the mean. The problem with this approach is that sometimes these
numbers will be postive, and sometimes they'll be negative - and these
will always cancel, giving sums equal to zero even for highly dispersed
distributions. Statisticians solved this problem by squaring all
the differences between each value and the mean. This variance calculation
gives you a number in the original units squared. The variance is
always positive or zero. A variance of zero means that all the numbers
in a sample have exactly the same value - which is also the mean.
The exact method for calculating VARIANCE is:
Var = Sum( SQR( X - Avg ) )
---------------------
N-1
Where the Sum() operator adds everything up,
the SQR() operator squares the differences between each X and
the average of the X's, and N is the number of X's
House Cakes Deviations Deviations-squared
----- ----- ---------- ------------------
3 0 -2 4
4 0 -2 4
5 1 -1 1
6 2 0 0
9 2 0 0
1 3 1 1
7 3 1 1
8 3 1 1
2 4 2 4
STANDARD DEVIATION = sqrt(VARIATION)
The standard deviation, like the variance, will be zero for samples that
all have the same value. The standard deviation will be in the original
units of the variable of interest. This is convenient, since we can
use it to reason about the expected values that the variable might have.
Most values will fall in the range defined by the mean plus or minus the
standard deviation. Thus, for our pastry product example, Louis concluded
that the peasants, on average, have two cakes, but that some peasants might
have none at all.
Mastery
Year Lead (in parts per billion)
---- ---------------------------
1995 222
1996 120
1997 222
1998 176
1999 473
2000 538
2001 456
Calculate the mean, median, and mode for these data points. Which measure is most appropriate
for assessing public safety requirements? Calculate the variance and standard deviance of the
amount lead of lead. The R&R INC plant plant was built in 1999. Does it seem reasonable to
attribute the increase in lead amounts to random chance?
Follow-up Units
Resources
Back To Core Curriculum for Technical Programs Welcome Page
Currently maintained by Steve Palladino
Created: May 14, 1997. Last updated: October 5, 1998.
Content comments to
Chris Funk
Formatting comments to Steve
Palladino