Unit 46: Address Matching
Written by Susan Jampoler, Geoknowledge
Context:
Address matching allows the user to convert postal addresses and/or
zip codes to geographic coordinates, create a new data layer containing
these points, and display the information on a map. Three components are
necessary to complete the address matching process: a geographic base file
(GBF), a table containing address information, and a computer software
package that performs the conversion. Address geocoding functionality is
available in most geographic information system (GIS) software packages.
The resultant new point data layer can subsequently be used to analyze
spatial patterns.
The following examples are typical problems where address geocoding
can be applied. Often, just visualizing the information on a map is enough
to answer the questions. However, the geocoding process is frequently a
preliminary step used in preparing the information for further spatial
analysis.
Example Applications
-
Medical
You work in the MIS division of a health maintenance organization (HMO)
which has recently received several complaints from participants. Waiting
time to get an doctor’s appointment is excessive and patients must travel
too far when they need to see a specialist. Several participating companies
are considering switching to another health care service. In order to retain
these organizations, your senior management has asked you to evaluate where
to increase physician coverage and how to improve service.
You maintain several databases, including information on participating
companies, individuals, physicians, and local hospital and diagnostic facilities.
It is hard to visualize where patients live, or where doctors and facilities
are located by sorting and studying these databases. Fortunately, all the
databases include a field containing address information.
The first task is to convert those addresses to points on a map using
address geocoding. You will need to obtain a geographic base file, probably
from a commercial vendor. Using the GBF and your databases, you now create
new point data layers which show the distribution of patients, physicians,
hospitals and diagnostic centers. You will then need to know where other
doctors not currently in your HMO are located. You can purchase this information
from several vendors using the standard industrial classification (SIC)
for the type of doctors you need, and add the layer to your analysis. Finally
you will use other GIS capabilities to determine where to recruit additional
physicians.
-
Local Government
Each day, citizens and builders come into your department to obtain building
permits. Your supervisor wants a monthly report describing the number,
type and distribution of permits throughout the county. Until now, you
have provided some charts that show the volume of permits according to
type (room additions, driveway repair, swimming pool, deck, etc.) and by
requester. You have located the permits using a push pin on a paper map.
Your county has experienced a rapid increase in the number of permits requested
and it is taking you a full day each month to create your map. There must
be an easier way! Address geocoding will allow you to locate each permit
by using the work site address in your permits database and the street
centerline file (the GBF) that your county’s mapping division has just
completed. You can now make several maps. For example, you can provide
a map showing all permits for the month, reclassify the information by
permit type or requester, and show change from one month to the next using
historical information.
-
Distribution
You work for a specialty attire company. Your product is very popular in
large metropolitan areas throughout the United States, particularly in
the west and south. You currently have two distribution centers, one in
the east and one in the west. The stores that carry your attire in the
west are complaining that your product is not arriving on time. Your western
distribution center is overwhelmed and is not capable of completing all
the delivery orders. Clearly, you need an additional distribution center.
Your task is to choose the correct location. Location analysis is a complicated
process and involves many GIS operations. Geographic information you must
consider includes, but is not limited to: local zoning, supplier location,
means of delivery (both how you receive and how you send out your goods),
and customer location. To map your customer locations you will use address
geocoding using your customer database and an available geographic base
file. This base file can be a zip code, zip+4, TIGER or TIGER derivative
file that you purchase. The points representing your customers will then
be considered in the geographic analysis when locating your new distribution
center.
-
Marketing
You work for a national computer chain in the direct mail department. You
send out a mass mailing once a month advertising sales. Rather than sending
out just one generic mailing, you want to develop advertisements that will
appeal to the postal patron. For several months your company has been asking
customers for their zip code when they make a purchase as part of a "market
survey to determine where to place new stores". The company now has a large
database that includes everything that was purchased. You can map this
information using address geocoding and then reclassify and sort the data
to show spatial patterns. For this problem, you need your database and
a zip code geographic base file. Once you see the map, it will help you
know what products are selling in a specific area. You can then advertise
complementary products and increase your sales.
Learning Outcomes
The following list describes the expected
skills that students should master for each level of training i.e., Awareness/Competency/Mastery.
Awareness:
The learning goals are to identify sources and to
develop a working knowledge of the three components necessary to complete
the geocoding process: the geographic base file, the address file and the
software. (Suggested time: one 50 minute unit)
Competency:
The learning goals are to define and evaluate appropriate
base files, understand the importance of standardized address files, bring
the necessary files into a software package, perform the geocoding process,
and visualize the results. (Suggested time: one 50 minute unit and one
50 minute lab)
Mastery:
The learning goals are to effectively evaluate the
accuracy of both base files and address files, standardize address files,
evaluate non-matches, understand the rematch process, and perform a basic
reclassification analysis using attribute information provided in the address
file. (Suggested time: one 50 minute unit)
Preparatory Units:
Recommended Units:
Unit
1 Data acquisition
Unit
2 Demographic data
Unit
19 Planning a tabular database
Unit
21 Using spreadsheets
Unit
30 Validating databases
Unit
31 Managing database files
Highly recommended background for instructor
Unit
016 NCGIA Core Curriculum in GIScience: Discrete Georeferencing
Complementary Units:
Unit
7 Metadata
Unit
47 Visualization
Awareness
Learning Objectives:
-
Identify sources of geographic
base (reference) files.
-
Identify sources of address files.
-
Determine address geocoding applications.
-
Evaluate desktop mapping software.
-
Understand the components of address
geocoding.
Topics:
1. Identify sources of geographic
base (reference) files
TIGER: U.S. Bureau of the Census
-
Available on CD, from libraries, on-line (Census
Tiger Files)
-
Must be converted to appropriate software format
-
Line files that are organized by county and contain:
Roads, railroads, rivers
Census statistical boundaries
Political boundaries
Metropolitan address ranges and zip codes for streets
-
Local government: County mapping organizations
-
Normally only available in the format supported by the
county
-
Requires conversion to appropriate software format
-
Data vendors
-
Can be purchased from a variety of vendors
-
Enhanced TIGER files
-
May be more accurate and up-to-date (location and attribute)
-
Converted to specific software format
(full image)
Example of enhanced tiger
format (Note: the address is separated into several fields)
Graphic 1: Example of a GBF
road: inset
2. Identify sources of Address
files
-
Any organizational data base that contains a field with
address information
i.e., customer records, permit sites, crime locations,
school children, store locations, disease outbreaks
(full image)
Example of an address file
(Note: The address is in one file)
-
Can be purchased address files, usually collected through
yellow pages entries
i.e., fast food restaurants, hospitals, child care
centers, competitor’s locations
-
Available on-line, on CDROM (Unit
016 NCGIA Core Curriculum in GeoScience, section 5.1.1)
3. Determine address geocoding
applications
-
Identifying location
i.e., customers, competitors, permits, crimes, fires,
available real estate
-
Siting facilities
i.e., existing distribution centers, health care
providers, service facilities
-
Determining patterns:
i.e., fires near schools, potential customers not
served by a store
-
Delivery:
i.e., mail, mass mailing, goods, services
-
Market analysis
i.e., location of competitors
-
Anytime the location cannot be directly georeferenced
4. Evaluating desktop software
-
Most desktop packages have address matching capabilities
-
Some packages come with geographic base files
-
Software must incorporate the capability to:
-
Be tolerant of errors in address files
-
Allow for consideration and review of "almost" matches
Matching rules
Data tables
Cut-off thresholds
-
Operate in both sequential batch and single event modes
5. Understand the components of
address geocoding
-
Reference Files (Geographic Base Files (GBF))
i.e., street network, zip codes
-
Table of addresses and other attribute information
i.e., crime data, customers, store locations
Competency
Learning Objectives:
-
Evaluate appropriate reference
files.
-
Evaluate address files for completeness
and standardization.
-
Perform address matching operations.
-
Perform visual analysis of resulting
point data layers.
-
Practical Exercise: geocoding.
Topics:
1. Evaluate appropriate reference
files
-
Detail and accuracy of the address file
Contains full address or just zip code information?
Contains direction information (i.e., N. Main St. or
Main St.)?
-
Range of detail in reference files
Single field
i.e., zip code, address all in one field, zip+4
Single house with range
i.e., house number, range along a street, no information
on what is on the left or right side of the street
U.S. streets with zones
i.e., house, range along a street, information on
what is on the left or right side of the street
(full image)
Example of zip code base file
(full image)
Example of US Streets base
file
-
Determine geographic extent of the application
Local, regional or national
More detailed reference files cost more to acquire
Ask: Does application support increased resolution
of the reference file?
(i.e. rural routes should not use street style addresses)
-
Successful implementation requires
Careful data preparation
Selecting the appropriate geocoding preferences in the
geographic base file used to match to the address file
2. Evaluate address file for completeness
and standardization
-
Addresses provide information about the location of
an event or an incident
-
Usually collected without regard to standard format:
no standard method for identifying features
i.e., Ave., Avenue, Av all stand for the same feature
type
Direction sometimes a suffix, sometimes a prefix
Often contain errors and omissions
i.e., Spelling errors, duplicate records, data
base not up to date
i.e., Phonetic errors, transpositions, random letter
insertion, character deletion or replacement
-
Files can be commercially standardized using U.S.
Postal Service format
-
The more complete and standardized the address file,
the more successful the address matching process
3. Perform address matching
operations
-
Prepare the data
-
Identify the base and address files
(full image)
Example of US Streets address
style (Note: Some fields are required, others are optional but may provide
a higher match success rate if used to index the base file)
(full image)
Example of address file
-
Define matching strategies for reference and address
files
What fields need to be indexed?
What fields will be matched?
What is a match?
What about errors?
-
Standardize the base and address files
-
Prepare the base file: Separate data into individual
fields and standardize abbreviations (this is usually done by the data
provider)
(full image)
Example of defining the index
process
-
Prepare the Address Table by separating the data into
individual fields and sorting (this is done by the software)
-
Match the address file to the GBF
-
Set up the match process by identifying how the address
file will link to the base reference file by defining the comparison methods
(this is done by the software based on the parameters you have set)
-
Compares the address file to the base reference file
field by field
i.e., prefix direction, prefix type, street type,
suffix direction
-
Compares the address character-by-character
i.e., Main compared to Maine
-
Specify probabilities to compute matching score
(full image)
Example of setting the matching
parameters (Note: In this ArcView example you 1) identify the GBF; 2) identify
the address file and address field; and, 3) set the comparison preferences.)
-
Perform the match
-
Software scores how close a match is found
-
Interpolates along the street network to determine the
address location
(full image)
Example of how the software
compare the address file to the base reference file (Note: The software
determines possible matches to the address file in the GBF and picks the
best match based on the parameters set.)
-
Create the new geographic data layer containing one
point for each address found
4. Perform visual analysis
of resulting point data layers
-
Display the resulting geographic point data layer
-
Relate new information to other pieces of information
5. Practical Exercise: Geocoding
Address geocoding capabilities are available in
most desktop packages. This exercise uses ArcView Version 3.0a. The data
sets and an ArcView project for the exercise can be downloaded.
They are in ArcView shapefile.
You work for the Office of Economic Development in
San Antonio, Texas, and are doing a market survey to determine how many
aircraft manufacturing facilities are in San Antonio, and where they are
located. You want to use address geocoding to create a map of the facilities.
The three steps you will take are to:
1) prepare the data;
2) match the addresses; and,
3) display the results.
Prepare the data: You
obtain the addresses of manufacturing plants through the electronic yellow
pages (http://www.bigbook.com
is a one of many places to look.) You create a database containing
this information and obtain a geographic base reference file from a local
data provider. Your third piece of information is the location of airfields
within the San Antonio area. You open your GIS desktop software package
and add your database (the aircraft manufacturers) plus the two geographic
data layers (airports and streets). (Example
of how this view may look.)
You are now ready to index the geographic base file
so the software can compare the information in the aircraft manufacturers
address table to your geographic
base file (streets). Let’s take
the case of Zee Systems, Inc., which has an office at 406 West Rhapsody
Drive. The software will take the address from the database. It will then
look for all the Rhapsody Drive street segments in the geographic base
file (see example).
Using the match rules you set up, it will exclude any streets that are
on East Rhapsody, identify the segment going from 306 to 598 West Rhapsody,
and interpolate that the office is about 2/3 of the way down the street
the right side. (see
example) Once the match is identified, a new record is added
to your point data layer of aircraft manufacturing facilities and the results
are displayed on your map.
In order for the software to make this comparison
between a geographic data layer and address table, you must complete several
steps. The first step is to determine the type of base file you have. In
this example, you are using a US Streets formatted
file. When using US street format, your database must contain fields
holding the left address from, left address to, right address from, right
address to, and street name. Optional fields can contain the street type,
prefix or suffix and direction. (see
example). Notice that the necessary fields are available. This
database is complicated by having two direction fields (prefix and suffix).
You can specify both when setting up the index parameters. In ArcView,
you need to set the Theme Preferences to recognize that the data layer
contains US Street information. Once
you set the preferences, the software asks you to build the index. The
indexing process allows the software to make the comparison between the
geographic base layer and the address file.
Match the addresses: You
are now ready to geocode your manufacturers table. You set up the link
between the geographic base file and the address field in the manufacturers
table. In ArcView, you will choose View, Geocode Addresses (see
example) and set up the relationship (see
example). Your reference theme
is the geographic base file (streets). You have already set the type of
base file you are using to US Streets. Aircraft Manufacturer is the address
table; you must tell the software you will use Address as the address field.
You must also create a new file that will contain the point where each
manufacturer is located. When you choose to match the two databases, the
software takes the first record in the address table and tries to find
the appropriate street (see
example). It moves through each record and identifies which
records are matched and which do not (see
example). Notice that 73% of the address records were matched.
In this example, do not worry about non-matches.
Display the results: The
software now creates the new point data layer containing the aircraft manufacturing
companies (see
results). You can see that the manufacturing facilities are
clustered around San Antonio International Airport and Kelly Air Force
Base.
Mastery
Learning Objectives:
-
Determine potential problems with
address and reference files.
-
Complete the matching process
including
-
Evaluating non-matched records
-
Standardizing an address table
-
Practical exercise: the rematch
process.
-
Practical exercise: creating a
map using attribute information.
Topics:
1. Determine potential problems
with address and reference files
-
Overall problems
-
Geocoding is based on assumptions
-
addresses are in a range and equally spaced along the
range
-
odd numbers are on one side of the street and evens
on the other
-
places have addresses
i.e., The White House is 1600 Pennsylvania Ave
-
Base file
-
Not current: i.e., streets not in file
-
Inaccurate locations
-
Incorrect or unidentified streets
-
Incorrect or unidentified address ranges
-
Inconsistent attribution i.e., I10 is also McArthur
Freeway
-
Address file
-
Incomplete
-
Inaccurate
-
Not standardized
-
Preferences
-
Spelling sensitivity set too high or low
-
Score to be considered is too high or low
2. Complete the matching process
-
Evaluate non-matched to determine the problem
-
GBF file
-
Increase geographic area covered
-
Add new developments
-
Address file
-
Preferences
-
Adjust index search (blocking rules)
-
Adjust matching weights (how close a match is necessary)
-
Adjust minimum score to be considered a match
3. Practical exercise: the
rematch process
In the previous example, 73% of the address file
was matched to a geographic location in the GBF. Based on the initial parameters,
there was one partial match and three addresses that did not match. The
rematch process allows you to evaluate why the record did not match, fix
any problems, and find more matches. Non-matched records are caused by:
incorrect or incomplete address
file records,
errors or omissions in the
geographic base file, or
by setting the preferences
incorrectly for the data being matched.
Incorrect or incomplete
address file records: In the previous example, the address "10823
Northeast Entrance R" scored as a 62% match. When you look at the record
compared to the GBF file, it looks like you have found a match. (see
example) The software does not recognize "R" as "Road", and
when the record is parsed it identifies the street name as "Entrance R"
and does not identify a street type. You need to fix the incorrect address
record to be "Rd" or "Road" and a match will be found. In this ArcView
example, you can click on the match button to interactively match the record
to the GBF.
Let’s look at the unmatched records. (see
example) The Alcor Aviation record is an example of an incomplete
address. Colwick Street has no street number. Once a street number is entered
into the database, you can rematch the record and find a match. Alternatively,
you can locate Colwick interactively, see that it is a small street near
the airport, and interpolate the point location. (see
example)
Errors or omissions in the
geographic base file: The other two unmatched records appear
to have adequate addresses. (see
example) The next step is to evaluate the GBF file. You can
sort the street database to show all the streets named "410". (see
example) Several problems become evident. Notice that the name
is inconsistent and there are no address ranges. What other problems do
you see?
Finally find all the Presa Street records. (see
example) The software is looking for 9594 South Presa Street
in the GBF. The highest range in the GBF file is 699 and all the Presa
Streets in the database have a "N" prefix. Either the GBF file is incomplete
or the address file in incorrect. You will need to do more research.
4. Practical exercise: creating
a map using attribute information
An aircraft manufacturing company is trying to locate
interior design and manufacturing companies and propeller manufacturers
in Texas. A list of these three types of companies containing addresses
is obtained. (see
example) You have a zip code geographic data base containing
city names and zip codes. You can address geocode the file containing company
addresses to this file using the zip field. (see
example) Follow the three steps in the geocoding process:
1) prepare the data;
2) match the addresses; and,
3) display the results.
First index the zipcode file so that the software knows
you are sorting by zip code. (see
example) Second, identify the files to match (see
example) and perform the batch match. All 47 records match.
Third, display your results. (see
example)
You can display the propeller companies separately.
In this ArcView example, make the aircraft company the active theme, and
choose Edit, Copy Theme. Then choose Edit, Paste. You now have a duplicate
theme. Under Theme, Properties, rename this theme "Propeller Manufacturers"
and define the theme as ([Specialty] = "P"). (see
example) The companies are in the Dallas/Fort Worth and San
Antonio area. To see just the interior companies, make the Aircraft Companies
theme active. Under Theme, Properties, define the subset as ([Specialty]
= "ID") or ([Specialty] = "IP"). Then Under Theme, Edit Legend, set the
Legend Type to Unique Value and the Values Field to Specialty. When you
look at the interiors companies, they cluster around Dallas/Fort Worth
and Houston. (see
example)
Follow-up Units
Unit
40 Using reclassification operators
Unit
45 Location allocation
Unit
47 Boolean search
Resource
-
Address geocoding: Conversion of postal addresses to
geographic coordinates.
-
DIME: Dual Independent Map Encoding.
Predecessor to the TIGER files. Developed for the 1970 census.
-
GBF: Geographic Base File. A digital
geographic data base containing street networks (lines) or zip code files
(point or polygon).
-
Indirect georeferencing: Location of a feature based
on some information (usually addresses) other than geographic coordinates
(such as latitude/longitude or UTM meters). The process interpolates where
the feature will be placed based on a reference file and a matching strategy.
-
Reference files: A GBF used in the address geocoding
process to indirectly reference a table of addresses to geographic coordinates.
-
SIC Standard Industrial Classification
A grouping of industries, classified by a government, usually with major
grouping and sub-codes.
-
Table of addresses: A database containing an address
field and, usually, other attribute information.
-
TIGER: Topologically Integrated Geographic
Encoding and Referencing. A digital geographic data layer
of streets and other census boundaries, created by the U.S. Bureau of the
Census to support U.S. Census operations.
Reference Material:
-
An excellent list of reference print and web references
is found in (Unit
016 NCGIA Core Curriculum in GIScience: Discrete GeoReferencing, section
7)
Review and study guide
-
What are the three components necessary for address
geocoding?
-
Geographic base file
-
Address file
-
Software
-
What should your consider when beginning a geocoding
application?
-
Level of detail in address file (i.e. just zip code,
full address, accuracy of the database
-
Accuracy of the geographic database and whether you
need to use more than one (i.e. streets for full street addresses, zip
codes for rural routes)
-
Purpose of the application
-
Describe the basic process used by the software to find
locate a US street address.
-
Finds the address in the address file
-
Parses the address out based on the rules selected by
the operator
-
Locates the street based on the blocking strategy selected
-
Determines the side of the street
-
Interpolates where the address is located on the street
-
Places the point
-
Create an attribute table for a geographic base file
that contains the correct fields for the US Streets address style.
-
What field is necessary in the address reference file?
Describe its characteristics?
-
Any field containing an address
-
Entire address is in one field (i.e. 125 Spring Valley
Loop)
-
Name some problems you can encounter with address files?
-
Standardization not considered when designing and populating
the database
-
Inconsistencies
-
Incomplete information (i.e. 125 Valley Rd. should be
125 N. Valley Rd.)
-
Multiple spellings of same place
-
Rural routes and P.O. boxes
-
How do you correct the problems?
-
Standardize database prior to address matching
-
US Postal Service standardization
-
What problems might you find in the geographic base
file?
-
Location inaccurate
-
New subdivisions not included: database not current
-
Inaccurate attributes (i.e. wrong street names)
Back To Core Curriculum for Technical Programs Welcome
Page
Currently maintained by Steve Palladino
Created: May 14, 1997. Last updated: January 5,
1999.
Content comments to Suzy
Jampoler
Formatting comments to Steve
Palladino