Home > Spatial Analysis of a Small Area Problem
Hawaii
International Conference on Statistics and Related Fields
Submission
Title Page
Title:
Spatial Analysis of a Small Area Problem
Topic Area(s):
Applied Statistics, Population Statistics
Keywords:
Kriging, Small Area Estimation
Author:
Brady West (student, University of Michigan at Ann Arbor)
Mailing Address:
2222 Fuller Ct., Apartment 708A, Ann Arbor, MI, 48105
E-Mail Address:
bwest@umich.edu
Phone Numbers:
734.998.0498 (home), 734.223.9793 (cell)
Fax Number:
734.647.2440 (work)
Advisor:
Edward Rothman, University of Michigan Department of Statistics
Spatial Analysis
of a Small Area Problem
Brady T. West
University of Michigan
Department of Statistics
April, 2001
“Spatial Analysis of a Small Area Problem”
Table of Contents
I. Introduction……………………………………………………………………. 3
II. Current Model-Based
Approaches to Small Area Estimation…………….. 4
III. The Kriging Framework…………………………………………………….. 5
IV.
Analysis: An Application of Kriging to Small Area Estimation………….. 9
in
Washtenaw County……………..10
V. Discussion……………………………………………………………………… 17
VI. Works Cited………………………………………………………………….. 18
VII. Appendix A:
Census Block Group Population Model Details…………... 20
VIII. Appendix B:
Census Block Group Population Model Diagnostics…….. 28
IX. Appendix C:
Kriging Details……………………………………………….. 37
X. Appendix D:
Kriging Plots…………………………………………………... 49
XI. Appendix E:
Additional SAS Code…………………………………………59
Introduction
The United States Census Bureau
currently gathers population data for certain bounded geographic regions
in all U.S. counties known as census tracts, and several different methods
have been developed by statisticians and population analysts to estimate
the populations of smaller areas within these tracts. In recent
years, the statistical technique of small area estimation has been a
very hot topic, and there is an ever-growing demand for reliable estimates
of small area populations of all types. Reliable estimates of
the populations of small areas (i.e. city blocks) within the currently
established census tracts are important for several reasons. These
extremely important estimates are used for, among other things, determination
of state funding allocations, determination of exact boundaries for
school and voting districts, administrative planning, marketing guidance,
and as data for detailed descriptive and analytical studies of cities
(Bryan: 1999).
Current small area estimation
techniques attempt to derive an estimate of some attribute for a small
area (i.e. population) by using available auxiliary data for the area;
practitioners of these techniques now generally accept that indirect
estimators should be used, based on explicit models relating small areas
through supplementary data such as administrative records and recent
Census data (Rao: 1999). Rao proposes that small area models of
this nature can be broadly classified into two types, relating available
auxiliary data for small areas to parameters of interest for the small
areas (i.e. total population).
Such model-based approaches
truly are beneficial, in that they increase the precision of small area
estimates. Surveys attempting to derive direct small area estimates
are often faced with the problem of extremely small sample sizes in
particular regions, which can produce estimates that are lacking in
precision. However, these model-based methods rely on the availability
of both auxiliary data and data for the parameter of interest (Census
tract data, for instance), and they ignore the spatial relationships
between the small areas requiring estimates and the areas where data
of interest are actually available. A need exists for a new small
area estimation technique that takes into account the spatial relationships
between small areas, where the data of interest are unavailable for
the construction of models, and those larger areas where the desired
data are present.
The geo-statistical spatial analysis technique known as kriging is frequently used by geologists for interpolation of attributes at points not physically sampled over a particular region, based on knowledge about the spatial relationships of the sampled points. Tobler’s Law of Geography1 provides the rationale behind such spatial interpolation: points close together in space are more likely to have similar attributes than points far apart. The small area analogy suggests that adjacent small areas will have characteristics extremely similar to both each other and the larger regions that they are near or a part of. Using this analogy, we investigate an application of kriging to the small area estimation problem described above, and determine the precision of small area estimates that take into account the spatial relationships between small areas and those larger areas where data are available for those attributes desired in the small areas.
Current Model-Based Approaches
to Small Area Estimation2
The model-based approach to
small area estimation permits validation of the models from sample data.
Rao (1999) classifies small area models into two types:
(1)
i
= xti + vi
In this model, area-specific auxiliary data xi (administrative records, census data) are available for the areas i = 1,…,m. The population small area total Yi, or some function
i
= g(Yi),
is assumed to be related to xi through the above linear model (1).
The vi’s are assumed to be normally distributed,
random, uncorrelated small area effects, with mean zero and variance 2v. represents the vector of regression
parameters. The second type of model is as follows:
(2)
yij = xtij
+ vi + eij
This model is appropriate for
continuous variables y. In this model, unit-specific
auxiliary data xij are again available for areas i
= 1,…,m , where j = 1,…,Ni
and Ni represents the number of population
units in the i-th area. The unit y-values, yij, are assumed to be related to the
auxiliary values xij through the nested error regression
model above (2). The vi’s are normal, independent, and identically
distributed, with mean zero and variance 2v.
The eij’s are independent of the vi’s, normal, independent of each other,
and identically distributed, with mean zero and variance 2e.
again represents the vector of regression parameters.
Rao further asserts that in
the case of type (1) models, direct survey estimators Yi are available whenever the sample sizes
ni >= 1. Then, it can be assumed that
(3)
i
= i
+ ei
where i = g(Yi) and the sampling errors ei are independent and normally distributed
with mean zero and known variance i.
Then, when model (3) is combined with model (1),
we have
(4)
i
= xti + vi +
ei
Models of this form, with i
= log(Yi), have recently been used to produce
model-based county estimates of poor school-age children in the United
States (Rao: 1999). According to Rao, “The success of small
area estimation…largely depends on getting good auxiliary information
{xi} that leads to a small model variance 2v
relative to i.”
The Kriging Framework
The geo-statistical technique
known as kriging was developed in the late 1950’s, building on the
pioneering work of a South African mining engineer named Danie Krige,
who in 1951 proposed innovative new concepts for mining estimation (Guyaguler:
2000; Goovaerts: 1997). Based on Krige’s work, Georges Matheron
developed the Theory of Regionalized Variables, combining Krige’s
concepts into a single framework which he coined “kriging.”
Formally, kriging now refers to a family of least-squares linear regression
algorithms that attempt to predict values of a regionalized variable,
or a phenomenon that is spread out in space, at locations where data
for the variable is not available, based on the spatial pattern of the
available data.
In general, kriging provides
an optimal means of estimating the values of a certain continuous attribute z
at any points u not physically sampled over
a specific area A, using knowledge about the underlying
spatial relationships in a data set. Statistical models known
as variograms (p. 7) provide knowledge about these underlying
spatial relationships, and express spatial variation in the available
data. The estimated values are weighted linear combinations of
the available data, where data points that are closer to the locations
requiring estimation are given more weight in producing the estimates,
and the resulting estimators are unbiased (Lang: 1997). What differentiates
kriging from other linear estimation methods is its aim to minimize
the variance of the errors of the estimates, which because they are
unbiased have mean zero. In most cases, kriging is applied to
produce a fine grid of accurate estimates of some variable (i.e. standing
water level, or richness of soil) over a physical site of interest.
Goovaerts (1997) outlines the
basic methodology behind the kriging procedure, and differentiates between
the different types of kriging:
Given n data at sampled locations ui of some study area A for some attribute z(ui),
i =
1,…,n,
we wish to produce values for the estimator z*(u). This estimator is defined as
follows:
(5) z*(u)
- m(u) = i i(u)[z(ui)
- m(ui)]
Here, i(u) = the weight assigned to the datum z(ui) using a theoretical variogram [see (8)], m(u) =
the expected value of z(u), and m(ui)
= the expected value
of z(ui). The vector of weights (u)
is determined to minimize the variance of the estimation errors, denoted
as
(6) 2e
= Var{z*(u) -
z(u)}
The variance of the estimation errors 2e is minimized under the constraint that
E{z*(u)
- z(u)} =
0, making the estimator
unbiased. The estimation errors can be computed using cross-validation,
where estimates are derived for actual data points by removing sample
values from the data set and then using the kriging procedure to re-estimate
them.
There are three main kriging
variants, distinguished according to the model considered for m(u),
which recall is the expected value of z(u). Ordinary kriging is
the most widely used kriging method (Guyaguler: 2000), which accounts
for local fluctuations of the mean for the attribute z by limiting the area where the mean
for z is fixed to some local neighborhood W(u) (Goovaerts: 1997). This variant
of kriging considers the mean m(u) for z to be constant but unknown for all
locations in W(u) . Ordinary kriging estimates
are derived in the following manner:
(7) z*(u)
= i i(u)z(ui)
The weights i(u) assigned to each datum z(ui) are constrained such that i i(u) = 1, to ensure the unbiasedness of the estimates
(Guyaguler: 2000). The weights are again determined using the
experimental variogram so as to minimize the error variance described
in (6).
The vector of weights is derived by solving the following system of
equations (Guyaguler: 2000):
(8) i i(u)(uj
- ui) +
= (uj
- uo) j = 1,…,n
Here, (uj
- ui) represents the value of the theoretical
variogram at distance (uj
- ui), where uj and
ui are
the locations of actual data points, and uo is the location where an estimate
is desired.
represents the “Lagrange parameter.” (Lang: 1997)
Ordinary kriging is an exact
interpolator, since the estimated value of the desired attribute at
some point is equal to the exact data value. This variant of kriging
only uses a local neighborhood of data in deriving the estimates, where
only the closest data are assigned weights and the expected value m(u) of
the desired attribute is considered to be constant, and this fact can
be used to show that the estimates are unbiased:
E{z*(u) - z(u)} = E{i i(u)z(ui)} - E{z(u)} = m(u) i i(u) - m(u) = 0.
Ordinary kriging is usually
preferred to other types of kriging, because it requires neither knowledge
nor stationarity of the mean m(u) over the entire area of interest A (Goovaerts:
1997).
The other two chief variants
of kriging include simple kriging, which considers the mean m(u)
of the attribute z to be known and constant throughout
the area of interest A, and kriging with a trend,
which assumes that the unknown local mean m(u) varies within each local neighborhood W(u), throughout A.
Some other “members” of the kriging family include block
kriging, which involves the estimation of average z-values
over some segment, surface, or volume of any size or shape, and cokriging,
which allows one to better estimate values of z if the spatial distribution of some
secondary variate sampled more intensely than z is known. Cokriging basically
incorporates the additional information provided by other covariates
into the above kriging methods, and can greatly improve interpolation
estimates if it is difficult or expensive to sample the primary variate z.
The values of the additional covariates are measured at locations where z
is measured, and also at several other locations, and this additional
information greatly improves the accuracy of the estimates of z.
Calculation of the Sample
and Theoretical Variograms
As mentioned earlier, the sample
variogram (or semivariogram, representing spatial covariance), a statistical
model computed by using the available sample data, is the key part of
the kriging procedure, in that it expresses how the data vary spatially
across the area of interest. A sample variogram takes as input
a distance h between two sampled points, and outputs
a variogram estimate (h) that
explains the variance in the attribute z over the distance between the two
points. The
sample variogram is derived in the following manner:
This results in a plot that
only has as many points as there are distance intervals, with a variogram
estimate for each distance interval. The next important step in
computing the variogram is to fit a model to these data (a theoretical
variogram), which will allow for estimation of the variogram at all
possible distances, rather than just the midpoint of each defined distance
interval. Geo-statistics literature generally suggests the choice
of one of five theoretical models to fit to sample variogram data, each
of which has the parameters defined in the plot3
below:
Plot 1:
Example of Theoretical Variogram
In a theoretical variogram, h
(above, separation distance) represents the distance between two points; ao represents the range parameter,
or the distance at which the upper limit of the variogram (above, semivariance)
is reached; co represents the nugget effect
parameter, or the apparent non-zero intersection of the sample variogram
with the y-axis (several factors, like sampling error and short-scale
variability, make it is possible for two points very close to each other
to have unusual variance in the attribute z, and this is the nugget effect); and c
represents the sill parameter, or the upper limit of the variogram.
One of five possibilities for
the theoretical variogram is chosen, generally based on visual rules
of thumb. If there is no nugget effect, the parameter co = 0.
(h) = co
+ c[1.5(h/ao) -
0.5(h/ao)3] for
ao h (h) = co
+ c for h ao
Once a theoretical variogram
has been fit to the sampled data, the system of equations described
in (8) can be solved for the vector of weights (u), and kriging estimates for the attribute z
at specific locations may be obtained.
Analysis: An Application
of Kriging to Small Area Estimation
Overview
We consider the following small
area estimation problem: population counts, as well as various demographic,
housing, and geographic data, are available at the census tract level
for a given county in the United States, and population estimates are
desired for smaller subdivisions of the census tracts known as census
block groups. Only geographic data (land area, etc.) are available
for the census block groups.
Definitions are of course in
order. According to the United States Bureau of the Census4,
census tracts are defined as “small, relatively permanent statistical
subdivisions of a county…Census tracts usually have between
2,500 and 8,000 persons and, when first delineated, are designed to
be homogeneous with respect to population characteristics, economic
status, and living conditions. Census tracts do not cross county boundaries.”
In addition, a census block group (BG) is defined as “a cluster
of blocks having the same first digit of their three-digit identifying
numbers within a census tract…Geographic BG's never cross census tract
boundaries…BG's generally contain between 250 and 550 housing units,
with the ideal size being 400 housing units.”
The U.S. Census Bureau currently
produces all of the data proposed to be available at the census tract
level in the above setting at the census block group level as well,
but for the purposes of investigating the effectiveness of kriging in
small area estimation, we consider these data to be unavailable for
the small areas of interest (the census block groups). The following
methodology can then be considered analogous to a variety of situations
where data are available for some larger area, but not for smaller subdivisions
of the larger area, where estimates for some parameter are desired.
Modeling the Census Block
Group Population in Washtenaw County
We begin our investigation
of this problem by considering 1990 census data for the county of Washtenaw
in the state of Michigan. The 1990 TIGER data sets produced by
the U.S. Census Bureau after the completion of the 1990 U.S. census
are available publicly over the web, at http://www.esri.com/data/online/tiger/data.html.
These TIGER data sets include geographic features such as roads, railroads,
rivers, and lakes, political boundaries, statistical boundaries, and
demographic attributes for the entire United States, making them very
useful for this setting. However, they do not contain the geographical
coordinates of the various geographic areas, which are essential for
the application of kriging. The SAS code used to merge the TIGER
data sets with the coordinate data for the geographic areas of interest5
in this study is available in Appendix E.
The images on the following
pages, produced using the spatial analysis software ArcView GIS Version
3.1, provide an idea of what the 1990 census tract and census block
group divisions of the county of Washtenaw looked like.
Figure 1:
1990 Census Tracts in Washtenaw County
The continuous variables below,
all considered relevant to the population of a particular area, are
available in the aforementioned TIGER data sets for both the Washtenaw
census tracts and the Washtenaw census block groups, in addition to
population counts6:
Land Area (km2) Water
Area (km2) %Units Occupied %Units Vacant
%Units Owned %Units Rented Median
Unit Value Median Monthly Rent
%Single Detached Units %Single
Attached Units %Duplex Units %Apartment Units
%Persons Employed %Persons Unemployed %Persons Not In Med. Household Income
Work Force
Income Per Capita %Children
in Poverty %In Poverty
%Units Built < 1970 %Units
Built 70-79 %Units Built 80-84 %Units Built > 1984
Here is the 1990 spatial distribution
of the variable %Apartment Units across the county of Washtenaw,
for both the census tracts and the block groups:
Figure 3: Spatial Distribution of % Apartment Units
Across the
1990 Census Tracts in Washtenaw County
One can see that the majority
of the apartments in Washtenaw County are concentrated around the areas
of Ann Arbor and Ypsilanti, which are college towns.
Figure 4: Spatial Distribution of % Apartment Units
Across the
1990 Census Block Groups in Washtenaw County
Because these data are all
available at the census block group level, it is possible to fit a linear
model to the block group data using ordinary regression analysis, with
Persons as the response, and produce fitted values for each of the
block groups. The accuracy of this model-based technique, using
data that are actually available for the small areas, can then be compared
with the accuracy of a new kriging-based technique, which will attempt
to estimate the small area populations when data are only available
for larger areas, or in this setting, when data are only available at
the census tract level. In order to compare the accuracies of
these two techniques, the following errors will be computed for each
census block group in Washtenaw, and the mean and standard deviation
(sd) of these errors for each technique will then be derived:
(9) errormodel
= actual population - predicted population
(10) errorkriging
= actual population - estimated population
A linear model of the form
described in (1)
was fit to the 1990 Washtenaw census block group data, with Persons
as the parameter of interest and all other variables as predictors.
The technical details of this model fitting process are described in
Appendix A and Appendix B.
The final model that resulted was
(11) sqrt(Y) = 0.0912x1 + 31.67x2 + 0.00002363x3 +
-0.0001586x4
+ -11.05x5 +
22.92x6
where Y is the population of the census block
group, x1
is the area of the block group in square kilometers, x2 is the % of housing units that are
occupied, x3 is the median value of the housing
units, x4 is the income per capita, x5 is the % of housing units built between
1980 and 1984, and x6 is the % of housing units built after
1984.
After back-transforming to
get the predicted population values for each block group by taking
Yo2, where Yo
was the fitted value for the block group in the above model, we find
the following results based on (9):
sd{errormodel} = 394.94771
mean{errormodel}
= 35.40443
Using Kriging for Small
Area Estimation
The situation is considered
where population data, various demographic, housing, and geographic
data, and geographical coordinates (physical longitude and latitude)
are all available for some large areas (in this example, these data
are available for the Washtenaw census tracts). Population estimates
are desired for smaller areas within these larger areas, and the only
data that are available for the small areas are physical areas and geographical
coordinates. In this example, the geographical coordinates of
a particular area are assumed to be the coordinates of some point internal
to the area.
1.
The first step will be to determine a linear model like that in (11)
relating the small areas, in order to establish those variables that
are significant predictors of population for the small areas.
This model may come from previous investigations, or by fitting a model
to data that actually are available for similar small areas. Predictor
variables in the model should represent percentages or summary measures
describing the small areas, and not counts for the entire areas.
The reason for this is as follows.
After determining the appropriate
small area model, we will use the available large area data to krig
for those variables found to be significant predictors of the parameter
of interest over the entire grid representing the collection of the
large areas (in this case, Washtenaw county). Then, we will use
these kriging estimates to predict the populations at the locations
of the small areas, based on the small area model. Kriging is
a form of point interpolation, where the estimated value of some attribute
at a location near a point where a value for the attribute is given
will likely be very similar to the given value.
Suppose a count variable, such
as the total number of occupied homes, is a significant predictor of
the desired attribute (population) in the small areas. When kriging
estimates for this variable are produced over the entire area of interest
(Washtenaw County), estimates for the total number of occupied homes
in small areas that are within the larger areas where data are available
(and thus have geographical coordinates that are extremely close to
those of the large areas) will be very similar to the total number of
occupied homes for the large areas, which will not make sense.
However, suppose a variable such as “% of homes occupied” is a significant
predictor of the desired attribute, and we obtain kriging estimates
for this variable over the entire area of interest. The kriging
estimates for this variable, at geographical coordinates representing
small area locations that are extremely close to the geographical coordinates
of the large areas, will again be similar to the given values for the
large areas, which now makes sense. The same would be true for
summary variables, such as median income.
Based on the population model
derived in (11), with Persons as the parameter
of interest, the following variables were found to be significant predictors
of population in the small areas:
Land Area (km2) %Units
Occupied Median Unit Value Income Per Capita
%Units Built 1980-84 %Units
Built > 1984
2.
The second step will be to krig for all of the variables that are significant
predictors of the parameter of interest (population), over the entire
area of interest (Washtenaw County), using the available large area
(tract level) data. This will result in estimated values for each
of the variables listed above (except for the area variable, which is
assumed to be given for the small areas) at all locations over the entire
area of interest. The estimated values of these variables at the
locations representing the small areas can then be used, along with
the given area values for the small areas, to estimate the populations
of the small areas using the pre-determined small area model.
Kriging estimates were all
derived using the SAS System, and the procedures variogram
and krige2d. Please see Appendix C
and Appendix D for the technical details behind the calculations
of all sample variograms, theoretical variograms, and kriging estimates,
and all of the associated SAS code.
After kriging for all of the
continuous variables above (except land area), and then using the kriging
estimates in the small area model described in (11) to obtain population estimates for
the small areas, we find the following results based on (9):
sd{errorkriging} = 437.7639
mean{errorkriging}
= 26.1729
We see a smaller mean error
than that found using ordinary regression to predict the small area
populations, but a larger standard deviation in the errors. These
errors have a distribution that is comparable to that found using the
model-based technique.
Discussion
The standard deviation of the
errors for the small area estimates derived using the kriging-based
technique is understandably larger than that found using the model-based
technique of ordinary regression. As indicated in (6),
all kriging estimates of an attribute computed at locations over a specific
area have their own standard deviation, because they are linear combinations
of all of the available data for the attribute, with the weights for
each datum derived by solving the system of equations defined in (8).
When these kriging estimates are input into a pre-determined model,
they are not fixed values but rather the expected values from a series
of distributions. Each individual kriging estimate for a particular
attribute at a particular location thus in effect brings its own error
with it, incorporating additional error into the predictions derived
by using the kriging estimates as “new” values in the model relating
the small areas.
Despite this fact, the mean
and standard deviation of the errors for the small area estimates derived
using a kriging-based technique were still comparable to those found
using a model-based technique with available small area data.
This finding suggests that the kriging-based technique can provide reasonable
estimates for small area attributes when model-based techniques cannot
be used due to a lack of small area data.
The technique of kriging also
produces estimates for a particular attribute at all locations over
an area of interest. Kriging was originally designed for geological
use, where the areas of interest are often fields or plots that do not
have roads, bodies of water, or parks dividing the sampled points.
When applying kriging to population estimation, kriging estimates of
attributes such as “% of housing units occupied” are computed at
all locations over an entire area, including points where there are
roads, forests, parks, lakes, etc. Future research into the use
of kriging-based or other spatial analytic techniques for small area
estimation should investigate the possibility that barriers (i.e. roads,
lakes, etc.) between “sampled points” such as census block groups,
where population or housing attributes have meaning, might have an effect
on the spatial continuity of attributes related to population.
For example, kriging techniques
may predict very similar values for the variable “% of housing units
occupied” in two census block groups that are very close to each other
but divided by a river, when in fact the actual values of this variable
may vary significantly due to the presence of the river. Local
kriging systems may need to be solved in order to derive kriging estimates
of attributes at all points lying between such barriers, if the barriers
are found to have a significant effect on the spatial continuity of
the desired attributes.
Works Cited
1990 Block Group Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/bg_info.html
1990 Census Tract and Block Numbering Areas Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/tr_info.html
Bryan, Thomas. 1999. “Small-Area Population Estimation Technique Using Administrative Records, and Evaluation of Results with Loss Functions and Optimization Criteria.” Federal Committee on Statistical Methodology Research Conference, November 15-17, 1999. Washington, DC: U.S. Bureau of the Census.
Census Block Group Cartographic Boundary Files: U.S. Bureau of the Census.
http://www.census.gov/geo/www/cob/bg.html
Census TIGER Data: ArcData
Online. http://www.esri.com/data/online/tiger/data.html
Census Tract Cartographic Boundary Files: U.S. Bureau of the Census.
http://www.census.gov/geo/www/cob/tr.html
Faraway, Julian J. 1999.
Practical Regression and Anova using R.
Gamma Design Software.
http://www.geostatistics.com/GSWIN/GSWINIsotropic_Variogram_Models.html
Gill, Andrew. 1996. “Kriging Example.”
http://www.maths.adelaide.edu.au/Applied/UA_DAM_FLUIDS/GROUNDWATER/GEOSTATS/KrigEx/krigex.html
Goovaerts, Pierre.
1997. Geostatistics for Natural Resources Evaluation.
New York: Oxford Press. Pp. 125-158, 203.
Goovaerts, Pierre.
2001. “Study of scale-dependent correlation structures using
factorial kriging analysis.” http://www-personal.engin.umich.edu/~goovaert/pg-suj3.html
Guyaguler, Baris. “Ordinary Kriging.”
http://pangea.stanford.edu/~baris/professional/theoryok.html
Lang, Chao-Yi. “Kriging Interpolation.”
http://www.tc.cornell.edu/Visualization/contrib/cs490-94to95/clang/kriging.html
Rao, J.N.K. 1999.
“Current Trends in Sample Survey Theory and Methods.” Pp.
16-22 in Indian Journal of Statistics, vol. 61.
Rao, Mahesh. 1999. “Interpolation.”
http://reserves.library.okstate.edu/geog5343/lec12
SAS Institute Inc.
1999. Online documentation: the KRIGE2D
and VARIOGRAM Procedures.
Schabenberger, Oliver.
1997. “Developing Variograms in SAS + Ordinary Kriging.”
http://www.cas.vt.edu/schabenb/Spatial.htm
Census Block Group Population Model:
Fitting Details and Associated R Code
Washtenaw Census Block Group (1990) Population Modeling
All of the following code was written for the R Software Package for interactive statistical analysis.
> wash <- read.table("c:\\wash1.txt", sep=",", h=T)
#variable selection
> wash2 <- wash[,1:181]
> wash3 <- wash2[,-c(8:58)]
> wash4 <- wash3[,-c(13:26)]
> wash5 <- wash4[,-c(14:19,25:70)]
> wash6 <- wash5[,-c(19,23:44)]
> wash7 <- wash6[,-c(23:30,34:37)]
Variables (Continuous):
State County Tract Blockgroup Land.km Water.km Persons Housing Occupied Vacant Ownr.occ Rent.occ Val.medi Rnt.medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc.medn Incprcap Childpov Inpovrty Bltbfr70 Blt.7079 Blt.8084 Bltaft84
#Convert specific variables to proportions:
> wash7$Vacant <- wash7$Vacant / wash7$Housing
> wash7$Ownr.occ <- wash7$Ownr.occ / wash7$Occupied
> wash7$Rent.occ <- wash7$Rent.occ / wash7$Occupied
> wash7$Occupied <- wash7$Occupied / wash7$Housing
> wash7$Detached <- wash7$Detached / wash7$Housing
> wash7$Attached <- wash7$Attached / wash7$Housing
> wash7$Duplex <- wash7$Duplex / wash7$Housing
> wash7$Apartmnt <- wash7$Apartmnt / wash7$Housing
> wash7$Employed <- wash7$Employed / wash7$Persons
> wash7$Unemploy <- wash7$Unemploy / wash7$Persons
> wash7$Notinwrk <- wash7$Notinwrk / wash7$Persons
> wash7$Bltbfr70 <- wash7$Bltbfr70 / wash7$Housing
> wash7$Blt.7079 <- wash7$Blt.7079 / wash7$Housing
> wash7$Blt.8084 <- wash7$Blt.8084 / wash7$Housing
> wash7$Bltaft84 <- wash7$Bltaft84 / wash7$Housing
> wash7$Childpov <- wash7$Childpov / wash7$Persons
> wash7$Inpovrty <- wash7$Inpovrty / wash7$Persons
> washmod <- wash7[,-c(1:4)] #for fitting purposes
#remove bad cases
> washmod <- washmod[-c(66,254),]
> wash7 <- wash7[-c(66,254),]
> washmod1 <- washmod[,-c(4,6,8)]
#remove housing, %vacant, %rent.occ (%occupied + %vacant = 1)
#5-number numerical summary
> summary(washmod1)
Land.km Water.km Persons Occupied
Min. : 0.0120 Min. :0.000 Min. : 28.0 Min. :0.6648
1st Qu.: 0.3680 1st Qu.:0.000 1st Qu.: 720.8 1st Qu.:0.9304
Median : 0.6685 Median :0.000 Median : 999.0 Median :0.9601
Mean : 6.8540 Mean :0.121 Mean :1054.0 Mean :0.9438
3rd Qu.: 2.1520 3rd Qu.:0.000 3rd Qu.:1220.0 3rd Qu.:0.9775
Max. :96.8800 Max. :3.272 Max. :5853.0 Max. :1.0000
Ownr.occ Val.medi Rnt.medi Detached
Min. :0.0000 Min. : 0 Min. : 0.0 Min. :0.0000
1st Qu.:0.2126 1st Qu.: 68330 1st Qu.: 391.3 1st Qu.:0.1695
Median :0.6467 Median : 90200 Median : 476.5 Median :0.6017
Mean :0.5623 Mean :100200 Mean : 487.2 Mean :0.5467
3rd Qu.:0.8838 3rd Qu.:118000 3rd Qu.: 571.3 3rd Qu.:0.9170
Max. :1.0000 Max. :500000 Max. :1001.0 Max. :1.0000
Attached Duplex Apartmnt Employed
Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.0000
1st Qu.:0.003135 1st Qu.:0.00318 1st Qu.:0.01137 1st Qu.:0.4919
Median :0.010070 Median :0.01210 Median :0.17220 Median :0.5380
Mean :0.056940 Mean :0.03666 Mean :0.32210 Mean :0.5416
3rd Qu.:0.033450 3rd Qu.:0.04498 3rd Qu.:0.62250 3rd Qu.:0.6101
Max. :0.679200 Max. :0.31360 Max. :0.99080 Max. :0.8768
Unemploy Notinwrk Inc.medn Incprcap
Min. :0.00000 Min. :0.0000 Min. : 0 Min. : 2415
1st Qu.:0.01022 1st Qu.:0.1599 1st Qu.: 24910 1st Qu.:12870
Median :0.02026 Median :0.2099 Median : 37510 Median :16800
Mean :0.02687 Mean :0.2359 Mean : 38860 Mean :17660
3rd Qu.:0.03485 3rd Qu.:0.2607 3rd Qu.: 50130 3rd Qu.:20250
Max. :0.16410 Max. :0.9911 Max. :111200 Max. :66660
Childpov Inpovrty Bltbfr70 Blt.7079
Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.000000 1st Qu.:0.02024 1st Qu.:0.2634 1st Qu.:0.04358
Median :0.006738 Median :0.05479 Median :0.3877 Median :0.18410
Mean :0.025130 Mean :0.12040 Mean :0.4195 Mean :0.22330
3rd Qu.:0.026300 3rd Qu.:0.14900 3rd Qu.:0.5733 3rd Qu.:0.36030
Max. :0.421100 Max. :0.74070 Max. :1.3570 Max. :0.84270
Blt.8084 Bltaft84
Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000
Median :0.02214 Median :0.02094
Mean :0.04213 Mean :0.08072
3rd Qu.:0.06329 3rd Qu.:0.10920
Max. :0.33700 Max. :1.06400
Variables with data indicating that particular cases might be outliers: Persons, Val.medi, Bltbfr70, Bltaft84, Incprcap, Notinwrk, Land.km.
According to ArcView, land cases are not errors in data entry.
> wrk <- (washmod1$Bltbfr70 > 1)
error in data entry.
> washmod1 <- washmod1[-108,]
> wash7 <- wash7[-108,]
Notinwrk values do not appear to be errors in data entry.
Persons values might be errors in data entry. More than most
Census Tracts!
Val.medi may not be an error in data entry.
> (washmod1$Bltaft84 > 1)
Bltaft84 value is an error in data entry.
> wash7 <- wash7[-181,]
> washmod1 <- washmod1[-181,]
Incprcap value may not be an error in data entry.
Fit full model:
> mod <- lm(Persons ~ .,washmod1)
Model Diagnostics
-----------------
> x <- model.matrix(mod)
> lev <- hat(x)
> plot(lev, ylab = "Leverages") #see Plot 1, Appendix B.
> abline(h=2*22/266)
> sum(lev)
[1] 22
> sum <- summary(mod)
> stud <- mod$res / (sum$sig*sqrt(1-lev))
> plot(stud,ylab="Studentized Residuals") #see Plot 2, App. B.
> jack <- stud*sqrt((266-22-1)/(266-22-stud^2))
> plot(jack,ylab="Jackknife Residuals") #see Plot 3, App. B.
> jack[jack > 3]
71 85 151
8.757113 3.755840 7.153207
Check to see if these cases are outliers using a Bonferonni-based test:
> qt(0.05/(266*2),266-22-1)
[1] -3.792889
Two of the cases appear to be outliers! Further investigation yields that these are the two cases with Persons counts above 5,000.
Check for influential observations:
> cook <- stud^2*lev/(22*(1-lev))
> plot(cook,ylab = "Cook Statistics") #see Plot 4, Appendix B.
> cook[cook > 0.15]
12 71 86 151
0.1690986 0.3085052 0.2805020 0.1715737
There appear to be four influential cases with relatively large Cook statistics, two of which were identified as earlier as outliers. Temporarily exclude them from analysis:
> washmod1 <- washmod1[-c(12,70,85,149),]
> wash7 <- wash7[-c(12,70,85,149),]
> dim(washmod1)
[1] 262 22
> dim(wash7)
[1] 262 29
Variable Transformation
-----------------------
> plot(mod$fit, mod$res, xlab = "Fitted Values", ylab = "Residuals") #see Plot 5, Appendix B.
#test for non-constant variance
> summary(lm(abs(mod$res) ~ mod$fit))
Call:
lm(formula = abs(mod$res) ~ mod$fit)
Residuals:
Min 1Q Median 3Q Max
-635.57 -219.80 -58.22 80.81 3381.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -90.06772 88.47109 -1.018 0.310
mod$fit 0.40512 0.08035 5.042 8.58e-07 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 394.2 on 264 degrees of freedom
Multiple R-Squared: 0.08783, Adjusted R-squared: 0.08438
F-statistic: 25.42 on 1 and 264 degrees of freedom, p-value: 8.577e-007
VERY strong evidence of non-constant variance in the residuals. Plot 5 suggests a square-root transformation of the count response (Persons).
> mod2 <- lm(sqrt(Persons) ~ ., washmod1)
> plot(mod2$fit, mod2$res, xlab = "Fitted Values", ylab = "Residuals", main = "Square Root Response") #see Plot 6, App. B.
> summary(lm(abs(mod2$res) ~ mod2$fit))
Call:
lm(formula = abs(mod2$res) ~ mod2$fit)
Residuals:
Min 1Q Median 3Q Max
-4.594 -2.669 -0.755 1.349 17.844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.38401 1.80611 3.535 0.000483 ***
mod2$fit -0.06380 0.05719 -1.116 0.265606
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.838 on 260 degrees of freedom
Multiple R-Squared: 0.004764, Adjusted R-squared: 0.0009365
F-statistic: 1.245 on 1 and 260 degrees of freedom, p-value: 0.2656
There is no longer any evidence of non-constant variance in the residuals after the square-root transformation of the response (Persons).
Variable Selection
------------------
Stepwise backward elimination with a critical alpha value of 0.15 (for prediction performance) results in the following model:
> summary(mod2)
Call:
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -
Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -
Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1)
Residuals:
Min 1Q Median 3Q Max
-19.3348 -3.5081 -0.2181 3.6621 24.2039
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.300e+01 7.446e+00 1.746 0.081991 .
Land.km 7.794e-02 2.464e-02 3.163 0.001752 **
Occupied 1.959e+01 7.748e+00 2.529 0.012056 *
Val.medi 3.255e-05 9.736e-06 3.343 0.000954 ***
Inc.medn 8.833e-05 3.816e-05 2.315 0.021436 *
Incprcap -3.926e-04 8.437e-05 -4.653 5.27e-06 ***
Bltbfr70 -3.385e+00 2.095e+00 -1.615 0.107473
Blt.8084 -1.320e+01 7.098e+00 -1.860 0.064105 .
Bltaft84 2.032e+01 3.510e+00 5.790 2.07e-08 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 6.029 on 253 degrees of freedom
Multiple R-Squared: 0.3133, Adjusted R-squared: 0.2916
F-statistic: 14.43 on 8 and 253 degrees of freedom, p-value: 0
Check for influential cases:
> sum <- summary(mod2)
> x <- model.matrix(mod2)
> lev <- hat(x)
> sum(lev)
[1] 9
> stud <- mod2$res / (sum$sig*sqrt(1-lev))
> cook <- stud^2*lev/(9*(1-lev))
> plot(cook,main="Cook Statistics after Variable Selection")
#see Plot 7, Appendix B.
> cook[cook > 0.15]
120
0.1631331
Is there a significant change in the model when this case is excluded?
> summary(mod2)
Call:
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -
Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -
Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1,
subset = (cook < max(cook)))
Residuals:
Min 1Q Median 3Q Max
-19.4777 -3.5513 -0.2323 3.5122 24.7954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.024e+01 7.545e+00 1.358 0.17580
Land.km 8.157e-02 2.458e-02 3.318 0.00104 **
Occupied 2.204e+01 7.812e+00 2.822 0.00516 **
Val.medi 2.792e-05 9.980e-06 2.798 0.00554 **
Inc.medn 6.137e-05 4.047e-05 1.517 0.13064
Incprcap -2.961e-04 9.781e-05 -3.027 0.00272 **
Bltbfr70 -2.778e+00 2.108e+00 -1.318 0.18873
Blt.8084 -1.361e+01 7.064e+00 -1.927 0.05515 .
Bltaft84 2.055e+01 3.494e+00 5.883 1.28e-08 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 5.997 on 252 degrees of freedom
Multiple R-Squared: 0.2871, Adjusted R-squared: 0.2644
F-statistic: 12.68 on 8 and 252 degrees of freedom, p-value: 2.665e-015
Two predictors are no longer significant! This case should also be excluded.
> wash7 <- wash7[-115,]
> washmod1 <- washmod1[-115,]
> dim(wash7)
[1] 261 29
> dim(washmod1)
[1] 261 22
Further stepwise backward elimination results in:
> summary(mod2)
Call:
lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -
Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -
Childpov - Rnt.medi - Apartmnt - Ownr.occ - Bltbfr70 - 1 -
Inc.medn, data = washmod1)
Residuals:
Min 1Q Median 3Q Max
-19.71474 -3.78245 -0.03491 3.49497 25.02833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Land.km 9.120e-02 2.354e-02 3.874 0.000136 ***
Occupied 3.167e+01 9.984e-01 31.718 < 2e-16 ***
Val.medi 2.363e-05 9.639e-06 2.451 0.014913 *
Incprcap -1.586e-04 6.117e-05 -2.592 0.010084 *
Blt.8084 -1.105e+01 6.720e+00 -1.645 0.101265
Bltaft84 2.292e+01 3.092e+00 7.412 1.83e-12 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 6.02 on 255 degrees of freedom
Multiple R-Squared: 0.9658, Adjusted R-squared: 0.965
F-statistic: 1200 on 6 and 255 degrees of freedom, p-value: 0
We have a very good fit. Plot 8 in Appendix B indicates that the residuals in this model appear to follow a normal distribution.
> dim(washmod1)
[1] 261 22
> dim(wash7)
[1] 261 29
> fit <- mod2$fit
> fit <- as.matrix(fit)
> dim(fit)
[1] 260 1
> wash8$fitted <- fit
> wash8$fitted2 <- (wash8$fitted)^2 #actual fitted values
> dim(wash8)
[1] 260 28
#write out data set with fitted values, included cases
> write.table(wash8,"c:\\washdata.txt",row.names=F,sep=",")
#compute mean, standard deviation of desired errors
> data <- read.table("c:\\washdata.txt",h=T,sep=",")
> data$error <- data$Persons - data$fitted2
> var(data$error)
[1] 155983.7
> mean(data$error)
[1] 35.40443
> sqrt(var(data$error))
[1] 394.9477
APPENDIX B
Census Block Group Population Model:
Diagnostic Plots
According to the rather arbitrary (2p / n) rule of thumb, several cases appear to have high leverage.
Plot 2: Studentized Residuals (Full Model)
Three cases appear to have unusually large studentized residuals.
Plot 3: Jackknife Residuals (Full Model)
Three cases again appear to have large jackknife residuals.
Plot 4: Cook Statistics (Full Model)
There appear to be four influential cases, each having an unusually large Cook statistic.
Plot 5: Residuals vs. Fitted Values (Full Model)
There is rather strong evidence of non-constant variance in the residuals, in a pattern that suggests a square-root transformation of the response variable Persons.
Plot 6: Residuals vs. Fitted Values AFTER square-root transformation of response
The transformation of the response appears to have cleared up the non-constant variance in the residuals.
Plot 7: Cook Statistics AFTER Variable Selection
There appears to be one rather influential case.
Kriging Analysis: Technical Details and Associated SAS Code
Summary of Kriging Analysis
1. Reconstruct the complete 1990 Washtenaw Census Tract data set, in R:
> wash <- read.table("c:\\washt.txt",h=T,sep=",")
> dim(wash)
[1] 81 193
> wash2 <- wash[,-c(173:193)]
> wash3 <- wash2[,-c(7:57)]
> wash4 <- wash3[,-c(12:25,27:32,38:76)]
> wash5 <- wash4[,-c(21:42,44:51,55:58)]
> dim(wash5)
[1] 81 28
> wash5$Vacant <- wash5$Vacant / wash5$Housing
> wash5$Ownr.occ <- wash5$Ownr.occ / wash5$Occupied
> wash5$Rent.occ <- wash5$Rent.occ / wash5$Occupied
> wash5$Occupied <- wash5$Occupied / wash5$Housing
> wash5$Detached <- wash5$Detached / wash5$Housing
> wash5$Attached <- wash5$Attached / wash5$Housing
> wash5$Duplex <- wash5$Duplex / wash5$Housing
> wash5$Apartmnt <- wash5$Apartmnt / wash5$Housing
> wash5$Employed <- wash5$Employed / wash5$Persons
> wash5$Unemploy <- wash5$Unemploy / wash5$Persons
> wash5$Notinwrk <- wash5$Notinwrk / wash5$Persons
> wash5$Bltbfr70 <- wash5$Bltbfr70 / wash5$Housing
> wash5$Blt.7079 <- wash5$Blt.7079 / wash5$Housing
> wash5$Blt.8084 <- wash5$Blt.8084 / wash5$Housing
> wash5$Bltaft84 <- wash5$Bltaft84 / wash5$Housing
> wash5$Childpov <- wash5$Childpov / wash5$Persons
> wash5$Inpovrty <- wash5$Inpovrty / wash5$Persons
> write.table(wash5,"c:\\prekrig.txt",row.names=F,sep=",")
2. Merge the Census Tract data set with the file containing the locations of all of the Census Tracts (SAS code):
data wash;
infile "c:\prekrig.txt" dlm = ",";
input State County Tract Land_km Water_km Persons Housing Occupied Vacant Ownr_occ Rent_occ Val_medi Rnt_medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc_medn Incprcap Childpov Inpovrty Bltbfr70 Blt_7079 Blt_8084 Bltaft84;
run;
data locs;
infile "c:\tractloc2.txt" dlm = ",";
input obs stcty tract long lat;
run;
proc sort data = locs;
by tract;
run;
proc sort data = wash;
by tract;
run;
data last;
merge wash (in=ina) locs (in=inb);
by tract;
if ina and inb;
run;
libname brady "h:\thesis";
data brady.washkrig;
set last;
run;
3. Use the given Census Tract data to krig for the significant predictors of population in the small area model across all of Washtenaw County, using proc variogram and proc krige2d in SAS:
Variable: % Housing Units Occupied
----------------------------------
(see all plots for % Housing Units Occupied in Appendix D)
*create individual variable subset:
data washocc;
set brady.washkrig (keep = Occupied long lat);
run;
*plot measurement locations (see Plot 1 in Appendix D):
proc gplot data = washocc;
title 'Scatter Plot of Measurement Locations';
plot lat*long / frame cframe = ligr haxis = axis1 vaxis = axis2;
symbol1 v = dot color = blue;
axis1 minor = none;
axis2 minor = none label = (angle = 90 rotate = 0);
label lat = 'Latitude'
long = 'Longitude';
run;
quit;
*look at 3d surface plot of variable:
proc g3d data = washocc;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Occupied / xticknum = 5 yticknum = 5
grid zmin = 0.8 zmax = 1.0;
label long = 'Longitude'
lat = 'Latitude'
Occupied = '% Housing Units Occupied';
run;
quit;
*look at the distribution of the pairwise distances between the tracts:
(see Plot 2 in Appendix D)
proc variogram data = washocc outdistance = outd;
compute nhc = 20 novariogram;
coordinates xc = long yc = lat;
var Occupied;
run;
title 'Distance Intervals';
proc print data = outd;
run;
data outd;
set outd;
mdpt = round((lb+ub) / 2,0.001);
label mdpt = 'Midpoint of Interval';
run;
axis1 minor = none;
axis2 minor = none label = (angle = 90 rotate = 0);
title 'Distribution of Pairwise Distances';
proc gchart data = outd;
vbar mdpt / type = sum sumvar = count discrete frame
cframe = ligr gaxis = axis1 raxis = axis2 nolegend;
run;
quit;
*look at the lower bound of distance interval where there are still sufficient (more than 30) pairs of points, and divide this number by the apparent lag distance (LAGD) to yield MAXLAGS. Plot 2 in Appendix D suggests setting MAXLAGS = 15, and lagd = 0.03. Compute and plot standard and robust semivariograms.
proc variogram data = washocc outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
var Occupied;
run;
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for %Occupied Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');
run;
quit;
*fit an appropriate theoretical variogram to the data, and plot it against the standard and robust semivariograms:
data outv3;
set outv;
c0 = 0.0012; c = 0.0022; a0 = 0.50;
vari = c0 + c * (1 - exp(-distance * distance / (a0 * a0)));
type = 'Gaussian';
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %Occupied Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting estimates:
proc sort data = washocc;
by long;
run;
proc sort data = washocc;
by lat;
run;
proc krige2d data = washocc outest = est;
pred var = Occupied;
model nugget = 0.0012 scale = 0.0022 range = 0.50 form = gauss;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %Occupied';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '%Occupied';
run;
4. Merge the kriging estimates with the final Census Block Group data set:
data est2 (keep = long lat koccu);
set est;
long = gxc;
lat = gyc;
long = round(long, 0.01);
lat = round(lat, 0.01);
koccu = estimate;
run;
data block;
set brady.krigdata; *copy of brady.washdata
long = round(long, 0.01);
lat = round(lat, 0.01);
run;
proc sort data = est2;
by long lat;
run;
proc sort data = block;
by long lat;
run;
data final;
merge block (in = ina) est2 (in = inb);
by long lat;
if ina and inb;
run;
*continue to append brady.krigdata, so that the final data set will contain both actual data and kriging estimates for all of the significant predictors of population for each of the Census Block Groups:
data brady.krigdata;
set final;
run;
5. Continue to krig for all other significant predictors of population in the small area model, following the same methodology outlined above:
Variable: Median Housing Unit Value
-----------------------------------
(see all plots for Median Housing Unit Value in Appendix D)
*fit theoretical variogram:
data outv3;
set outv;
c = 3500000000; a0 = 0.06;
if (distance <= 0.06) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);
type = 'Spherical';
end;
else if (distance > 0.06) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
*krig using the theoretical variogram, and plot the resulting estimates:
proc krige2d data = wash_med outest = est;
pred var = Val_medi;
model scale = 3500000000 range = 0.06 form = spherical;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for Med.Value';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = 'Median Unit Value';
run;
Variable: % Units Built After 1984
----------------------------------
(see all plots for % Units Built After 1984 in Appendix D)
*fit theoretical variogram:
data outv3;
set outv;
c = 0.015; a0 = 0.10;
if (distance <= 0.10) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);
type = 'Spherical';
end;
else if (distance > 0.10) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %BltAft84 Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting estimates:
proc krige2d data = wash84 outest = est;
pred var = Bltaft84;
model scale = 0.015 range = 0.10 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %BltAft84';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '% Units Built After 1984';
run;
*zero out any negative kriging estimates:
data final;
set final;
if (kaft84 < 0) then kaft84 = 0;
run;
Variable: Income Per Capita
---------------------------
(see all plots for Income Per Capita in Appendix D)
data washinc;
set brady.washkrig (keep = Incprcap long lat);
run;
proc g3d data = washinc;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Incprcap / xticknum = 5 yticknum = 5
grid zmin = 4120 zmax = 59400;
label long = 'Longitude'
lat = 'Latitude'
Incprcap = 'Income Per Capita';
run;
quit;
proc variogram data = washinc outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
var Incprcap;
run;
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for IncPrCap Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');
run;
quit;
*fit theoretical variogram:
data outv3;
set outv;
c = 75000000; a0 = 0.03;
if (distance <= 0.03) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);
type = 'Spherical';
end;
else if (distance > 0.03) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for IncPrCap Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting estimates:
proc krige2d data = washinc outest = est;
pred var = Incprcap;
model scale = 75000000 range = 0.03 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for IncPrCap';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = 'Income Per Capita';
run;
Variable: % Units Built Between 1980-1984
-----------------------------------------
(see all plots for % Units Built Between 1980-1984 in Appendix D)
data wash8084;
set brady.washkrig (keep = Blt_8084 long lat);
run;
proc g3d data = wash8084;
title 'Surface Plot of Variable Measurements';
scatter lat*long=Blt_8084 / xticknum = 5 yticknum = 5
grid zmin = 0 zmax = 0.1375;
label long = 'Longitude'
lat = 'Latitude'
Blt_8084 = '% Units Built 1980-1984';
run;
quit;
proc variogram data = wash8084 outv = outv;
compute lagd = 0.03 maxlag = 15 robust;
coordinates xc = long yc = lat;
var Blt_8084;
run;
title 'Variogram Results';
proc print data = outv label;
var lag count distance variog rvario;
run;
data outv2;
set outv;
vari = variog;
type = 'regular';
output;
vari = rvario;
type = 'robust';
output;
run;
title 'Standard and Robust Semivariogram for %Blt8084 Data';
proc gplot data = outv2;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');
run;
quit;
*fit theoretical variogram:
data outv3;
set outv;
c = 0.00155; a0 = 0.05;
if (distance <= 0.05) then do;
vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);
type = 'Spherical';
end;
else if (distance > 0.05) then do;
vari = c;
type = 'Spherical';
end;
output;
vari = variog; type = 'regular'; output;
vari = rvario; type = 'robust'; output;
run;
title 'Theoretical and Sample Semivariogram for %Blt8084 Data';
proc gplot data = outv3;
plot vari*distance=type / frame cframe = ligr vaxis = axis2
haxis = axis1;
symbol1 i = join l = 1 c = blue v = star;
symbol2 i = join l = 1 c = black v = square;
symbol3 i = join l = 1 c = red v = diamond;
axis1 minor = none label = (c = black 'Lag Distance');
axis2 minor = none label = (angle = 90 rotate = 0 c = black
'Variogram');
run;
quit;
*krig using the theoretical variogram, and plot the resulting estimates:
proc krige2d data = wash8084 outest = est;
pred var = Blt_8084;
model scale = 0.00155 range = 0.05 form = sph;
coord xc = long yc = lat;
grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by
0.01;
run;
proc g3d data = est;
title 'Surface Plot of Kriged Estimates for %Blt8084';
scatter gyc*gxc = estimate / grid;
label gyc = 'Latitude'
gxc = 'Longitude'
estimate = '% Units Built 1980-1984';
run;
6. Using the final data set constructed after all kriging estimates have been derived, obtain population estimates at the locations representing the small areas by entering the kriging estimates of the significant predictors for each small area into the model relating the small areas:
data test;
set brady.krigdata;
estimate = 0.0912 * land_km + 31.67 * koccu + 0.00002363 * kmedv + -0.0001586 * kinc + -11.05 * k8084 + 22.92 * kaft84;
est2 = estimate * estimate;
error = Persons - est2;
run;
*compute the mean and standard deviation of the desired errors:
proc means data = test;
var error;
run;
data brady.krigdata;
set test;
run;
APPENDIX D
Kriging Plots
Additional SAS Code
SAS code used to create the data file that includes the locations of each Census Block Group in Washtenaw County:
data bgid;
infile "c:\bg26_d90_pa.dat";
input number : 4. county $char5. tract 4. block 3.;
run;
data bgid2;
set bgid;
if county = '26161';
run;
data coords;
infile "c:\bg26_d90_p.dat" lrecl = 70;
input number 1-10 long 17-23 lat 46-51;
run;
data coords2;
set coords;
if number ne -0.8 and number ne -0.9;
run;
proc sort data = bgid2;
by number;
run;
proc sort data = coords2;
by number;
run;
data final;
merge bgid2 (in=ina) coords2(in=inb);
by number;
if ina;
run;
data temp;
set final;
file "c:\bgloc2.txt";
put @1 number @5 "," @6 county @11 "," @12 tract @16 ","
@17 block @18 "," @19 long @26 "," @27 lat;
run;
SAS code used to create the data file that includes the locations of each Census Tract in Washtenaw County:
data tractid;
infile "c:\tr26_d90_pa.dat";
input number : 4. county $char5. tract 4.;
run;
data tractid2;
set tractid;
if county = '26125' or county = '26161' or county = '26163';
run;
data coords;
infile "c:\tr26_d90_p.dat" lrecl = 46;
input number 1-10 long 11-28 lat 29-46;
run;
data coords2;
set coords;
if number ne -8 and number ne -9;
run;
proc sort data = tractid2;
by number;
run;
proc sort data = coords2;
by number;
run;
data final;
merge tractid2 (in=ina) coords2(in=inb);
by number;
if ina;
run;
data temp;
set final;
file "c:\tractloc2.txt";
put @1 number @4 "," @5 county @10 "," @11 tract @15 "," @16 long @26 "," @27 lat;
run;
SAS code used to create the master data set, including all variables for each Census Block Group (demographics, population and housing characteristics, latitude and longitude coordinates, etc.):
data wash;
infile "c:\washdata.txt" dlm = ",";
input state county tract block land_km water_km persons housing occupied vacant ownr_occ rent_occ val_medi rnt_medi detached attached duplex apartmnt employed unemploy notinwrk inc_medn incprcap childpov inpovrty bltbfr70 blt_7079 blt_8084 bltaft84 fitted fitted2;
run;
data locs;
infile "c:\bgloc2.txt" dlm = ",";
input number county tract block long lat;
run;
proc sort data = locs;
by tract block;
run;
proc sort data = wash;
by tract block;
run;
data last;
merge wash (in=ina) locs (in=inb);
by tract block;
if ina and inb;
run;
libname brady "h:\thesis";
data brady.washdata;
set last;
long = long * 100;
lat = lat * 100;
run;
1 Taken from “http://reserves.library.okstate.edu/geog5343/lec12/sld001.htm”
2 From: Rao, J.N.K. 1999. Current Trends in Sample Survey Theory and Methods. Indian Journal of Statistics 61: 16-22.
3 Taken from “http://www.geostatistics.com/GSWin/GSWINGaussian_Isotropic_Model.html”
4 United States Bureau of the Census - Geographic Area Descriptions: http://www.census.gov/geo/www/cob/tr_info.html; http://www.census.gov/geo/www/cob/bg_info.html
5 Latitude and longitude coordinate data for the census tracts and block groups in the county of Washtenaw were obtained from http://www.census.gov/geo/www/cob/tr.html and http://www.census.gov/geo/www/cob/bg.html.
6 Note that “Units” refers to the total number of housing units in the given area. Hence “% Apartment Units” refers to the proportion of housing units that are apartments.
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011