Hawaii International Conference on Statistics and Related Fields

Submission Title Page

Title: Spatial Analysis of a Small Area Problem

Topic Area(s): Applied Statistics, Population Statistics

Keywords: Kriging, Small Area Estimation

Author: Brady West (student, University of Michigan at Ann Arbor)

Mailing Address: 2222 Fuller Ct., Apartment 708A, Ann Arbor, MI, 48105

E-Mail Address: bwest@umich.edu

Phone Numbers: 734.998.0498 (home), 734.223.9793 (cell)

Fax Number: 734.647.2440 (work)

Advisor: Edward Rothman, University of Michigan Department of Statistics

Spatial Analysis of a Small Area Problem

Brady T. West

University of Michigan

Department of Statistics

April, 2001

“Spatial Analysis of a Small Area Problem”

Table of Contents

I. Introduction……………………………………………………………………. 3

II. Current Model-Based Approaches to Small Area Estimation…………….. 4

III. The Kriging Framework…………………………………………………….. 5

Calculation of the Sample and Theoretical Variograms……….. 7

IV. Analysis: An Application of Kriging to Small Area Estimation………….. 9

Overview……………………………………………………………9

Modeling the Census Block Group Population

in Washtenaw County……………..10

Using Kriging for Small Area Estimation………………………. 15

V. Discussion……………………………………………………………………… 17

VI. Works Cited………………………………………………………………….. 18

VII. Appendix A: Census Block Group Population Model Details…………... 20

VIII. Appendix B: Census Block Group Population Model Diagnostics…….. 28

IX. Appendix C: Kriging Details……………………………………………….. 37

X. Appendix D: Kriging Plots…………………………………………………... 49

XI. Appendix E: Additional SAS Code…………………………………………59

Introduction

The United States Census Bureau currently gathers population data for certain bounded geographic regions in all U.S. counties known as census tracts, and several different methods have been developed by statisticians and population analysts to estimate the populations of smaller areas within these tracts. In recent years, the statistical technique of small area estimation has been a very hot topic, and there is an ever-growing demand for reliable estimates of small area populations of all types. Reliable estimates of the populations of small areas (i.e. city blocks) within the currently established census tracts are important for several reasons. These extremely important estimates are used for, among other things, determination of state funding allocations, determination of exact boundaries for school and voting districts, administrative planning, marketing guidance, and as data for detailed descriptive and analytical studies of cities (Bryan: 1999).

Current small area estimation techniques attempt to derive an estimate of some attribute for a small area (i.e. population) by using available auxiliary data for the area; practitioners of these techniques now generally accept that indirect estimators should be used, based on explicit models relating small areas through supplementary data such as administrative records and recent Census data (Rao: 1999). Rao proposes that small area models of this nature can be broadly classified into two types, relating available auxiliary data for small areas to parameters of interest for the small areas (i.e. total population).

Such model-based approaches truly are beneficial, in that they increase the precision of small area estimates. Surveys attempting to derive direct small area estimates are often faced with the problem of extremely small sample sizes in particular regions, which can produce estimates that are lacking in precision. However, these model-based methods rely on the availability of both auxiliary data and data for the parameter of interest (Census tract data, for instance), and they ignore the spatial relationships between the small areas requiring estimates and the areas where data of interest are actually available. A need exists for a new small area estimation technique that takes into account the spatial relationships between small areas, where the data of interest are unavailable for the construction of models, and those larger areas where the desired data are present.

The geo-statistical spatial analysis technique known as kriging is frequently used by geologists for interpolation of attributes at points not physically sampled over a particular region, based on knowledge about the spatial relationships of the sampled points. Tobler’s Law of Geography¹ provides the rationale behind such spatial interpolation: points close together in space are more likely to have similar attributes than points far apart. The small area analogy suggests that adjacent small areas will have characteristics extremely similar to both each other and the larger regions that they are near or a part of. Using this analogy, we investigate an application of kriging to the small area estimation problem described above, and determine the precision of small area estimates that take into account the spatial relationships between small areas and those larger areas where data are available for those attributes desired in the small areas.

Current Model-Based Approaches to Small Area Estimation²

The model-based approach to small area estimation permits validation of the models from sample data. Rao (1999) classifies small area models into two types:

(1) _i = x^t_i + v_i

In this model, area-specific auxiliary data x_i (administrative records, census data) are available for the areas i = 1,…,m. The population small area total Y_i, or some function

_i = g(Y_i), is assumed to be related to x_i through the above linear model (1). The v_i’s are assumed to be normally distributed, random, uncorrelated small area effects, with mean zero and variance ²_v.  represents the vector of regression parameters. The second type of model is as follows:

(2) y_ij = x^t_ij + v_i + e_ij

This model is appropriate for continuous variables y. In this model, unit-specific auxiliary data x_ij are again available for areas i = 1,…,m , where j = 1,…,N_i and N_i represents the number of population units in the i-th area. The unit y-values, y_ij, are assumed to be related to the auxiliary values x_ij through the nested error regression model above (2). The v_i’s are normal, independent, and identically distributed, with mean zero and variance ²_v. The e_ij’s are independent of the v_i’s, normal, independent of each other, and identically distributed, with mean zero and variance ²_e.  again represents the vector of regression parameters.

Rao further asserts that in the case of type (1) models, direct survey estimators Y_i are available whenever the sample sizes n_i >= 1. Then, it can be assumed that

(3) _i = _i + e_i

where _i = g(Y_i) and the sampling errors e_i are independent and normally distributed with mean zero and known variance _i. Then, when model (3) is combined with model (1), we have

(4) _i = x^t_i + v_i + e_i

Models of this form, with _i = log(Y_i), have recently been used to produce model-based county estimates of poor school-age children in the United States (Rao: 1999). According to Rao, “The success of small area estimation…largely depends on getting good auxiliary information {x_i} that leads to a small model variance ²_v relative to _i.”

The Kriging Framework

The geo-statistical technique known as kriging was developed in the late 1950’s, building on the pioneering work of a South African mining engineer named Danie Krige, who in 1951 proposed innovative new concepts for mining estimation (Guyaguler: 2000; Goovaerts: 1997). Based on Krige’s work, Georges Matheron developed the Theory of Regionalized Variables, combining Krige’s concepts into a single framework which he coined “kriging.” Formally, kriging now refers to a family of least-squares linear regression algorithms that attempt to predict values of a regionalized variable, or a phenomenon that is spread out in space, at locations where data for the variable is not available, based on the spatial pattern of the available data.

In general, kriging provides an optimal means of estimating the values of a certain continuous attribute z at any points u not physically sampled over a specific area A, using knowledge about the underlying spatial relationships in a data set. Statistical models known as variograms (p. 7) provide knowledge about these underlying spatial relationships, and express spatial variation in the available data. The estimated values are weighted linear combinations of the available data, where data points that are closer to the locations requiring estimation are given more weight in producing the estimates, and the resulting estimators are unbiased (Lang: 1997). What differentiates kriging from other linear estimation methods is its aim to minimize the variance of the errors of the estimates, which because they are unbiased have mean zero. In most cases, kriging is applied to produce a fine grid of accurate estimates of some variable (i.e. standing water level, or richness of soil) over a physical site of interest.

Goovaerts (1997) outlines the basic methodology behind the kriging procedure, and differentiates between the different types of kriging:

Given n data at sampled locations u_i of some study area A for some attribute z(u_i),

i = 1,…,n, we wish to produce values for the estimator z*(u). This estimator is defined as follows:

(5) z*(u) - m(u) = _i _i(u)[z(u_i) - m(u_i)]

Here, _i(u) = the weight assigned to the datum z(u_i) using a theoretical variogram [see (8)], m(u) = the expected value of z(u), and m(u_i) = the expected value of z(u_i). The vector of weights (u) is determined to minimize the variance of the estimation errors, denoted as

(6) ²_e = Var{z*(u) - z(u)}

The variance of the estimation errors ²_e is minimized under the constraint that

E{z*(u) - z(u)} = 0, making the estimator unbiased. The estimation errors can be computed using cross-validation, where estimates are derived for actual data points by removing sample values from the data set and then using the kriging procedure to re-estimate them.

There are three main kriging variants, distinguished according to the model considered for m(u), which recall is the expected value of z(u). Ordinary kriging is the most widely used kriging method (Guyaguler: 2000), which accounts for local fluctuations of the mean for the attribute z by limiting the area where the mean for z is fixed to some local neighborhood W(u) (Goovaerts: 1997). This variant of kriging considers the mean m(u) for z to be constant but unknown for all locations in W(u) . Ordinary kriging estimates are derived in the following manner:

(7) z*(u) = _i _i(u)z(u_i)

The weights _i(u) assigned to each datum z(u_i) are constrained such that _i _i(u) = 1, to ensure the unbiasedness of the estimates (Guyaguler: 2000). The weights are again determined using the experimental variogram so as to minimize the error variance described in (6). The vector of weights is derived by solving the following system of equations (Guyaguler: 2000):

(8) _i _i(u)(u_j - u_i) +  = (u_j - u_o) j = 1,…,n

Here, (u_j - u_i) represents the value of the theoretical variogram at distance (u_j - u_i), where u_j and u_i are the locations of actual data points, and u_o is the location where an estimate is desired.  represents the “Lagrange parameter.” (Lang: 1997)

Ordinary kriging is an exact interpolator, since the estimated value of the desired attribute at some point is equal to the exact data value. This variant of kriging only uses a local neighborhood of data in deriving the estimates, where only the closest data are assigned weights and the expected value m(u) of the desired attribute is considered to be constant, and this fact can be used to show that the estimates are unbiased:

E{z*(u) - z(u)} = E{_i _i(u)z(u_i)} - E{z(u)} = m(u) _i _i(u) - m(u) = 0.

Ordinary kriging is usually preferred to other types of kriging, because it requires neither knowledge nor stationarity of the mean m(u) over the entire area of interest A (Goovaerts: 1997).

The other two chief variants of kriging include simple kriging, which considers the mean m(u) of the attribute z to be known and constant throughout the area of interest A, and kriging with a trend, which assumes that the unknown local mean m(u) varies within each local neighborhood W(u), throughout A. Some other “members” of the kriging family include block kriging, which involves the estimation of average z-values over some segment, surface, or volume of any size or shape, and cokriging, which allows one to better estimate values of z if the spatial distribution of some secondary variate sampled more intensely than z is known. Cokriging basically incorporates the additional information provided by other covariates into the above kriging methods, and can greatly improve interpolation estimates if it is difficult or expensive to sample the primary variate z. The values of the additional covariates are measured at locations where z is measured, and also at several other locations, and this additional information greatly improves the accuracy of the estimates of z.

Calculation of the Sample and Theoretical Variograms

As mentioned earlier, the sample variogram (or semivariogram, representing spatial covariance), a statistical model computed by using the available sample data, is the key part of the kriging procedure, in that it expresses how the data vary spatially across the area of interest. A sample variogram takes as input a distance h between two sampled points, and outputs a variogram estimate (h) that explains the variance in the attribute z over the distance between the two points. The sample variogram is derived in the following manner:

The range of distances between sampled points (from zero to the maximum distance) is divided into a set of discrete intervals. The width of each lag, or discrete interval h, must be large enough such that there are enough point pairs for estimation in all of the intervals. A rule of thumb in computing sample variograms is to have at least 30 point pairs in each discrete interval. The reliability of a variogram hinges on having sufficient pairs of observations in each lag (Schabenberger: 1997). However, for plotting and estimation purposes, it is desirable to have as many points as possible for a plot of (h) against h, so there is an important tradeoff between the number of lags and the number of point pairs within each lag.

For every pair of sampled points, the distance between the points is computed, along with the squared difference in the z-values.

Each pair of sampled points is assigned to one of the distance intervals, and the total variance in each interval is accumulated.

After every pair of points has been used, the average variance in each distance interval is computed. This value (h) is then plotted at the midpoint distance of each distance interval h.

This results in a plot that only has as many points as there are distance intervals, with a variogram estimate for each distance interval. The next important step in computing the variogram is to fit a model to these data (a theoretical variogram), which will allow for estimation of the variogram at all possible distances, rather than just the midpoint of each defined distance interval. Geo-statistics literature generally suggests the choice of one of five theoretical models to fit to sample variogram data, each of which has the parameters defined in the plot³ below:

Plot 1: Example of Theoretical Variogram

In a theoretical variogram, h (above, separation distance) represents the distance between two points; a_o represents the range parameter, or the distance at which the upper limit of the variogram (above, semivariance) is reached; c_o represents the nugget effect parameter, or the apparent non-zero intersection of the sample variogram with the y-axis (several factors, like sampling error and short-scale variability, make it is possible for two points very close to each other to have unusual variance in the attribute z, and this is the nugget effect); and c represents the sill parameter, or the upper limit of the variogram.

One of five possibilities for the theoretical variogram is chosen, generally based on visual rules of thumb. If there is no nugget effect, the parameter c_o = 0.

The linear model describes a straight-line variogram: (h) = c_o + [h(c/a_o)]

The linear-to-sill model describes a variogram that appears to be linear and then reaches an abrupt asymptote: (h) = c_o + [h(c/a_o)] for a_o  h (h) = c_o + c for h  a_o

The exponential model describes a variogram that approaches the sill gradually but never converges with the sill (the a_o parameter is used to provide range): (h) = c_o + c[1 - exp (-h/a_o)]

The spherical model is a modified quadratic function that describes a variogram similar to that described by the exponential model which actually reaches an asymptote at the sill:

(h) = c_o + c[1.5(h/a_o) - 0.5(h/a_o)³] for a_o  h (h) = c_o + c for h  a_o

The Gaussian model is similar to the exponential model, and describes a variogram that begins rising slowly from the y-intercept, and then rises more rapidly toward the sill (again, the a_o parameter is used to provide range): (h) = c_o + c[1 - exp (-h²/a_o²)]

Once a theoretical variogram has been fit to the sampled data, the system of equations described in (8) can be solved for the vector of weights (u), and kriging estimates for the attribute z at specific locations may be obtained.

Analysis: An Application of Kriging to Small Area Estimation

Overview

We consider the following small area estimation problem: population counts, as well as various demographic, housing, and geographic data, are available at the census tract level for a given county in the United States, and population estimates are desired for smaller subdivisions of the census tracts known as census block groups. Only geographic data (land area, etc.) are available for the census block groups.

Definitions are of course in order. According to the United States Bureau of the Census⁴, census tracts are defined as “small, relatively permanent statistical subdivisions of a county…Census tracts usually have between 2,500 and 8,000 persons and, when first delineated, are designed to be homogeneous with respect to population characteristics, economic status, and living conditions. Census tracts do not cross county boundaries.” In addition, a census block group (BG) is defined as “a cluster of blocks having the same first digit of their three-digit identifying numbers within a census tract…Geographic BG's never cross census tract boundaries…BG's generally contain between 250 and 550 housing units, with the ideal size being 400 housing units.”

The U.S. Census Bureau currently produces all of the data proposed to be available at the census tract level in the above setting at the census block group level as well, but for the purposes of investigating the effectiveness of kriging in small area estimation, we consider these data to be unavailable for the small areas of interest (the census block groups). The following methodology can then be considered analogous to a variety of situations where data are available for some larger area, but not for smaller subdivisions of the larger area, where estimates for some parameter are desired.

Modeling the Census Block Group Population in Washtenaw County

We begin our investigation of this problem by considering 1990 census data for the county of Washtenaw in the state of Michigan. The 1990 TIGER data sets produced by the U.S. Census Bureau after the completion of the 1990 U.S. census are available publicly over the web, at http://www.esri.com/data/online/tiger/data.html. These TIGER data sets include geographic features such as roads, railroads, rivers, and lakes, political boundaries, statistical boundaries, and demographic attributes for the entire United States, making them very useful for this setting. However, they do not contain the geographical coordinates of the various geographic areas, which are essential for the application of kriging. The SAS code used to merge the TIGER data sets with the coordinate data for the geographic areas of interest⁵ in this study is available in Appendix E.

The images on the following pages, produced using the spatial analysis software ArcView GIS Version 3.1, provide an idea of what the 1990 census tract and census block group divisions of the county of Washtenaw looked like.

Figure 1: 1990 Census Tracts in Washtenaw County

Figure 2: 1990 Census Block Groups in Washtenaw County

The continuous variables below, all considered relevant to the population of a particular area, are available in the aforementioned TIGER data sets for both the Washtenaw census tracts and the Washtenaw census block groups, in addition to population counts⁶:

Land Area (km²) Water Area (km²) %Units Occupied %Units Vacant

%Units Owned %Units Rented Median Unit Value Median Monthly Rent

%Single Detached Units %Single Attached Units %Duplex Units %Apartment Units

%Persons Employed %Persons Unemployed %Persons Not In Med. Household Income

Work Force

Income Per Capita %Children in Poverty %In Poverty

%Units Built < 1970 %Units Built 70-79 %Units Built 80-84 %Units Built > 1984

Here is the 1990 spatial distribution of the variable %Apartment Units across the county of Washtenaw, for both the census tracts and the block groups:

Figure 3: Spatial Distribution of % Apartment Units

Across the 1990 Census Tracts in Washtenaw County

One can see that the majority of the apartments in Washtenaw County are concentrated around the areas of Ann Arbor and Ypsilanti, which are college towns.

Figure 4: Spatial Distribution of % Apartment Units

Across the 1990 Census Block Groups in Washtenaw County

Because these data are all available at the census block group level, it is possible to fit a linear model to the block group data using ordinary regression analysis, with Persons as the response, and produce fitted values for each of the block groups. The accuracy of this model-based technique, using data that are actually available for the small areas, can then be compared with the accuracy of a new kriging-based technique, which will attempt to estimate the small area populations when data are only available for larger areas, or in this setting, when data are only available at the census tract level. In order to compare the accuracies of these two techniques, the following errors will be computed for each census block group in Washtenaw, and the mean and standard deviation (sd) of these errors for each technique will then be derived:

(9) error_model = actual population - predicted population

(10) error_kriging = actual population - estimated population

A linear model of the form described in (1) was fit to the 1990 Washtenaw census block group data, with Persons as the parameter of interest and all other variables as predictors. The technical details of this model fitting process are described in Appendix A and Appendix B. The final model that resulted was

(11) sqrt(Y) = 0.0912x₁ + 31.67x₂ + 0.00002363x₃ +

-0.0001586x₄ + -11.05x₅ + 22.92x₆

where Y is the population of the census block group, x₁ is the area of the block group in square kilometers, x₂ is the % of housing units that are occupied, x₃ is the median value of the housing units, x₄ is the income per capita, x₅ is the % of housing units built between 1980 and 1984, and x₆ is the % of housing units built after 1984.

After back-transforming to get the predicted population values for each block group by taking Y_o², where Y_o was the fitted value for the block group in the above model, we find the following results based on (9):

sd{error_model} = 394.94771

mean{error_model} = 35.40443

Using Kriging for Small Area Estimation

The situation is considered where population data, various demographic, housing, and geographic data, and geographical coordinates (physical longitude and latitude) are all available for some large areas (in this example, these data are available for the Washtenaw census tracts). Population estimates are desired for smaller areas within these larger areas, and the only data that are available for the small areas are physical areas and geographical coordinates. In this example, the geographical coordinates of a particular area are assumed to be the coordinates of some point internal to the area.

1. The first step will be to determine a linear model like that in (11) relating the small areas, in order to establish those variables that are significant predictors of population for the small areas. This model may come from previous investigations, or by fitting a model to data that actually are available for similar small areas. Predictor variables in the model should represent percentages or summary measures describing the small areas, and not counts for the entire areas. The reason for this is as follows.

After determining the appropriate small area model, we will use the available large area data to krig for those variables found to be significant predictors of the parameter of interest over the entire grid representing the collection of the large areas (in this case, Washtenaw county). Then, we will use these kriging estimates to predict the populations at the locations of the small areas, based on the small area model. Kriging is a form of point interpolation, where the estimated value of some attribute at a location near a point where a value for the attribute is given will likely be very similar to the given value.

Suppose a count variable, such as the total number of occupied homes, is a significant predictor of the desired attribute (population) in the small areas. When kriging estimates for this variable are produced over the entire area of interest (Washtenaw County), estimates for the total number of occupied homes in small areas that are within the larger areas where data are available (and thus have geographical coordinates that are extremely close to those of the large areas) will be very similar to the total number of occupied homes for the large areas, which will not make sense. However, suppose a variable such as “% of homes occupied” is a significant predictor of the desired attribute, and we obtain kriging estimates for this variable over the entire area of interest. The kriging estimates for this variable, at geographical coordinates representing small area locations that are extremely close to the geographical coordinates of the large areas, will again be similar to the given values for the large areas, which now makes sense. The same would be true for summary variables, such as median income.

Based on the population model derived in (11), with Persons as the parameter of interest, the following variables were found to be significant predictors of population in the small areas:

Land Area (km²) %Units Occupied Median Unit Value Income Per Capita

%Units Built 1980-84 %Units Built > 1984

2. The second step will be to krig for all of the variables that are significant predictors of the parameter of interest (population), over the entire area of interest (Washtenaw County), using the available large area (tract level) data. This will result in estimated values for each of the variables listed above (except for the area variable, which is assumed to be given for the small areas) at all locations over the entire area of interest. The estimated values of these variables at the locations representing the small areas can then be used, along with the given area values for the small areas, to estimate the populations of the small areas using the pre-determined small area model.

Kriging estimates were all derived using the SAS System, and the procedures variogram and krige2d. Please see Appendix C and Appendix D for the technical details behind the calculations of all sample variograms, theoretical variograms, and kriging estimates, and all of the associated SAS code.

After kriging for all of the continuous variables above (except land area), and then using the kriging estimates in the small area model described in (11) to obtain population estimates for the small areas, we find the following results based on (9):

sd{error_kriging} = 437.7639

mean{error_kriging} = 26.1729

We see a smaller mean error than that found using ordinary regression to predict the small area populations, but a larger standard deviation in the errors. These errors have a distribution that is comparable to that found using the model-based technique.

Discussion

The standard deviation of the errors for the small area estimates derived using the kriging-based technique is understandably larger than that found using the model-based technique of ordinary regression. As indicated in (6), all kriging estimates of an attribute computed at locations over a specific area have their own standard deviation, because they are linear combinations of all of the available data for the attribute, with the weights for each datum derived by solving the system of equations defined in (8). When these kriging estimates are input into a pre-determined model, they are not fixed values but rather the expected values from a series of distributions. Each individual kriging estimate for a particular attribute at a particular location thus in effect brings its own error with it, incorporating additional error into the predictions derived by using the kriging estimates as “new” values in the model relating the small areas.

Despite this fact, the mean and standard deviation of the errors for the small area estimates derived using a kriging-based technique were still comparable to those found using a model-based technique with available small area data. This finding suggests that the kriging-based technique can provide reasonable estimates for small area attributes when model-based techniques cannot be used due to a lack of small area data.

The technique of kriging also produces estimates for a particular attribute at all locations over an area of interest. Kriging was originally designed for geological use, where the areas of interest are often fields or plots that do not have roads, bodies of water, or parks dividing the sampled points. When applying kriging to population estimation, kriging estimates of attributes such as “% of housing units occupied” are computed at all locations over an entire area, including points where there are roads, forests, parks, lakes, etc. Future research into the use of kriging-based or other spatial analytic techniques for small area estimation should investigate the possibility that barriers (i.e. roads, lakes, etc.) between “sampled points” such as census block groups, where population or housing attributes have meaning, might have an effect on the spatial continuity of attributes related to population.

For example, kriging techniques may predict very similar values for the variable “% of housing units occupied” in two census block groups that are very close to each other but divided by a river, when in fact the actual values of this variable may vary significantly due to the presence of the river. Local kriging systems may need to be solved in order to derive kriging estimates of attributes at all points lying between such barriers, if the barriers are found to have a significant effect on the spatial continuity of the desired attributes.

Works Cited

1990 Block Group Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/bg_info.html

1990 Census Tract and Block Numbering Areas Descriptions: U.S. Bureau of the Census. http://www.census.gov/geo/www/cob/tr_info.html

Bryan, Thomas. 1999. “Small-Area Population Estimation Technique Using Administrative Records, and Evaluation of Results with Loss Functions and Optimization Criteria.” Federal Committee on Statistical Methodology Research Conference, November 15-17, 1999. Washington, DC: U.S. Bureau of the Census.

Census Block Group Cartographic Boundary Files: U.S. Bureau of the Census.

http://www.census.gov/geo/www/cob/bg.html

Census TIGER Data: ArcData Online. http://www.esri.com/data/online/tiger/data.html

Census Tract Cartographic Boundary Files: U.S. Bureau of the Census.

http://www.census.gov/geo/www/cob/tr.html

Faraway, Julian J. 1999. Practical Regression and Anova using R.

Gamma Design Software. http://www.geostatistics.com/GSWIN/GSWINIsotropic_Variogram_Models.html

Gill, Andrew. 1996. “Kriging Example.”

http://www.maths.adelaide.edu.au/Applied/UA_DAM_FLUIDS/GROUNDWATER/GEOSTATS/KrigEx/krigex.html

Goovaerts, Pierre. 1997. Geostatistics for Natural Resources Evaluation. New York: Oxford Press. Pp. 125-158, 203.

Goovaerts, Pierre. 2001. “Study of scale-dependent correlation structures using factorial kriging analysis.” http://www-personal.engin.umich.edu/~goovaert/pg-suj3.html

Guyaguler, Baris. “Ordinary Kriging.”

http://pangea.stanford.edu/~baris/professional/theoryok.html

Lang, Chao-Yi. “Kriging Interpolation.”

http://www.tc.cornell.edu/Visualization/contrib/cs490-94to95/clang/kriging.html

Rao, J.N.K. 1999. “Current Trends in Sample Survey Theory and Methods.” Pp. 16-22 in Indian Journal of Statistics, vol. 61.

Rao, Mahesh. 1999. “Interpolation.”

http://reserves.library.okstate.edu/geog5343/lec12

SAS Institute Inc. 1999. Online documentation: the KRIGE2D and VARIOGRAM Procedures.

Schabenberger, Oliver. 1997. “Developing Variograms in SAS + Ordinary Kriging.” http://www.cas.vt.edu/schabenb/Spatial.htm

APPENDIX A

Census Block Group Population Model:

Fitting Details and Associated R Code

Washtenaw Census Block Group (1990) Population Modeling

All of the following code was written for the R Software Package for interactive statistical analysis.

> wash <- read.table("c:\\wash1.txt", sep=",", h=T)

#variable selection

> wash2 <- wash[,1:181]

> wash3 <- wash2[,-c(8:58)]

> wash4 <- wash3[,-c(13:26)]

> wash5 <- wash4[,-c(14:19,25:70)]

> wash6 <- wash5[,-c(19,23:44)]

> wash7 <- wash6[,-c(23:30,34:37)]

Variables (Continuous):

State County Tract Blockgroup Land.km Water.km Persons Housing Occupied Vacant Ownr.occ Rent.occ Val.medi Rnt.medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc.medn Incprcap Childpov Inpovrty Bltbfr70 Blt.7079 Blt.8084 Bltaft84

#Convert specific variables to proportions:

> wash7$Vacant <- wash7$Vacant / wash7$Housing

> wash7$Ownr.occ <- wash7$Ownr.occ / wash7$Occupied

> wash7$Rent.occ <- wash7$Rent.occ / wash7$Occupied

> wash7$Occupied <- wash7$Occupied / wash7$Housing

> wash7$Detached <- wash7$Detached / wash7$Housing

> wash7$Attached <- wash7$Attached / wash7$Housing

> wash7$Duplex <- wash7$Duplex / wash7$Housing

> wash7$Apartmnt <- wash7$Apartmnt / wash7$Housing

> wash7$Employed <- wash7$Employed / wash7$Persons

> wash7$Unemploy <- wash7$Unemploy / wash7$Persons

> wash7$Notinwrk <- wash7$Notinwrk / wash7$Persons

> wash7$Bltbfr70 <- wash7$Bltbfr70 / wash7$Housing

> wash7$Blt.7079 <- wash7$Blt.7079 / wash7$Housing

> wash7$Blt.8084 <- wash7$Blt.8084 / wash7$Housing

> wash7$Bltaft84 <- wash7$Bltaft84 / wash7$Housing

> wash7$Childpov <- wash7$Childpov / wash7$Persons

> wash7$Inpovrty <- wash7$Inpovrty / wash7$Persons

> washmod <- wash7[,-c(1:4)] #for fitting purposes

#remove bad cases

> washmod <- washmod[-c(66,254),]

> wash7 <- wash7[-c(66,254),]

> washmod1 <- washmod[,-c(4,6,8)]

#remove housing, %vacant, %rent.occ (%occupied + %vacant = 1)

#5-number numerical summary

> summary(washmod1)

Land.km Water.km Persons Occupied

Min. : 0.0120 Min. :0.000 Min. : 28.0 Min. :0.6648

1st Qu.: 0.3680 1st Qu.:0.000 1st Qu.: 720.8 1st Qu.:0.9304

Median : 0.6685 Median :0.000 Median : 999.0 Median :0.9601

Mean : 6.8540 Mean :0.121 Mean :1054.0 Mean :0.9438

3rd Qu.: 2.1520 3rd Qu.:0.000 3rd Qu.:1220.0 3rd Qu.:0.9775

Max. :96.8800 Max. :3.272 Max. :5853.0 Max. :1.0000

Ownr.occ Val.medi Rnt.medi Detached

Min. :0.0000 Min. : 0 Min. : 0.0 Min. :0.0000

1st Qu.:0.2126 1st Qu.: 68330 1st Qu.: 391.3 1st Qu.:0.1695

Median :0.6467 Median : 90200 Median : 476.5 Median :0.6017

Mean :0.5623 Mean :100200 Mean : 487.2 Mean :0.5467

3rd Qu.:0.8838 3rd Qu.:118000 3rd Qu.: 571.3 3rd Qu.:0.9170

Max. :1.0000 Max. :500000 Max. :1001.0 Max. :1.0000

Attached Duplex Apartmnt Employed

Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.0000

1st Qu.:0.003135 1st Qu.:0.00318 1st Qu.:0.01137 1st Qu.:0.4919

Median :0.010070 Median :0.01210 Median :0.17220 Median :0.5380

Mean :0.056940 Mean :0.03666 Mean :0.32210 Mean :0.5416

3rd Qu.:0.033450 3rd Qu.:0.04498 3rd Qu.:0.62250 3rd Qu.:0.6101

Max. :0.679200 Max. :0.31360 Max. :0.99080 Max. :0.8768

Unemploy Notinwrk Inc.medn Incprcap

Min. :0.00000 Min. :0.0000 Min. : 0 Min. : 2415

1st Qu.:0.01022 1st Qu.:0.1599 1st Qu.: 24910 1st Qu.:12870

Median :0.02026 Median :0.2099 Median : 37510 Median :16800

Mean :0.02687 Mean :0.2359 Mean : 38860 Mean :17660

3rd Qu.:0.03485 3rd Qu.:0.2607 3rd Qu.: 50130 3rd Qu.:20250

Max. :0.16410 Max. :0.9911 Max. :111200 Max. :66660

Childpov Inpovrty Bltbfr70 Blt.7079

Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000

1st Qu.:0.000000 1st Qu.:0.02024 1st Qu.:0.2634 1st Qu.:0.04358

Median :0.006738 Median :0.05479 Median :0.3877 Median :0.18410

Mean :0.025130 Mean :0.12040 Mean :0.4195 Mean :0.22330

3rd Qu.:0.026300 3rd Qu.:0.14900 3rd Qu.:0.5733 3rd Qu.:0.36030

Max. :0.421100 Max. :0.74070 Max. :1.3570 Max. :0.84270

Blt.8084 Bltaft84

Min. :0.00000 Min. :0.00000

1st Qu.:0.00000 1st Qu.:0.00000

Median :0.02214 Median :0.02094

Mean :0.04213 Mean :0.08072

3rd Qu.:0.06329 3rd Qu.:0.10920

Max. :0.33700 Max. :1.06400

Variables with data indicating that particular cases might be outliers: Persons, Val.medi, Bltbfr70, Bltaft84, Incprcap, Notinwrk, Land.km.

According to ArcView, land cases are not errors in data entry.

> wrk <- (washmod1$Bltbfr70 > 1)

error in data entry.

> washmod1 <- washmod1[-108,]

> wash7 <- wash7[-108,]

Notinwrk values do not appear to be errors in data entry.

Persons values might be errors in data entry. More than most

Census Tracts!

Val.medi may not be an error in data entry.

> (washmod1$Bltaft84 > 1)

Bltaft84 value is an error in data entry.

> wash7 <- wash7[-181,]

> washmod1 <- washmod1[-181,]

Incprcap value may not be an error in data entry.

Fit full model:

> mod <- lm(Persons ~ .,washmod1)

Model Diagnostics

-----------------

> x <- model.matrix(mod)

> lev <- hat(x)

> plot(lev, ylab = "Leverages") #see Plot 1, Appendix B.

> abline(h=2*22/266)

> sum(lev)

[1] 22

> sum <- summary(mod)

> stud <- mod$res / (sum$sig*sqrt(1-lev))

> plot(stud,ylab="Studentized Residuals") #see Plot 2, App. B.

> jack <- stud*sqrt((266-22-1)/(266-22-stud^2))

> plot(jack,ylab="Jackknife Residuals") #see Plot 3, App. B.

> jack[jack > 3]

71 85 151

8.757113 3.755840 7.153207

Check to see if these cases are outliers using a Bonferonni-based test:

> qt(0.05/(266*2),266-22-1)

[1] -3.792889

Two of the cases appear to be outliers! Further investigation yields that these are the two cases with Persons counts above 5,000.

Check for influential observations:

> cook <- stud^2*lev/(22*(1-lev))

> plot(cook,ylab = "Cook Statistics") #see Plot 4, Appendix B.

> cook[cook > 0.15]

12 71 86 151

0.1690986 0.3085052 0.2805020 0.1715737

There appear to be four influential cases with relatively large Cook statistics, two of which were identified as earlier as outliers. Temporarily exclude them from analysis:

> washmod1 <- washmod1[-c(12,70,85,149),]

> wash7 <- wash7[-c(12,70,85,149),]

> dim(washmod1)

[1] 262 22

> dim(wash7)

[1] 262 29

Variable Transformation

-----------------------

> plot(mod$fit, mod$res, xlab = "Fitted Values", ylab = "Residuals") #see Plot 5, Appendix B.

#test for non-constant variance

> summary(lm(abs(mod$res) ~ mod$fit))

Call:

lm(formula = abs(mod$res) ~ mod$fit)

Residuals:

Min 1Q Median 3Q Max

-635.57 -219.80 -58.22 80.81 3381.34

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -90.06772 88.47109 -1.018 0.310

mod$fit 0.40512 0.08035 5.042 8.58e-07 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 394.2 on 264 degrees of freedom

Multiple R-Squared: 0.08783, Adjusted R-squared: 0.08438

F-statistic: 25.42 on 1 and 264 degrees of freedom, p-value: 8.577e-007

VERY strong evidence of non-constant variance in the residuals. Plot 5 suggests a square-root transformation of the count response (Persons).

> mod2 <- lm(sqrt(Persons) ~ ., washmod1)

> plot(mod2$fit, mod2$res, xlab = "Fitted Values", ylab = "Residuals", main = "Square Root Response") #see Plot 6, App. B.

> summary(lm(abs(mod2$res) ~ mod2$fit))

Call:

lm(formula = abs(mod2$res) ~ mod2$fit)

Residuals:

Min 1Q Median 3Q Max

-4.594 -2.669 -0.755 1.349 17.844

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.38401 1.80611 3.535 0.000483 ***

mod2$fit -0.06380 0.05719 -1.116 0.265606

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.838 on 260 degrees of freedom

Multiple R-Squared: 0.004764, Adjusted R-squared: 0.0009365

F-statistic: 1.245 on 1 and 260 degrees of freedom, p-value: 0.2656

There is no longer any evidence of non-constant variance in the residuals after the square-root transformation of the response (Persons).

Variable Selection

------------------

Stepwise backward elimination with a critical alpha value of 0.15 (for prediction performance) results in the following model:

> summary(mod2)

Call:

lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -

Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -

Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1)

Residuals:

Min 1Q Median 3Q Max

-19.3348 -3.5081 -0.2181 3.6621 24.2039

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.300e+01 7.446e+00 1.746 0.081991 .

Land.km 7.794e-02 2.464e-02 3.163 0.001752 **

Occupied 1.959e+01 7.748e+00 2.529 0.012056 *

Val.medi 3.255e-05 9.736e-06 3.343 0.000954 ***

Inc.medn 8.833e-05 3.816e-05 2.315 0.021436 *

Incprcap -3.926e-04 8.437e-05 -4.653 5.27e-06 ***

Bltbfr70 -3.385e+00 2.095e+00 -1.615 0.107473

Blt.8084 -1.320e+01 7.098e+00 -1.860 0.064105 .

Bltaft84 2.032e+01 3.510e+00 5.790 2.07e-08 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 6.029 on 253 degrees of freedom

Multiple R-Squared: 0.3133, Adjusted R-squared: 0.2916

F-statistic: 14.43 on 8 and 253 degrees of freedom, p-value: 0

Check for influential cases:

> sum <- summary(mod2)

> x <- model.matrix(mod2)

> lev <- hat(x)

> sum(lev)

[1] 9

> stud <- mod2$res / (sum$sig*sqrt(1-lev))

> cook <- stud^2*lev/(9*(1-lev))

> plot(cook,main="Cook Statistics after Variable Selection")

#see Plot 7, Appendix B.

> cook[cook > 0.15]

120

0.1631331

Is there a significant change in the model when this case is excluded?

> summary(mod2)

Call:

lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -

Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -

Childpov - Rnt.medi - Apartmnt - Ownr.occ, data = washmod1,

subset = (cook < max(cook)))

Residuals:

Min 1Q Median 3Q Max

-19.4777 -3.5513 -0.2323 3.5122 24.7954

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.024e+01 7.545e+00 1.358 0.17580

Land.km 8.157e-02 2.458e-02 3.318 0.00104 **

Occupied 2.204e+01 7.812e+00 2.822 0.00516 **

Val.medi 2.792e-05 9.980e-06 2.798 0.00554 **

Inc.medn 6.137e-05 4.047e-05 1.517 0.13064

Incprcap -2.961e-04 9.781e-05 -3.027 0.00272 **

Bltbfr70 -2.778e+00 2.108e+00 -1.318 0.18873

Blt.8084 -1.361e+01 7.064e+00 -1.927 0.05515 .

Bltaft84 2.055e+01 3.494e+00 5.883 1.28e-08 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 5.997 on 252 degrees of freedom

Multiple R-Squared: 0.2871, Adjusted R-squared: 0.2644

F-statistic: 12.68 on 8 and 252 degrees of freedom, p-value: 2.665e-015

Two predictors are no longer significant! This case should also be excluded.

> wash7 <- wash7[-115,]

> washmod1 <- washmod1[-115,]

> dim(wash7)

[1] 261 29

> dim(washmod1)

[1] 261 22

Further stepwise backward elimination results in:

> summary(mod2)

Call:

lm(formula = sqrt(Persons) ~ . - Notinwrk - Duplex - Blt.7079 -

Water.km - Inpovrty - Unemploy - Attached - Detached - Employed -

Childpov - Rnt.medi - Apartmnt - Ownr.occ - Bltbfr70 - 1 -

Inc.medn, data = washmod1)

Residuals:

Min 1Q Median 3Q Max

-19.71474 -3.78245 -0.03491 3.49497 25.02833

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Land.km 9.120e-02 2.354e-02 3.874 0.000136 ***

Occupied 3.167e+01 9.984e-01 31.718 < 2e-16 ***

Val.medi 2.363e-05 9.639e-06 2.451 0.014913 *

Incprcap -1.586e-04 6.117e-05 -2.592 0.010084 *

Blt.8084 -1.105e+01 6.720e+00 -1.645 0.101265

Bltaft84 2.292e+01 3.092e+00 7.412 1.83e-12 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 6.02 on 255 degrees of freedom

Multiple R-Squared: 0.9658, Adjusted R-squared: 0.965

F-statistic: 1200 on 6 and 255 degrees of freedom, p-value: 0

We have a very good fit. Plot 8 in Appendix B indicates that the residuals in this model appear to follow a normal distribution.

> dim(washmod1)

[1] 261 22

> dim(wash7)

[1] 261 29

> fit <- mod2$fit

> fit <- as.matrix(fit)

> dim(fit)

[1] 260 1

> wash8$fitted <- fit

> wash8$fitted2 <- (wash8$fitted)^2 #actual fitted values

> dim(wash8)

[1] 260 28

#write out data set with fitted values, included cases

> write.table(wash8,"c:\\washdata.txt",row.names=F,sep=",")

#compute mean, standard deviation of desired errors

> data <- read.table("c:\\washdata.txt",h=T,sep=",")

> data$error <- data$Persons - data$fitted2

> var(data$error)

[1] 155983.7

> mean(data$error)

[1] 35.40443

> sqrt(var(data$error))

[1] 394.9477

APPENDIX B

Census Block Group Population Model:

Diagnostic Plots

Plot 1: Leverages of all cases after fitting the full model

According to the rather arbitrary (2p / n) rule of thumb, several cases appear to have high leverage.

Plot 2: Studentized Residuals (Full Model)

Three cases appear to have unusually large studentized residuals.

Plot 3: Jackknife Residuals (Full Model)

Three cases again appear to have large jackknife residuals.

Plot 4: Cook Statistics (Full Model)

There appear to be four influential cases, each having an unusually large Cook statistic.

Plot 5: Residuals vs. Fitted Values (Full Model)

There is rather strong evidence of non-constant variance in the residuals, in a pattern that suggests a square-root transformation of the response variable Persons.

Plot 6: Residuals vs. Fitted Values AFTER square-root transformation of response

The transformation of the response appears to have cleared up the non-constant variance in the residuals.

Plot 7: Cook Statistics AFTER Variable Selection

There appears to be one rather influential case.

Plot 8: Normal Q-Q Plot, Indicating Normal Residuals in the Final Model

APPENDIX C

Kriging Analysis: Technical Details and Associated SAS Code

Summary of Kriging Analysis

1. Reconstruct the complete 1990 Washtenaw Census Tract data set, in R:

> wash <- read.table("c:\\washt.txt",h=T,sep=",")

> dim(wash)

[1] 81 193

> wash2 <- wash[,-c(173:193)]

> wash3 <- wash2[,-c(7:57)]

> wash4 <- wash3[,-c(12:25,27:32,38:76)]

> wash5 <- wash4[,-c(21:42,44:51,55:58)]

> dim(wash5)

[1] 81 28

> wash5$Vacant <- wash5$Vacant / wash5$Housing

> wash5$Ownr.occ <- wash5$Ownr.occ / wash5$Occupied

> wash5$Rent.occ <- wash5$Rent.occ / wash5$Occupied

> wash5$Occupied <- wash5$Occupied / wash5$Housing

> wash5$Detached <- wash5$Detached / wash5$Housing

> wash5$Attached <- wash5$Attached / wash5$Housing

> wash5$Duplex <- wash5$Duplex / wash5$Housing

> wash5$Apartmnt <- wash5$Apartmnt / wash5$Housing

> wash5$Employed <- wash5$Employed / wash5$Persons

> wash5$Unemploy <- wash5$Unemploy / wash5$Persons

> wash5$Notinwrk <- wash5$Notinwrk / wash5$Persons

> wash5$Bltbfr70 <- wash5$Bltbfr70 / wash5$Housing

> wash5$Blt.7079 <- wash5$Blt.7079 / wash5$Housing

> wash5$Blt.8084 <- wash5$Blt.8084 / wash5$Housing

> wash5$Bltaft84 <- wash5$Bltaft84 / wash5$Housing

> wash5$Childpov <- wash5$Childpov / wash5$Persons

> wash5$Inpovrty <- wash5$Inpovrty / wash5$Persons

> write.table(wash5,"c:\\prekrig.txt",row.names=F,sep=",")

2. Merge the Census Tract data set with the file containing the locations of all of the Census Tracts (SAS code):

data wash;

infile "c:\prekrig.txt" dlm = ",";

input State County Tract Land_km Water_km Persons Housing Occupied Vacant Ownr_occ Rent_occ Val_medi Rnt_medi Detached Attached Duplex Apartmnt Employed Unemploy Notinwrk Inc_medn Incprcap Childpov Inpovrty Bltbfr70 Blt_7079 Blt_8084 Bltaft84;

run;

data locs;

infile "c:\tractloc2.txt" dlm = ",";

input obs stcty tract long lat;

run;

proc sort data = locs;

by tract;

run;

proc sort data = wash;

by tract;

run;

data last;

merge wash (in=ina) locs (in=inb);

by tract;

if ina and inb;

run;

libname brady "h:\thesis";

data brady.washkrig;

set last;

run;

3. Use the given Census Tract data to krig for the significant predictors of population in the small area model across all of Washtenaw County, using proc variogram and proc krige2d in SAS:

Variable: % Housing Units Occupied

----------------------------------

(see all plots for % Housing Units Occupied in Appendix D)

*create individual variable subset:

data washocc;

set brady.washkrig (keep = Occupied long lat);

run;

*plot measurement locations (see Plot 1 in Appendix D):

proc gplot data = washocc;

title 'Scatter Plot of Measurement Locations';

plot lat*long / frame cframe = ligr haxis = axis1 vaxis = axis2;

symbol1 v = dot color = blue;

axis1 minor = none;

axis2 minor = none label = (angle = 90 rotate = 0);

label lat = 'Latitude'

long = 'Longitude';

run;

quit;

*look at 3d surface plot of variable:

proc g3d data = washocc;

title 'Surface Plot of Variable Measurements';

scatter lat*long=Occupied / xticknum = 5 yticknum = 5

grid zmin = 0.8 zmax = 1.0;

label long = 'Longitude'

lat = 'Latitude'

Occupied = '% Housing Units Occupied';

run;

quit;

*look at the distribution of the pairwise distances between the tracts:

(see Plot 2 in Appendix D)

proc variogram data = washocc outdistance = outd;

compute nhc = 20 novariogram;

coordinates xc = long yc = lat;

var Occupied;

run;

title 'Distance Intervals';

proc print data = outd;

run;

data outd;

set outd;

mdpt = round((lb+ub) / 2,0.001);

label mdpt = 'Midpoint of Interval';

run;

axis1 minor = none;

axis2 minor = none label = (angle = 90 rotate = 0);

title 'Distribution of Pairwise Distances';

proc gchart data = outd;

vbar mdpt / type = sum sumvar = count discrete frame

cframe = ligr gaxis = axis1 raxis = axis2 nolegend;

run;

quit;

*look at the lower bound of distance interval where there are still sufficient (more than 30) pairs of points, and divide this number by the apparent lag distance (LAGD) to yield MAXLAGS. Plot 2 in Appendix D suggests setting MAXLAGS = 15, and lagd = 0.03. Compute and plot standard and robust semivariograms.

proc variogram data = washocc outv = outv;

compute lagd = 0.03 maxlag = 15 robust;

coordinates xc = long yc = lat;

var Occupied;

run;

title 'Variogram Results';

proc print data = outv label;

var lag count distance variog rvario;

run;

data outv2;

set outv;

vari = variog;

type = 'regular';

output;

vari = rvario;

type = 'robust';

output;

run;

title 'Standard and Robust Semivariogram for %Occupied Data';

proc gplot data = outv2;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');

run;

quit;

*fit an appropriate theoretical variogram to the data, and plot it against the standard and robust semivariograms:

data outv3;

set outv;

c0 = 0.0012; c = 0.0022; a0 = 0.50;

vari = c0 + c * (1 - exp(-distance * distance / (a0 * a0)));

type = 'Gaussian';

output;

vari = variog; type = 'regular'; output;

vari = rvario; type = 'robust'; output;

run;

title 'Theoretical and Sample Semivariogram for %Occupied Data';

proc gplot data = outv3;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

symbol3 i = join l = 1 c = red v = diamond;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');

run;

quit;

*krig using the theoretical variogram, and plot the resulting estimates:

proc sort data = washocc;

by long;

run;

proc sort data = washocc;

by lat;

run;

proc krige2d data = washocc outest = est;

pred var = Occupied;

model nugget = 0.0012 scale = 0.0022 range = 0.50 form = gauss;

coord xc = long yc = lat;

grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by 0.01;

run;

proc g3d data = est;

title 'Surface Plot of Kriged Estimates for %Occupied';

scatter gyc*gxc = estimate / grid;

label gyc = 'Latitude'

gxc = 'Longitude'

estimate = '%Occupied';

run;

4. Merge the kriging estimates with the final Census Block Group data set:

data est2 (keep = long lat koccu);

set est;

long = gxc;

lat = gyc;

long = round(long, 0.01);

lat = round(lat, 0.01);

koccu = estimate;

run;

data block;

set brady.krigdata; *copy of brady.washdata

long = round(long, 0.01);

lat = round(lat, 0.01);

run;

proc sort data = est2;

by long lat;

run;

proc sort data = block;

by long lat;

run;

data final;

merge block (in = ina) est2 (in = inb);

by long lat;

if ina and inb;

run;

*continue to append brady.krigdata, so that the final data set will contain both actual data and kriging estimates for all of the significant predictors of population for each of the Census Block Groups:

data brady.krigdata;

set final;

run;

5. Continue to krig for all other significant predictors of population in the small area model, following the same methodology outlined above:

Variable: Median Housing Unit Value

-----------------------------------

(see all plots for Median Housing Unit Value in Appendix D)

*fit theoretical variogram:

data outv3;

set outv;

c = 3500000000; a0 = 0.06;

if (distance <= 0.06) then do;

vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);

type = 'Spherical';

end;

else if (distance > 0.06) then do;

vari = c;

type = 'Spherical';

end;

output;

vari = variog; type = 'regular'; output;

vari = rvario; type = 'robust'; output;

run;

*krig using the theoretical variogram, and plot the resulting estimates:

proc krige2d data = wash_med outest = est;

pred var = Val_medi;

model scale = 3500000000 range = 0.06 form = spherical;

coord xc = long yc = lat;

grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by

0.01;

run;

proc g3d data = est;

title 'Surface Plot of Kriged Estimates for Med.Value';

scatter gyc*gxc = estimate / grid;

label gyc = 'Latitude'

gxc = 'Longitude'

estimate = 'Median Unit Value';

run;

Variable: % Units Built After 1984

----------------------------------

(see all plots for % Units Built After 1984 in Appendix D)

*fit theoretical variogram:

data outv3;

set outv;

c = 0.015; a0 = 0.10;

if (distance <= 0.10) then do;

vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);

type = 'Spherical';

end;

else if (distance > 0.10) then do;

vari = c;

type = 'Spherical';

end;

output;

vari = variog; type = 'regular'; output;

vari = rvario; type = 'robust'; output;

run;

title 'Theoretical and Sample Semivariogram for %BltAft84 Data';

proc gplot data = outv3;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

symbol3 i = join l = 1 c = red v = diamond;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black

'Variogram');

run;

quit;

*krig using the theoretical variogram, and plot the resulting estimates:

proc krige2d data = wash84 outest = est;

pred var = Bltaft84;

model scale = 0.015 range = 0.10 form = sph;

coord xc = long yc = lat;

grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by

0.01;

run;

proc g3d data = est;

title 'Surface Plot of Kriged Estimates for %BltAft84';

scatter gyc*gxc = estimate / grid;

label gyc = 'Latitude'

gxc = 'Longitude'

estimate = '% Units Built After 1984';

run;

*zero out any negative kriging estimates:

data final;

set final;

if (kaft84 < 0) then kaft84 = 0;

run;

Variable: Income Per Capita

---------------------------

(see all plots for Income Per Capita in Appendix D)

data washinc;

set brady.washkrig (keep = Incprcap long lat);

run;

proc g3d data = washinc;

title 'Surface Plot of Variable Measurements';

scatter lat*long=Incprcap / xticknum = 5 yticknum = 5

grid zmin = 4120 zmax = 59400;

label long = 'Longitude'

lat = 'Latitude'

Incprcap = 'Income Per Capita';

run;

quit;

proc variogram data = washinc outv = outv;

compute lagd = 0.03 maxlag = 15 robust;

coordinates xc = long yc = lat;

var Incprcap;

run;

title 'Variogram Results';

proc print data = outv label;

var lag count distance variog rvario;

run;

data outv2;

set outv;

vari = variog;

type = 'regular';

output;

vari = rvario;

type = 'robust';

output;

run;

title 'Standard and Robust Semivariogram for IncPrCap Data';

proc gplot data = outv2;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');

run;

quit;

*fit theoretical variogram:

data outv3;

set outv;

c = 75000000; a0 = 0.03;

if (distance <= 0.03) then do;

vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);

type = 'Spherical';

end;

else if (distance > 0.03) then do;

vari = c;

type = 'Spherical';

end;

output;

vari = variog; type = 'regular'; output;

vari = rvario; type = 'robust'; output;

run;

title 'Theoretical and Sample Semivariogram for IncPrCap Data';

proc gplot data = outv3;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

symbol3 i = join l = 1 c = red v = diamond;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black

'Variogram');

run;

quit;

*krig using the theoretical variogram, and plot the resulting estimates:

proc krige2d data = washinc outest = est;

pred var = Incprcap;

model scale = 75000000 range = 0.03 form = sph;

coord xc = long yc = lat;

grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by

0.01;

run;

proc g3d data = est;

title 'Surface Plot of Kriged Estimates for IncPrCap';

scatter gyc*gxc = estimate / grid;

label gyc = 'Latitude'

gxc = 'Longitude'

estimate = 'Income Per Capita';

run;

Variable: % Units Built Between 1980-1984

-----------------------------------------

(see all plots for % Units Built Between 1980-1984 in Appendix D)

data wash8084;

set brady.washkrig (keep = Blt_8084 long lat);

run;

proc g3d data = wash8084;

title 'Surface Plot of Variable Measurements';

scatter lat*long=Blt_8084 / xticknum = 5 yticknum = 5

grid zmin = 0 zmax = 0.1375;

label long = 'Longitude'

lat = 'Latitude'

Blt_8084 = '% Units Built 1980-1984';

run;

quit;

proc variogram data = wash8084 outv = outv;

compute lagd = 0.03 maxlag = 15 robust;

coordinates xc = long yc = lat;

var Blt_8084;

run;

title 'Variogram Results';

proc print data = outv label;

var lag count distance variog rvario;

run;

data outv2;

set outv;

vari = variog;

type = 'regular';

output;

vari = rvario;

type = 'robust';

output;

run;

title 'Standard and Robust Semivariogram for %Blt8084 Data';

proc gplot data = outv2;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black 'Variogram');

run;

quit;

*fit theoretical variogram:

data outv3;

set outv;

c = 0.00155; a0 = 0.05;

if (distance <= 0.05) then do;

vari = c * (1.5 * (distance / a0) - 0.5 * (distance / a0)**3);

type = 'Spherical';

end;

else if (distance > 0.05) then do;

vari = c;

type = 'Spherical';

end;

output;

vari = variog; type = 'regular'; output;

vari = rvario; type = 'robust'; output;

run;

title 'Theoretical and Sample Semivariogram for %Blt8084 Data';

proc gplot data = outv3;

plot vari*distance=type / frame cframe = ligr vaxis = axis2

haxis = axis1;

symbol1 i = join l = 1 c = blue v = star;

symbol2 i = join l = 1 c = black v = square;

symbol3 i = join l = 1 c = red v = diamond;

axis1 minor = none label = (c = black 'Lag Distance');

axis2 minor = none label = (angle = 90 rotate = 0 c = black

'Variogram');

run;

quit;

*krig using the theoretical variogram, and plot the resulting estimates:

proc krige2d data = wash8084 outest = est;

pred var = Blt_8084;

model scale = 0.00155 range = 0.05 form = sph;

coord xc = long yc = lat;

grid x = -84.0710 to -83.5600 by 0.01 y = 42.0885 to 42.4118 by

0.01;

run;

proc g3d data = est;

title 'Surface Plot of Kriged Estimates for %Blt8084';

scatter gyc*gxc = estimate / grid;

label gyc = 'Latitude'

gxc = 'Longitude'

estimate = '% Units Built 1980-1984';

run;

6. Using the final data set constructed after all kriging estimates have been derived, obtain population estimates at the locations representing the small areas by entering the kriging estimates of the significant predictors for each small area into the model relating the small areas:

data test;

set brady.krigdata;

estimate = 0.0912 * land_km + 31.67 * koccu + 0.00002363 * kmedv + -0.0001586 * kinc + -11.05 * k8084 + 22.92 * kaft84;

est2 = estimate * estimate;

error = Persons - est2;

run;

*compute the mean and standard deviation of the desired errors:

proc means data = test;

var error;

run;

data brady.krigdata;

set test;

run;

APPENDIX D

Kriging Plots

1990 Washtenaw Census Tract Data: Location Summary

Plot 1: Scatter Plot of all Measurement Locations for Tracts in Washtenaw County

Plot 2: Distribution of Pairwise Distances Between Tracts

Variable: % Housing Units Occupied

Variable: Median Housing Unit Value

Variable: Income Per Capita

Variable: % Units Built Between 1980-1984

Variable: % Units Built After 1984

APPENDIX E

Additional SAS Code

SAS code used to create the data file that includes the locations of each Census Block Group in Washtenaw County:

data bgid;

infile "c:\bg26_d90_pa.dat";

input number : 4. county $char5. tract 4. block 3.;

run;

data bgid2;

set bgid;

if county = '26161';

run;

data coords;

infile "c:\bg26_d90_p.dat" lrecl = 70;

input number 1-10 long 17-23 lat 46-51;

run;

data coords2;

set coords;

if number ne -0.8 and number ne -0.9;

run;

proc sort data = bgid2;

by number;

run;

proc sort data = coords2;

by number;

run;

data final;

merge bgid2 (in=ina) coords2(in=inb);

by number;

if ina;

run;

data temp;

set final;

file "c:\bgloc2.txt";

put @1 number @5 "," @6 county @11 "," @12 tract @16 ","

@17 block @18 "," @19 long @26 "," @27 lat;

run;

SAS code used to create the data file that includes the locations of each Census Tract in Washtenaw County:

data tractid;

infile "c:\tr26_d90_pa.dat";

input number : 4. county $char5. tract 4.;

run;

data tractid2;

set tractid;

if county = '26125' or county = '26161' or county = '26163';

run;

data coords;

infile "c:\tr26_d90_p.dat" lrecl = 46;

input number 1-10 long 11-28 lat 29-46;

run;

data coords2;

set coords;

if number ne -8 and number ne -9;

run;

proc sort data = tractid2;

by number;

run;

proc sort data = coords2;

by number;

run;

data final;

merge tractid2 (in=ina) coords2(in=inb);

by number;

if ina;

run;

data temp;

set final;

file "c:\tractloc2.txt";

put @1 number @4 "," @5 county @10 "," @11 tract @15 "," @16 long @26 "," @27 lat;

run;

SAS code used to create the master data set, including all variables for each Census Block Group (demographics, population and housing characteristics, latitude and longitude coordinates, etc.):

data wash;

infile "c:\washdata.txt" dlm = ",";

input state county tract block land_km water_km persons housing occupied vacant ownr_occ rent_occ val_medi rnt_medi detached attached duplex apartmnt employed unemploy notinwrk inc_medn incprcap childpov inpovrty bltbfr70 blt_7079 blt_8084 bltaft84 fitted fitted2;

run;

data locs;

infile "c:\bgloc2.txt" dlm = ",";

input number county tract block long lat;

run;

proc sort data = locs;

by tract block;

run;

proc sort data = wash;

by tract block;

run;

data last;

merge wash (in=ina) locs (in=inb);

by tract block;

if ina and inb;

run;

libname brady "h:\thesis";

data brady.washdata;

set last;

long = long * 100;

lat = lat * 100;

run;

¹ Taken from “http://reserves.library.okstate.edu/geog5343/lec12/sld001.htm”

² From: Rao, J.N.K. 1999. Current Trends in Sample Survey Theory and Methods. Indian Journal of Statistics 61: 16-22.

³ Taken from “http://www.geostatistics.com/GSWin/GSWINGaussian_Isotropic_Model.html”

⁴ United States Bureau of the Census - Geographic Area Descriptions: http://www.census.gov/geo/www/cob/tr_info.html; http://www.census.gov/geo/www/cob/bg_info.html

⁵ Latitude and longitude coordinate data for the census tracts and block groups in the county of Washtenaw were obtained from http://www.census.gov/geo/www/cob/tr.html and http://www.census.gov/geo/www/cob/bg.html.

⁶ Note that “Units” refers to the total number of housing units in the given area. Hence “% Apartment Units” refers to the proportion of housing units that are apartments.

Spatial Analysis of a Small Area Problem

Figure 2: 1990 Census Block Groups in Washtenaw County

APPENDIX A

Plot 1: Leverages of all cases after fitting the full model

Plot 8: Normal Q-Q Plot, Indicating Normal Residuals in the Final Model

APPENDIX C

Kriging Plots

1990 Washtenaw Census Tract Data: Location Summary

Plot 1: Scatter Plot of all Measurement Locations for Tracts in Washtenaw County

Plot 2: Distribution of Pairwise Distances Between Tracts

Variable: % Housing Units Occupied

Variable: Median Housing Unit Value

Variable: Income Per Capita

Variable: % Units Built Between 1980-1984

Variable: % Units Built After 1984

APPENDIX E

Recent Documents:

Recent Search: