Fundamentals of GIS Application in Exposure Assessment Using GIS in exposure assessment for epidemiologic studies requires knowledge and expertise in at least three core scientific areas: geospatial sciences, environmental sciences, and epidemiology. Geospatial Science For a GIS to accurately represent occurrences on the earth’s surface, the location of data must be reliable, accurate, and pertinent (Falbo et al. 1991). Geospatial science is the systematic study of geographic variables relating to, occupying, or having the character of space. Fundamental elements of geospatial sciences relevant to GIS applications in exposure assessment include data representation, scale, and accuracy. Data representation is the format of the unit of analysis used in the GIS. The most commonly used representations of space in a GIS are the raster and vector data models. In the raster model, grid cells serve as the basic units of analysis. An example would be pixels of remotely sensed imagery from satellite imagery. The vector model uses points, lines, or polygons based on continuous geometry of space to represent data. Other, more specialized data models are available in most GIS software. For example, the triangulated irregular network (TIN) model provides an efficient means of representing elevation data often used for terrain analysis. GIS software contain algorithms for translating between formats, for example, raster → vector, vector → raster, point → TIN, although some error may be introduced by these data transformation processes. More complete information on data models can be found in textbooks such as those by Chrisman (2002) and DeMers (2000). Selection of scale is perhaps the most important factor in creating and analyzing GIS databases for exposure assessment and epidemiology. The following is a list of definitions of the the scaling factors most likely to be encountered in an epidemiology study: Cartographic scale: Traditional map scale ratio relates the size of a feature on the ground to the size of a feature on the map. This is the scale normally listed on a road map. Scale selection results in the amount of detail including roads, water bodies, and land use patterns. Geographic extent: Refers to the size of the study area. For example, a study can be regional scale or global scale. The extent of the study area and/or its subsets can affect the analysis results (e.g., different results might be obtained when looking at cancer incidence in one state or province versus nationwide). Spatial resolution: Refers to the grain, or smallest, unit that is distinguishable. Map data at different scales will allow for resolution of different objects. For example, a house site represented on a 1:24,000 scale map would not appear on a 1:100,000 scale map. In remotely sensed imagery, resolution is directly related to the pixel size, the area on the ground from which the radiances are integrated. Lower resolution pixel (1 km2) data may be less useful than higher resolution pixel Landsat data (30 m2) for some environmental health studies. Operational scale: Refers to the scale at which the process of interest occurs. For example, contaminant transport may occur at a small or large scale. Processes can be resolution dependent, that is, they can be detected at one scale but not another. Homogeneity and heterogeneity of spatial data are affected by scale, and the scale chosen may affect the ability of the study to detect a relationship between the environmental exposure and the health outcome. This issue is similar to the modifiable areal unit problem, a term introduced by Openshaw and Taylor (1979) that has long been recognized as an issue in the analysis of aggregated data such as disease incidence rates and census enumeration (Fotheringham and Wong 1991; Holt et al. 1996). For example, studies of disease incidence reported at the county level require the environmental data to be aggregated to an exposure metric at the same resolution. Such aggregation may obscure intracounty variation in exposure (operational scale) and thus the relationship between the target contaminant and the disease. Accuracy can be defined as how well the GIS data represent reality in terms of positional, attribute, and temporal accuracy. Positional accuracy relates to the agreement between data representation in the GIS and actual location of the data, or “ground truth.” Attribute accuracy is a measure of how well information linked to the data representation format is correct (e.g., is the line segment tagged with the correct street information?). Temporal accuracy concerns the appropriateness of using a particular snapshot or snapshots of time for a particular GIS-based analysis or modeling effort. For example, temporal accuracy would reflect how well using a single-year crop map would reflect proximity to pesticide use for exposure assessment of a particular disease outcome. Errors in GIS can be categorized as source errors or processing errors. Source errors relate to the accuracy of the data per se, that is, the differences between the data in the GIS and reality. For example, geocoding is often used to estimate the location of residences and pollutant sources; however, the positional error generated at this first step in the exposure assessment process is rarely evaluated. A study by Krieger et al. (2001) compared geocoding firms and found widely varying geocoding success rates as well as large differences in the accuracy of census tract assignment. The positional accuracy of geocoded addresses in epidemiology studies was evaluated in a breast cancer case–control study in western New York (Bonner et al. 2003) and in a non-Hodgkin lymphoma case–control study in Iowa (Ward et al., in press). The positional errors were comparable in the two studies; the majority of homes were geocoded to within 100 meters of their location determined by GPS. However, positional errors were greater for homes outside the large metropolitan areas (Bonner et al. 2003), and rural addresses in Iowa had a median positional error of around 200 meters (Ward et al. submitted). Processing errors can be introduced into the database as a result of GIS-based analysis and modeling. For each layer of data combined in a GIS analysis, additional uncertainty in the analysis process will be introduced because of error propagation. Beyea and Hatch (1999) provide an in-depth discussion of uncertainty in GIS-based exposure modeling. Environmental Science Environmental science is the systematic study of the complex of physical, chemical, and biotic factors that act upon on an organism or an ecologic community and ultimately determine its form and survival. It can include circumstances, objects, or conditions by which an organism or community is surrounded and the aggregate of social and cultural conditions that influence the life of an individual or community. Fundamental elements of environmental science relevant to GIS applications in exposure assessment include measurement data and predictive algorithms for fate and transport of chemical compounds in the environment. Environmental science studies rely heavily on measurement data of the factors that influence life. Institutions in almost every country in the world, such as the U.S. Environmental Protection Agency (U.S. EPA), have been established with a primary mission of collecting and analyzing environmental samples to understand the impact of these factors on the health of the earth’s ecosystem. As a result, an abundance of measurement data concerning the chemical composition of air and water resources is available to environmental epidemiology studies. A basic principle in environmental sciences is that measurement data should be used within the bounds of the purpose for which the sample was collected. Often this purpose is to define regional or systematic trends in environmental quality at a scale and resolution that may not be adequate for epidemiologic studies, especially studies of individuals. For example, public water utilities operating in the United States with a service population > 10,000 are required by federal law to report levels of certain byproducts of the disinfection process to the U.S. EPA. Most utilities meet this requirement by taking four samples at different locations in their water distribution system every 3 months. Although this sampling design may be sufficient to indicate compliance with the law, it may not be sufficient to adequately encompass the spatial and temporal variability in exposure necessary to classify exposure to individuals using the water. Environmental scientists often use computer-based simulation models to supplement measurement data in environmental studies. These models are generally composed of mathematic algorithms designed to predict interaction between, and effect of the complex factors on, an organism or ecologic community. The models can be stochastic (based on statistical probability) or deterministic (based on physical processes). In either case the models are dependent on measurement data for calibration of the predictive algorithms and validation of the predicted results. A fundamental rule in environmental modeling is not to transfer use of a model from one geographic region to another without validating it with measurement data from the new study area. Often such model transfer will require recalibration of the model as well. It is also a general rule in environmental modeling to reserve a statistically sufficient portion of available measurement data for model validation. Caution should also be employed in using a model at a spatial scale or temporal pattern for which it was not designed. A number of textbooks address environmental science and modeling (Clark 1996; Crawford-Brown 2001). “Geophysical plausibility” is the term we have coined for use in application of environmental science to exposure assessment for epidemiology. In simplest terms this axiom would dictate that an association between a contaminant source and exposure to an organism or ecologic community cannot exist unless there is a plausible geophysical route of transport for the contaminant between the source and the receptor. For example, assume we are conducting a study of drinking water as the sole source of exposure to a specific contaminant and a disease outcome. If a landfill is leaching the contaminant into a groundwater resource (aquifer) in our study area, but our study population has always used another water supply source with no geophysical connectivity to the aquifer, it is implausible that the contaminant from the landfill is causing the adverse health outcome through a drinking water route of exposure. This axiom is particularly relevant in the use of GIS-based processing functions (e.g., kriging on measurement data) to develop exposure estimates in environmental epidemiology studies. Epidemiology The fundamental guidelines for the design of an environmental epidemiology study are relevant whether or not GIS technology is being used for exposure assessment. A well-designed epidemiologic study takes into account potential confounding factors, including other exposures that may co-occur with the exposure of interest. The study should be designed to have adequate power to detect an association between the exposure and health outcome and to evaluate exposure–response relationships. For many environmental exposures the anticipated magnitude of the association with disease is likely to be modest, therefore a careful evaluation of the expected prevalence of exposure is critical to determining adequate study power. A GIS can be used to evaluate the population potentially exposed and to determine if there is likely to be adequate variation in exposure across a study area. Wartenberg et al. (1993) used a GIS to develop an automated method for identifying populations living near high-voltage lines for the purpose of evaluating childhood leukemia and electromagnetic radiation. Another example is the use of a GIS to link disease registry information with public water supply monitoring and location data to determine potential study areas for evaluating the relation between disinfection byproducts exposure and adverse reproductive outcomes and cancer (Raucher et al. 2000). The epidemiologic study should have the capability to evaluate the exposure in relation to an appropriate latency for the disease and to evaluate critical time windows of exposure. One limitation of a GIS is that mapped data often represent only one snapshot in time. However, several recent efforts have used GIS to reconstruct historical exposure to pesticides (Brody et al. 2002) and drinking water contaminants (Swartz et al. 2003) over a period of decades for a study of breast cancer on Cape Cod, Massachusetts. A study of fetal death in California (Bell et al. 2001) used an exposure metric based on agricultural pesticide use near the mother’s residence during specific time periods during the pregnancy. Misclassification of exposure is of particular concern in environmental epidemiology studies because of the challenges in estimating exposure to environmental contaminants, which can occur across multiple locations and often at low levels. Exposure errors in time–series studies can occur as a continuum of measurement errors between classic-type errors and Berkson errors, as has been presented in detail by Zeger et al. (2000) regarding air pollution and health. Each type of error has different effects on the estimation of risk. Berkson error occurs when the exposure metric is at the population level, and individual exposures vary because of different activity patterns. An example of a population-level or aggregate exposure metric is the assignment of air pollutant levels from a stationary air monitor to the population living in the vicinity of the monitor. Berkson error does not lead to bias in the risk estimate although the variance of the risk estimate is increased (Zeger et al. 2000). In a classic error model the exposure metric used in an epidemiologic study is measured with error and is an imperfect surrogate for the true exposure. If misclassification of exposure is nondifferential in terms of the health outcome, the effect is generally to bias risk estimates toward the null, thus potentially missing true associations (Copeland et al. 1977; Flegal et al. 1986). To evaluate the degree of misclassification that may occur in an epidemiologic study, it is important to consider the sensitivity and specificity of the exposure metric employed. Sensitivity is the ability of an exposure metric to correctly classify as exposed those who are truly exposed. Specificity is the ability of the metric to correctly classify as unexposed those who are unexposed. Most epidemiologists do not formally assess the validity of their exposure metric before a study is launched; however, small reductions in sensitivity and/or specificity of the exposure metric can have substantial effects on the estimates of risk. When the true prevalence of exposure is low (e.g., less than 10%) small reductions in specificity cause substantial reductions in the risk estimates, whereas reductions in sensitivity have smaller effects. When the exposure is common in the study population, the sensitivity of the exposure metric becomes more important (Stewart and Correa-Villasenor 1991). A common metric used in studies employing GIS is the proximity between a pollutant source and a residence. Simple proximity metrics are likely to overestimate the population truly exposed (high sensitivity but low specificity). If those truly exposed represent only a small percent of the study population, there will be substantial attenuation of the risk estimate if a true risk exists. Rull and Ritz (2003) compared several methods of classifying a study population in California on the basis of agricultural pesticide use reported by the California Pesticide Use Reporting (CPUR) database (http://www.cdpr.ca.gov/). The prevalence of exposure differed substantially depending on the metric used. They assumed that a metric that accounted for the location of crop fields more accurately represented true exposures and this metric resulted in lower exposure prevalence compared with a metric based on the CPUR database alone. In a simulation study they demonstrated that the reduced specificity of the CPUR metric resulted in substantial attenuation of risk estimates.