Methods The data and factors This section describes the data provided by CAPC and lists the factors considered by our analysis. The raw test results were supplied by the Antech and IDEXX corporations [5,6] and report whether each performed test was positive or negative — no uncertainty margin is supplied for the results. While individual tests are reported by zip-code of the testing clinic, the raw data were aggregated into the number of positive tests and the total number of tests conducted over each calendar year in each county of the conterminous United States. Since only two calendar years of records are available, no attempt is made to include seasonal structure. The IDEXX samples represent both the results of pet-side heartworm antigen test kit and results from an IDEXX capture system [Heartworm RT and the 4Dx Plus (and originally 4Dx tests)] along with tests run by the IDEXX diagnostic laboratories [Heartworm Antigen by ELISA-Canine*]. Antech tests were performed at Antech Laboratories and utilized the Dirochek Assay and the AccuPlex4 heartworm antigen detection assay. Over 2011 and 2012, there were 9,580,719 total tests performed by either method, of which 111,259 were positive. Rudimentary statistical checks do not show vast differences between Antech and IDEXX samples.In most of the southeastern United States, veterinarians assume that outdoor dogs are at risk of heartworm infection, and thus, recommend that clients place their dogs on preventive protection. This may not be practical for all pet owners due to costs. In the CAPC data, many counties in the eastern United States report a small number (say less than 20) of tests. This is likely because tests are not being reported or are incommensurate with CAPC protocols, not because they are uncommonly performed. Tests from such counties are likely performed for the same reasons as other southeastern counties reporting a greater number of tests. In other areas of the United States, such as Montana or Idaho, testing is likely only performed if dogs have signs suggestive of heartworm disease and the veterinarian requires a confirmative test. Sometimes, veterinarians are aware that dogs travel to heartworm endemic areas for part of the year. These dogs may be tested annually when they return to their home state. Also, as is evident from Figures 1 and 2, it is now fairly obvious that heartworm occurs in much of the Western United States. In some of these areas, heartworm testing is probably conducted for the same reasons as more prevalent areas. Overall, while it is understood that there may be some sampling biases in certain areas of the United States, the CAPC data seems fairly reflective of a true random sample for many counties in the United States. Figure 1 Raw reported heartworm prevalence rates for 2011 and 2012. Figure 2 Head-banging smoothed heartworm prevalence rates for 2011 and 2012. Other data aspects are worth illuminating. First, heartworm infections are not detectable by almost all testing methods until about 6 or 7 months after the dogs have become infected; this is how long it takes for either microfilariae or antigen to appear in the blood after infection. Thus, many of the detected infections likely commenced the year prior to a positive test result. Because no travel information exists, it cannot be known where a dog acquires infection. However, it is suspected that the majority of infected dogs were infected close to home. Second, it is not known if a dog has been tested more than once. Dogs may be tested more than once annually to verify the need for treatment, to verify successful treatment, or annually tested after the infection is first identified. The factors chosen for inclusion in this study are those envisioned to impact whether a dog is likely to have heartworm. These factors are a subset of those listed in Brown et al.[2] and contain climate variables (annual temperature, precipitation, and relative humidity; Figures 3, 4, 5); geographic factors (elevation, forest coverage, surface water coverage; Figures 6, 7, 8); societal factors (human population density and household income; Figures 9 and 10); and the presence or absence of Aedes aegypti, Aedes albopictus and six other mosquito species (Figures 11, 12, 13, 14, 15, 16, 17, 18). Presence or absence of mosquito species was used because abundance data are not available. Table 1 lists all considered factors. Many (but not all) of the factors are available on a county-by-county basis across the United States. Our methods do not a priori assume that all factors significantly influence heartworm prevalence rates, but rather seek to determine which factors significantly influence prevalence. Figure 3 2011 U.S. annual average temperature (Degrees F). The temperature data included in this study were annual in nature and were aggregated by the National Climatic Data Center (NCDC) [7] by climate region [8]. These data are not county-by-county — all counties within a climate region are assigned the same annual temperature. For example, the state of Alabama has 67 counties and 8 climate regions. Annual temperatures for 2011 were used to generate this graphic. Temperature dependence on latitude is clear. Figure 4 2011 U.S. annual total precipitation (Inches). The precipitation data were also obtained from the NCDC and has the same spatial resolution as the temperature data. Data used in this figure are for 2011. One sees a relatively dry Southwestern United States and higher precipitation in the southeastern United States. Figure 5 2011 U.S. annual relative humidities (Percent). Our humidity data were relative (and not absolute) humidities. Relative humidity is not measured directly, but can be estimated from temperature and dew point (dew point is measured) via RH = exp 17.27 ( D - 32 ) 5 / 9 ( D - 32 ) 5 / 9 + 237.3 exp 17.27 ( T - 32 ) 5 / 9 ( T - 32 ) 5 / 9 + 237.3 100 % , where T is annual temperature, D is annual dew point, and exp is the exponential function [9]. As expected, the Southeast is the most humid region in the United States. Figure 6 U.S. county elevations (Feet). Elevation data were obtained on a county-by-county [10] basis, with height of the highest point in each county used to produce this figure. Data containing average county elevations would be preferable to use but were not readily available. Of course, any biases should be minor in the eastern United States where most counties are homogeneous in elevation. However, the western States are less homogeneous. For example, Inyo County in California contains both the highest (Mount Whitney, 14,505 ft.) and lowest points (Death Valley, -282 ft.) in the conterminous United States. These limitations aside, elevation is a potentially important factor for heartworm prevalence as higher elevations are often associated with drier conditions. Figure 7 2007 U.S. county forestion coverage (Percent). County forest coverage was obtained from the United States Department of Agriculture (USDA) [11]. Total county area was obtained from the Census Bureau [12]. Percentages of forest coverage were calculated by dividing the forest coverage area by county area (times 100%). The definition of forest coverage here is restricted to agriculture woodland, which means land supporting trees capable of producing timber or other wood products, including but not limited to logs, lumber, posts, and firewood. This data are updated by the USDA every five years. However, the 2012 data is not available yet; hence, the 2007 data were used to generate the graph. Figure 8 2007 U.S. county water coverage (Percent). County surface water coverage was obtained from the Census Bureau [12] and was calculated by dividing the surface water area by total county area reported in the Census Bureau [12]. The surface water coverage data were last updated in 2011, which were used in our analysis. Figure 9 2010 U.S. population density (100 people per square mile). Population densities were calculated by dividing the number of 100 people in each county by the county area. The county populations and areas were taken from the most recent (2010) census data (Census Bureau [13]). Figure 10 2011 U.S. median household income (Dollars). Median household incomes were obtained from the Census Bureau (2010 census). These data were adjusted for inflation based on 2010 dollars; the Census Bureau adjusts by multiplying 2011 median household income by the ratio of the Consumer Price Index of 2010 and 2011 [14]. Figure 11 Presence of Aedes aegypti. Figure 12 Presence of Aedes albopictus. Figure 13 Presence of Aedes canadensis. Figure 14 Presence of Aedes sierrensis. Figure 15 Presence of Aedes trivittatus. Figure 16 Presence of Anopheles punctipennis. Figure 17 Presence of Anopheles quadrimaculatus. Figure 18 Presence of Culex quinquefasciatus . Table 1 Heartworm factors considered for inclusion in the study   Factors Data available period Scale Source Climate factors Annual temperature 2011 and 2012 Division National Climate Data Center (NCDC) Annual precipitation 2011 and 2012 Division NCDC Annual relative humidity 2011 and 2012 Station NCDC Geographic factors Elevation 2012 County http://www.cohp.org/ Percentage forest coverage 2007 County United States Department of Agriculture (USDA) Percentage surface water coverage 2010 County U.S. Census Bureau Societal factors Population density 2010 County U.S. Census Bureau Median household income 2011 County U.S. Census Bureau Mosquito species Aedes aegypti 2008 County Moore, CG. [15] Aedes albopictus 2012 County Hynes NA [16] Aedes canadensis 2004 County RF Darsie, Jr. and RA Ward [17] Aedes sierrensis 2004 County RF Darsie, Jr. and RA Ward Aedes trivittatus 2004 County RF Darsie, Jr. and RA Ward Anopheles punctipennis 2004 County RF Darsie, Jr. and RA Ward Anopheles quadrimaculatus 2004 County RF Darsie, Jr. and RA Ward Culex quinquefasciatus 2004 County RF Darsie, Jr. and RA Ward Construction of the baseline heartworm prevalence map We first construct a baseline heartworm prevalence map. This is done on an annual basis as there is too little data to consider seasonal effects. As will become apparent, this analysis is a necessary precursor to assess factor importance — informative factors should be able to reproduce the structure of our baseline prevalence map. For the years 2011 and 2012, all data were combined into a single sample. For a county s, let p(s) denote the probability that a single dog tests heartworm positive. For notation, n(s) is the number of tests in county s and k(s) is the number of positive tests at county s. For example, if county s has 3 positive tests out of 100 during 2011 and eight positive tests out of 200 during 2012, then k(s) = 11, n(s) = 300, and p^(s)=11/300 (a hat over a quantity indicates it is an estimate). Figure 1 displays county-by-county values of p^(s). This figure indicates that heartworm is most problematic in the Lower Mississippi Valley. The role of factors in explaining the prevalence rates will be discussed in Section "Factor quantification". No factors are involved in the calculation of p^(s). Since the number of dogs tested in distinct counties greatly varies, the raw values of p^(s) need to be weighted. Estimated values of p(s) are more accurate for a sample of 100 dogs than for a sample of 10 dogs. To quantify this, the classical standard error is used. In particular, the estimated variance of p^(s) is (1) Var ( p ^ ( s ) ) ^ = p ^ ( s ) ( 1 - p ^ ( s ) ) n ( s ) . The estimated standard error of p^(s) is the square root of (1). We weight the values of p^(s) inversely proportional to this standard error. Before doing this, an adjustment was made to the values of p^(s) for small sample sizes. Technically, counties where all tests are positive or negative have Varp^(s)=0 (hence, its reciprocal is infinite), which could adversely impact our ensuing smoothing methods. To combat this, the Wilson estimator that adds two to numerator counts and four to denominator counts is used in lieu of p^(s): p ^ W ( s ) = k ( s ) + 2 n ( s ) + 4 . This estimator has desirable sampling properties and cannot be zero or unity [18]. Of course, for large k(s) and n(s), p^(s) and p^W(s) are approximately equivalent. The raw values of p^W(s) greatly vary across counties. Even counties in close proximity to one another often have highly different prevalence estimates. This said, some spatial structure clearly exists in Figure 1. Our next goal is to extract and explore this structure. To accomplish this, the weighted head-banging spatial smoothing algorithm was applied to the county-by-county values of p^W(s). This procedure serves to remove localized small-scale variations due to random chance, illuminating large-scale structures that are actual features of the prevalence rates. While we will not delve into the details of weighted head-banging procedures, the technique is a median-based algorithm proposed by Tukey and Tukey [19] for smoothing spatial data. One needs to input the longitude and latitude of the centroid of each county, the county value to be smoothed, and the corresponding weights. Mungiole et al.[20] discuss the algorithm in detail. Head-banging takes its name from a child’s game, where the child presses their face against pins protruding from a board that are of various lengths. The result leaves an impression of the child’s face, smoothing the lengths of adjacent nails but leaving the general structure of the face’s impression. Head-banging techniques are very effective for down-weighting or removing noisy ‘spikes’ while preserving edge structures. A spike is an isolated observation that lacks confirmation from nearby data. Because of different testing practices from county to county, many spikes exist in the heartworm prevalence estimates. An edge occurs where data changes significantly in pattern — perhaps due to a mountain range. Edges are informative as they often demarcate distinct data regions. To run the weighted head-banging algorithm, a parameter called the number of triples must be selected and the weights need to be specified. At each county where data is present, a set of triples (a triple for a county is represented by the county itself and two nearby counties) were selected based on the criteria proposed by Hansen [21]. The weight of county s, denoted by w(s), comes from the inverse of the standard error of p^(s), with p^W(s) replacing p^(s): w ( s ) = n ( s ) k ( s ) + 2 n ( s ) + 4 1 - k ( s ) + 2 n ( s ) + 4 . Figure 2 shows our smoothed prevalence rates based on the weighted head-banging procedure. The larger the triple parameter is, the smoother (less rough) the resulting map will be. We have intentionally left the graphic slightly under-smoothed. This is because it is easy to visually smooth variabilities away with the eye, but impossible to recover true fluctuations that are erroneously smoothed away. Thirty triples were used to produce this graphic. Figure 2 has interesting implications. First and foremost, heartworm is most prevalent in the Lower Mississippi Valley. While the northern latitudes show less activity, places where the prevalence rates were relatively higher do exist. Michigan, Vermont, and Northwest Washington, for example, show greater heartworm disease prevalence than some of the other states at the same latitude. The Northern Rockies perhaps show the least heartworm disease prevalence. While many inferences can be made from Figure 2, we caution the reader not to over-interpret minutia. The map is constructed from only two years of data and there are variations in the results that may be spurious. For example, two very close locations — say Baton Rouge and New Orleans, Louisiana — might be shaded different colors on the map, but should not be expected to have radically different prevalence rates. As additional years of data are collected, we expect our baseline to become more accurate. Another issue involves dispersal of the disease: there is no a priori reason to think that prevalence rates are static in time. With only two years of observations, time trends will be difficult to discern and are not explored herein. Factor quantification This section examines the significance of the individual factors presented in Table 1 in predicting heartworm prevalence. A logistic regression model was created using data from 2011 and 2012. Logistic regression methods (as opposed to ordinary regression methods) [22] are specifically designed for cases involving a binary outcome that can be summarized by a probability, and hence limited to take values in the interval [0,1]. Our goal is to reproduce the structure in Figure 2. Let X(s) = (f1(s),…,f8(s);11(s),…,18(s))′ be the collection of all predictive factors at county s. The logistic regression model attempts to explain spatial variations in p(s) from the factors via (2) p ( s ) = e g ( X ( s ) ) 1 + e g ( X ( s ) ) , where g(X(s)) has form (3) g ( X ( s ) ) = logit ( p ( s ) ) = ln p ( s ) 1 - p ( s ) = β 0 + ∑ i = 1 8 β i f i ( s ) + ∑ i = 1 8 γ i 1 i ( s ) . Clarifying terms, ln denotes natural logarithm and logit(x) = ln(x)- ln(1 - x). Notice that e g(X(s))/(1 + e g(X(s))) ∈ [0,1] for any value of g(X(s)). This guarantees that all predicted prevalence rates lie between zero and unity. The overall location parameter, β0, is common to all counties while β1,…,β8, are regression coefficients for the eight non-mosquito factors, and γ1,…,γ8, are regression coefficients for the eight mosquito species. The notation 1 i (s),1 ≤ i ≤ 8, are zero-one indicators: 1 i (s) is taken as unity if the ith mosquito type is present in county s and zero otherwise. To estimate the parameters β0, β i ,1 ≤ i ≤ 8, and γ i ,1 ≤ i ≤ 8, from the data, the classical method of maximum likelihood [23] is used. Once the logistic regression parameters are estimated, an estimate of p(s) based upon the fitted model is computed via (4) p ^ Logistic ( s ) = e ĝ ( X ( s ) ) 1 + e ĝ ( X ( s ) ) , where quantities in (4) are estimated by (5) ĝ ( X ( s ) ) = β ^ 0 + ∑ i = 1 8 β ^ i f i ( s ) + ∑ i = 1 8 γ ^ i 1 i ( s ) .