Results How Smooth Are the Posterior Means? The results of a Bayesian disease-mapping analysis are typically presented in the form of a map displaying a point estimate (usually the mean or median of the posterior distribution) of the relative risk for each area. To interpret such maps, one needs to understand the extent to which the statistical model is able to smooth the risk estimates to eliminate random noise while at the same time avoiding oversmoothing that might flatten any true variations in risk. To address this issue, we consider the two aspects separately: a) do the Bayesian methods provide adequate smoothing of the background rates, and b) to what extent is the posterior mean estimate different from the background risk in the small number of areas simulated with a true elevated risk? In all the cases simulated, we found substantial shrinkage of the relative risk estimates for the background rates. This is well illustrated in Figure 1, which displays raw and smooth estimates for all the background areas of Simu 2 and an SF of 1 or 4. Note that when SF = 1, the histogram of the raw standardized mortality or morbidity ratio (SMR) estimates is very dispersed (Figure 1A), with a range of 0–11, and shows a skewed distribution. Clearly, mapping the raw SMRs would present a misleading picture of the risk pattern, whereas any of the three Bayesian models give posterior mean relative risk estimates for the background areas that are well centered on 1 (Figure 1B–D), with just a few areas having estimates outside the 0.9–1.1 range. When the expected counts are higher (SF = 4), the histogram of the raw SMRs is less spread but still substantially overdispersed, whereas those corresponding to the three models are even more concentrated on 1 than when SF = 1 (Figure 1F–H). Thus the false patterns created by the Poisson noise are adequately smoothed out by all the disease-mapping models. Details of the performance of the BYM model in estimating the relative risk of the high-risk areas are presented in Table 1, with findings for L1-BYM and MIX shown in Tables 2 and 3, respectively. Overall, for the BYM model, a great deal of smoothing of the relative risks is apparent. For the isolated areas in Simu 1, one can see that relative risks of 1.5 in any single area are smoothed away, even in the most favorable case of an area with expected counts of 70 (90% area SF = 10). When the simulated relative risk is 2, the posterior mean risk estimate is above 1.2 only when the expected count is around 50 or more (e.g., 75% area with SF = 10). Relative risks of 3 are smoothed to about half their values when the expected counts are around 10 (e.g., 25% area with SF = 10 or 75% area with SF = 2). Comparison of Simu 2 with Simu 1 (75% area) shows that having a cluster of high-risk areas rather than a single area with elevated risk slightly decreases the amount of smoothing for the same average expected count. Again, this is apparent in the many-cluster situation of Simu 3, where even though the true θ*i are smaller, the relative risk estimates are higher than those for Simu 2. Overall, the performance of the L1-BYM model (Table 2) is similar to that of the BYM model. However, as expected, the L1-BYM model effects a little less smoothing in cases of large expected counts or high relative risk estimates. For Simu 3 the estimates are nearly identical to those of the BYM model. Thus, simply changing the distributional assumptions in the autoregressive specifications results in only a small modification in the estimates. The results for the MIX model given in Table 3 show a different pattern than those for the BYM or L1-BYM. For Simu 1 and an elevated relative risk of 1.5, strong smoothing toward 1 is apparent as for BYM. However, for Simu 2, posterior mean relative risks become higher than 1.2 for the largest SF. At the other end of the spectrum, relative risks of 3 are well estimated with posterior means above 2.5 as soon as the expected count is above 10 either for single areas (e.g., 50% area with SF = 4) or for the 1% clustered areas with SF = 2. These results are in accordance with the nature of the MIX model. When there is sufficient evidence in the data to create a group of areas with higher risk, the posterior mean risks for the areas in this group are well estimated and close to the simulated values. Otherwise, all areas are allocated to the background category and smoothed toward 1. Having many heterogeneous clusters as in Simu 3 does not improve the MIX performance as much as that of BYM. Because of the more diffuse nature of some of the clusters, more areas in the background are randomly included in the group of areas with higher risk. Thus, the MIX model still has a mode close to the true relative risk, but the histogram of the mean posterior risks for all the high-risk areas has a longer left-hand tail than in the Simu 2 scenario (Figure 2). The difference in performance of the three models is further illustrated in Figure 3, which displays, for the three models, box plots of the posterior mean estimates of the relative risk in the raised-risk areas over the 100 replicates for Simu 2 with true relative risks of 3 and 2. When the true relative risk is 3, the MIX model is clearly performing better than the other two models, whereas for a relative risk of 2 and the lowest SF, the MIX model is the model that produces the most smoothing. Interpreting the Posterior Distribution of the Risk Mapping the posterior mean relative risk as discussed previously does not make full use of the output of the Bayesian analysis that provides, for each area, samples from the whole posterior distribution of the relative risk. Mapping the probability that a relative risk is greater than a specified threshold of interest has been proposed by several authors [e.g., Clayton and Bernardinelli (1992)]. We carry this further and investigate the performance of decision rules for classifying an area Ai as having an increased risk based on how much of the posterior distribution of θi exceeds a reference threshold. Figure 4 presents an example of the posterior distribution of the relative risk for such an area. The shaded proportion corresponds to the posterior probability that θ > 1. To be precise, to classify any area as having an elevated risk, we define the decision rule D(c, R0), which depends on a cutoff probability c and a reference threshold R0 such that area Ai is classified as having an elevated risk according to D(c, R0) ↔ Prob(θi > R0) > c. The appropriate rules to investigate will depend on the shape of the posterior distribution of θi for the elevated areas. We first discuss rules adapted to the autoregressive BYM and L1-BYM models. For these two models we have seen that, in general, the mean of the posterior distribution of θi in the raised-risk areas is greater than 1 but rarely above 1.5 in many of the scenarios investigated. Thus, it seems sensible to take R0 = 1 as a reference threshold. We would also expect the bulk of the posterior distribution to be shifted above 1 for these areas, suggesting that cutoff probabilities well above 0.5 are indicated. In the first instance, we choose c = 0.8. Thus, for the BYM and L1-BYM models, we report results corresponding to the decision rule D(0.8, 1). See Appendix B for a detailed justification of this choice of value of c and the performance of different decision rules. In contrast, we have seen that the mean of the posterior distribution of θi for raised-risk areas for the MIX model is closer to the true value for many scenarios, and there is clear indication that the upper tail of this distribution can be well above 1. Furthermore, the spread of this distribution is less than the corresponding one for the BYM or L1-BYM models, as noted by Green and Richardson (2002). The choice of threshold is thus more crucial for this model, making it harder to find an appropriate decision rule. After some exploratory analyses of the simple clusters in Simu 1 and Simu 2, we found that a suitable decision rule for the MIX model in these two scenarios is to choose R0 = 1.5. For such a high threshold, one would expect that it is enough for a small fraction (e.g., 5 or 10%) of the posterior distribution of θi to be above 1.5 to indicate that an area has elevated risk. Thus, for the MIX model we report results corresponding to the decision rule D(0.05, 1.5). Two types of errors are associated with any decision rule: a) a false-positive result, that is, declaring an area as having elevated risk when in fact its underlying true rate equals the background level (an error also traditionally referred to as type I error or lack of specificity); and b) a false-negative result, that is, declaring an area to be in the background when in fact its underlying rate is elevated (an error also referred to as type II error or lack of sensitivity). In epidemiology, performances are discussed either by reporting these error rates or their complementary quantities that measure the success rates of the decision rule. The two goals of disease mapping can be summarized as follows: not to overinterpret excesses arising by chance, that is, to minimize the false-positive rate but to detect patterns of true heterogeneity, that is, to maximize the sensitivity. We thus choose to report these two easily interpretable quantities. To be precise, for any decision rule D(c, R0), we compute the false-positive rate (or 1 – specificity), that is, the proportion of background areas falsely declared elevated by the decision rule D(c, R0) the sensitivity (or 1 – false-negative rate), that is, the proportion of areas generated with elevated rates correctly declared elevated by the decision rule D(c, R0). It is clear that there must be a compromise between these two goals: a stricter rule (i.e., one with a higher value of c or R0 or both) reduces the false-positive rate but also decreases the sensitivity and thus increases the false-negative rate. Thus, to judge the performance of any decision rule, one has to consider both types of errors, not necessarily equally weighted. See Appendix B for an illustration of the implication of different weighting on the overall performance of the decision rule. Table 4 summarizes the probabilities of false-positive results for the three models. For BYM and L1-BYM, the probabilities stay below 10% with no discernible pattern for Simu 1 and Simu 2. The error rates are clearly smaller and around 3% for Simu 3. In this scenario, the background relative risk is shifted below 1, so a decision rule with R0 = 1 is, in effect, a more stringent rule than in the case of Simu 1 and Simu 2 where the background relative risks are close to 1. For the MIX model, the false-positive rates are quite low for Simu 1 and Simu 2 and stay mostly below 3%. However, as shown in the last line of Table 4, these rates have greatly increased for the Simu 3 scenario, indicating that the decision rule D(0.05, 1.5) is no longer appropriate in this heterogeneous context. The heterogeneity creates a lot of uncertainty, with some background areas being grouped with nearby high-risk areas; consequently, the rule D(0.05, 1.5) is not stringent (specific) enough. Thus, we have investigated a series of rules D(c, 1.5) for c = 0.1–0.4 for the MIX model in the Simu 3 scenario. As c increases, the probability of false positive decreases; for D(0.4, 1.5), the probability is, on average, around 3% and always below 7% (Table 5). Concerning the detection of truly increased relative risks and sensitivity, we first discuss the results for the BYM and L1-BYM models. As expected from the posterior means shown in Tables 1 or 2, the ability to detect true increased risk areas is limited when the increase is only of the order of 1.5. If one takes as a guideline the cases where the detection of true positive is 50% or more, Tables 6 and 7 show that this sensitivity is reached for an expected count of around 50 in the case of a single isolated area and around 20 for the 1% cluster scenario. This shows that for rare diseases and small areas, there is little chance of detecting increased risks of around 1.5 while adequately controlling the false-positive rate. True relative risks of 2 are detected with at least 75% probability when expected counts are between 10 and 20 per area, depending on the spatial structure of the risk surface, whereas true relative risks of 3 are detected almost certainly when expected counts per area are 5 or more. There is no clear pattern of difference between the results for BYM and L1-BYM; overall, the sensitivity is similar. For Simu 3 we see that the sensitivity is lower than for the other simulation scenarios with equivalent expected counts (as were the rates of false positive in Table 4), in line with the true relative risks being closer to 1 than for Simu 1 and Simu 2. Hence, the decision rule D(0.8, 1) is more specific but less sensitive in this scenario. In situations comprising a large degree of heterogeneity akin to Simu 3, it thus might be advantageous to consider alternative rules, even if the rate of false positive is less well controlled. For example, in the case of a true relative risk (θ) = 1.65 and SF = 4, the use of rule D(0.7, 1) for the BYM model leads to a higher probability of false positive (6% compared with the 3% shown in Table 4). However, the corresponding gain in sensitivity is more than 10%, with the probability of detecting a true positive increasing to 82% compared with 71% when using the rule D(0.8, 1) (Table 5). Nevertheless, even with this relaxed and more sensitive rule, the chance of detecting a true relative risk as small as 1.3 is only around 50% if the SF is 4 (i.e., average cluster with total expected count around 80). On the other hand, true relative risks of around 2 are detected with high probability as soon as the SF is 2 (which corresponds, on average, to a cluster with total expected count of 40). The contrasting behavior of the MIX model is again apparent in Table 8 when one compares the results for the θ =1.5 scenario with the other columns. For Simu 1 and Simu 2 the sensitivity is generally below that of the BYM model and especially when the true relative risk is 1.5; single clusters with θ = 1.5 are simply not detected. In the 1% cluster case expected counts of at least 20 (10) are necessary to be over 95% certain of detecting a true relative risk of 2 (3) (Table 8). Note that the results of the last line of Table 8 should be discounted in view of the high probability of false-positive results corresponding to this scenario (Simu 3) for the D(0.05, 1.5) rule shown in Table 4. Thus, it is apparent that for the MIX model, it is hard to calibrate a good decision rule appropriate for a variety of spatial patterns of elevated risk. In Table 5 we summarize the results corresponding to the decision rule D(0.4, 1.5), which offers a reasonable compromise between keeping the rate of false positives below 7% and an acceptable detection rate of true clusters. With this rule true relative risks of 1.65 with an SF of 2 (i.e., average cluster with total expected count slightly under 40) or larger have more than a 50% chance of being detected, and true relative risks of around 2 are nearly always detected. However, this model does not detect a true relative risk as small as 1.3.