The following issues were debated (Table 1): Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31]. An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding. The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32]. Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36]. While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”. A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20]. Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals). Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37]. Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease. Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada