TEST0

PMC:4712234 / 12758-12760 JSON TXT 4 Projects

Conclusions for mammography screening after 25-year follow-up of the Canadian National Breast Cancer Screening Study (CNBSS) Abstract Twenty-five-year follow-up data of the Canadian National Breast Cancer Screening Study (CNBSS) indicated no mortality reduction. What conclusions should be drawn? After conducting a systematic literature search and narrative analysis, we wish to recapitulate important details of this study, which may have been neglected: Sixty-eight percent of all included cancers were palpable, a situation that does not allow testing the value of early detection. Randomisation was performed at the sites after palpation, while blinding was not guaranteed. In the first round, this “randomisation" assigned 19/24 late stage cancers to the mammography group and only five to the control group, supporting the suspicion of severe errors in the randomisation process. The responsible physicist rated mammography quality as “far below state of the art of that time". Radiological advisors resigned during the study due to unacceptable image quality, training, and medical quality assurance. Each described problem may strongly influence the results between study and control groups. Twenty-five years of follow-up cannot heal these fundamental problems. This study is inappropriate for evidence-based conclusions. The technology and quality assurance of the diagnostic chain is shown to be contrary to today's screening programmes, and the results of the CNBSS are not applicable to them. Key Points • The evidence base of the Canadian study (CNBSS) has to be questioned. • Severe flaws in the randomization process and test methods occurred. • Problems were criticized during and after conclusion of the trial by experts. • The results are not applicable to quality-assured screening programs. • The evidence base of this study must be re-analyzed. Electronic supplementary material The online version of this article (doi:10.1007/s00330-015-3849-2) contains supplementary material, which is available to authorized users. Introduction The Canadian National Breast Cancer Screening Study (CNBSS) is one of eight large-scale randomized controlled trials on mammography screening. Compared to the other studies, it is an outlier. Following its first publication in 1992 [1, 2], the quality, randomization, and design of this study was highly debated in the scientific community. The CNBSS is an outlier among the eight usually cited randomized controlled trials (RCTs), including those with comparable follow-up [3], which showed an average mortality reduction for invited versus non-invited women aged 50–69 years of 23 % and for examined ages about 20 % [4–11]. It contradicts the results of today’s quality-assured mammography screening programmes and of the present Canadian screening programme, which report a mortality reduction ranging around 40 % for participating versus nonparticipating women [12–18]. However, it has been classified as one of two studies of high quality in Cochrane reviews, published by Goetzsche since 2000 [6, 19]. A 25-year follow-up of the CNBSS, with added data from two different studies, was published recently [20]. It claimed that mammography screening, performed annually for 5 years (from 1980 to 1985) in the age-group 40–59 years showed no benefit for breast cancer-specific mortality over clinical examination. Taking this outcome seriously, mammography screening should be stopped immediately. Is this a reasonable conclusion for ongoing mammography screening programs? To answer this question, critical details of the CNBSS, which may have been lost, forgotten, or suppressed, are reconsidered. Another unasked question concerns its applicability to the present situation and to quality-assured screening programmes following European guidelines. Materials and methods A systematic search concerning the CNBSS was conducted, which was followed by thematic analysis of study findings and a narrative synthesis. The search included articles from 1981 to 2014 and was performed using PubMed and Embase databases. The following adapted MeSH-Terms were used: “CNBSS”, “Canadian National Breast Screening Study”, “NBSS”, and “Breast cancer”. The selected studies were limited to articles published in English. Papers, comments, editorials, letters, reports, and handbooks dealing with the subject were included. A total of 240 articles (PubMed n = 129, Embase n = 111) were identified. Additionally, a manual search and cited reference search generated another 41 articles leading to a total of 281 articles, of which a title and abstract screening was executed by two authors independently. They identified those articles, which treated specific structural, procedural and evaluation issues of the CNBSS. One hundred and twenty-nine potentially relevant articles were saved for full text screening, while 53 articles were duplicates, 97 articles did not meet the inclusion criteria, and two articles were not available. These 129 articles were screened for the following topics: inclusion criteria, randomisation process, contamination of the control group, quality of mammograms, mammographic reading, and diagnostic chain. Forty-three articles were excluded, as they did not meet these inclusion criteria. Finally, 86 articles were selected for review (Fig. 1). Fig. 1 Flow diagram of systematic literature search Results Sixteen of the 86 eligible articles mentioned > 1 of the debated issues, but argued neither for nor against the quality of the study. Thirty-seven articles of various authors, including 15 articles (co-)authored by five radiological reviewers1 [21–29] (Kopans, Feig, Logan, Sickles, Tabar) one advisor (Moskowitz), and the responsible physicist [30] (Yaffe) published on “severe flaws” of the study. Thirty-three articles, of which 27 were (co-)authored by the two principal investigators, Miller or Baines, defended the study (Fig. 2). Two of the radiological reviewers resigned during the study. Fig. 2 This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM) The following issues were debated (Table 1): Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31]. An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding. The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32]. Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36]. While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”. A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20]. Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals). Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37]. Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease. Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada Finally, it has been estimated that 25 % of the recommended biopsies were not performed. One of the surgeons was convinced that non palpable lesions detected by mammography did not require biopsy [1, 28]. Discussion and conclusion The above summary of the published results demonstrates several debatable issues concerning the CNBSS:It has to be emphasized that by definition, screening should address asymptomatic women [40]. Current screening is population-based and aims to invite asymptomatic women. The prevalent round of the Canadian trial had a very high proportion of palpable cancers (partly skewed by the nature of their recruitment strategy). Thus, their results are not applicable to current practice where there is uniform access to high quality symptomatic services for women with symptoms. By accepting palpable tumours (most probably advanced and with worse prognosis), the results are skewed: the overall numbers of cancers are artificially high while the palpable cancers cannot contribute to an improved mortality reduction. Thus, the screening effect will be considerably diluted and underestimated. This leads also to underpowered statistics [28, 31]. The documented process of randomisation did not warrant blinding. Thus, any person involved in the study could subvert the randomisation. Also, the probability of subversion was enhanced since mammography was not necessarily offered to women in the control group. We do not assume that the principal investigators committed any fraud. However, they could not have prevented subversion with the chosen protocol. The disproportional distribution of far advanced stages (cancers with > 4 involved lymph nodes) in the prevalence screen < age 50 years is highly significant and supports the doubts concerning correct randomisation. An even distribution of demographic and risk factors cannot exclude a bias toward late stage cancers, which may severely impact on the assessment of mortality reduction and calculation of overdiagnoses. Long-term mortality reduction was calculated from a maximum of five annual rounds. Because of the short overall duration and continuing entrance of first round screenees, the maximum screening effect could not be reached for many of the participants. Mortality reduction was calculated based on cumulative rates of a mixed trial participation of one to five rounds during up to 5 years. This might lead to a substantial underestimation of the true screening effect compared to a screening programme following approved guidelines (in which participants undergo approximately 10 complete rounds in 20 years) [41]. The higher evidence classification of individual versus cluster randomized studies is correct in principle. But for screening trials, where non-invited and invited women cannot be blinded, individual randomisation may lead to a much higher contamination of the control group than in a cluster randomized setting. Thus, in the CNBSS, as in other individual randomized screening trials, a substantial underestimation of the screening effect through contamination cannot be ruled out. The fact that none of the radiological reviewers including the responsible physicist considered the quality sufficient is highly concerning, as is the described lack of technologist and reader training and the high rate of interval cancers. Two reviewers resigned during the study. How can a method be tested if it is not properly performed and interpreted? What is the value of the results? The fact that recommended biopsies of mammographically detected abnormalities were not systematically performed is likely to have distorted the results. What effect is expected from early detection if suspicious findings are not followed by adequate assessment and therapy? Both obvious and probable protocol deficiencies are likely to have had an impact on the results counteracting a possible effect of mammography screening on breast cancer mortality and distorting estimates on overdiagnosis. As appropriate randomisation is one of the key validity criteria for RCTs, a study with that kind of violation should not be rated as a high quality study. Whether evidence from such a trial can be used at all must be questioned. Even if the raised concerns were insignificant and the results were valid, the question remains whether the results and conclusions from a screening trial performed in 1980-85 are applicable to or useful for the assessment of present screening programmes. The most appropriate answer is obviously “no”. First, the age range in the CNBSS was 40–59 years, which does not apply to the age range of most mammography screening programs today (50–69 or 74 years), as recommended in National and European Guidelines. Because of the lower incidence of breast cancer at younger ages, the absolute effect is lower. Today we know that mammography quality is even more important in the younger age group due to more difficult detection within dense breast tissue. Secondly, the mammographic technique and quality assurance of the complete chain from screening to screen reading, assessment, and treatment in modern population-based mammography screening programs is almost completely different from the CNBSS [42]. Abnormalities are routinely assessed using state of the art minimally invasive methods, and treatment is increasingly standardized and adapted to the stage at detection and the aggressiveness of the cancer. Finally the improved quality today clearly has increased the sensitivity and specificity of mammography screening. The result of 68 % of palpable breast cancers in the mammography arm (average size 1.9 cm) would today be unacceptable for annual (!) screening. What can we conclude from this short review? Probable deficits of randomisation and of proper application of the test cannot be repaired by performing a follow-up study. The results of such a study remain biased. We do not want to discount the CNBSS which represents an enormous and exceptional effort in the 1980s, and we took note of its size, the time, and circumstances when it was conducted. Also, the important question asked in this trial was different from all other screening trials. However, the chosen methodology led to obvious biases with significant impact on the results, especially when long-term results are considered. Furthermore, it is more than obvious that the setting of mammography in the CNBSS is not comparable to present mammography screening programs. Therefore, using the CNBSS as “highest evidence” to assess the effects of modern mammography screening programs of the new millennium is not scientifically justified. Considering the fact that properly performed cohort studies and nested case-control studies with appropriate consideration of length time bias demonstrate a much higher effect of mortality reduction the assumption that the null effect of the CNBSS is “due to availability of chemotherapy” [20] is unproven and highly speculative. It is an unanswered question, why high-ranking journals, like the BMJ, and representatives of evidence-based medicine close their eyes to these arguments [43] and still refer to the CNBSS as “superior” evidence (against mammography screening). Taking the CNBSS as an example, the authors want to point out how evidence in the field of breast cancer screening has systematically been omitted, distorted, or inappropriately used over the last decades. When using CNBSS data, opponents of screening mammography [6, 19, 44, 45] ignore or misinterpret an important part of the existing evidence. The consequence of this recommendation is “waiting until a cancer becomes palpable”. This means that contrary to early detection, women would present at a stage that usually requires aggressive treatment including chemotherapy and more often axillary dissection. There is no doubt that evidence shows that the earlier the stage of breast cancer at diagnosis, the better the prognosis. In conclusion, the comparison of the settings of the CNBSS with the setting of modern mammography screening is akin to comparing apples with pears. Drawing conclusions from the CNBSS for today’s quality-assured population based screening programmes is an act of negligence. What we need today is the continuous evaluation of the ongoing mammography screening programmes, including, but not only, breast cancer mortality as an outcome. Electronic supplementary material ESM 1 (DOCX 21 kb) 1 assigned during different time periods Acknowledgements The scientific guarantor of this publication is Sylvia H. Heywang-Koebrunner. The authors of this manuscript declare relationships with the following companies: Prof. Sylvia Heywang-Koebrunner: 1. I am head of the Reference Center Mammography Screening Munich which is responsible for quality assurance and training of screeners in Bavaria and Thuringia. This makes 50 % of my work. 2. I work as head of a screening unit and I work in private practice, specialized in breast imaging and interventions. About 70 % of my work is associated with mammography screening Prof. Ingrid Schreer: No other relationships/conditions/circumstances that present a potential conflict of interest Dr. Astrid Hacker: I'm an employee of the Reference Center for Mammography Screening Munich. Dr. Maria Noftz: No other relationships/conditions/circumstances that present a potential conflict of interest Prof. Dr. Alexander Katalinic: I am chair of the scientific committee of the German Mammography Screening. This is an independent scientific committee, which is giving advice for the ongoing program. The task is honorary; there is no fee paid to me or any committee member. Only costs for travel to the committee meetings are paid by the screening program. The authors state that this work has not received any funding. One of the authors has significant statistical expertise. Institutional Review Board approval was obtained. No study subjects or cohorts have been previously reported in any other journal. Methodology: The special report is a retrospective review on an existing trial performed in the 1980s (CNBSS) and a lately published follow-up study of the CNBSS. It is performed at one institution.

Document structure show

article-title	Conclusions for mammography screening after 25-year follow-up of the Canadian National Breast Cancer Screening Study (CNBSS)
abstract	Twenty-five-year follow-up data of the Canadian National Breast Cancer Screening Study (CNBSS) indicated no mortality reduction. What conclusions should be drawn? After conducting a systematic literature search and narrative analysis, we wish to recapitulate important details of this study, which may have been neglected: Sixty-eight percent of all included cancers were palpable, a situation that does not allow testing the value of early detection. Randomisation was performed at the sites after palpation, while blinding was not guaranteed. In the first round, this “randomisation" assigned 19/24 late stage cancers to the mammography group and only five to the control group, supporting the suspicion of severe errors in the randomisation process. The responsible physicist rated mammography quality as “far below state of the art of that time". Radiological advisors resigned during the study due to unacceptable image quality, training, and medical quality assurance. Each described problem may strongly influence the results between study and control groups. Twenty-five years of follow-up cannot heal these fundamental problems. This study is inappropriate for evidence-based conclusions. The technology and quality assurance of the diagnostic chain is shown to be contrary to today's screening programmes, and the results of the CNBSS are not applicable to them. Key Points • The evidence base of the Canadian study (CNBSS) has to be questioned. • Severe flaws in the randomization process and test methods occurred. • Problems were criticized during and after conclusion of the trial by experts. • The results are not applicable to quality-assured screening programs. • The evidence base of this study must be re-analyzed. Electronic supplementary material The online version of this article (doi:10.1007/s00330-015-3849-2) contains supplementary material, which is available to authorized users.
p	Twenty-five-year follow-up data of the Canadian National Breast Cancer Screening Study (CNBSS) indicated no mortality reduction. What conclusions should be drawn? After conducting a systematic literature search and narrative analysis, we wish to recapitulate important details of this study, which may have been neglected: Sixty-eight percent of all included cancers were palpable, a situation that does not allow testing the value of early detection. Randomisation was performed at the sites after palpation, while blinding was not guaranteed. In the first round, this “randomisation" assigned 19/24 late stage cancers to the mammography group and only five to the control group, supporting the suspicion of severe errors in the randomisation process. The responsible physicist rated mammography quality as “far below state of the art of that time". Radiological advisors resigned during the study due to unacceptable image quality, training, and medical quality assurance. Each described problem may strongly influence the results between study and control groups. Twenty-five years of follow-up cannot heal these fundamental problems. This study is inappropriate for evidence-based conclusions. The technology and quality assurance of the diagnostic chain is shown to be contrary to today's screening programmes, and the results of the CNBSS are not applicable to them.
p	Key Points
p	• The evidence base of the Canadian study (CNBSS) has to be questioned.
p	• Severe flaws in the randomization process and test methods occurred.
p	• Problems were criticized during and after conclusion of the trial by experts.
p	• The results are not applicable to quality-assured screening programs.
p	• The evidence base of this study must be re-analyzed.
sec	Electronic supplementary material The online version of this article (doi:10.1007/s00330-015-3849-2) contains supplementary material, which is available to authorized users.
title	Electronic supplementary material
p	The online version of this article (doi:10.1007/s00330-015-3849-2) contains supplementary material, which is available to authorized users.
body	Introduction The Canadian National Breast Cancer Screening Study (CNBSS) is one of eight large-scale randomized controlled trials on mammography screening. Compared to the other studies, it is an outlier. Following its first publication in 1992 [1, 2], the quality, randomization, and design of this study was highly debated in the scientific community. The CNBSS is an outlier among the eight usually cited randomized controlled trials (RCTs), including those with comparable follow-up [3], which showed an average mortality reduction for invited versus non-invited women aged 50–69 years of 23 % and for examined ages about 20 % [4–11]. It contradicts the results of today’s quality-assured mammography screening programmes and of the present Canadian screening programme, which report a mortality reduction ranging around 40 % for participating versus nonparticipating women [12–18]. However, it has been classified as one of two studies of high quality in Cochrane reviews, published by Goetzsche since 2000 [6, 19]. A 25-year follow-up of the CNBSS, with added data from two different studies, was published recently [20]. It claimed that mammography screening, performed annually for 5 years (from 1980 to 1985) in the age-group 40–59 years showed no benefit for breast cancer-specific mortality over clinical examination. Taking this outcome seriously, mammography screening should be stopped immediately. Is this a reasonable conclusion for ongoing mammography screening programs? To answer this question, critical details of the CNBSS, which may have been lost, forgotten, or suppressed, are reconsidered. Another unasked question concerns its applicability to the present situation and to quality-assured screening programmes following European guidelines. Materials and methods A systematic search concerning the CNBSS was conducted, which was followed by thematic analysis of study findings and a narrative synthesis. The search included articles from 1981 to 2014 and was performed using PubMed and Embase databases. The following adapted MeSH-Terms were used: “CNBSS”, “Canadian National Breast Screening Study”, “NBSS”, and “Breast cancer”. The selected studies were limited to articles published in English. Papers, comments, editorials, letters, reports, and handbooks dealing with the subject were included. A total of 240 articles (PubMed n = 129, Embase n = 111) were identified. Additionally, a manual search and cited reference search generated another 41 articles leading to a total of 281 articles, of which a title and abstract screening was executed by two authors independently. They identified those articles, which treated specific structural, procedural and evaluation issues of the CNBSS. One hundred and twenty-nine potentially relevant articles were saved for full text screening, while 53 articles were duplicates, 97 articles did not meet the inclusion criteria, and two articles were not available. These 129 articles were screened for the following topics: inclusion criteria, randomisation process, contamination of the control group, quality of mammograms, mammographic reading, and diagnostic chain. Forty-three articles were excluded, as they did not meet these inclusion criteria. Finally, 86 articles were selected for review (Fig. 1). Fig. 1 Flow diagram of systematic literature search Results Sixteen of the 86 eligible articles mentioned > 1 of the debated issues, but argued neither for nor against the quality of the study. Thirty-seven articles of various authors, including 15 articles (co-)authored by five radiological reviewers1 [21–29] (Kopans, Feig, Logan, Sickles, Tabar) one advisor (Moskowitz), and the responsible physicist [30] (Yaffe) published on “severe flaws” of the study. Thirty-three articles, of which 27 were (co-)authored by the two principal investigators, Miller or Baines, defended the study (Fig. 2). Two of the radiological reviewers resigned during the study. Fig. 2 This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM) The following issues were debated (Table 1): Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31]. An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding. The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32]. Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36]. While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”. A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20]. Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals). Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37]. Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease. Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada Finally, it has been estimated that 25 % of the recommended biopsies were not performed. One of the surgeons was convinced that non palpable lesions detected by mammography did not require biopsy [1, 28]. Discussion and conclusion The above summary of the published results demonstrates several debatable issues concerning the CNBSS:It has to be emphasized that by definition, screening should address asymptomatic women [40]. Current screening is population-based and aims to invite asymptomatic women. The prevalent round of the Canadian trial had a very high proportion of palpable cancers (partly skewed by the nature of their recruitment strategy). Thus, their results are not applicable to current practice where there is uniform access to high quality symptomatic services for women with symptoms. By accepting palpable tumours (most probably advanced and with worse prognosis), the results are skewed: the overall numbers of cancers are artificially high while the palpable cancers cannot contribute to an improved mortality reduction. Thus, the screening effect will be considerably diluted and underestimated. This leads also to underpowered statistics [28, 31]. The documented process of randomisation did not warrant blinding. Thus, any person involved in the study could subvert the randomisation. Also, the probability of subversion was enhanced since mammography was not necessarily offered to women in the control group. We do not assume that the principal investigators committed any fraud. However, they could not have prevented subversion with the chosen protocol. The disproportional distribution of far advanced stages (cancers with > 4 involved lymph nodes) in the prevalence screen < age 50 years is highly significant and supports the doubts concerning correct randomisation. An even distribution of demographic and risk factors cannot exclude a bias toward late stage cancers, which may severely impact on the assessment of mortality reduction and calculation of overdiagnoses. Long-term mortality reduction was calculated from a maximum of five annual rounds. Because of the short overall duration and continuing entrance of first round screenees, the maximum screening effect could not be reached for many of the participants. Mortality reduction was calculated based on cumulative rates of a mixed trial participation of one to five rounds during up to 5 years. This might lead to a substantial underestimation of the true screening effect compared to a screening programme following approved guidelines (in which participants undergo approximately 10 complete rounds in 20 years) [41]. The higher evidence classification of individual versus cluster randomized studies is correct in principle. But for screening trials, where non-invited and invited women cannot be blinded, individual randomisation may lead to a much higher contamination of the control group than in a cluster randomized setting. Thus, in the CNBSS, as in other individual randomized screening trials, a substantial underestimation of the screening effect through contamination cannot be ruled out. The fact that none of the radiological reviewers including the responsible physicist considered the quality sufficient is highly concerning, as is the described lack of technologist and reader training and the high rate of interval cancers. Two reviewers resigned during the study. How can a method be tested if it is not properly performed and interpreted? What is the value of the results? The fact that recommended biopsies of mammographically detected abnormalities were not systematically performed is likely to have distorted the results. What effect is expected from early detection if suspicious findings are not followed by adequate assessment and therapy? Both obvious and probable protocol deficiencies are likely to have had an impact on the results counteracting a possible effect of mammography screening on breast cancer mortality and distorting estimates on overdiagnosis. As appropriate randomisation is one of the key validity criteria for RCTs, a study with that kind of violation should not be rated as a high quality study. Whether evidence from such a trial can be used at all must be questioned. Even if the raised concerns were insignificant and the results were valid, the question remains whether the results and conclusions from a screening trial performed in 1980-85 are applicable to or useful for the assessment of present screening programmes. The most appropriate answer is obviously “no”. First, the age range in the CNBSS was 40–59 years, which does not apply to the age range of most mammography screening programs today (50–69 or 74 years), as recommended in National and European Guidelines. Because of the lower incidence of breast cancer at younger ages, the absolute effect is lower. Today we know that mammography quality is even more important in the younger age group due to more difficult detection within dense breast tissue. Secondly, the mammographic technique and quality assurance of the complete chain from screening to screen reading, assessment, and treatment in modern population-based mammography screening programs is almost completely different from the CNBSS [42]. Abnormalities are routinely assessed using state of the art minimally invasive methods, and treatment is increasingly standardized and adapted to the stage at detection and the aggressiveness of the cancer. Finally the improved quality today clearly has increased the sensitivity and specificity of mammography screening. The result of 68 % of palpable breast cancers in the mammography arm (average size 1.9 cm) would today be unacceptable for annual (!) screening. What can we conclude from this short review? Probable deficits of randomisation and of proper application of the test cannot be repaired by performing a follow-up study. The results of such a study remain biased. We do not want to discount the CNBSS which represents an enormous and exceptional effort in the 1980s, and we took note of its size, the time, and circumstances when it was conducted. Also, the important question asked in this trial was different from all other screening trials. However, the chosen methodology led to obvious biases with significant impact on the results, especially when long-term results are considered. Furthermore, it is more than obvious that the setting of mammography in the CNBSS is not comparable to present mammography screening programs. Therefore, using the CNBSS as “highest evidence” to assess the effects of modern mammography screening programs of the new millennium is not scientifically justified. Considering the fact that properly performed cohort studies and nested case-control studies with appropriate consideration of length time bias demonstrate a much higher effect of mortality reduction the assumption that the null effect of the CNBSS is “due to availability of chemotherapy” [20] is unproven and highly speculative. It is an unanswered question, why high-ranking journals, like the BMJ, and representatives of evidence-based medicine close their eyes to these arguments [43] and still refer to the CNBSS as “superior” evidence (against mammography screening). Taking the CNBSS as an example, the authors want to point out how evidence in the field of breast cancer screening has systematically been omitted, distorted, or inappropriately used over the last decades. When using CNBSS data, opponents of screening mammography [6, 19, 44, 45] ignore or misinterpret an important part of the existing evidence. The consequence of this recommendation is “waiting until a cancer becomes palpable”. This means that contrary to early detection, women would present at a stage that usually requires aggressive treatment including chemotherapy and more often axillary dissection. There is no doubt that evidence shows that the earlier the stage of breast cancer at diagnosis, the better the prognosis. In conclusion, the comparison of the settings of the CNBSS with the setting of modern mammography screening is akin to comparing apples with pears. Drawing conclusions from the CNBSS for today’s quality-assured population based screening programmes is an act of negligence. What we need today is the continuous evaluation of the ongoing mammography screening programmes, including, but not only, breast cancer mortality as an outcome. Electronic supplementary material ESM 1 (DOCX 21 kb)
sec	Introduction The Canadian National Breast Cancer Screening Study (CNBSS) is one of eight large-scale randomized controlled trials on mammography screening. Compared to the other studies, it is an outlier. Following its first publication in 1992 [1, 2], the quality, randomization, and design of this study was highly debated in the scientific community. The CNBSS is an outlier among the eight usually cited randomized controlled trials (RCTs), including those with comparable follow-up [3], which showed an average mortality reduction for invited versus non-invited women aged 50–69 years of 23 % and for examined ages about 20 % [4–11]. It contradicts the results of today’s quality-assured mammography screening programmes and of the present Canadian screening programme, which report a mortality reduction ranging around 40 % for participating versus nonparticipating women [12–18]. However, it has been classified as one of two studies of high quality in Cochrane reviews, published by Goetzsche since 2000 [6, 19]. A 25-year follow-up of the CNBSS, with added data from two different studies, was published recently [20]. It claimed that mammography screening, performed annually for 5 years (from 1980 to 1985) in the age-group 40–59 years showed no benefit for breast cancer-specific mortality over clinical examination. Taking this outcome seriously, mammography screening should be stopped immediately. Is this a reasonable conclusion for ongoing mammography screening programs? To answer this question, critical details of the CNBSS, which may have been lost, forgotten, or suppressed, are reconsidered. Another unasked question concerns its applicability to the present situation and to quality-assured screening programmes following European guidelines.
title	Introduction
p	The Canadian National Breast Cancer Screening Study (CNBSS) is one of eight large-scale randomized controlled trials on mammography screening. Compared to the other studies, it is an outlier.
p	Following its first publication in 1992 [1, 2], the quality, randomization, and design of this study was highly debated in the scientific community. The CNBSS is an outlier among the eight usually cited randomized controlled trials (RCTs), including those with comparable follow-up [3], which showed an average mortality reduction for invited versus non-invited women aged 50–69 years of 23 % and for examined ages about 20 % [4–11].
p	It contradicts the results of today’s quality-assured mammography screening programmes and of the present Canadian screening programme, which report a mortality reduction ranging around 40 % for participating versus nonparticipating women [12–18].
p	However, it has been classified as one of two studies of high quality in Cochrane reviews, published by Goetzsche since 2000 [6, 19]. A 25-year follow-up of the CNBSS, with added data from two different studies, was published recently [20]. It claimed that mammography screening, performed annually for 5 years (from 1980 to 1985) in the age-group 40–59 years showed no benefit for breast cancer-specific mortality over clinical examination. Taking this outcome seriously, mammography screening should be stopped immediately.
p	Is this a reasonable conclusion for ongoing mammography screening programs? To answer this question, critical details of the CNBSS, which may have been lost, forgotten, or suppressed, are reconsidered. Another unasked question concerns its applicability to the present situation and to quality-assured screening programmes following European guidelines.
sec	Materials and methods A systematic search concerning the CNBSS was conducted, which was followed by thematic analysis of study findings and a narrative synthesis. The search included articles from 1981 to 2014 and was performed using PubMed and Embase databases. The following adapted MeSH-Terms were used: “CNBSS”, “Canadian National Breast Screening Study”, “NBSS”, and “Breast cancer”. The selected studies were limited to articles published in English. Papers, comments, editorials, letters, reports, and handbooks dealing with the subject were included. A total of 240 articles (PubMed n = 129, Embase n = 111) were identified. Additionally, a manual search and cited reference search generated another 41 articles leading to a total of 281 articles, of which a title and abstract screening was executed by two authors independently. They identified those articles, which treated specific structural, procedural and evaluation issues of the CNBSS. One hundred and twenty-nine potentially relevant articles were saved for full text screening, while 53 articles were duplicates, 97 articles did not meet the inclusion criteria, and two articles were not available. These 129 articles were screened for the following topics: inclusion criteria, randomisation process, contamination of the control group, quality of mammograms, mammographic reading, and diagnostic chain. Forty-three articles were excluded, as they did not meet these inclusion criteria. Finally, 86 articles were selected for review (Fig. 1). Fig. 1 Flow diagram of systematic literature search
title	Materials and methods
p	A systematic search concerning the CNBSS was conducted, which was followed by thematic analysis of study findings and a narrative synthesis. The search included articles from 1981 to 2014 and was performed using PubMed and Embase databases. The following adapted MeSH-Terms were used: “CNBSS”, “Canadian National Breast Screening Study”, “NBSS”, and “Breast cancer”. The selected studies were limited to articles published in English. Papers, comments, editorials, letters, reports, and handbooks dealing with the subject were included.
p	A total of 240 articles (PubMed n = 129, Embase n = 111) were identified. Additionally, a manual search and cited reference search generated another 41 articles leading to a total of 281 articles, of which a title and abstract screening was executed by two authors independently. They identified those articles, which treated specific structural, procedural and evaluation issues of the CNBSS. One hundred and twenty-nine potentially relevant articles were saved for full text screening, while 53 articles were duplicates, 97 articles did not meet the inclusion criteria, and two articles were not available. These 129 articles were screened for the following topics: inclusion criteria, randomisation process, contamination of the control group, quality of mammograms, mammographic reading, and diagnostic chain. Forty-three articles were excluded, as they did not meet these inclusion criteria. Finally, 86 articles were selected for review (Fig. 1). Fig. 1 Flow diagram of systematic literature search
figure	Fig. 1 Flow diagram of systematic literature search
label	Fig. 1
caption	Flow diagram of systematic literature search
p	Flow diagram of systematic literature search
sec	Results Sixteen of the 86 eligible articles mentioned > 1 of the debated issues, but argued neither for nor against the quality of the study. Thirty-seven articles of various authors, including 15 articles (co-)authored by five radiological reviewers1 [21–29] (Kopans, Feig, Logan, Sickles, Tabar) one advisor (Moskowitz), and the responsible physicist [30] (Yaffe) published on “severe flaws” of the study. Thirty-three articles, of which 27 were (co-)authored by the two principal investigators, Miller or Baines, defended the study (Fig. 2). Two of the radiological reviewers resigned during the study. Fig. 2 This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM) The following issues were debated (Table 1): Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31]. An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding. The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32]. Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36]. While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”. A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20]. Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals). Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37]. Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease. Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada Finally, it has been estimated that 25 % of the recommended biopsies were not performed. One of the surgeons was convinced that non palpable lesions detected by mammography did not require biopsy [1, 28].
title	Results
p	Sixteen of the 86 eligible articles mentioned > 1 of the debated issues, but argued neither for nor against the quality of the study. Thirty-seven articles of various authors, including 15 articles (co-)authored by five radiological reviewers1 [21–29] (Kopans, Feig, Logan, Sickles, Tabar) one advisor (Moskowitz), and the responsible physicist [30] (Yaffe) published on “severe flaws” of the study. Thirty-three articles, of which 27 were (co-)authored by the two principal investigators, Miller or Baines, defended the study (Fig. 2). Two of the radiological reviewers resigned during the study. Fig. 2 This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM)
figure	Fig. 2 This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM)
label	Fig. 2
caption	This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM)
p	This chart shows the number of publications which criticized or defended the CNBSS. With very few exceptions, only the main investigators defended the trial. Numerous authors criticized it. Seven of these authors were involved with review or quality assurance of the CNBSS. These authors are mentioned explicitly. Some published several articles concerning the CNBSS. Further publications mentioned some of the issues, but did not comment them. The list of references can be reviewed in the electronic supplementary material (ESM)
p	The following issues were debated (Table 1): Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31]. An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding. The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32]. Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36]. While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”. A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20]. Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals). Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37]. Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease. Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada
table-wrap	Table 1 Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41] This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)
label	Table 1
caption	Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature
p	Issues concerning design, quality assurance, and evaluation of the CNBSS: all cited literature
table	Topics Arguments CNBSS Literature Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28] The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation. A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality) Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6] They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6]. They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7]. Individual randomisation is assumed generally superior to cluster randomisation Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47] Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23], Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29] Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46] Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28] Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers. The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36] Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70] Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57] Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62] Not commented, but mentioned (n = 3) [68], [63], [67] Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4] “ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4] “Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5] Not commented, but mentioned (n = 0) Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21] Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36] Not commented, but mentioned (n = 3) [66], [47], [41]
tr	Topics Arguments CNBSS Literature
th	Topics
th	Arguments
th	CNBSS
th	Literature
tr	Randomisation (n = 45) Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4]. Critique (n = 25) [8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28]
td	Randomisation (n = 45)
td	Randomisation was performed at each site after clinical examination (change or violation of initial protocol); on-site randomisation after palpation could no longer guarantee blinding according to independent external review [1]. Various possibilities of subversion existed for anyone working in the study and the probable or possible incentive was to give patients with a highly suspicious finding the best possible diagnosis [2–4].
td	Critique (n = 25)
td	[8], [9], [10], [11], [12], [2], [1], [13], [14], [15], [16], [3], [4], [17], [18], [19], [20], [21], [22] [23],[24], [25], [26], [27], [28]
tr	The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation.
td	The disproportionally large number of first round participants < age 50 years with advanced cancers entering the mammography group is considered as a strong indicator of flawed randomisation.
tr	A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality)
td	A low number of late stage cancers, possibly shifted from the control to the study group, could strongly affect and bias the calculation of mortality reduction or overdiagnosis, while other available variables will probably not yet be affected (due to their low association to BC mortality)
tr	Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram. Defense (n = 14) [29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6]
td	Defense: “Irrespective of the findings on physical examination….women were independently and blindly assigned randomly” [5]; presence of an incentive is rejected, assuming that symptomatic women would have been seen by a surgeon who “if indicated” [5] could have requested a diagnostic mammogram.
td	Defense (n = 14)
td	[29], [26], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40],[6]
tr	They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6].
td	They refer to one analysis of a subgroup (Manitoba), which “showed no definitive evidence to support a nonrandom allocation of women with prior breast disease to the mammography arms of the study” [6].
tr	They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7].
td	They argue that “>50 other variables showed no statistically significant difference between study and control groups” [7].
tr	Individual randomisation is assumed generally superior to cluster randomisation
td	Individual randomisation is assumed generally superior to cluster randomisation
tr	Not commented, but mentioned (n = 7) [41], [42], [43], [44], 45, [46], [47]
td	Not commented, but mentioned (n = 7)
td	[41], [42], [43], [44], 45, [46], [47]
tr	Inclusion criteria (n = 14) By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable). Critique (n = 9) [8], [48], [13], [3], [4], [49], [25], [24], [23],
td	Inclusion criteria (n = 14)
td	By including mainly symptomatic women in this study, the value of screening by definition cannot be tested (68 % of cancers of the mammography arm were palpable).
td	Critique (n = 9)
td	[8], [48], [13], [3], [4], [49], [25], [24], [23],
tr	Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect Defense (n = 3) [37], [31], [29]
td	Consequence: dilution of the measurable effect, study is underpowered for testing true screening effect
td	Defense (n = 3)
td	[37], [31], [29]
tr	Defense: issue not directly addressed (reasons for choosing this protocol are explained) Not commented, but mentioned (n = 2) [47], [46]
td	Defense: issue not directly addressed (reasons for choosing this protocol are explained)
td	Not commented, but mentioned (n = 2)
td	[47], [46]
tr	Mammography quality (n = 46) According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30]. Critique (n = 28) [8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28]
td	Mammography quality (n = 46)
td	According to the responsible physicist, outdated equipment was used. The equipment at some centers was quite old, and at many centres it lacked key features such as automatic exposure control and grids [30].
td	Critique (n = 28)
td	[8], [9], [13], [25], [4], [24], [3], [23], [49], [22], [18], [19], [9], [27], [21], [17], [20], [51], [52], [53], [50], [54], [54], [55], [56], [57], [58], [28]
tr	Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers.
td	Image quality was rated by external reviewers to be satisfactory in less than 40 % during 1980-1984 (active recruitment until 1985). According to the external reviewers, improvements of the image quality were regularly demanded and two reviewers resigned during the study due to unacceptable quality. Reported problems concerned incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images resulting in too dark or too bright images, no training of readers, high numbers of obviously missed cancers.
tr	The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.” Defense (n = 10) [29], [38], [33], [34], [36], [59], [60], [61], [62], [36]
td	The principal investigators state [5], “Facilities and equipment for modern film screen mammography were prerequisites. Quality control procedures were established for radiation physics and mammography interpretation.”
td	Defense (n = 10)
td	[29], [38], [33], [34], [36], [59], [60], [61], [62], [36]
tr	Not commented, but mentioned (n = 10) [43], [45], [63], [64], [65], [66], [67], [68], [69], [70]
td	Not commented, but mentioned (n = 10)
td	[43], [45], [63], [64], [65], [66], [67], [68], [69], [70]
tr	Reading quality (n = 26) Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals. Critique (n = 16) [25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57]
td	Reading quality (n = 26)
td	Radiologists were not trained in reading mammograms. According to the reviewers, the quality of mammography reading was low and an extraordinarily high number of missed cancers occurred within the 1 year intervals.
td	Critique (n = 16)
td	[25], [4], [24], [3], [23], [18], [9], [27], [17], [51], [52], [50], [54], [55], [56], [57]
tr	Reading quality and training of readers has not been specifically addressed by proponents of the study Defense (n = 7) [33], [34], [36], [38], [7], [71], [62]
td	Reading quality and training of readers has not been specifically addressed by proponents of the study
td	Defense (n = 7)
td	[33], [34], [36], [38], [7], [71], [62]
tr	Not commented, but mentioned (n = 3) [68], [63], [67]
td	Not commented, but mentioned (n = 3)
td	[68], [63], [67]
tr	Recommended biopsies not performed (n = 3) “25 % of needle localization were recommended but not performed” [4, 5] Critique (n = 1) [4]
td	Recommended biopsies not performed (n = 3)
td	“25 % of needle localization were recommended but not performed” [4, 5]
td	Critique (n = 1)
td	[4]
tr	“ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4]
td	“ …at least one physician refused to do a biopsy on nonpalpable (=mammographically detected) lesions…” [4]
tr	“Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36] Defense (n = 2) [36], [5]
td	“Study surgeons decided if diagnostic follow-up was recommended.” “Most biopsies were done” [36]
td	Defense (n = 2)
td	[36], [5]
tr	Not commented, but mentioned (n = 0)
td	Not commented, but mentioned (n = 0)
tr	Contamination (n = 9) 26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect Critique (n = 5) [24], [48], [13], [72], [21]
td	Contamination (n = 9)
td	26 % of women allocated to the control group underwent mammography during the study period. This is a known problem, but may be more pronounced in the trials with individual randomisation. It falsely dilutes the measurable effect
td	Critique (n = 5)
td	[24], [48], [13], [72], [21]
tr	Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect Defense (n = 1) [36]
td	Baines argues that a problem concerning 26 % of the control group can only exert a “small” effect
td	Defense (n = 1)
td	[36]
tr	Not commented, but mentioned (n = 3) [66], [47], [41]
td	Not commented, but mentioned (n = 3)
td	[66], [47], [41]
table-wrap-foot	This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)
p	This table gives an overview of the main issues discussed concerning the CNBSS. The list of references can be reviewed in the electronic supplementary material (ESM)
p	Randomisation: Randomisation was performed after a clinical breast examination (CBE) by the principal investigators at each site [28, 31, 32]. This contradicted the initial study design [33], but was documented in the handbook of operation [34] and by an external investigation [32]. While Baines states that “the center coordinators … were blind to the CBE”, other authors including external reviewers of the study reported the opposite [28, 34]. Also, the coordinator of one of the CNBSS sites was removed from her position because of suspected subversion of the randomisation process [31].
p	An external investigation [32] for possible fraud stated that in 12 out of 15 sites “…the nurses and probably the coordinators were aware of the findings of the clinical examination when the allocation was made”. In addition, more alterations of the allocation book were found in the mammography screening arm than in the control group ( >100 unexplained alterations) [28, 31, 32, 34]. This investigation, which mainly concentrated on checking erasements of participants’ names in the allocation book, did “not uncover credible evidence of subversion” [32]. However, Boyd [31] and others (e.g. Kopans, Burhenne) pointed out that various other easy possibilities of subversion existed that could not be excluded with absent blinding.
p	The external investigation stated, considering that “…referral would not have ensured mammography. The charge has been made that there remained a motive for the examiner or coordinator to subvert the randomisation, if for clinical or other reasons he or she believed that the subject should…have a mammogram” [32].
p	Unfortunately, a severe imbalance of the distribution of advanced cancers (>4 involved lymph nodes) was noted among those women screened in the first round < age 50 years. Nineteen women with far advanced cancers (>4 involved lymph nodes) were allocated to the screening group versus five in the control group, and more women with prior breast cancer were reported in the screening group (n = 8 versus n = 1) [28, 31, 34–36].
p	While Baines argues that more than 50 other variables (“…demographic and risk factors”) showed “virtually identical distribution across control and study groups”, Kopans pointed out that “….shifting a much smaller number of advanced cancers to the study group would substantially affect mortality…..without producing a demographic imbalance”.
p	A bias of randomisation must also be suspected when comparing the hazard ratios of the mammography arm in the prevalence round versus subsequent rounds: 1.47 versus 0.9 [20].
p	Goetzsche investigated none of the above issues of the randomisation process. Instead, he classified the CNBSS as one of two “high quality” RCTs using one formal criterion: he rated individual randomisation higher than cluster randomisation (where demographic regions are invited/not invited as opposed to individuals).
p	Comparable to the Malmö study, which also used individual randomisation, a high proportion of cross-over was reported: 26 % of the women in the control group underwent mammography [37].
p	Inclusion criteria: Women with palpable tumours were included in the CNBSS [1, 2]. This fact alone calls for scrutinizing the inclusion criteria. In the mammography arm of the CNBSS 68 % of breast cancers were already palpable. Furthermore, women with previous breast cancers were included [28, 31], participants whose prognosis may have already be determined by the prior disease.
p	Quality of the screening procedure: Miller claimed that all equipment was new and that a quality assurance was in place [11, 6]. The responsible physicist stated [30], “quality was far below state of the art – even at that time (early 1980s). Problems … from inadequate equipment, … inappropriate imaging technique and lack of….specialized training….”. This was confirmed by several others including external reviewers [28, 29, 34], who specified that “training for mammography technologists or radiologists and a quality assurance were not present”. The study had initially been overseen by two highly renowned advisors (S. Feig and W. A. Logan). They resigned from the study after 3 years due to unacceptable quality. Subsequent external review rated image quality as acceptable in < 40 % of the cases during 1980-1984, in 1985 in about 60 %, and in 1986-7 in < 85 %. Problems included incomplete inclusion of the breast tissue, unsharp images, low image contrast, over- or underexposed images [28, 38] (Fig. 3). Overall, a very large number of interval cancers (143/575) resulted in this annual screening trial. It exceeds the rate, which in modern quality-assured mammography screening is considered acceptable for year one of the interval, by a factor of about 3. According to the reviewers, the above problems of poor image quality, inadequate training of personnel and readers may have accounted for the majority of the excess interval cancers. A subgroup analysis by Moskowitz [39] reported an increase of the detection rate with improving image quality at the end of the study, supporting a correlation with image quality. Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada
figure	Fig. 3 Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada
label	Fig. 3
caption	Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada
p	Images demonstrating the significant changes of technology between 1984, 1987 (CNBSS-study), and a follow-up mammogram of the same patient of 1993. Even though the later technology is still far from present contrast resolution, it becomes obvious that on the former mammograms almost no structures can be discerned in 80 % of the breast making detection of both masses and microcalcifications almost impossible. Images reproduced courtesy of Dr. Roberta Jong, Sunnybrook Health Sciences Centre, Toronto Canada
p	Finally, it has been estimated that 25 % of the recommended biopsies were not performed. One of the surgeons was convinced that non palpable lesions detected by mammography did not require biopsy [1, 28].
sec	Discussion and conclusion The above summary of the published results demonstrates several debatable issues concerning the CNBSS:It has to be emphasized that by definition, screening should address asymptomatic women [40]. Current screening is population-based and aims to invite asymptomatic women. The prevalent round of the Canadian trial had a very high proportion of palpable cancers (partly skewed by the nature of their recruitment strategy). Thus, their results are not applicable to current practice where there is uniform access to high quality symptomatic services for women with symptoms. By accepting palpable tumours (most probably advanced and with worse prognosis), the results are skewed: the overall numbers of cancers are artificially high while the palpable cancers cannot contribute to an improved mortality reduction. Thus, the screening effect will be considerably diluted and underestimated. This leads also to underpowered statistics [28, 31]. The documented process of randomisation did not warrant blinding. Thus, any person involved in the study could subvert the randomisation. Also, the probability of subversion was enhanced since mammography was not necessarily offered to women in the control group. We do not assume that the principal investigators committed any fraud. However, they could not have prevented subversion with the chosen protocol. The disproportional distribution of far advanced stages (cancers with > 4 involved lymph nodes) in the prevalence screen < age 50 years is highly significant and supports the doubts concerning correct randomisation. An even distribution of demographic and risk factors cannot exclude a bias toward late stage cancers, which may severely impact on the assessment of mortality reduction and calculation of overdiagnoses. Long-term mortality reduction was calculated from a maximum of five annual rounds. Because of the short overall duration and continuing entrance of first round screenees, the maximum screening effect could not be reached for many of the participants. Mortality reduction was calculated based on cumulative rates of a mixed trial participation of one to five rounds during up to 5 years. This might lead to a substantial underestimation of the true screening effect compared to a screening programme following approved guidelines (in which participants undergo approximately 10 complete rounds in 20 years) [41]. The higher evidence classification of individual versus cluster randomized studies is correct in principle. But for screening trials, where non-invited and invited women cannot be blinded, individual randomisation may lead to a much higher contamination of the control group than in a cluster randomized setting. Thus, in the CNBSS, as in other individual randomized screening trials, a substantial underestimation of the screening effect through contamination cannot be ruled out. The fact that none of the radiological reviewers including the responsible physicist considered the quality sufficient is highly concerning, as is the described lack of technologist and reader training and the high rate of interval cancers. Two reviewers resigned during the study. How can a method be tested if it is not properly performed and interpreted? What is the value of the results? The fact that recommended biopsies of mammographically detected abnormalities were not systematically performed is likely to have distorted the results. What effect is expected from early detection if suspicious findings are not followed by adequate assessment and therapy? Both obvious and probable protocol deficiencies are likely to have had an impact on the results counteracting a possible effect of mammography screening on breast cancer mortality and distorting estimates on overdiagnosis. As appropriate randomisation is one of the key validity criteria for RCTs, a study with that kind of violation should not be rated as a high quality study. Whether evidence from such a trial can be used at all must be questioned. Even if the raised concerns were insignificant and the results were valid, the question remains whether the results and conclusions from a screening trial performed in 1980-85 are applicable to or useful for the assessment of present screening programmes. The most appropriate answer is obviously “no”. First, the age range in the CNBSS was 40–59 years, which does not apply to the age range of most mammography screening programs today (50–69 or 74 years), as recommended in National and European Guidelines. Because of the lower incidence of breast cancer at younger ages, the absolute effect is lower. Today we know that mammography quality is even more important in the younger age group due to more difficult detection within dense breast tissue. Secondly, the mammographic technique and quality assurance of the complete chain from screening to screen reading, assessment, and treatment in modern population-based mammography screening programs is almost completely different from the CNBSS [42]. Abnormalities are routinely assessed using state of the art minimally invasive methods, and treatment is increasingly standardized and adapted to the stage at detection and the aggressiveness of the cancer. Finally the improved quality today clearly has increased the sensitivity and specificity of mammography screening. The result of 68 % of palpable breast cancers in the mammography arm (average size 1.9 cm) would today be unacceptable for annual (!) screening. What can we conclude from this short review? Probable deficits of randomisation and of proper application of the test cannot be repaired by performing a follow-up study. The results of such a study remain biased. We do not want to discount the CNBSS which represents an enormous and exceptional effort in the 1980s, and we took note of its size, the time, and circumstances when it was conducted. Also, the important question asked in this trial was different from all other screening trials. However, the chosen methodology led to obvious biases with significant impact on the results, especially when long-term results are considered. Furthermore, it is more than obvious that the setting of mammography in the CNBSS is not comparable to present mammography screening programs. Therefore, using the CNBSS as “highest evidence” to assess the effects of modern mammography screening programs of the new millennium is not scientifically justified. Considering the fact that properly performed cohort studies and nested case-control studies with appropriate consideration of length time bias demonstrate a much higher effect of mortality reduction the assumption that the null effect of the CNBSS is “due to availability of chemotherapy” [20] is unproven and highly speculative. It is an unanswered question, why high-ranking journals, like the BMJ, and representatives of evidence-based medicine close their eyes to these arguments [43] and still refer to the CNBSS as “superior” evidence (against mammography screening). Taking the CNBSS as an example, the authors want to point out how evidence in the field of breast cancer screening has systematically been omitted, distorted, or inappropriately used over the last decades. When using CNBSS data, opponents of screening mammography [6, 19, 44, 45] ignore or misinterpret an important part of the existing evidence. The consequence of this recommendation is “waiting until a cancer becomes palpable”. This means that contrary to early detection, women would present at a stage that usually requires aggressive treatment including chemotherapy and more often axillary dissection. There is no doubt that evidence shows that the earlier the stage of breast cancer at diagnosis, the better the prognosis. In conclusion, the comparison of the settings of the CNBSS with the setting of modern mammography screening is akin to comparing apples with pears. Drawing conclusions from the CNBSS for today’s quality-assured population based screening programmes is an act of negligence. What we need today is the continuous evaluation of the ongoing mammography screening programmes, including, but not only, breast cancer mortality as an outcome.
title	Discussion and conclusion
p	The above summary of the published results demonstrates several debatable issues concerning the CNBSS:It has to be emphasized that by definition, screening should address asymptomatic women [40]. Current screening is population-based and aims to invite asymptomatic women. The prevalent round of the Canadian trial had a very high proportion of palpable cancers (partly skewed by the nature of their recruitment strategy). Thus, their results are not applicable to current practice where there is uniform access to high quality symptomatic services for women with symptoms. By accepting palpable tumours (most probably advanced and with worse prognosis), the results are skewed: the overall numbers of cancers are artificially high while the palpable cancers cannot contribute to an improved mortality reduction. Thus, the screening effect will be considerably diluted and underestimated. This leads also to underpowered statistics [28, 31]. The documented process of randomisation did not warrant blinding. Thus, any person involved in the study could subvert the randomisation. Also, the probability of subversion was enhanced since mammography was not necessarily offered to women in the control group. We do not assume that the principal investigators committed any fraud. However, they could not have prevented subversion with the chosen protocol. The disproportional distribution of far advanced stages (cancers with > 4 involved lymph nodes) in the prevalence screen < age 50 years is highly significant and supports the doubts concerning correct randomisation. An even distribution of demographic and risk factors cannot exclude a bias toward late stage cancers, which may severely impact on the assessment of mortality reduction and calculation of overdiagnoses. Long-term mortality reduction was calculated from a maximum of five annual rounds. Because of the short overall duration and continuing entrance of first round screenees, the maximum screening effect could not be reached for many of the participants. Mortality reduction was calculated based on cumulative rates of a mixed trial participation of one to five rounds during up to 5 years. This might lead to a substantial underestimation of the true screening effect compared to a screening programme following approved guidelines (in which participants undergo approximately 10 complete rounds in 20 years) [41]. The higher evidence classification of individual versus cluster randomized studies is correct in principle. But for screening trials, where non-invited and invited women cannot be blinded, individual randomisation may lead to a much higher contamination of the control group than in a cluster randomized setting. Thus, in the CNBSS, as in other individual randomized screening trials, a substantial underestimation of the screening effect through contamination cannot be ruled out. The fact that none of the radiological reviewers including the responsible physicist considered the quality sufficient is highly concerning, as is the described lack of technologist and reader training and the high rate of interval cancers. Two reviewers resigned during the study. How can a method be tested if it is not properly performed and interpreted? What is the value of the results? The fact that recommended biopsies of mammographically detected abnormalities were not systematically performed is likely to have distorted the results. What effect is expected from early detection if suspicious findings are not followed by adequate assessment and therapy?
p	It has to be emphasized that by definition, screening should address asymptomatic women [40]. Current screening is population-based and aims to invite asymptomatic women. The prevalent round of the Canadian trial had a very high proportion of palpable cancers (partly skewed by the nature of their recruitment strategy). Thus, their results are not applicable to current practice where there is uniform access to high quality symptomatic services for women with symptoms. By accepting palpable tumours (most probably advanced and with worse prognosis), the results are skewed: the overall numbers of cancers are artificially high while the palpable cancers cannot contribute to an improved mortality reduction. Thus, the screening effect will be considerably diluted and underestimated. This leads also to underpowered statistics [28, 31].
p	The documented process of randomisation did not warrant blinding. Thus, any person involved in the study could subvert the randomisation. Also, the probability of subversion was enhanced since mammography was not necessarily offered to women in the control group. We do not assume that the principal investigators committed any fraud. However, they could not have prevented subversion with the chosen protocol. The disproportional distribution of far advanced stages (cancers with > 4 involved lymph nodes) in the prevalence screen < age 50 years is highly significant and supports the doubts concerning correct randomisation. An even distribution of demographic and risk factors cannot exclude a bias toward late stage cancers, which may severely impact on the assessment of mortality reduction and calculation of overdiagnoses.
p	Long-term mortality reduction was calculated from a maximum of five annual rounds. Because of the short overall duration and continuing entrance of first round screenees, the maximum screening effect could not be reached for many of the participants. Mortality reduction was calculated based on cumulative rates of a mixed trial participation of one to five rounds during up to 5 years. This might lead to a substantial underestimation of the true screening effect compared to a screening programme following approved guidelines (in which participants undergo approximately 10 complete rounds in 20 years) [41].
p	The higher evidence classification of individual versus cluster randomized studies is correct in principle. But for screening trials, where non-invited and invited women cannot be blinded, individual randomisation may lead to a much higher contamination of the control group than in a cluster randomized setting. Thus, in the CNBSS, as in other individual randomized screening trials, a substantial underestimation of the screening effect through contamination cannot be ruled out.
p	The fact that none of the radiological reviewers including the responsible physicist considered the quality sufficient is highly concerning, as is the described lack of technologist and reader training and the high rate of interval cancers. Two reviewers resigned during the study. How can a method be tested if it is not properly performed and interpreted? What is the value of the results?
p	The fact that recommended biopsies of mammographically detected abnormalities were not systematically performed is likely to have distorted the results. What effect is expected from early detection if suspicious findings are not followed by adequate assessment and therapy?
p	Both obvious and probable protocol deficiencies are likely to have had an impact on the results counteracting a possible effect of mammography screening on breast cancer mortality and distorting estimates on overdiagnosis. As appropriate randomisation is one of the key validity criteria for RCTs, a study with that kind of violation should not be rated as a high quality study. Whether evidence from such a trial can be used at all must be questioned.
p	Even if the raised concerns were insignificant and the results were valid, the question remains whether the results and conclusions from a screening trial performed in 1980-85 are applicable to or useful for the assessment of present screening programmes. The most appropriate answer is obviously “no”.
p	First, the age range in the CNBSS was 40–59 years, which does not apply to the age range of most mammography screening programs today (50–69 or 74 years), as recommended in National and European Guidelines. Because of the lower incidence of breast cancer at younger ages, the absolute effect is lower. Today we know that mammography quality is even more important in the younger age group due to more difficult detection within dense breast tissue. Secondly, the mammographic technique and quality assurance of the complete chain from screening to screen reading, assessment, and treatment in modern population-based mammography screening programs is almost completely different from the CNBSS [42]. Abnormalities are routinely assessed using state of the art minimally invasive methods, and treatment is increasingly standardized and adapted to the stage at detection and the aggressiveness of the cancer. Finally the improved quality today clearly has increased the sensitivity and specificity of mammography screening. The result of 68 % of palpable breast cancers in the mammography arm (average size 1.9 cm) would today be unacceptable for annual (!) screening.
p	What can we conclude from this short review?
p	Probable deficits of randomisation and of proper application of the test cannot be repaired by performing a follow-up study. The results of such a study remain biased.
p	We do not want to discount the CNBSS which represents an enormous and exceptional effort in the 1980s, and we took note of its size, the time, and circumstances when it was conducted. Also, the important question asked in this trial was different from all other screening trials.
p	However, the chosen methodology led to obvious biases with significant impact on the results, especially when long-term results are considered. Furthermore, it is more than obvious that the setting of mammography in the CNBSS is not comparable to present mammography screening programs. Therefore, using the CNBSS as “highest evidence” to assess the effects of modern mammography screening programs of the new millennium is not scientifically justified.
p	Considering the fact that properly performed cohort studies and nested case-control studies with appropriate consideration of length time bias demonstrate a much higher effect of mortality reduction the assumption that the null effect of the CNBSS is “due to availability of chemotherapy” [20] is unproven and highly speculative.
p	It is an unanswered question, why high-ranking journals, like the BMJ, and representatives of evidence-based medicine close their eyes to these arguments [43] and still refer to the CNBSS as “superior” evidence (against mammography screening).
p	Taking the CNBSS as an example, the authors want to point out how evidence in the field of breast cancer screening has systematically been omitted, distorted, or inappropriately used over the last decades.
p	When using CNBSS data, opponents of screening mammography [6, 19, 44, 45] ignore or misinterpret an important part of the existing evidence. The consequence of this recommendation is “waiting until a cancer becomes palpable”. This means that contrary to early detection, women would present at a stage that usually requires aggressive treatment including chemotherapy and more often axillary dissection. There is no doubt that evidence shows that the earlier the stage of breast cancer at diagnosis, the better the prognosis.
p	In conclusion, the comparison of the settings of the CNBSS with the setting of modern mammography screening is akin to comparing apples with pears. Drawing conclusions from the CNBSS for today’s quality-assured population based screening programmes is an act of negligence.
p	What we need today is the continuous evaluation of the ongoing mammography screening programmes, including, but not only, breast cancer mortality as an outcome.
sec	Electronic supplementary material ESM 1 (DOCX 21 kb)
title	Electronic supplementary material
sec	ESM 1 (DOCX 21 kb)
label	ESM 1
caption	(DOCX 21 kb)
p	(DOCX 21 kb)
back	1 assigned during different time periods Acknowledgements The scientific guarantor of this publication is Sylvia H. Heywang-Koebrunner. The authors of this manuscript declare relationships with the following companies: Prof. Sylvia Heywang-Koebrunner: 1. I am head of the Reference Center Mammography Screening Munich which is responsible for quality assurance and training of screeners in Bavaria and Thuringia. This makes 50 % of my work. 2. I work as head of a screening unit and I work in private practice, specialized in breast imaging and interventions. About 70 % of my work is associated with mammography screening Prof. Ingrid Schreer: No other relationships/conditions/circumstances that present a potential conflict of interest Dr. Astrid Hacker: I'm an employee of the Reference Center for Mammography Screening Munich. Dr. Maria Noftz: No other relationships/conditions/circumstances that present a potential conflict of interest Prof. Dr. Alexander Katalinic: I am chair of the scientific committee of the German Mammography Screening. This is an independent scientific committee, which is giving advice for the ongoing program. The task is honorary; there is no fee paid to me or any committee member. Only costs for travel to the committee meetings are paid by the screening program. The authors state that this work has not received any funding. One of the authors has significant statistical expertise. Institutional Review Board approval was obtained. No study subjects or cohorts have been previously reported in any other journal. Methodology: The special report is a retrospective review on an existing trial performed in the 1980s (CNBSS) and a lately published follow-up study of the CNBSS. It is performed at one institution.
footnote	1 assigned during different time periods
label	1
p	assigned during different time periods
ack	Acknowledgements The scientific guarantor of this publication is Sylvia H. Heywang-Koebrunner. The authors of this manuscript declare relationships with the following companies: Prof. Sylvia Heywang-Koebrunner: 1. I am head of the Reference Center Mammography Screening Munich which is responsible for quality assurance and training of screeners in Bavaria and Thuringia. This makes 50 % of my work. 2. I work as head of a screening unit and I work in private practice, specialized in breast imaging and interventions. About 70 % of my work is associated with mammography screening Prof. Ingrid Schreer: No other relationships/conditions/circumstances that present a potential conflict of interest Dr. Astrid Hacker: I'm an employee of the Reference Center for Mammography Screening Munich. Dr. Maria Noftz: No other relationships/conditions/circumstances that present a potential conflict of interest Prof. Dr. Alexander Katalinic: I am chair of the scientific committee of the German Mammography Screening. This is an independent scientific committee, which is giving advice for the ongoing program. The task is honorary; there is no fee paid to me or any committee member. Only costs for travel to the committee meetings are paid by the screening program. The authors state that this work has not received any funding. One of the authors has significant statistical expertise. Institutional Review Board approval was obtained. No study subjects or cohorts have been previously reported in any other journal. Methodology: The special report is a retrospective review on an existing trial performed in the 1980s (CNBSS) and a lately published follow-up study of the CNBSS. It is performed at one institution.
title	Acknowledgements
p	The scientific guarantor of this publication is Sylvia H. Heywang-Koebrunner. The authors of this manuscript declare relationships with the following companies:
p	Prof. Sylvia Heywang-Koebrunner:
p	1. I am head of the Reference Center Mammography Screening Munich which is responsible for quality assurance and training of screeners in Bavaria and Thuringia. This makes 50 % of my work.
p	2. I work as head of a screening unit and I work in private practice, specialized in breast imaging and interventions. About 70 % of my work is associated with mammography screening
p	Prof. Ingrid Schreer: No other relationships/conditions/circumstances that present a potential conflict of interest
p	Dr. Astrid Hacker: I'm an employee of the Reference Center for Mammography Screening Munich.
p	Dr. Maria Noftz: No other relationships/conditions/circumstances that present a potential conflict of interest
p	Prof. Dr. Alexander Katalinic: I am chair of the scientific committee of the German Mammography Screening. This is an independent scientific committee, which is giving advice for the ongoing program. The task is honorary; there is no fee paid to me or any committee member. Only costs for travel to the committee meetings are paid by the screening program.
p	The authors state that this work has not received any funding. One of the authors has significant statistical expertise. Institutional Review Board approval was obtained. No study subjects or cohorts have been previously reported in any other journal. Methodology: The special report is a retrospective review on an existing trial performed in the 1980s (CNBSS) and a lately published follow-up study of the CNBSS. It is performed at one institution.

Annnotations TAB TSV DIC JSON TextAE

last updated at 2021-11-25 09:55:13 UTC

Denotations: 1
Blocks: 0
Relations: 0

PMC:4712234 / 12758-12760 JSONTXT 4 Projects

Document structure show

Annnotations TAB TSV DIC JSON TextAE

PMC:4712234 / 12758-12760 JSON TXT 4 Projects