Results

Quality of evidence for each outcome
The quality of evidence for each outcome as assessed by the 17 graders are shown in Table 4. Much of the disagreement was due to lacking information in the evidence summaries that we prepared based on the information available in the chosen examples. We agreed that the evidence summaries should include footnotes explaining the basis for judgements about study quality, consistency and directness. We also agreed that it was necessary to include information about baseline risk and the setting as part of the background information since different assumptions about these factors also explained some of the disagreement. It was possible to reach a consensus about the quality of evidence for most outcomes when we discussed our judgements. Of the 48 outcomes that were included across the 12 examples, we were not able to reach a consensus regarding five. The lack of consensus resulted from disagreement about whether there was sparse data for three outcomes and because of insufficient information for two outcomes.
Table 4  Results, summary of the judgements made by the 17 evaluators of the quality for each of the outcomes presented in the 12 examples in the pilot study.   We found that in addition to study design, quality, consistency and directness, other quality criteria also influenced judgements about evidence. These additional criteria were sparse data, strong associations, publication bias, dose response, and situations where all plausible confounders strengthened rather than weakened our confidence in the direction of the effect. Concequently, the consistency with which we considered these additional issues were affected and disagreements regarding the quality of evidence for each outcome were reduced.

Relative importance of each outcome
Specification of outcomes in the question that each example addressed resulted in some confusion regarding the relative importance of each outcome and the overall quality of evidence across outcomes. We therefore agreed that outcomes should not be included in the questions and that all important outcomes should be considered. There was good agreement about the relative importance of the 48 outcomes that were considered. We reached a consensus about the relative importance of all but two of the outcomes. This was due to uncertainty and true disagreement about the importance of these two outcomes, dental fluorosis and bone fractures, in relation to the question about water fluoridation.

Overall quality of important outcomes
There was a lack of agreement about the overall quality of evidence across the critical outcomes for each question (Table 5). This poor agreement reflected an accumulation of disagreements about the quality of evidence and importance of the individual outcomes that were considered for each question. In addition, we found that it did not make sense to downgrade the overall quality of evidence because of lower quality evidence for one of several critical outcomes when all of the outcomes showed effect in the same direction. We therefore agreed that the overall quality of evidence should be based on the higher quality evidence, rather than the lowest quality of evidence, when all of the results are in favour of the same option.
Table 5  Results, summary of the judgements made by the 17 evaluators of the overall quality in the 12 examples in the pilot study   The kappa statistics for each question are shown in Table 6. The number of outcomes per example range from two to seven and the kappa ranged from 0 to 0.82. In some instances, the agreement among the graders was slightly worse than by chance as indicated by the negative kappa values seen in Table 6. The kappa across the 46 outcomes included in the calculation was 0.395 (SE 0.008). Kappa for agreement beyond chance for the 12 final judgements about the quality of evidence was 0.270 (SE 0.015).
Table 6  Results, kappa agreement among the evaluators for each of the 12 examples in the pilot study

Balance between benefits and harms
The graders assessments about the balance between benefits and harms are shown in Table 7. There is visibly a poor agreement, this can, in part, be explained by the accumulation of all the previous differences in grading of the quality and importance of the evidence. Some of the judges made assumptions or considered information that was not included in the evidence profiles. When we discussed these judgements, we reached a consensus about the balance between benefits and harms for all but three questions. For one question we found we needed more information. For the second judgement we disagreed about the importance of two of the outcomes. For the third judgement we disagreed about the relative values we attached to the benefits and the harms.
Table 7  Results, summary of the judgements made by the 17 evaluators about the balance between benefits and harms for each of the 12 examples in the pilot study

Recommendation
The graders individual considerations about the recommendations are shown in Table 8. During the discussion, we reached a consensus on a recommendation for the nine examples where we agreed on the balance between benefits and harms. We found that first agreeing on the balance between the benefits and harms clarified our judgements about recommendations and facilitated a consensus. There was not a one-to-one correspondence between our judgements about trade-offs and our judgements about recommendations, because the latter took into account additional considerations.
Table 8  Results, summary of the recommendations made the 17 evaluators for each of the 12 examples in the pilot study

Sensibility and understandability
Eleven raters provided feedback on the sensibility and understandability of the GRADE system for grading evidence and formulating recommendations. Nine of the 11 respondents agreed or strongly agreed that the judgements about the overall quality of evidence were clear and understandable, and that the judgements about the balance between benefits and harms were clear and understandable using the GRADE approach. Everyone agreed or strongly agreed that the judgements about recommendations were clear and understandable. Eight of the judges agreed or strongly agreed that the GRADE approach to judging the overall quality of evidence was better than other grading systems with which they were familiar. Two disagreed and one was not sure. Eight also agreed that the GRADE approach to formulating recommendations was better than approaches with which the raters were familiar. Three raters were not sure about whether the GRADE approach was superior to other approaches of formulating recomendations.
Nine of the 11 respondents agreed or strongly agreed that the GRADE approach was applicable to different types of interventions, and that the approach was clear and simple to apply. Five judges disagreed that the information that is needed is generally available, two were not sure and four agreed. Six of the eleven judges disagreed or strongly disagreed that subjective decisions were generally not needed, four were not sure and one agreed. Ten of the eleven judges agreed or strongly agreed that all the components included in each of the four types of judgements should be included; one judge was not sure. Five of the judges were unsure if there were not important components that were missing from any of the four types of judgements, one disagreed and three agreed or strongly agreed. Eight judges agreed or strongly agreed that the ways in which the components were aggregated for each of the four types of judgements were clear and simple; three were unsure. Seven judges agreed or strongly agreed that the ways in which the included components were aggregated were appropriate for each of the four types of judgements, two were unsure and two disagree. Ten of the eleven judges agreed or strongly agreed that the categories were sufficient to discriminate between different grades for each of the four types of judgements; one disagreed. All the eleven judges agreed or strongly agreed that the GRADE approach successfully discriminated between different quality of evidence, and between different grades of recommendations.