Discussion
This pilot study of the GRADE approach to grading the quality of evidence and strength of recommendations helped to identify problems with the approach and enabled us to address these. We found that it was possible to resolve most of the disagreements we had when making judgements independently and there was agreement that this approach warrants further development and evaluation.
Many of the disagreements were a direct result of a lack of information. We concluded that there is a need for detailed additional information in evidence profiles, and have modified the evidence profiles accordingly. When we have found an empirical basis or compelling arguments, we have also provided precise definitions. For example, we have agreed on a basis for defining strong and very strong associations. However, in many cases we continue to rely on judgement. We have addressed this by always including the rationale for such judgements in footnotes attached to the evidence profile.
The evidence profiles used in the pilot study were based on systematic reviews. [2-13] Much of the information we found lacking was missing in these original systematic reviews, particularly information about harms and side effects. It was outside of the scope of this study to systematically collect this information. However, systematic reviews of evidence of harms, as well as benefits, are essential for guidelines development panels. If reviews, such as Cochrane reviews, are going to meet the needs of guideline development panels, and others making decisions about health care, it is essential that evidence of adverse effects is systematically included in these.
An important benefit of the approach to grading evidence and recommendations that we used in this study is that it clarifies the source of true disagreements, as well as helping to resolve disagreements through discussing each type of judgement sequentially. Judgements about the relative importance of different outcomes and about trade-offs, as well as about the quality of evidence, are made explicitly, rather than implicitly. This facilitates discussion and clarification of these judgements. It may be helpful to guideline panels and others to use this approach before making decisions and recommendations.
The most common source of disagreement that we encountered was differences in what we consider to be sparse data. We have not reached a consensus on a definition of sparse data, but have acknowledged that we have different thresholds and now recognize this when we make judgements about the quality of evidence [16].
We have as a result of this pilot study been able to make considerable improvements to our system for grading the quality of evidence and strength of recommendations. The evidence profiles used in the pilot study have been modified and now include information that was missing and was found to be an important source of disagreement, as illustrated in Table 9 and Table 10 and the criteria used for grading the quality of evidence for each important outcome have been modified as summarised in Table 11. Guideline generation includes judgement. Individual, residual judgements will impact on the agreement we measured in this study. Thus, lower kappa values are expected. Further refinement of the GRADE system and additional instructions will improve agreement.
Table 9  Example of a modified GRADE evidence profile quality assessment. Table 9 and 10 is what Table 1 and 2 became when including the improvements made based on the pilot study experience. 
Question: Should depressed patients be treated with SSRIs rather than tricyclics?
Setting: Primary care 
Patients: Moderately depressed adult patients 
Reference: North of England Evidence Based Guideline Development Project. Evidence based clinical practice guideline: the choice of antidepressants for depression in primary care. Newcastle upon Tyne: Centre for Health Services Research, 1997.  *There was uncertainty about the directness of the outcome measure because of the short duration of the trials.
**It is possible that people at lower risk were more likely to have been given SSRI's and it is uncertain if changing antidepressant would have deterred suicide attempts.
SD = Sparse data (Yes or No)
SA = Strong association (No, + = strong, ++ = very strong)
RB = Reporting bias (Yes or No)
DR = Dose response (Yes or No)
PC = All plausible confounders would have reduced the effect (Yes or No)
CI = confidence interval
WMD = weighted mean difference
RRR = relative risk reduction
Table 10  Example of a modified GRADE evidence profile summary of findings. Table 9 and 10 is what Table 1 and 2 became when including the improvements made based on the pilot study experience  ***There is uncertainty about the baseline risk for poisoning fatalities.
Table 11  Modified GRADE quality assessment criteria  * 1 = move up or down one grade (for example from high to moderate)
2 = move up or down two grades (for example from high to low)
The highest possible score is High (4) and the lowest possible score is Very low (1). Thus, for example, randomised trials with a strong association would not move up a grade.
** A relative risk of >2 (< 0.5), based on consistent evidence from two or more observational studies, with no plausible confounders
*** A relative risk of > 5 (< 0.2) based on direct evidence with no major threats to validity  Judgements about confidence in evidence and recommendations are complex. The GRADE system represents our current thinking about how to reduce errors and improve communication of these complex judgements. Ongoing developments include:
• Exploring the extent to which the same system should be applied to public health and health policy decisions as well as clinical decisions
• Developing guidance for when and how costs (resource utilisation) should be considered
• Developing guidance for judgements regarding sparse data
• Adapting the approach to accommodate recommendations about diagnostic tests when these are based on evidence of test accuracy
• Incorporating considerations about equity
• Preparing tools to support the application of the GRADE system
Plans for further development include studies of the reliability and sensibility of this approach and a study comparing alternative ways of presenting these judgements [17]. We invite other organisations responsible for systematic reviews of the effects of healthcare or practice guidelines to work with us to further develop and evaluate the system described here.