Methods Seventeen people independently judged the quality of evidence, the balance between benefits and harms, and the formulation of a recommendation for 12 examples. The 17 judges all had experience using other approaches to grade evidence and recommendations. Evidence profiles For each example we prepared an evidence profile. Each evidence profile was made based on information available in a systematic review and consists of two tables, one for quality assessment of the available information and one table that presents a summary of the findings (Table 1 and Table 2). For the purpose of testing our grading approach in this pilot study we made the assumption that the systematic reviews that we used were all well conducted. The examples we used and presented here were selected to test our new approach, not with an intention of making actual recommendations for a specific setting based on up-to-date systematic reviews. The quality assessment table was designed such that the quality of each outcome was evaluated separately. For each outcome, the table contained information regarding the number of studies that had reported the outcome, information about the study design (RCTs or observational studies) and the quality of the studies that reported on that outcome (was there any limitations in the design or conduct of these studies). Also included in the quality assessment table was information about the consistency of the results across studies for each outcome and information regarding directness of the study population, outcome measure, intervention and comparison. The summary of findings table was also designed such that each outcome was presented separately. For each outcome information are presented about both the experimental and the control group patients, for dichotomous outcomes the number of events and the total number of participants, and for continuous outcomes means (standard deviation) and the number of patients were presented. Also included in the summary of findings table is information about the effect, relative effect (95% confidence interval) and absolute effect for each outcome. Table 1 Example of an evidence profile quality assessment given to the evaluators for them to grade in the pilot study. Example question: Should depressed patients in primary care be treated with SSRIs rather than tricyclics? Table 2 Example of an evidence profile summary of findings given to the evaluators for them to grade in the pilot study. Example question: Should depressed patients in primary care be treated with SSRIs rather than tricyclics? * Uncertainty about baseline risk: Fatality data may be influenced by which pills are given to whom, and it is uncertain if changing antidepressant would deter suicide attempts Instructions and a form for recording each judgement were included with each example [see Additional file 1]. The judges were instructed to apply the approach without second guessing the information presented in the evidence profile or the approach. They were asked to note problems that they encountered and judgements that did not make sense to them when they adhered to the approach as instructed. Questions and judgements The 12 examples were selected to include a variety of health care interventions, types of evidence and types of recommendations. The questions that were posed in the 12 examples were: • Should depressed patients in primary care be treated with SSRIs or tricyclics? [2] • Should patients with atrial fibrillation be treated with warfarin or aspirin for prevention of stroke? [3] • Should patients with pain believed to be due to degenerative arthritis be treated with non-steroidal anti-inflammatory drugs (NSAIDs) or paracetamol? [4] • Should patients who have had a myocardial infarction be given antiplatelet therapy to reduce all cause mortality? [5] • Should patients who have had a myocardial infarction be offered exercise rehabilitation? [5] • Should patients with deep venous thrombosis be treated with Low Molecular Weight Heparin (LMWH) or IV unfractionated heparin for prevention of pulmonary embolism? [6] • Should antibiotics be used to treat acute maxillary sinusitis? [7] • Should BCG vaccine be used to prevent tuberculosis? [8] • Should surgical discectomy be recommended for patients with sciatica due to lumbar disc prolapse? [9] • Should community water fluoridation be used to reduce dental caries? [10,11] • Should distribution of child safety seats and education programs be used to increase correct use of child safety seats? [12] • Should hormone replacement therapy be given to prevent cardiovascular heart disease in healthy post menopausal women? [13] For each example each person made judgements about; • the quality of evidence for each outcome, scored as high, intermediate, low, or very low; • the relative importance of each outcome, scored as critical to the decision (7–9), important but not critical to the decision (4–6), or not important to the decision (1–3); • the overall quality of all the critical outcomes, scored as high, intermediate, low, or very low; • the balance between benefits and harms, scored as net benefit, trade offs, uncertain net benefit, or not net benefit; and • the recommendation, scored as do it, probably do it, toss up, probably don't do it, or don't do it. For each example the judgements made by all 17 people were collected and summarised as illustrated in Table 3. Disagreements were discussed at a meeting attended by 15 of the 17 judges. Because of a lack of time, the last two examples were discussed at another meeting attended by six of the 17 judges, but all 17 raters provided judgements for all of the 12 examples. For each example the kappa agreement was calculated [14] for the 17 graders across the four levels for the quality of evidence across outcomes for each example (number of outcomes per example range from two to seven), across all outcomes (46) and for the judgements about overall quality of the evidence (12). Table 3 Summary of the judgements made by the 17 evaluators for Example 1 of the pilot study. Should depressed patients in primary care be treated with SSRIs rather than tricyclics? Sensibility and understandability After grading all 12 examples, the judges were asked 16 questions regarding the sensibility and understandability of the approach. Each question consisted of a statement and five response options: strongly disagree, disagree, not sure, agree, and strongly agree. Eleven people completed this questionnaire. The questionnaire was adapted from Feinstein [15] and the 16 statements were: 1. The approach is applicable to different types of interventions, including drugs, surgery, counselling, and community-based interventions. 2. The approach is clear and simple to apply 3. The information that is needed is generally available. 4. Subjective decisions are generally not needed. 5. All of the components included in each of the five types of judgements should be included 6. There are not important components that are missing for any of the five types of judgements. 7. The ways in which the components are aggregated for each of the five types of judgements are clear and simple. 8. The ways in which the included components are aggregated are appropriate for each of the five types of judgements. 9. The categories are sufficient to discriminate between different grades for each of the five types of judgements. 10. The approach successfully discriminates between different grades of evidence. 11. The approach successfully discriminates between different grades of recommendations. 12. The overall quality of evidence is clear and understandable. 13. The balance between the benefits and harms is clear and understandable. 14. The recommendation is clear and understandable. 15. The way in which the overall quality of evidence was graded is better than other ways of doing this with which I am familiar. 16. The way in which the recommendation was graded is better than other ways of doing this with which I am familiar.