Am Fam Physician. 2004;70(12):2268-2270
to the editor: There are clear advantages to using a single evidence-grading system that is simple, comprehensible, and usable for researchers, clinicians, and others. We applaud the authors of the article1 on strength of recommendation taxonomy (SORT) in the February 1, 2004, issue of American Family Physician for tackling this challenging and important task.
We think SORT1 may prove a good starting point. There are some areas that need polishing because of omissions and inconsistencies. We want to address some conceptual issues to engender further discussions.
The biggest issue is the potential for the application of SORT to result in misleading conclusions of effectiveness for ineffective interventions. For example, observational studies of hormone therapy could earn SORT Level 2 for study quality (cohort studies such as Sullivan, et al2) that would lead to grade B (clear recommendation of moderate strength) for secondary prevention of coronary artery disease, leading users to conclude an effectiveness that we know is not true.
We very much like the focus on patient-centeredness. However, we would recommend this be separated from the issue of validity. A valid study of an intermediate marker not demonstrated to provide clinical benefit would receive a low grade. Yet, over time, that marker might be found to have clinical significance. It would be beneficial to have a method that indicates that the study was valid.
Also, the use of grades A, B, and C as a scale already has connotations in our culture (e.g., grade C is associated with “satisfactory” or a “passing” mark). In SORT, grade C evidence is not “passing” evidence. Furthermore, “strong” to “weak” implies a continuum in a range, when some “evidence” comes from studies so invalid it should not be considered evidence. We would recommend demarcation of clearer lines, including a way to label a study as not being a good one, perhaps using a short descriptor.
This risk of being misled will not be mitigated by physicians simply applying their “understanding” of evidence-based medicine to “retranslate” SORT when they lack this understanding in the first place. The majority of physicians (76 percent) fail a simple critical appraisal pretest we give before our teaching programs. Most physicians are not going to look at grade C evidence with an understanding that they should not draw conclusions from such information.
We would encourage more emphasis on “validity” in labeling preferential types of studies and in criteria. For example, systematic reviews or meta-analyses should meet criteria for validity, not just include valid studies. The last criterion for levels of evidence in Figure 3 of the article1 omits many possible biases that could invalidate a study.
We also don’t find what authors refer to as “quantity” addressed in the grading scale. It appears possible for a single high-quality, powered, clinical trial with a very small “n” to achieve top grade.
The authors’1 work is a major undertaking. We especially like having an algorithm that facilitates assessments. A simple, clear method would be a boon, and we look forward to learning others’ ideas.
to the editor:The Strength of Recommendation Taxonomy (SORT)1 developed by the family medicine editors is a significant contribution that may enhance evidence-based practice by simplifying the communication of evidence-based medicine (EBM) concepts within the family medicine community.
DynaMed (www.DynamicMedical.com) is an evidence-based clinical reference with a user base of practicing clinicians who vary widely in awareness and acceptance of EBM concepts. DynaMed processes (systematic monitoring of the research literature with article selection based on validity and relevance, summarization of clinically relevant information, and explicit descriptions of study methodology) have been designed to provide the best available evidence in a transparent fashion, but without “labeling” or branding the evidence. The DynaMed Systematic Literature Surveillance process is described on their Web site athttp://www.dynamicmedical.com/policy.
We have previously avoided codified evidence labeling in DynaMed because existing labeling systems were inconsistent, had potential for misinterpretation, and were not commonly used across different sources. We did not wish to independently create a new rating system and add to the problem.
Sixteen DynaMed editors actively participated in a process in which we examined the SORT criteria and came to a consensus regarding their use in DynaMed. We have determined that the SORT criteria are easy to apply, separate reported evidence into clinically meaningful categories and, given the nature of their development, have the potential to be used in multiple sources and develop recognition across the discipline.
Therefore, we will start using the SORT criteria in DynaMed. We would like to suggest three areas for improvement. First, information presentation should be intuitive. Many clinicians who have not had orientation to the SORT criteria could misinterpret the meanings of levels 2 and 3 and grades B and C. It is impractical to expect busy clinicians to read “how to use” instructions when they are reading information for purposes other than evidence rating. Therefore, we will add the following short explanatory phrases as we incorporate SORT labels into DynaMed:
Level 1 [likely reliable] evidence
Level 2 [mid-level] evidence
Level 3 [lacking direct] evidence
Grade A recommendation [consistent high-quality evidence] Grade B recommendation [inconsistent or limited evidence] Grade C recommendation [lacking direct evidence]
Second, there is no evidence that clinicians without training in evidence labeling correctly interpret the meaning of evidence labels.2 We are willing to assist others who are interested in researching this area.
in reply:As we stated in our article,1 the Strength of Recommendation Taxonomy (SORT) “is a work in progress.” We thank the letters’ authors for their thoughtful comments.
Ms. Strite and Dr. Stuart state that their most important concern is that possibly ineffective interventions could receive grade B. They use the example of hormone therapy for prevention of coronary heart disease. However, we clearly state in Figure 1 of our article1 that: “Recommendations should be based on the highest quality evidence available. For example, vitamin E was found in some cohort studies (level 2 study quality) to have a benefit for cardiovascular protection, but good-quality randomized trials (level 1) have not confirmed this effect. Therefore, it is preferable to base clinical recommendations in a manuscript on the level 1 studies.” Thus, the recommendation regarding hormone therapy should be based on the best available evidence from more recent randomized controlled trials (RCTs) rather than from older observational studies.
Their second concern is that the type of outcome measured in a study (patient-oriented or disease-oriented) should be classified separately from validity. There are two reasons not to take this approach. One of our goals was to keep the taxonomy simple, ideally with no more than three categories (A, B, and C). Adding a second dimension (patient-oriented versus disease-oriented for each of the validity categories) would double the number of categories. Second, interventions based only on data from intermediate markers of disease should be interpreted cautiously. There are too many examples of surrogate end points such as blood pressure, peak flow rate, and blood sugar levels that make perfect physiologic sense but that do not consistently predict improved patient outcomes. We identified a number of examples of inconsistency between disease-oriented and patient-oriented outcomes in Table 1 of our article.1
The authors’ third point that a grade of “C” implies a “passing grade” is well taken. However, a recommendation in an article by definition recommends that the clinician do or not do something. We are confident that clinicians will understand that in a taxonomy using A, B, and C, a rating of C is the weakest support for such a recommendation.
Figure 3 of our article1 is criticized for failing to include criteria for the validity of systematic reviews, meta-analyses, and other studies. As we state in the text, “the algorithms are to be considered general guidelines, and special circumstances may dictate assignment of a different strength of recommendation.” We stand by the algorithms as being useful when applied by thoughtful authors and clinicians.
Finally, we recognize the importance of quantity in addition to quality and consistency of evidence, and have made it implicit in the assessments of study quality. For example, a high-quality RCT is defined as having adequate statistical power, which requires adequate size. To avoid introducing yet another layer of classification, we made this characteristic implicit in the study quality assessment.
In response to Dr. Alper, we are pleased that DynaMed has chosen to use SORT in their database. Regarding the addition of clarifying labels to the levels of evidence for individual studies (1, 2, and 3) and strengths of recommendation (A, B, and C), we cannot endorse them without further study. For example, we are concerned that the terms “mid-level evidence” and “direct evidence” are subject to misinterpretation. We share Dr. Alper’s interest in the GRADE Working Group and their proposed evidence taxonomy.2 However, we are concerned that their system may lack the simplicity and ease of use of SORT. We agree that any taxonomy would benefit from more detailed study and welcome his call for research into these issues.