Current issues of ACP Journal Club are published in Annals of Internal Medicine


Meta-statistics: help or hindrance?

ACP J Club. 1993 May-June;118:A13. doi:10.7326/ACPJC-1993-118-3-A13

Related Content in the Archives
Review: Cognitive behavioral techniques reduce blood pressure when compared with no therapy but not when compared with placebo therapy

In an earlier editorial in ACP Journal Club, we discussed our selection criteria for review articles (1). Most of the review articles that meet our criteria are meta-analyses, systematic reviews of specific clinical questions that use statistical methods to combine the results of previous research (2, 3).

The reason for insisting that reviews be systematic is to ensure that the conclusions are valid (1, 4, 5). The need for statistical analysis in reviews may be less clear. Sometimes the use of statistics in a meta-analysis may even seem to be more of a hindrance than a help to readers who are not familiar with the techniques. In this editorial we try to clarify the role of statistics in reviews and to provide help with assessing the results of meta-analyses.

Why use statistics in reviews?

To prepare a research review, investigators must collect data from individual studies, just as they must collect data from individual patients in primary studies. In either case, statistical methods can be used to analyze and summarize data. If used appropriately, they provide a powerful tool for deriving meaningful conclusions from the data and help prevent errors in interpretation.

A common error when statistical methods are not used in reviews is to compare the number of “positive” studies with the number of “negative” studies. With such a “vote counting” approach, a study may be counted as “positive” in one review and “negative” in another, depending on how the results are interpreted by the reviewers. There is also a tendency to overlook small but clinically important effects when counting votes, particularly when counting studies with statistically “nonsignificant” results as “negative” (6, 7).

Another error that can easily occur when statistical methods are not used is inappropriate weighting of the results of the individual studies. For example, vote counting gives studies equal weight, ignoring differences in the size and quality of the studies. It is also possible for a reviewer to inappropriately stress the results of one study over another when using nonquantitative methods.

Of course, the use of statistical methods does not guarantee that the results of a meta-analysis are valid, any more than it does for a primary study. Moreover, like any tool, statistical methods can be misused.

Are the results of the meta-analysis likely to be valid?

The results of a meta-analysis are only worth considering if they are valid. Guides for assessing the validity of meta-analyses and other review articles include the following (2-5):

1. Did the review address a clearly focused question?

2. Were inclusion criteria clearly stated and appropriate?

3. Is it unlikely that important, relevant studies were missed?

4. Was the validity of the included studies appraised?

5. Were assessments of studies reproducible?

All of the review articles included in ACP Journal Club have been screened using the first two guides. Having determined that the results are likely to be valid, using guides like these, the following questions can be used to help understand and make good use of the results.

Are the studies comparable?

Decisions about whether it makes sense to combine studies rely primarily on clinical judgment. Before considering the results, it is important to ensure that the patients' “exposures” (interventions, diagnostic tests, prognosticators) and outcomes in each study included in a meta-analysis are similar enough that it makes sense to combine them. ACP Journal Club only includes reviews that report explicit criteria for selecting studies for inclusion. This makes it possible for readers to decide for themselves if the criteria that were used were sensible compared with the question that the review addresses.

A frequent complaint about meta-analyses is that they combine apples and oranges. This is often true, but there is nothing wrong with this, if one is interested in fruit. For example, trials of secondary prevention of myocardial infarction with β-blockers have differed in patient characteristics, type and dose of β-blocker, and how outcomes were measured. Nonetheless, the best estimate of the effectiveness of β-blockers after a myocardial infarction is from the overall results of all of the valid studies (8). Differences in the results of these trials are more likely to be due to chance than to differences in the effectiveness of the drug regimens that were used or the responsiveness of the patients that were studied (9).

Are the results of the comparable studies similar?

If it makes clinical sense to combine a group of studies, the next question to ask is whether differences among the results of the studies are greater than could be expected due to chance alone. One way of doing this is to look at a graphical play of the results. If the confidence intervals for the results of each study (typically presented by horizontal lines) do not overlap, it suggests that the differences are likely to be “statistically significant” (10). Tests of homogeneity are formal statistical analyses for estimating the probability of observed differences in results having occurred because of chance alone. The more significant (closer to zero) the test, the less likely it is that the observed differences were caused by chance alone.

When there is “statistically significant” heterogeneity, it suggests that the observed differences in results are likely caused by factors other than chance. When looking for possible explanations for the differences, however, such as differences in the interventions that were studied, it is important to be cautious (9, 10). Even if a meta-analysis included only well-designed, randomized controlled trials, patients were not randomized to one study compared with another. Because studies usually differ in many ways, it is risky to attribute differences in results to any one factor. If the author of a meta-analysis did this, even if a technique such as logistic regression was used to control for other potential explanatory factors (3), it should be considered “hypothesis generating” rather than “hypothesis testing.”

What is the best overall estimate of the results?

Meta-analyses use a variety of techniques. Unfortunately, there is not one “correct” technique. The choice of technique depends on the nature of the data being analyzed. Fortunately, as with other uses of statistics, a conceptual understanding of the principles is more important than detailed knowledge of the specific techniques.

The central aim of most meta-analyses is to provide a more precise estimate (of, say, the effectiveness of an intervention) based on a weighted average of the results from more than one study. Typically, the weight given to each study is the inverse of the variance; that is, more weight is given to studies with more events (11, 12). It is possible to give studies more or less weight based on their methodologic quality and their precision, but this is rarely done (13). Readers more often encounter a “sensitivity analysis” that tests the robustness of the results by excluding some studies, such as those of poorer methodologic quality.

Generally, each study is summarized using a measure of association that represents the within-study comparison of the treatment (exposed) and control groups. In this way patients in each study are only compared with patients in the same study. Occasionally a meta-analysis is published in which an “effect size” is calculated for each study group based on the difference in outcomes before and after treatment, and groups from different studies are directly compared with each other. The power of randomization is completely lost with this approach, and the data have been reduced to the equivalent of much weaker before-after studies: Readers should regard them as such.

The choice of which measure of association to use is not always straightforward, and the measure used in the analysis may not be the best measure for clinical decision making. For instance, the odds ratio is commonly used in meta-analyses because of advantages in the statistical techniques available for combining odds ratios, but it can be difficult to interpret the clinical importance of an odds ratio. Conversely, although it is easy to interpret the clinical importance of a risk difference, the statistical methods available for combining risk differences can have disadvantages.

For the most part, clinicians should be more concerned about how they interpret the summary statistic that was used than about which summary statistic was used in the analysis. For example, when odds ratios summarize results, they usually represent the odds of an adverse outcome occurring in the group that received the intervention compared with the odds of an adverse event in the control group. If the proportion of patients who have the outcome is very low, the odds ratio and the relative risk are approximately the same, and the proportional risk reduction is equal to one minus the odds ratio. As the proportion of patients who have the outcome increases, however, the odds ratio becomes progressively smaller than the relative risk. Thus, in situations where the baseline risk is high, the assumption that the odds ratio equals the relative risk does not hold true and can lead to overestimating the proportional risk reduction.

How precise is the overall estimate of the results?

The confidence interval around an estimate derived from a meta-analysis tells us how precise that estimate is. It is, in principle, no different from a confidence interval around an estimate derived from primary research (14). One important consideration that arises in meta-analyses, however, is whether to incorporate between-study variation in estimating the confidence interval. If between-study variation is minimal, the difference is unimportant. If between-study variation is considerable, an analysis that ignores between-study variation will give a less conservative estimate of the confidence interval than one that does not. Most of the statistical methods commonly used in meta-analyses of clinical studies, particularly those that use odds ratios, ignore between-study variation and hence may provide a confidence interval that overestimates the precision of the results. As a result, readers should be cautious about interpreting the overall estimate derived from a meta-analysis that reports a statistically significant heterogeneity, as well as from analyses that seek to attribute between-study differences to any one factor.

What are the clinical implications of the results?

The results of a valid meta-analysis provide the best estimate of the outcomes that can be expected for a clinical intervention. But, it is still necessary to weigh the expected benefits against the potential harms and costs (15). This requires judgment about how much patients value the different consequences. The authors of meta-analyses and the commentators in ACP Journal Club often make these judgments implicitly. Their conclusions about the implications of the results are worth considering, but before accepting their conclusions, it is worthwhile to consider the results that are reported. You can then make an informed decision about whether their conclusions, and your own, are supported by the results.

Andrew D. Oxman, MD, MSc


1. Oxman AD.Readers' guide for review articles: Why worry about methods? [Editorial]. ACP J Club. 1991 Jul-Aug;115:A12.

2. Sacks HS, Berrier J, Reitman D, et al. Meta-analyses of randomized controlled trials. N Engl J Med. 1987;316:450-5.

3. L'Abbé KA, Detsky AS, O'Rourke K. Meta-analysis in clinical research. Ann Intern Med. 1987;107:224-33.

4. Mulrow CD. The medical review article: state of the science. Ann Intern Med. 1987;106:485-8.

5. Oxman AD, Guyatt GH. Guidelines for reading literature reviews. Can Med Assoc J. 1988;138:697-703.

6. Cooper HM, Rosenthal R. Statistical versus traditional procedures for summarizing research findings. Psychol Bull. 1980;87:442-9.

7. Antman EM, Lau J, Kupelnick B, Mosteller F, Chalmers TC. A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. JAMA. 1992;268:240-8.

8. Yusuf S, Peto R, Lewis J, Collins R, Sleight P. β blockade during and after myocardial infarction: an overview of the randomized trials. Prog Cardiovasc Dis. 1985;27:335-71.

9. Yusuf S, Wittes J, Probstfiel J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991;266:93-8.

10. Walker AM, Martin-Moreno JM, Artalejo FR. Odd man out: a graphical approach to meta-analysis. Am J Public Health. 1988;78:961-6.

11. Oxman AD, Guyatt GH. A consumer's guide to subgroup analyses. Ann Intern Med. 1992;116:78-84.

12. Laird NM, Mosteller F. Some statistical methods for combining experimental results. Int J Technol Assess Health Care. 1990;6:5-30.

13. Detsky AS, Naylor CD, O'Rourke K, McGeer AJ, L'Abbé KA. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol. 1992;45:255-65.

14. Altman DG.Confidence intervals in research evaluation [Editorial]. ACP J Club. 1992 Mar-Apr;116:A28.

15. Eddy DM. Anatomy of a decision. JAMA. 1990;263:441-3.