Chapter 16: Outcome Studies in Trauma

Mohit Bhandari

Chapter Outline


The “outcomes” movement in orthopedic surgery involves careful attention to the design, statistical analysis, and critical appraisal of clinical research. The delineation between “outcomes” research and “evidence-based medicine (EBM)” is vague. Orthopedic surgeons and researchers have adopted their own style of critical appraisal, often coined as “evidence-based orthopedics” (EBO). EBO entails using a clear delineation of relevant clinical questions, a thorough search of the literature relating to the questions, a critical appraisal of available evidence and its applicability to the clinical situation, and a balanced application of the conclusions to the clinical problem.29,50,51 
The balanced application of the evidence (the clinical decision-making) is the central point of practicing EBO and involves, according to EBM principles, integration of our clinical expertise and judgment, patients’ perceptions and societal values, and the best available research evidence.2,22 
EBO involves a hierarchy of evidence, from meta-analyses of high-quality randomized trials showing definitive results directly applicable to an individual patient, to relying on physiologic rationale or previous experience with a small number of similar patients. The hallmark of the evidence-based surgeon is that, for particular clinical decisions, he or she knows the strength of the evidence, and therefore the degree of uncertainty. 
In the process of adopting EBO strategies, surgeons must avoid common misconceptions about EBO. Critics have mistakenly suggested that evidence can be derived only from the results of randomized trials or that statistical significance automatically means clinical relevance. These things are not true. That being said, new methods for measurement of fracture healing and function and quality of life outcomes will likely see their value demonstrated in the conduct of high-quality clinical trials that test new innovative approaches to trauma care.5 
This chapter provides an evaluation of all study designs with recommendations to their appropriate use in orthopedic clinical research. 

Hierarchy of Evidence

Among various study designs, there exists a hierarchy of evidence with randomized controlled trials (RCTs) at the top, controlled observational studies in the middle, and uncontrolled studies and opinion at the bottom (Fig. 16-1).19,22,23,50 Understanding the association between study design and level of evidence is important. The Journal of Bone and Joint Surgery (JBJS), as of January 2003, has published the level of evidence associated with each published scientific article to provide readers with a gauge of the validity of the study results. Based upon a review of several existing evidence ratings, the JBJS uses five levels for each of the four different study types (therapeutic, prognostic, diagnostic, and economic or decision-modeling studies) (Table 16-1).60 Level I studies may be deemed appropriate for the application to patient care, whereas level IV studies should be interpreted with caution. For example, readers should be more confident about the results of a high-quality multicenter randomized trial of arthroplasty versus internal fixation on revision rates and mortality (level I study) than two separate case series evaluating either arthroplasty or internal fixation on the same outcomes (level IV studies). 
Figure 16-1
The hierarchy of evidence with high-quality randomized trials at the top and expert opinion at the bottom.
View Original | Slide (.ppt)
Table 16-1
Level of Evidence
View Large
Table 16-1
Level of Evidence
Types of Studies
Therapeutic Studies—Investigating the Results of Treatment Prognostic Studies Investigating the Outcome of Disease Diagnostic Studies—Investigating a Diagnostic Test Economic and Decision Analyses—Developing an Economic or Decision Model
Level I
    Randomized trial
      Statistically significant difference
      No statistically significant difference but narrow CIs
    Systematic reviewa of level I RCTs (and studies were homogenous)
    Prospective studyb
    Systematic reviewa of level I studies
    Testing of previously developed diagnostic criteria on consecutive patients (with universally applied reference criterion standard)
    Systematic reviewa of level I studies
    Clinically sensible costs and alternatives; values obtained from many studies; with multiway sensitivity analyses
    Systematic reviewa of level I studies
Level II
    Prospective cohort studyc
    Poor-quality RCT (e.g., <80% follow-up)
    Systematic reviewa
      Level II studies
      Nonhomogeneous level I studies
    Retrospectived study
    Untreated controls from an RCT
    Systematic reviewa of level II studies
    Development of diagnostic criteria on consecutive patients (with universally applied reference criterion standard)
    Systematic reviewa of level II studies
    Clinically sensible costs and alternatives; values obtained from limited studies; with multiway sensitivity analyses
    Systematic reviewa of level II studies
Level III
    Case-control studye
    Retrospectived cohort study
    Systematic reviewa of level III studies
    Study of nonconsecutive patients; without consistently applied reference criterion standard
    Systematic reviewa of level III studies
    Analyses based on limited alternatives and costs and poor estimates
    Systematic reviewa of level III studies
Level IV Case series (no, or historical, control group) Case series
    Case-control study
    Poor reference standard
Analyses with no sensitivity analyses
Level V Expert opinion Expert opinion Expert opinion Expert opinion

Adapted from JBJS Guidelines. Available online at

Bhandari and Tornetta18 have evaluated the interobserver agreement among reviewers with varying levels of epidemiology training in categorizing clinical studies published in the JBJS into levels of evidence. Among 51 included articles, the majority were studies of therapy (68.6%) constituting level IV evidence (56.9%). Overall, the agreement among reviewers for the study type, level of evidence, and subcategory within each level was substantial (range: 0.61 to 0.75). Epidemiology-trained reviewers demonstrated greater agreement (range: 0.99 to 1) across all aspects of the classification system when compared with nonepidemiology-trained reviewers (range: 0.6 to 0.75). The findings suggested that epidemiology- and nonepidemiology-trained reviewers can apply the levels of evidence guide to published studies with acceptable interobserver agreement. Although reliable, it remains unknown whether this system is valid.18 
The hierarchy of evidence bases its classification on the validity of the study design. Thus, those designs that limit bias to the greatest extent find themselves at the top of the pyramid and those inherently biased designs are at the bottom (Fig. 16-1). Application of the levels of evidence also requires a fundamental understanding of various study designs. 
Sackett et al.50 proposed a grading system that categorizes the hierarchy of research designs as levels of evidence. Each level (from 1 to 5) is associated with a corresponding grade of recommendation: (i) grade A—consistent level I studies, (ii) grade B—consistent level II or level III studies, (iii) grade C—level IV studies, and (iv) grade D—level V studies.19,22,23,50 
More recently, the grading of recommendations assessment, development and evaluation (GRADE) working group suggested that, when making a recommendation for treatment, four areas should be considered (Table 16-2)3,3,6: (i) what are the benefits versus the harms? Are there clear benefits to an intervention or are there more harms than good?; (ii) what is the quality of the evidence?; (iii) are there modifying factors affecting the clinical setting such as the proximity of qualified persons able to carry out the intervention?; and (iv) what is the baseline risk for the potential population being treated? 
Table 16-2
Criteria for Assessing Grade of Evidence
View Large
Table 16-2
Criteria for Assessing Grade of Evidence
Type of Evidence
Randomized trial = high quality
Quasi-randomized = moderate quality
Observational study = low quality
Any other evidence = very low quality
Decrease Grade(s) If
Serious (−1) or very serious (−2) limitation to study quality
Important inconsistency (−1)
Some (−1) or major (−2) uncertainty about directness
Imprecise or sparse data (−1)
High probability of reporting bias (−1)
Increase Grade(s) If
Strong evidence of association—significant relative risk greater than 2 (<0.5) based on consistent evidence from two or more observational studies, with no plausible confounders (+1)
Very strong evidence of association—significant relative risk greater than 5 (<0.2) based on direct evidence with no major threats to validity (+2)
Evidence of a dose response gradient (+1)
All plausible confounders would have reduced the effect (+1)

Study Designs

The types of study designs used in clinical research can be classified broadly according to whether the study focuses on describing the distributions or characteristics of a disease or on elucidating its determinants (Fig. 16-2).23 Descriptive studies describe the distribution of a disease, particularly what type of people have the disease, in what locations, and when. Cross-sectional studies, case reports, and case series represent the types of descriptive studies. Analytic studies focus on determinants of a disease by testing a hypothesis with the ultimate goal of judging whether a particular exposure causes or prevents disease. Analytic design strategies are broken into two types: Observational studies, such as case-control and cohort studies, and experimental studies, also called clinical trials. The difference between the two types of analytic studies is the role that the investigator plays in each of the studies. In the observational study, the investigator simply observes the natural course of events. In the trial, the investigator assigns the intervention or treatment. 
View Original | Slide (.ppt)
Figure 16-2
Categorization of study designs.
View Original | Slide (.ppt)
Bhandari et al.17 reviewed each type of study to highlight methodologic issues inherent in their design (Table 16-3). 
Table 16-3
Study Designs and Common Errors
View Large
Table 16-3
Study Designs and Common Errors
Study Design Summary Common Errors
Meta-analysis High-quality studies addressing a focused clinical question are critically reviewed and their results statistically combined Major differences between pooled studies (heterogeneity)
Poor-quality studies pooled = less valid results
Randomized trial Patients are randomized to receive alternative treatments (i.e., cast vs. intramedullary nail for tibial shaft fracture)
Outcomes (i.e., infection rates) are measured prospectively
Type II (β) errors: Insufficient sample size
Type I (α) error: Overuse of statistical tests and multiple outcomes
Lack of blinding
Lack of concealed randomization
Prospective cohort (with comparison group) Patients who receive two different treatments are followed forward in time. Choice of treatment is not randomly assigned (i.e., surgeon preference, patient preference)
Comparison group is identified and followed at the same time as the treatment group (i.e., concurrent comparison group)
Outcomes (i.e., infection rates) are measured prospectively
Type II (β) errors: Insufficient sample size
Type I (α) error: Overuse of statistical tests and multiple outcomes
Lack of adjustment for differences in characteristics between treatment and comparison groups
Prospective case series (without comparison group) Patients who receive a particular treatment are followed forward in time (i.e., intramedullary nailing of tibial fractures)
No concurrent comparison group is utilized
Lack of independent or blinded assessment of outcomes
Lack of follow-up
Case-control study Patients with an outcome of interest (i.e., infection) are compared backward in time (retrospective) to similar patients without the outcome of interest (i.e., no infection)
Risk factors for a particular outcome can be determined between cases and controls
Type II (β) errors: Insufficient sample size
Type I (α) error: Overuse of statistical tests and multiple outcomes
Problems in ascertainment of cases and controls
Retrospective case series (with comparison group) Patients with a particular treatment are identified backward in time (i.e., retrospectively)
Comparison patients are also identified retrospectively
Type II (β) errors: Insufficient sample size
Type I (α) error: Overuse of statistical tests and multiple outcomes
Incomplete reporting in patient charts

Meta-Analysis (Level I Evidence; Grade A Recommendation)

Although not considered to be a primary study design, meta-analysis deserves mention because it is frequently utilized in the surgical literature. A meta-analysis is a systematic review that combines the results of multiple studies (of small sample size) to answer a focused clinical question. Meta-analyses are retrospective in nature. The main advantage of meta-analysis is the ability to increase the “total sample size” of the study by combining the results of many smaller studies. When well-designed studies are available on a particular question of interest, a meta-analysis can provide important information to guide clinical practice. Consider the following example. Several small randomized trials have attempted to resolve the issue of whether operative repair of acute Achilles tendon ruptures in younger patients reduces the risk of rerupture compared with conservative treatment. Of five randomized trials (ranging in sample size from 27 to 111 patients), four found nonsignificant differences in rerupture rates. These studies were underpowered. Using meta-analytic techniques, the results of these small studies were combined (n = 336 patients) to produce a summary estimate of 3.6% surgery versus 10.6% conservative (relative risk = 0.41; 95% confidence interval [CI], 0.17% to 0.99%; p = 0.05) of adequate study power (>80%) to help guide patient care.10 
Another benefit of meta-analysis is the increased impact over traditional reviews (i.e., narrative or nonsystematic reviews). Rigorous systematic reviews received over twice the number of mean citations compared with other systematic or narrative reviews (13.8 vs. 6; p = 0.008).13 
Authors of meta-analyses can be limited to summarizing the outcomes available and not necessarily the outcomes of interest. There is often a trade-off between pooling data from many studies on common and sometimes less relevant outcomes (i.e., nonunion) versus fewer studies reporting less common outcomes of interest (i.e., avascular necrosis). Thus, the definition eligibility criteria for the studies to be included is an important step in the conduct of a meta-analysis. 
Meta-analysis of high-quality randomized trials represents the current standard in the translation of evidence to practice. Although meta-analysis can be a powerful tool, its value is diminished when poor quality studies (i.e., case series) are included in the pooling. Pooled analyses of nonrandomized studies are prone to bias and have limited validity. Surgeons should be aware of these limitations when extrapolating such data to their particular clinical settings. 

Randomized Trial (Level I Evidence; Grade A Recommendation)

When considering a single study, the randomized trial is the single most important design to limit bias in clinical research.12 Randomized trials are by no means easy to conduct even when the fracture is a common one. In a systematic review of hip fracture trials around the world, Yeung and Bhandari20 identified 199 randomized trials.61 Sweden ranked highest with 50 trials (8,941 patients). The United Kingdom followed with 40 trials (7,589 patients). The United States and Canada together contributed only a tenth of the total number of trials contributed by European countries. 
Although it may seem elementary to explain the term “randomization,” most surgeons are unfamiliar with the rationale for random allocation of patients in a trial. Orthopedic treatment studies attempt to determine the impact of an intervention on events such as nonunions, infections, or death—occurrences that we call the trial’s target outcomes or target events. Patients’ age, the underlying severity of fracture, the presence of comorbid conditions, health habits, and a host of other factors typically determine the frequency with which a trial’s target outcome occurs (prognostic factors). Randomization gives a patient entering a clinical trial an equal probability (or chance) of being allocated to alternative treatments. Patients can be randomized to alternative treatments by random number tables or computerized randomization systems. Randomization is the only method for controlling for known and unknown prognostic factors between two comparison groups. For instance, in a study comparing plates and intramedullary nails for the treatment of tibial shaft fractures in patients with concomitant head injury, investigators reported imbalance in acetabular fractures between treatment groups. Readers will agree that differences in patient function or mortality may not be attributed to treatment groups, but rather, differences in the proportion of patients with acetabular fractures. Realizing this imbalance because of lack of randomization, the investigators employed a less attractive strategy to deal with the imbalance—statistical adjustment for differences between groups. By controlling for the difference in the number of acetabular fractures between groups, the effect of plates versus nails in patients was determined. 
Equally important is the concept of “concealment” (not to be confused with blinding).12 Concealed randomization ensures that surgeons are unable to predict the treatment to which their next patient will be allocated. The safest manner in which to limit this occurrence is a remote 24-hour telephone randomization service. Historically, treatment allocations in surgical trials have been placed within envelopes; although seemingly concealed, envelopes are prone to tampering. 
Whereas it is believed that surgical trials cannot be double-blinded because of the relative impossibility of blinding surgeons, Devereaux et al.26 have recently challenged the “classic” definition of double-blinding. In a survey of 91 internists and researchers, 17 unique definitions of “double-blinding” were obtained. Moreover, randomized trials in five high-profile medical journals (The New England Journal of Medicine, The Lancet, British Medical Journal, Annals of Internal Medicine, and Journal of the American Medical Association) revealed considerable variability in the reporting of blinding terminology. Common sources of blinding in a randomized trial include physicians, patients, outcome assessors, and data analysts. Current recommendations for reporting randomized trials include explicit statements about who was blinded in the study rather than using the term “double-blinded.” Surgical trials can always blind the data analyst, almost always blind the outcome assessor, occasionally blind the patient, and never blind the surgeon. In a review of orthopedic trials, outcome assessors were blinded only 44% of the time and data analysts were never blinded. However, at least two-thirds of surgical trials could have achieved double-blinding by blinding the outcome assessors, patients, or data analysts.14 
The principle of attributing all patients to the group to which they were randomized results is an intention-to-treat (ITT) principle (Fig. 16-3).12 This strategy preserves the value of randomization: Prognostic factors that we know about and those we do not know about will be, on average, equally distributed in the two groups, and the effect we see will be just that because of the treatment assigned. When reviewing a report of a randomized trial, one should look for evidence that the investigators analyzed all patients in the groups to which they were randomized. Some suggest that an ITT approach is too conservative and more susceptible to type II error because of increased biologic variability. Their argument is that an ITT analysis is less likely to show a positive treatment effect, especially for those studies that randomized patients who had little or no chance of benefiting from the intervention. 
Figure 16-3
The intention to treat principle: A per protocol analysis analyzes patient outcomes to the treatment they “actually received” whereas intention to treat analysis evaluates outcomes based upon the treatment to which patients were originally randomized.
The intention to treat principle: A per protocol analysis analyzes patient outcomes to the treatment they “actually received” whereas intention to treat analysis evaluates outcomes based upon the treatment to which patients were originally randomized.
View Original | Slide (.ppt)
Figure 16-3
The intention to treat principle: A per protocol analysis analyzes patient outcomes to the treatment they “actually received” whereas intention to treat analysis evaluates outcomes based upon the treatment to which patients were originally randomized.
The intention to treat principle: A per protocol analysis analyzes patient outcomes to the treatment they “actually received” whereas intention to treat analysis evaluates outcomes based upon the treatment to which patients were originally randomized.
View Original | Slide (.ppt)
An alternative approach, referred to as a per protocol analysis, reports outcomes on the treatments patients actually received regardless of the number of crossovers from one treatment to another. This approach is often utilized to determine whether imbalances in baseline factors actually affect the final result. It may be particularly important when patients who are randomized to one treatment (i.e., reamed or unreamed tibial nail) but never receive either treatment. For example, in a trial of reamed versus unreamed tibial nailing, a patient randomized to a reamed tibial nail who ultimately receives an external fixator because of an intraoperative surgical decision will be excluded from in per protocol analysis; however, recall that this same patient would be included in the reamed tibial nail group in an ITT analysis. 
The overall quality of a randomized trial can be evaluated with a simple checklist (Table 16-4). This checklist provides guides to the assessment of the methodologic rigor of a trial. 
Table 16-4
Checklist for Assessing Quality of Reporting
View Large
Table 16-4
Checklist for Assessing Quality of Reporting
Randomization 1 Yes 1 Partly 0 No
Were the patients assigned randomly? 2 Yes 0 No
Randomization adequately described? 1 Yes 0 No
Was treatment group concealed to investigator?
Description of outcome measurement adequate? 1 Yes 1 Partly 0 No
Outcome measurements objective? 2 Yes 0 No
Were the assessors blind to treatment? 1 Yes 0 No
Were inclusion/exclusion criteria well defined? 2 Yes 1 Partly 0 No
Number of patients excluded and reason? 2 Yes 1 Partly 0 No
Was the therapy fully described for the treatment group? 2 Yes 1 Partly 0 No
Was the therapy fully described for the controls? 2 Yes 1 Partly 0 No
Statistics 1 Yes 1 Partial 0 No
Was the test stated and was there a p value? 2 Yes 0 No
Was the statistical analysis appropriate? 1 Yes 0 No
Was the trial negative, were confidence intervals of post hoc power calculations performed? 1 Yes 0 No
Sample size calculation before the study?
Total/4 (if positive trial); total/5 (negative trial)
Total score: 20 points (if positive trial); 21 points (if negative trial)

Randomized Trial (Expertise-Based Design)

In conventional surgical hip fracture trials, all surgeons involved in the trial have performed both total hip arthroplasties (THAs) and hemiarthroplasties. Surgeons performing arthroplasty are frequently less experienced (or expert) in one or both surgical alternatives. This trial aims to limit this differential expertise across treatment alternatives. In our proposed expertise-based design, we will randomize patients to receive THA (by surgeons who are experienced and committed to performing only THA) or to hemiarthroplasty (by surgeons with expertise in hemiarthroplasty who are committed to performing only hemiarthroplasty). Devereaux et al.26 have outlined the advantages of this trial design, which include the following: (i) elimination of differential expertise bias where, in conventional designs, a larger proportion of surgeons are expert in one procedure under investigation than the other; (ii) differential performance, cointervention, data collection, and outcome assessment are less likely than in conventional RCTs; (iii) procedural crossovers are less likely because surgeons are committed and experienced in their procedures; and (iv) ethical concerns are reduced because all surgeries are conducted by surgeons with expertise and conviction concerning the procedure.26 

Observational Study (Cohort, Case Series)

Studies in which randomization is not employed can be referred to as nonrandomized, or observational, study designs. The role of observational comparative studies in evaluating treatments is an area of continued debate: Deliberate choice of the treatment for each patient implies that observed outcomes may be caused by differences among people being given the two treatments, rather than the treatments alone.11 Unrecognized confounding factors can interfere with the attempts to correct for identified differences between groups. There has been considerable debate about whether the results of nonrandomized studies are consistent with the results of RCTs.8,25,32,36 Nonrandomized studies have been reported to overestimate or underestimate the treatment effects.32,36 
One example of the pitfalls of nonrandomized studies was reported in a study comparing study designs that addressed the general topic of comparison of arthroplasty and internal fixation for hip fracture.19 Mortality data was available in 13 nonrandomized studies (n = 3,108 patients) and in 12 randomized studies (n = 1,767 patients). Nonrandomized studies overestimated the risk of mortality by 40% when compared with the results of randomized trials (relative risk: 1.44 vs. 1.04, respectively) (Fig. 16-4). If we believe the data from the nonrandomized trials, then no surgeon would offer a patient a hemiarthroplasty for a displaced hip fracture, given the significant risk of mortality. However, in practice, arthroplasty is generally favored over internal fixation in the treatment of displaced femoral neck fractures. Thus, surgeons believe the randomized trials that report no significant differences in mortality and significant reductions in revisions with arthroplasty. 
Figure 16-4
Nonrandomized studies overestimate the benefit of internal fixation regarding mortality by 40%.
View Original | Slide (.ppt)
Figure 16-4
Estimates from randomized trials tend to provide a more conservative estimate of a treatment effect when compared with nonrandomized studies.
Nonrandomized studies overestimate the benefit of internal fixation regarding mortality by 40%.
Nonrandomized studies overestimate the benefit of internal fixation regarding mortality by 40%.
View Original | Slide (.ppt)
Important contradictory examples of observational and RCT results can be found in the surgical literature. An observational study of extracranial-to-intracranial bypass surgery suggested a “dramatic improvement in the symptomatology of virtually all patients” undergoing the procedure.31 However, a subsequent large RCT demonstrated a 14% relative increase in the risk of fatal and nonfatal stroke in patients undergoing this procedure compared with medical management.1 These considerations have supported a hierarchy of evidence, with RCTs at the top, controlled observational studies in the middle, and uncontrolled studies and opinion at the bottom. However, these findings have not been supported in two publications in the New England Journal of Medicine that identified nonsignificant differences in results between RCTs and observational studies.8,25 
Although randomized trials, when available, represent the most valid evidence, information from nonrandomized studies can provide invaluable data to generate hypotheses for future studies. 

Prospective Observational Study (Level II Evidence; Grade B Recommendation)

A prospective observational study identifies a group of patients at a similar point in time and follows them forward in time. Outcomes are determined prior to the start of the study and evaluated at regular time intervals until the conclusion of the study. A comparison group (controls) may also be identified concurrently and followed for the same time period. 
Whereas comparison groups are helpful when comparing the outcomes of two surgical alternatives, a prospective evaluation of a single group of patients with complex injuries can provide information on the frequency of success (radiographic and functional outcomes) and expected complications. This information is most useful when the data collected remains consistent over time, the data collected includes important baseline patient characteristics and patient outcomes, and efforts are made to ensure patients are followed over time. Professor Joel Matta’s acetabular fracture database is one striking example of a carefully designed single-surgeon, prospective database that has consistently collected data on patients for more than 20 years (personal communication). With over 1,000 patients with acetabular fractures included in this database, the current limits of technique, results, and complications can be reported to serve as a benchmark for future studies. In addition, these types of studies can assist surgeons in discussing the expected risk and outcomes of surgery with their patients during the informed consent process. 

Case-Control Study (Level III Evidence; Grade B Recommendation)

If the outcome of interest is rare (i.e., mortality or infection), conducting a prospective cohort study may be cost-prohibitive. A case-control study is a useful strategy in such circumstances.23 Cases with the outcome of interest are identified retrospectively from a group of patients (i.e., databases) and matched (i.e., by age, gender, severity of injury) with control patients who do not have the outcome of interest. Both groups can be compared for differences in “risk” factors.11 One control may be matched for each case that is identified (1:1 matching). Alternatively, multiple controls may be matched to each case (i.e., 3:1 or 4:1 matching). The validity of results from case-control studies depends upon the accuracy of the reporting of the outcomes of interest. For example, investigators conducted a study to determine risk factors for hip fracture among elderly women.30 To accomplish this, they identified 159 women with their first hip fracture and 159 controls (1:1 matching) matched for gender, age, and residence. Risk factors included perceived safety of the residence, psychotropic drug use, and tendency to fall. Comparison of these factors between the hip fracture and control groups revealed an increased risk of perceived safety (odds ratio = 5.8), psychotropic drug use (odds ratio = 2.6), and tendency to fall (odds ratio = 2.3) among patients who sustained a fracture compared with those who did not. 

Retrospective Case Series (Level IV Evidence; Grade C Recommendation)

The retrospective study design, although less costly and less time consuming, is often limited by bias in the ascertainment of cases and the evaluation of outcomes. Comparison groups can be identified during the same time period as the treatment group (concurrent controls). However, controls from a different period of time can also be utilized (historical controls). Patient follow-up may be conducted passively (via patient records) or actively (patient follow-up appointment and examination). When patient charts have formed the basis for the outcome evaluation, readers should be convinced that the outcomes were objective measures accurately obtained from patient records. For example, in-hospital mortality data is an objective outcome that is likely to have been well documented in patient charts; however, patient satisfaction or functional outcome is subjective and far less likely to have been recorded with any standardization or consistency. 
A case series can provide initial useful information about the safety and complication profile of a new surgical technique or implant. This information is most valid when eligibility criteria for patient inclusion are clearly defined, consecutive patients are screened for eligibility, surgery and perioperative care are consistent, outcomes are objective and independently assessed, and follow-up is complete. Unfortunately, the validity of the results can be compromised by inadequate and incomplete reporting of patient characteristics and outcomes in patient charts. 

Case Study: The Study to Prospectively Evaluate Reamed Intramedullary Nails in Tibial Fractures Trial (Level I Study)

The debate of reamed versus nonreamed insertion of tibial intramedullary nails was largely fueled decades ago by case series (level IV evidence). Case series eventually led to prospective cohort comparison of reamed and unreamed nailing techniques (level II). Realizing the biases inherent in nonrandomized designs, a number of investigators conducted randomized trials ranging in sample size from 50 to 136 patients.55 Despite a strong design, these trials were limited by small sample sizes, imprecise treatment effects, lack of outcome assessment blinding, and unconcealed allocation of patients to treatment groups. 
The Study to Prospectively evaluate Reamed Intramedullary Nails in Tibial fractures (SPRINT) trial was designed to compare the effects of reamed and nonreamed intramedullary nailing approaches.56 To overcome the limitations of previous studies, the design involved concealed central randomization, blind adjudication of outcomes, and disallowing reoperation before 6 months. 
SPRINT enrolled 1,339 patients from July 2000 to September 2005 across 29 clinical sites in Canada, the United States, and the Netherlands. The final follow-up occurred in September 2006 and final outcomes adjudication was completed in January 2007. Participating investigators randomized patients by accessing a 24-hour toll-free remote telephone randomization system that ensured concealment. Randomization was stratified by center and severity of soft tissue injury (open, closed, or both open and closed) in randomly permuted blocks of two and four. Patients and clinicians were unaware of block sizes. Patients were allocated to fracture fixation with an intramedullary nail following reaming of the intramedullary canal (reamed group) or with an intramedullary nail without prior reaming (nonreamed group). 
All patients received postoperative care according to the same protocol. SPRINT investigators hypothesized that the benefits of reamed nails suggested by previous literature may have been because of a lower threshold for early reoperation in patients with nonreamed nails. Therefore, reoperations were disallowed within the first 6 months following surgery. Exceptions to the 6-month rule included reoperations for infections, fracture gaps, nail breakage, bone loss, or malalignment. Patients, outcome assessors, and data analysts were blinded to treatment allocation. Reoperation rates were monitored at hospital discharge; 2 weeks post discharge; 6 weeks post surgery; and 3, 6, 9, and 12 months post surgery. 
The SPRINT trial set a number of important benchmarks in study methodology including: (i) a sample size 10-fold greater than the largest previous tibial fracture trial; (ii) a modern trial organization including an independent blinded adjudication and data safety monitoring committee; (iii) use of innovative trial infrastructure for randomization and data management; and (iv) large-scale multimillion collaborative funding from the National Institutes of Health and the Canadian Institutes of Health proving that orthopedic surgical trials belong in the same arena as the large cardiovascular and osteoporosis trials. 

Understanding Statistics in Trauma Outcome Studies

Hypothesis Testing

The essential paradigm for statistical inference in the medical literature has been that of hypothesis testing. The investigator starts with what is called a null hypothesis that the statistical test is designed to consider and possibly disprove. Typically, the null hypothesis is that there is no difference between treatments being compared. In a randomized trial in which investigators compare an experimental treatment with a placebo control, one can state the null hypothesis as follows: The true difference in effect on the outcome of interest between the experimental and control treatments is zero. We start with the assumption that the treatments are equally effective, and we adhere to this position unless data make it untenable. 
In this hypothesis-testing framework, the statistical analysis addresses the question of whether the observed data are consistent with the null hypothesis. The logic of the approach is as follows: Even if the treatment truly has no positive or negative impact on the outcome (i.e., the effect size is zero), the results observed will seldom show exact equivalence; that is, no difference at all will be observed between the experimental and control groups. As the results diverge further from the finding of “no difference,” the null hypothesis that there is no difference between treatment effects becomes less and less credible. If the difference between results of the treatment and control groups becomes large enough, clinicians must abandon belief in the null hypothesis. We will further develop the underlying logic by describing the role of chance in clinical research. 
Let us conduct a hypothetical experiment in which the suspected coin is tossed 10 times and, on all 10 occasions, the result is heads.2 How likely is this to have occurred if the coin was indeed unbiased? Most people would conclude that it is highly unlikely that chance could explain this extreme result. We would therefore be ready to reject the hypothesis that the coin is unbiased (the null hypothesis) and conclude that the coin is biased. Statistical methods allow us to be more precise by ascertaining just how unlikely the result is to have occurred simply as a result of chance if the null hypothesis is true. The law of multiplicative probabilities for independent events (where one event in no way influences the other) tells us that the probability of 10 consecutive heads can be found by multiplying the probability of a single head (1/2) 10 times over; that is, 1/2 × 1/2 × 1/2, and so on.2 The probability of getting 10 consecutive heads is slightly less than 1 in 1,000. In a journal article, one would likely see this probability expressed as a p value, such as p < 0.001. 

What is the p Value?

What is the precise meaning of this p value? Statistical convention calls results that fall beyond this boundary (i.e., p value <0.05) statistically significant. The meaning of statistically significant, therefore, is that it is “sufficiently unlikely to be due to chance alone that we are ready to reject the null hypothesis.” In other words, the p value is defined as the probability, under the assumption of no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. Let us use the example of a study that reports the following: Patient function scores following tibial intramedullary nailing were significantly greater than those patients treated with plates (75 points vs. 60 points, p < 0.05). This may be interpreted as the probability that the difference of 15 points observed in the study was because the chance is less than 5% (or 1 in 20). 

The 95% Confidence Interval

Investigators usually (though arbitrarily) use the 95% CI when reporting the precision around a proportion. One can consider the 95% CI as defining the range that includes the true difference 95% of the time.12 In other words, if the investigators repeated their study 100 times, it would be expected that the point estimate of their result would lie within the CI 95 of those 100 times. The true point estimate will lie beyond these extremes only 5% of the time, a property of the CI that relates closely to the conventional level of statistical significance of p < 0.05. For example, if a study reports that nails reduced the risk of infection by 50% compared with plates in patients with tibial shaft fractures (95% CI: 25% to 75%), one may interpret the results consistent with as little as a 25% risk reduction or as much as a 75% risk reduction. In other words, the true risk reduction of infection with nails lies somewhere between 25% and 75% (95% of the time). 

Measures of Central Tendency and Spread

Investigators will often provide a general summary of data from a clinical or experimental study. A number of measures can be utilized. These include measures of central tendency (mean, median, and mode) and measures of spread (standard deviation, range). The sample mean is equal to the sum of the measurements divided by the number of observations. The median of a set of measurements is the number that falls in the middle. The mode, however, is the most frequently occurring number in a set of measurements. Continuous variables (such as blood pressure or body weight) can be summarized with a mean if the data is normally distributed. If the data is not normally distributed, then the median may be a better summary statistic. Categorical variables (pain grade: 0, 1, 2, 3, 4, or 5) can be summarized with a median. 
Along with measures of central tendency, investigators will often include a measure of spread. The standard deviation is derived from the square root of the sample variance. One standard deviation away from the mean accounts for somewhere around 68% of the observations. Two standard deviations away from the mean account for roughly 95% of the observations and three standard deviations account for about 99% of the observations. 
The variance is calculated as the average of the squares of the deviations of the measurements about their mean. The range of a dataset reflects the smallest value and the largest value. 

Measures of Treatment Effect (Dichotomous Variables)

Information comparing the outcomes (dichotomous: Mortality, reoperation) of two procedures can be presented to patients as an odds ratio, a relative risk, a relative risk reduction (RRR), an absolute risk reduction, and the number needed to treat. Both reduction in relative risk and reduction in absolute risk have been reported to have the strongest influences on patient decision-making.15 

Common Statistical Tests

Common statistical tests include those that examine differences between two or more means, differences between proportions, and associations between two or more variables (Table 16-5).28 
Table 16-5
View Large
Table 16-5
Common Statistical Testsa
Data Type and Distribution
Samples Categorical Ordered categorical or continuous and nonnormal Continuous and normal
Two samples Different individuals χ2 test
Fisher’s exact test
Mann–Whitney U test
Wilcoxon rank-sum test
Unpaired t-test
Related or matched samples McNemar’s test Wilcoxon signed-rank test Paired t-test
Three or more samples Different individuals χ2 test
Fisher’s exact test
Kruskal–Wallis statistic ANOVA
Related samples Cochran Q test Friedman statistic Repeated measures

Adapted from Griffin D, Audige L. Common statistical methods in orthopaedic clinical studies. Clin Orthop Relat Res. 2003;413:70–79.


Comparing Two Independent Means

When we wish to test the null hypothesis that the means of two independent samples of normally distributed continuous data are the same, the appropriate test statistic is called t, hence the t-test. The author of the original article describing the distribution of the t-statistic used the pseudonym Student leading to the common attribution Student’s t-test.21 When the data is nonnormally distributed, a nonparametric test such as the Mann–Whitney U or Wilcoxon rank-sum test can be utilized. If the means are paired, such as left and right knees, a paired t-test is most appropriate. The nonparametric correlate of this test is the Wilcoxon signed-rank test. 

Comparing Multiple Independent Means

When three or more different means have to be compared (i.e., hospital stay among three tibial fracture treatment groups: Plate fixation, intramedullary nail, and external fixation), single factor analysis of variance is a test of choice. If the test yields statistical significance, investigators can conduct post hoc comparison tests (usually a series of pairwise comparisons using t-tests) to determine where the differences lie. It should be recalled that the p value (α-level) should be adjusted for multiple post hoc tests. One rather conservative method is the Bonferroni correction factor that simply divides the α-level (p = 0.05) by the number of tests performed. 

Comparing Two Proportions

A common situation in the orthopedic literature is that two proportions are compared. For example, these may be the proportion of patients in each of two treatment groups who experience an infection. The chi-squared (χ2) test is a simple method of determining whether the proportions are really different. When samples are small, the χ2 test becomes rather approximate because the data is discrete but the χ2 distribution from which the p value is calculated is continuous. A “Yates’ correction” is a device that is sometimes used to account for this, but when cell counts in the contingency table become very low (say, less than five), the χ2 test becomes unreliable and a Fisher’s exact test is the test of choice. 

Determining Association Between One or More Variables Against One Continuous Variable

When two variables have been shown to be associated, it may be logical to try to use one variable to predict the other. The variable to be predicted is called the dependent variable and the one to be used for prediction is the independent variable. For such a linear relationship, the equation y = a + bx is defined as the regression equation. a is a constant and b the regression coefficient. Fitting the regression equation, generally using a software package, is the process of calculating values for a and b, which allows the regression line represented by this equation to best fit the observed data. The p value reflects the result of a hypothesis test that x and y are in fact unrelated, or in this case that b is equal to zero. 


The strength of the relationship between two variables (i.e., age vs. hospital stay in patients with ankle fractures) can be summarized in a single number: The correlation coefficient. The correlation coefficient, which is denoted by the letter r, can range from −1 (representing the strongest possible negative relationship in which the person who scores the highest on one variable scores the lowest on the other variable) to 1 (representing the strongest possible positive relationship in which the person who is older also has the longest hospital stay). A correlation coefficient of 0 denotes no relationship between the two variables. 

Common Errors in the Design of Orthopedic Studies

Any study that compares two or more treatments (i.e., comparative study: Randomized trial, observational study with control group, case-control) can be subject to errors in hypothesis testing. For example, when investigators conduct studies to determine whether two treatments have different outcomes, there are four potential outcomes (Fig. 16-5)50: (i) a true positive result (i.e., the study correctly identifies a true difference between treatments); (ii) a true negative result (i.e., the study correctly identifies no difference between treatments); (iii) a false negative result—type II (β) error (i.e., the study incorrectly concludes no difference between treatments when a difference really exists); and (iv) a false positive result—type I (α) error (i.e., the study incorrectly concludes a difference between treatments when no difference exists). 
Figure 16-5
Errors in hypothesis testing: Type I and type II errors are presented along with the power of a study (1−p).
View Original | Slide (.ppt)

Type II Errors (β-Error)

It is perceived that trials of surgical therapies may be sufficiently undersized to have a meaningful impact on clinical practice. Such trials of small sample size are subject to β-errors (type II errors): The probability of concluding that no difference between treatment groups exists when, in fact, there is a difference (Fig. 16-6). Typically, investigators will accept a β-error rate of 20% (β = 0.2), which corresponds with a study power of 80%. Most investigators agree that β-error rates greater than 20% (study power less than 80%) are subject to unacceptably high risks of false negative results. 
Figure 16-6
The current conceptual framework for evidence-based practice encompassing research findings, patients’ values and preferences, clinical circumstances, and expertise.
View Original | Slide (.ppt)
In an effort to quantify the extent to which orthopedic trauma trials were underpowered, Lochner et al.37 reviewed 117 randomized trials in trauma for type II error rates. The mean overall study power was 24.65% (range 2% to 99%). The potential type II error rate for primary outcomes was 91%. For example, one study demonstrated “no difference” between reamed and nonreamed tibial intramedullary nailing; however, this study was underpowered for this conclusion (study power = 32%). Thus, these conclusions should be interpreted with caution. 

Case Study—The Risk of Small Sample Sizes

The SPRINT trial evaluated reamed versus unreamed nailing of the tibia in 1,226 patients, as well as in open and closed fracture subgroups (N = 400 and N = 826, respectively).16 To evaluate the impact of smaller sample sizes on the results, the SPRINT investigators analyzed the reoperation rates and relative risk comparing treatment groups at 50, 100, and then increments of 100 patients up to the final sample size. Results at various enrollments were compared with the final SPRINT findings. In the final analysis, there was a statistically significant decreased risk of reoperation with reamed nails for closed fractures (RRR 35%). Results for the first 35 patients enrolled suggested reamed nails increased the risk of reoperation in closed fractures by 165%. Only after 543 patients with closed fractures were enrolled did the results reflect the final advantage for reamed nails in this subgroup. Had the SPRINT trial stopped at few than 100 patients, the findings may have represented a misleading estimate of the true effect of reamed nailing. 

Type I Error (α-Error)

Most surgeons are less familiar with the concept of concluding that the results of a particular study are true, when, in fact, they are really because of chance (or random sampling error). This erroneous false positive conclusion is designated as a type I or α-error (Fig. 16-6).20 By convention, most studies in orthopedics adopt an α-error rate of 0.05. Thus, investigators can expect a false positive error about 5% of the time. Ideally, a type I error rate is based on one comparison between alternative treatment groups usually designated as the primary outcome measure. In situations where no primary outcome variable has been determined, there is a risk of conducting multiple tests of significance on multiple outcomes measures. This form of data dredging by investigators risks spurious false positive findings. Several techniques are available to adjust for multiple comparisons, such as the Bonferroni correction. 
Most readers are intuitively skeptical when 1 in a list of 20 outcomes measured by an investigator is significant (p < 0.05) between two treatment groups. This situation typically occurs when investigators are not sure what they are looking for and therefore test several hypotheses hoping that one may be true. Statistical aspects of the multiple testing issues are straightforward. If n independent associations are examined for statistical significance, the probability that at least one of them will be found statistically significant is 1−(1−α)n if all n of the individual null hypotheses are true. Therefore, it is argued that studies that generate a large number of measures of association have markedly greater probability of generating some false positive results because of random error than does the stated α-level for individual comparisons. 
Bhandari et al.20 conducted a review of recently published randomized trials (within the last 2 years) to determine the risk of type I errors among surgical trials that did not explicitly state a primary outcome. One study examining outcomes in two different uncemented total knee arthroplasty designs evaluated 21 different outcome measures and found 13 outcomes that were significantly different between groups. As there was no clear statement about a designated primary outcome measure, the risk of a false positive result was 66%.20 

The Misuse of Subgroup Analyses in Orthopedic Outcome Studies

Subgroup analyses can be defined as treatment outcome comparisons for patients subdivided by baseline characteristics.46,62 For instance, in a study of operative versus nonoperative management of calcaneal fractures, investigators may report no difference in the overall outcome (patient function), but subsequently conduct a series of comparisons across different patient subgroups (gender, disability status, or comorbidities). Subgroup analyses are frequently post hoc analyses that risk false positive results (type I error) in which ineffective (or even harmful) treatments may be deemed beneficial in a subgroup. Conducting multiple statistical tests risks spurious false positive findings. Alternatively, false negative results may occur because negative subgroup analyses are often underpowered. 
Bhandari et al.9 identified important errors in surgical RCTs related to subgroup analyses. The majority of authors did not report whether subgroup analyses were planned a priori, and these analyses often formed the basis of the RCT conclusions. Inferences from such RCTs may be misleading and their application to clinical practice unwarranted.46,62 
In a review of 72 RCTs published in orthopedics and other surgical subspecialties, 27 (38%) RCTs reported a total of 54 subgroup analyses with a minimum of 1 and a maximum of 32 subgroup analyses per study.9 The majority of subgroup analyses, 49 (91%), were performed post hoc and not stated to be preplanned at the outset of the study nor included in the hypothesis. The majority of investigators inappropriately used tests of significance when comparing outcomes between subgroups of patients (41 subgroup analyses, 76%); however, only three of the analyses were performed using statistical tests for interaction. Investigators reported differences between subgroups in 31 (57%) of the analyses, all of which were featured in the summary or conclusion of the published paper. 
Subgroup analyses should be undertaken and interpreted with caution. The validity of a subgroup analysis can be improved by defining a few important (and biologically plausible) subgroups before conducting a study and conducting statistical tests of interaction. When faced with a subgroup analysis in a published scientific paper, readers should ask the following questions: Is the subgroup difference suggested by comparisons within rather than between studies? Did the hypothesis precede rather than follow the analysis? Was the subgroup effect one of a small number of hypothesized effects tested? Is the magnitude of the effect large? Was the effect statistically significant? Is the effect consistent across studies? Is there indirect evidence that supports the hypothesized subgroup effect? 

Statistical Versus Clinical Significance

Statistically significant differences between two treatments may not necessarily reflect a clinically important difference. Although it is well known that orthopedic studies with small sample sizes risk underpowered false negative conclusions (β-errors), statistically significant findings in small trials can occur at the consequence of very large differences between treatments (treatment effect). It is not uncommon for randomized trials to report RRRs larger than 50% when comparing one treatment with another. 
Sung et al.57 conducted a comprehensive search for all RCTs between January 1, 1995 and December 31, 2004. Eligible studies included those that focused upon orthopedic trauma. Baseline characteristics and treatment effects were abstracted by two reviewers. Briefly, for continuous outcome measures (i.e., functional scores), effect sizes (mean difference/standard deviation) were calculated. Dichotomous variables (i.e., infection, nonunion) were summarized as absolute risk differences and RRRs. Effect sizes >0.8 and RRRs greater than 50% were defined as large effects. 
These investigators identified 433 RCTs, of which 76 RCTs had statistically significant findings on 184 outcomes (122 continuous / 62 dichotomous outcomes). The average study reported large reductions (>50% RRR) in the risk of an adverse outcome event versus a comparative treatment; however, almost 1 in 2 study outcomes (47%) had RRRs less than 50%, and over 1 in 5 (23%) had RRRs less than 20%. 

Study Power and Sample Size Calculation

The power of a study is the probability of concluding a difference between two treatments when one actually exists. Power (1−β) is simply the complement of the type II error (β). Thus, if we accept a 20% chance of an incorrect study conclusion (β = 0.2), we are also accepting that we will come to the correct conclusion 80% of the time. Study power can be used before the start of a clinical trial to assist with sample size determination, or following the completion of study to determine if the negative findings were true (or because of chance). 
The power of a statistical test is typically a function of the magnitude of the treatment effect, the designated type I error rate (α), and the sample size (n). When designing a trial, investigators can decide upon the desired study power (1−β) and calculate the necessary sample to achieve this goal.28 Numerous free sample size calculators are available on the internet and use the same principles and formulae estimating sample size in clinical trials. 

Comparing Two Continuous Variables

A continuous variable is one with a scale (i.e., blood pressure, functional outcome score, time to healing). For example, in planning a trial of alternative strategies for the treatment of humeral shaft fractures, an investigator may identify a systematic review of the literature that reports that the time to fracture healing with treatment A is 110 ± 45 days, whereas time to healing with treatment B (control group) can be expected to be up to 130 ± 40 days. The expected treatment difference is 20 days and the effect size (mean difference/standard deviation) is 0.5 (20/40). Effect sizes can be categorized as small (0.1), medium (0.3), and large (0.5). The anticipated sample size for this continuous outcome measure is determined by a standard equation. 
A particular study will require approximately 63 patients in total to have sufficient power to identify a difference of 20 days between treatments, if it occurs. An investigator may then audit his or her center’s previous year and decide if enough patients will present to the center to meet the sample size requirements. Table 16-6 provides additional scenarios and the sample size requirements for varying differences in healing times between treatment and control groups. As the difference between treatments diminishes, the sample size requirements increase (Table 16-6). 
Table 16-6
Sample Size Requirements for Continuous Outcome (Time to Fracture Healing)
Time to Healing (Control Group) Time to Healing (Treatment Group) % Reduction in Time to Healing Number of Patients Needed per Group
150 days 120 20% 16
150 days 135 10% 63
150 days 143 5% 289
Let us consider another study that aims to compare functional outcome scores in patients with ankle fractures treated operatively versus nonoperatively. Previous studies using the functional outcome score have reported standard deviations for operative and nonoperative cases of 12 points, respectively. Based upon previous studies, we want to be able to detect a difference of 5 points on this functional outcome score between treatments. 
From the equation in the Appendix at the end of this chapter, our proposed study will require 90 patients per treatment arm to have adequate study power. 
Reworking the above equation, the study power can be calculated for any given sample size by transforming the above formula and calculating the z-score:   
The actual study power that corresponds to the calculated z-score can be looked up in readily available statistical literature19 or on the internet (keyword: “z-table”).23,60 From the above example, the z-score will be 0.84 for a sample size of 90 patients. The corresponding study power for a z-score of 0.84 is 80%. 

When the Outcome Measure is Dichotomous (Proportion)

A dichotomous variable is typically one that has one of two options (i.e., infection or not, nonunion or not, alive or dead). Let us assume that this same investigator chooses nonunion as the primary outcome instead of time to union. Based upon the previous literature, he or she believes that treatment A will result in a 95% union rate and treatment B (control group) will result in a 90% union rate. Eight-hundred-and-sixty-nine patients are required for the study to identify a 5% difference in nonunion rates between treatments. An investigator may realize that this number is sufficiently large enough to prohibit the trial being conducted at one center and may elect to gain support at multiple sites for this trial. For example, in a proposed trial using pulmonary embolus risk as the primary outcome, the number of patients required may be prohibitive (Table 16-7). 
Table 16-7
Sample Size Requirements for Difference Baseline Risks of Pulmonary Embolus
Pulmonary Embolus Rate Control Group Pulmonary Embolus Rate Treatment Group % Reduction in Pulmonary Embolus Risk Number of Patients Needed Per Group
10% 8% 20% 3,213
1% 0.8% 20% 35,001
0.10% 0.08% 20% 352,881
Returning to our example of ankle fractures, let us now assume that we wish to change our outcome measure to differences in secondary surgical procedures between operatively and nonoperatively treated ankle fractures. A clinically important difference is considered to be 5%. Based upon the previous literature, it is estimated that the secondary surgical rates in operative and nonoperative treated ankles will be 5% and 10%, respectively. The number of patients required for our study can now be calculated from the equation presented in the Appendix. 
Thus, we need 433 patients per treatment arm to have adequate study power for our proposed trial. 
Reworking the above equation, the study power can be calculated for any given sample size by transforming the above formula and calculating the z-score:   
From the above example, the z-score will be 0.84 for a sample size of 433 patients. The corresponding study power for a z-score of 0.84 is 80%. 

Measuring Patient Health and Function

The basis of the “outcomes movement” in trauma is the move toward identifying patient-relevant and clinically important measures to evaluate the success (or failure) of surgical interventions. Common to any outcome measure that gains widespread use should be its reliability and validity. Reliability refers to the extent to which an instrument yields the same results in repeated applications in a population with stable health. In other words, reliability represents the extent to which the instrument is free of random error. Validity is an estimation of the extent to which an instrument measures what it was intended to measure. The process of validating an instrument involves accumulating evidence that indicates the degree to which the measure represents what it was intended to represent. Some of these methods include face, content, and construct validity.7,33 

What is Health-Related Quality of Life?

The World Health Organization defines health as “a state of complete physical, mental, and social well-being.” Thus, when measuring health in a clinical or research setting, questioning a patient’s well-being within each of these domains is necessary to comprehensively represent the concept of health. Instruments that measure aspects of this broad concept of health are often referred to as health-related quality of life (HRQOL) measures. These measures encompass a broad spectrum of items including those associated with activities of daily life such as work, recreation, household management, and relationships with family, friends, and social groups. HRQOL considers not only the ability to function within these roles, but also the degree of satisfaction derived from performing them. 
A generic instrument is one that measures general health status inclusive of physical symptoms, function, and emotional dimensions of health. A disadvantage of generic instruments, however, is that they may not be sensitive enough to be able to detect small but important changes.28 
Disease-specific measures, on the other hand, are tailored to inquire about the specific physical, mental, and social aspects of health affected by the disease in question, allowing them to detect small, important changes.33 Therefore, to provide the most comprehensive evaluation of treatment effects, no matter the disease or intervention, investigators often include both a disease-specific and a generic health measure. In fact, many granting agencies and ethics boards insist that a generic instrument be included in the design of proposed clinical studies. 
Often, the combination of objective endpoints in a surgical study (i.e., quality of fracture reduction) and validated measures of patient function and quality of life is an ideal combination. Whereas an intra-articular step-off in a tibial plafond fracture may be viewed as a less-than-satisfactory radiograph outcome, there may be no detectable effect on patient function or quality of life.38 
Another factor to consider is the ability of the outcome measure to discriminate between patients across a spectrum of the injury in question. Questionnaires may sometimes exhibit ceiling and floor effects. Ceiling effects occur when the instrument is too easy and all respondents score the highest possible score. Alternatively, floor effects can occur if the instrument is very difficult or tapping into rare issues associated with the disease. Most patients will score the lowest possible score. Miranda et al.,42 in a study of 80 patients with pelvic fractures, found that the severity of pelvic fracture did not alter Short Form-36 (SF-36) and Iowa pelvic scores. 
Despite increasing severity of the pelvic injury, functional outcomes remained equally poor. This was likely related to the associated soft tissue injuries that created a “floor effect” limiting the ability to discriminate between the orthopedic injuries. 

Common Outcome Instruments Used in Trauma

Beaton and Schemitsch6 have reported commonly used measures of outcome in orthopedics (Table 16-8). These include both generic and disease-specific instruments. Properties of these instruments follow. 
Table 16-8
Commonly Used Outcome Measures
View Large
Table 16-8
Commonly Used Outcome Measures
Measurement Properties
Type Measure Domains/Scales Number of Items Response Categories Target Population Internal Consistency Test–Retest Reliability Construct Validity Responsiveness Comments
Utility EQ-5D Mobility
Self care
Usual activities
Total: 5
3 All NA Y YY Y Describes health state that is transcribed into utility using UK data. Indirect measure of utility.
Generic SF-36 version 2 Physical function bodily pain
Role function—physical
Role function—emotional
Mental health vitality
Social functioning
General health
Total = 35 + 1 item
3–6 All YY Y YY YY Version 2 now in use.
Uses improved scaling for role functioning, and clearer wording.
Reliability is lower than desired for individual level of interpretation, fine for group.
Region SMFA Daily activities
Emotional status
Arm/hand function
Above combined for functional index
Bothersome index
5 points Musculoskeletal YY YY YY YY Normative data now available
Only measure designed for any musculoskeletal problem.
DASH Physical function, symptoms (one scale) 30 5 All upper limb musculoskeletal disorders YY YY Y YY Normative data now available.
Manual available.
Developed in oncology; used in hip fractures.
Toronto extremity salvage score (TESS) Physical function in surgical oncology 30 5 Lower limb sarcoma YY YY Y YY
Specific WOMAC Physical function
5 or VAS Osteoarthritis of knee, hip YY YY YY YY Adopted as key outcome for evaluating knee arthroplasty.
Roland and Morris Physical function because of low back pain 24 2 (Yes/No) Low back pain Y YY YY YY Excellent review and comparison with Oswestry in Roland and Fairbanks.48
Oswestry Pain
Personal care
Sex life
Social life
1 each 6 points Low back pain YY YY YY YY Excellent review and comparison with Roland in Roland and Fairbanks.48
Simple Shoulder Test (SST) Function-8
Sleep position
2 (Yes difficult Yes/No) Shoulder disorders Y YY YY YY Developers suggest reporting % with difficulty in each item, not a summative score. Some psychometrics done using sum of items.
Neck disability index Pain
Personal care
1 each 6 points Whiplash disorders Y Y Y Y Neck pain has few instruments that have been evaluated for psychometrics. This is most tested.
Patient-specific No patient-specific measure found in literature reviewed.

NA, not available; Y, one or two articles found in support of this attribute; YY, multiple articles supporting this attribute.


From Beaton DE, Schemitsch E. Measures of health-related quality of life and physical function. Clin Orthop Relat Res. 2003;413:90–105.



The EQ-5D, formally described as the EuroQOL, is a five-item scale that is designed to allow people to describe their health state across five dimensions.19 There are three response categories that combine for a total of 243 possible health states. The preference weight allows a single numeric score from slightly less than zero (theoretically worse than death) to one (best health state). EQ-5D scores are used in economic appraisals (such as cost utility analyses) in the construction of quality-adjusted life years for the calculation of cost per quality of life year gained and its comparison across interventions. 

Short Form-36

The SF-36 is a generic measure of health status. It is probably one of the most widely used measures. The SF-36 has 35 items that fit into one of 8 subscales. One additional item is not used in the scores. In 1994, the developers, led by Ware,59 produced two summary scores for the SF-36: The physical component score (more heavily weights dimensions of pain, physical function, and role function physical) and the mental component score (more weight given to mental health, vitality, etc.). The two physical component scores are standardized, so the general population (based on a US sample) will score 50 on average, with a standard deviation of 10. The subscale scores, often presented as a profile graph, are scored on a scale of 0 to 100 where 100 is a good health state. 

Short Musculoskeletal Function Assessment Form

The short musculoskeletal function assessment (SMFA) form is a 46-item questionnaire that is a shortened version of Swion-kowski’s full musculoskeletal functional assessment.53 The SMFA has two main scores: The function index (items 1 to 34) and the bothersome index (items 35 to 46). The functional index is subdivided into 4 subscales (daily activities, emotional status, arm and hand function, and mobility). The SMFA has been tested in patients with musculoskeletal disorders, as this is the target population. The psychometric properties are high, suggesting that it can be used for monitoring individual patients. The SMFA was designed to describe the various levels of function in people with musculoskeletal disorders, as well as monitor change over time. The SMFA correlates highly with the SF-36 and use of both instruments in the same patient population is likely redundant. 

Disabilities of the Arm, Shoulder, and Hand Form

The Disabilities of the Arm, Shoulder, and Hand (DASH) form is a 30-item questionnaire designed to measure physical function and disability in any or all disorders of the upper limb. It is therefore designed to be sensitive to disability and change in disability in the hand as well as in the shoulder. In one study, it was directly compared to a shoulder and a wrist measure, and had similar levels of construct validity, responsiveness, and reliability. Another study showed slightly lower properties in the DASH as compared with a wrist-specific measure in patients with wrist fracture. Like the SMFA, the measurement properties of the DASH are quite high (internal consistency 0.96, test–retest 0.95, good validity and responsiveness) suggesting it could also be used in individual patients in a clinical setting. 

Western Ontario and McMaster Universities Osteoarthritis Index

The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) is a 24-item scale divided into three dimensions: Function, pain, and stiffness. The most commonly used response scale is a five-point Likert; however, there is a visual analogue scale version. It has been widely used and tested in the field of osteoarthritis and rheumatoid arthritis and a review of its psychometric properties was summarized by McConnell et al.40 in 2001. The WOMAC is the most commonly used and endorsed patient-based outcome after hip or knee arthroplasty. 

Hip Rating Questionnaire

The Hip Rating Questionnaire (HRQ) is a patient-administered, 14-item questionnaire that uses a 100-point summated rating scale. A higher score suggests better health status. Equal weight is given to the domains of the overall impact of arthritis, pain, walking, and function. This questionnaire is designed to assess outcomes after total hip replacement surgery. According to Johanson et al.,35 2-week test–retest administrations produced a weighted κ-score of 0.7, and the sensitivity to change was deemed to be excellent. 

Harris Hip Score

The Harris Hip Score (HHS) is a patient- and clinician-administered questionnaire designed to assess patients with traumatic arthritis of the hip.47 It is a 10-item questionnaire that uses a 100-point summated rating scale and takes approximately 15 to 30 minutes to administer. There are four domains: The pain domain contributes 44 points; function, 47; range of motion, 5; and absence of deformity, 4. The function domain is divided into gait and activities, whereas deformity considers hip flexion, adduction, internal rotation, and limb-length discrepancy and range-of-motion measures.47 A higher score suggests better health status. The HHS is the most commonly used scoring system for evaluating hip arthroplasty. Its responsiveness has been found to be comparable to, and in some cases, better than the WOMAC pain and function subscales.47 

The Hospital for Joint Diseases Hip Fracture Recovery Score (Functional Recovery Score)

The Hospital for Joint Diseases Hip Fracture Recovery Score (FRS) is an interviewer-administered questionnaire with 11 items comprising three main components: Basic activities of daily living assessed by four items and contributing 44 points, instrumental activities of daily living assessed by six items and contributing 33 points, and mobility assessed by one item and contributing 33 points. Therefore, complete independence in basic and instrumental activities of daily living and mobility will give a score of 100 points.63,64 It is a patient-oriented outcomes measure that is designed to assess functional recovery for ambulatory hip fracture patients.63,64 Use of the FRS can provide the means of assessing the recovery of prefracture function.63,64 The FRS has been found to be responsive to change, reliable, and has predictive validity as well as discriminant validity.64 

Get-Up and Go Test

The Get-Up and Go (GUG) test was developed as a clinical measure of balance in elderly people and is an in-person assessment. The GUG test measures the time a person takes to get up from a chair and walk 15.2 m (50 ft) as fast as possible along a level and unobstructed corridor. Thus, this performance-based measure of physical function requires the patient to be able to rise from a seated position, walk, and maintain his or her balance.45 The scoring of this instrument is based on balance function, which is scored on a 5-point scale, with 1 indicating normal and 5 indicating severely abnormal. A patient with a score of 3 or more is at risk for falling. Mathias et al.39 found that when patients underwent laboratory tests of balance and gait, there was good correlation between the laboratory tests and the objective assessment. 

Merle d’Aubigne-Postel Score

The Merle d’Aubigné-Postel (MDP) score contains three domains: Pain, mobility, and walking ability. These three domains have the same impact. The scores for pain and walking ability can be added and subsequently classified into the grades very good, good, medium, fair, and poor. These grades are then adjusted down by one to two grades to account for the mobility score, which results in the final clinical grade. The modified MDP is slightly different from the original in terms of language and grading, as the modified version is calculated on a scale of 0 to 6 (as opposed to 1 to 6) and does not combine the scores to obtain a total score.44 

Knee Injury and Osteoarthritis Outcome Score

The Knee injury and Osteoarthritis Outcome Score (KOOS) is designed to assess short- and long-term patient-relevant outcomes after knee injury.49 The KOOS was designed based on the WOMAC, literature review, and an expert panel and has been statistically validated for content validity, construct validity, reliability, and responsiveness. The questionnaire is composed of 42 items that are scored on a Likert scale. A higher score indicates better health status. Subscales include pain, symptoms, activities of daily living, sport and recreation, and knee-related quality of life.49 

Lower Extremity Measure

The Lower Extremity Measure is a patient-administered instrument designed to assess physical function.34 This questionnaire is a modification of the Toronto Extremity Salvage Score and has been statistically confirmed for reliability, validity, and responsiveness. The Lower Extremity Measure is composed of 29 items on a Likert scale and administration takes approximately 5 minutes. This questionnaire has been designed for an elderly population, with 10 points indicating significant clinical change.34 

Olerud and Molander Scoring System

The Olerud and Molander Scoring System is a patient-administered questionnaire designed to assess the symptoms after ankle fracture.43 It is composed of nine items on a summated rating scale and has been compared with the visual analog scale (VAS), range of motion, osteoarthritis, and dislocation for statistical validation. A higher score indicates better health status.43 

American Shoulder and Elbow Surgeons Assessment Form

The American Shoulder and Elbow Surgeons (ASES) Assessment Form is designed to assess the shoulder and elbow and is patient- and clinician-administered.41 There is no cost to obtain this instrument. Subscales include shoulder score index pain, instability, activities of daily living, range of motion, signs, and strength. A higher score indicates better health status. The instrument is a combination of VAS and Yes/No scaled questions. Administration by the patient takes approximately 3 minutes.41 

American Orthopedic Foot and Ankle Scale

The American Orthopedic Foot and Ankle Scale was designed for use among patients with foot or ankle dysfunction. It contains four region-specific scales, including ankle–hindfoot, midfoot, hallux metatarsophalangeal, and lesser metatarsophalangeal–interphalangeal scales. Patients self-report information about pain and function in each region. This scale also incorporates physical examination results recorded by the clinician. Although the American Orthopedic Foot and Ankle Scale has been widely used in studies of foot and ankle surgical outcomes, limitations have also been reported.52,54 

Utilizing Outcome Studies in Decision-Making (Evidence-Based Orthopedics)

What is Evidence-Based Orthopedics?

The term EBM first appeared in the fall of 1990 in a document for applicants to the Internal Medicine Residency Program at McMaster University in Ontario, Canada, which described EBM as an attitude of enlightened skepticism toward the application of diagnostic, therapeutic, and prognostic technologies. As outlined in the text Clinical Epidemiology and first described in the literature in the ACP Journal Club in 1991, the EBM approach to practicing medicine relies on an awareness of the evidence upon which a clinician’s practice is based and the strength of inference permitted by that evidence.29 The most sophisticated practice of EBM requires, in turn, a clear delineation of relevant clinical questions, a thorough search of the literature relating to the questions, a critical appraisal of available evidence and its applicability to the clinical situation, and a balanced application of the conclusions to the clinical problem. The balanced application of the evidence (i.e., the clinical decision-making) is the central point of practicing EBM and involves, according to EBM principles, integration of our clinical expertise and judgment with patients’ preferences and societal values and with the best available research evidence (Fig. 16-6). The EBM working group at McMaster University has proposed a working model for evidence-based clinical practice that encompasses current research evidence, patient preferences, clinical circumstances, and clinical expertise. EBM is commonly misunderstood as removing clinical expertise as a factor in patient decision-making. This is not so. The common thread that weaves the relationships between patients, circumstances, and research is the experience and skill of the surgeon. 

Finding Current Evidence in Trauma

To be effective EBM practitioners, surgeons must acquire the necessary skills to find the “best” evidence available to answer clinically important questions. Reading a few articles published in common orthopedic journals each month is insufficient preparation for answering the questions that emerge in daily practice. There are at least 100 orthopedic journals indexed by MEDLINE.2 For surgeons whose principal interest is orthopedic traumatology, the list is even larger. Given their large clinical demands, surgeons’ evidence searches must be time-efficient. Evidence summaries (such as those published in the Journal of Orthopaedic Trauma) and systematic reviews (comprehensive literature reviews) are useful resources for surgeons (Table 16-9). The most efficient way to find them is by electronic searching of databases and/or the internet. With time at a premium, it is important to know where to look and how to develop a search strategy, or filter, to identify the evidence most efficiently and effectively. Recently, we have developed a point of care resource in orthopedics that provides timely and regularly updated evidence reports in trauma. The site, known as OrthoEvidence ( searches journals each month and identifies high-quality evidence (namely randomized clinical trials or meta-analyses). Data from these trials are abstracted and a careful risk of bias assessment is conducted. The end result, termed an “Advanced Clinical Evidence (ACE) report,” is posted on the site. 
Table 16-9
Finding Current Evidence: Resources
View Large
Table 16-9
Finding Current Evidence: Resources
    Using the Medical Literature
    Journal of American Medical Association User’s Guides
    Canadian Medical Association Journal User’s Guides
    Journal of Bone and Joinf Surgery User’s Guides
    Canadian Journal of Surgery User’s Guides
Electronic Publications
    ACP Journal Club (American College of Physicians) (
    Bandolier: Evidence-based healthcare
    National Guideline Clearinghouse (Agency of Health Care Policy and Research [AHCPR];
Internet Resources

User’s Guide to Evaluate an Orthopedic Intervention

Most surgical interventions have inherent benefits and associated risks. Before implementing a new therapy, one should ascertain the benefits and risks of the therapy, and be assured that the resources consumed in the intervention will not be exorbitant. A simple three-step approach can be used when reading an article from the orthopedic literature (Table 16-10). It is prudent to ask whether the study can provide valid results (internal validity), to review the results, and to consider how the results can be applied to patient care (generalizability). Lack of randomization, no concealment of treatment allocation, lack of blinding, and incomplete follow-up are serious threats to the validity of a published randomized trial. The user’s guide focuses the assessment on assuring that investigators have considered these issues in the conduct of their study. Understanding the language of EBM is also important. Table 16-11 provides a summary of common terms used when considering the results of a clinical paper. Although randomized trials sit atop the hierarchy of an intervention, not all orthopedic research questions are suitable for randomized trials. For example, observational studies (prospective cohorts) are more suitable designs when evaluating prognosis (or risk factors) for outcome following a surgical procedure. However, common problems with alternative (and accepted) surgical treatments argue strongly in favor of randomized trials. Complex problems with nonconsensus in surgical technique or lack of acceptance of one approach argue in favor of observational studies to further elucidate the technique as well as understand the indications for alternative approaches before embarking on a randomized trial. 
Table 16-10
User’s Guide to Orthopedic Randomized Trials
View Large
Table 16-10
User’s Guide to Orthopedic Randomized Trials
Did experimental and control groups begin the study with a similar prognosis?
Were patients randomized?
Was randomization concealed?
Were patients analyzed in the groups to which they were randomized?
Were patients in the treatment and control groups similar with respect to known prognostic factors?
Did experimental and control groups retain a similar prognosis after the study started?
Did investigators avoid effects of patient awareness of allocation—were patients blinded?
Were aspects of care that affect prognosis similar in the two groups—were clinicians blinded?
Was outcome assessed in a uniform way in experimental and control groups—were those assessing the outcome blinded?
Was follow-up complete?
How large was the treatment effect?
How precise was the estimate of the treatment effect?
Can the results be applied to my patient?
Were all patient-important outcomes considered?
Are the likely treatment benefits worth the potential harms and costs?
Table 16-11
Presentation of Results
View Large
Table 16-11
Presentation of Results
Infection No Infection
Treatment Group 10 90
Control Group 50 50
Treatment event rate (TER): A / (A + B) = 10/100 = 10%
The incidence of infection in the treatment group
Control event rate (CER): C / (C + D) = 50/100 = 50%
The incidence of infection in the control group
Relative risk: TER / CER = 10/50 = 0.2 or 20%
The relative risk of infection in the treatment group relative to the control group
RRR: 1−RR = 1 − 0.2 = 0.8 or 80%
Treatment reduces the risk of infection by 80% compared with controls
Absolute risk reduction (ARR): CER − TER = 50% − 10% = 40%
The actual numerical difference in infection rates between treatment and controls
Number needed to treat: 1 / ARR = 1 / 0.4 = 2.5
For every 2.5 patients who received the treatment, 1 infection can be prevented
Odds ratio: AD / BC = (10)(50) / (90)(50) = 500 / 4500 = 0.11 The odds of infection in treatment compared with controls is 0.11

Incorporating Evidence-Based Orthopedics into Daily Trauma Practice

EBM is becoming an accepted educational paradigm in medical education at a variety of levels. An analysis of the literature related to journal clubs in residency programs in specialties other than orthopedic surgery reveals that the three most common goals were to teach critical appraisal skills (67%), to have an impact on clinical practice (59%), and to keep up with the current literature (56%).58 The implementation of the structured article review checklist has been found to increase resident satisfaction and improve the perceived educational value of the journal club without increasing resident workload or decreasing attendance at the conference. 
Structured review instruments have been applied in a number of orthopedic training programs; assessments of the outcomes and effectiveness of this format for journal club are ongoing. One example of one structured review instrument for use in orthopedic training programs is provided in Figure 16-7
Figure 16-7
A checklist to assess the quality of surgical therapies.
View Original | Slide (.ppt)

The Future of Outcome Studies in Orthopedic Trauma

Over the past 50 years, there has been a vast proliferation of randomized trials. Although the strength of evidence is most persuasive in large, randomized trials with small CIs around their treatment effect, this is not always feasible for many clinical problems in orthopedics. Indeed, only 3% (72 of 2,498 studies) of studies published in orthopedics reflect randomized trial methodology.14 The design, conduct, and analysis of orthopedic research has gained widespread appreciation in surgery, particularly in orthopedic surgery. Still, only 14% of the original contributions in JBJS represent level I evidence.18 When randomization is either not feasible or unethical, prospective observational studies represent the best evidence. Approximately, one in five scientific articles published in JBJS represent this level II evidence.18 In a more recent review of the literature, Chan and Bhandari23 identified 87 randomized trials in orthopedic surgical procedures, representing 14% of the published studies. JBJS contributed 4.1% of the published randomized trials in this report. 
Future studies can provide high-quality data on which to base practice if we conduct RCTs whenever feasible, ensure adequate sample size, involve biostatisticians and methodologists, collect data meticulously, and accurately report our results using sensible outcomes and measures of treatment effect. Limiting type II errors (β-errors) will need multicenter initiatives. These larger trials have the advantage of increased generalizability of the results and the potential for large-scale and efficient recruitment (1,000 patients or more). Single-center trials that may have taken a decade to recruit enough patients can now be completed in a few years with collaborative research trials. The obvious drawback with multicenter initiatives is the relative complexity of the design and the cost. It is reasonable to expect that a trial of over 1,000 patients will cost more than $3 to 4 million to conduct. 


The purpose of the “outcomes movement” and EBM is to provide healthcare practitioners and decision-makers (physicians, nurses, administrators, regulators) with tools that allow them to gather, access, interpret, and summarize the evidence required to inform their decisions and to explicitly integrate this evidence with the values of patients. In this sense, EBM is not an end in itself, but rather a set of principles and tools that help clinicians distinguish ignorance of evidence from real scientific uncertainty, distinguish evidence from unsubstantiated opinions, and ultimately provide better patient care. 

Appendix: Sample Size Calculations

1. Continuous Variables

The number of patients required per treatment arm to obtain 80% study power (β = 0.2) at a 0.05 α-level of significance is as follows:   where 
    n1 = sample size of group one
    n2 = sample size of group two
    Δ = difference of outcome parameter between groups (5 points)
    σ = sample standard deviations (12)
    z1-α/2 = z0.975 = 1.96 (for α = 0.05)
    z1-β = z0.8 = 0.84 (for β = 0.2)

2. Dichotomous Variables

The number of patients required per treatment arm to obtain 80% study power (β = 0.2) at a 0.05 α-level of significance is as follows:   where 
    n1 = sample size of group one
    n2 = sample size of group two
    p1, p2 = sample probabilities (5% and 10%)
    q1, q2 = 1 − p1, 1 − p2 (95% and 90%)
    pm = (p1 + p2)/2 (7.5%)
    qm = 1 − pm (92.5%)
    Δ = difference = p2 − p1 (5%)
    z1-α/2 = z0.975 = 1.96 (for α = 0.05)
    z1-β = z0.8 = 0.84 (for β = 0.2)


American Medical Association. User’s guides to the medical literature: a manual for evidence-based clinical practice. In Guyatt GH, Rennie D, eds. 2nd ed. Chicago, IL: American Medical Association Press; 2001.
Atkins D, Best D, Briss PA, et al. Grading quality of evidence and strength of recommendations. BMJ. 2004;328(7454):1490.
Atkins D, Briss PA, Eccles M, et al. Systems for grading the quality of evidence and the strength of recommendations II: pilot study of a new system. BMC Health Serv Res. 2005;5(1):25.
Atkins D, Eccles M, Flottorp S, et al. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches. The GRADE Working Group. BMC Health Serv Res. 2004;4:38.
Balogh ZJ, Reumann MK, Gruen RL, et al. Advances and future directions for management of trauma patients. Lancet. 2012;380(9847):1109–1119.
Beaton DE, Schemitsch E. Measures of health-related quality of life and physical function. Clin Orthop Relat Res. 2003;413:90–105.
Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Engl J Med. 2000;342:1878–1886.
Bhandari M, Devereaux PJ, Li P, et al. The misuse of baseline comparison tests and subgroup analyses in surgical randomized controlled trials. Clin Orthop Relat Res. 2006;447:247–251.
Bhandari M, Guyatt GH, Siddiqui F, et al. Operative versus nonoperative treatment of achilles tendon rupture—a systematic overview and meta-analysis. Clin Orthop Relat Res 2002:400:190–200.
Bhandari M, Guyatt GH, Swiontkowski MF. User’s guide to the orthopaedic literature: how to use an article about a prognosis. J Bone Joint Surg. 2001;83A:1555–1564.
Bhandari M, Guyatt GH, Swiontkowski MF. User’s guide to the orthopaedic literature: how to use an article about a surgical therapy. J Bone Joint Surg. 2001;83A:916–926.
Bhandari M, Montori VM, Devereaux PJ, et al. Doubling the impact: publication of systematic review articles in orthopaedic journals. J Bone Joint Surg Am. 2004;86:1012–1016.
Bhandari M, Richards R, Schemitsch EH. The quality of randomized trials in Journal of Bone and Joint Surgery from 1988–2000. J Bone Joint Surg Am. 2002;84A:388–396.
Bhandari M, Swiontkowski MF, Einhorn TA, et al. Interobserver agreement in the application of levels of evidence to scientific papers in the American volume of the Journal of Bone and Joint Surgery. J Bone Joint Surg Am. 2004;86A:1717–1720.
Bhandari M, Tornetta P III. Issues in the hierarchy of study design, hypothesis testing, and presentation of results. Tech Orthop. 2004:19:57–65.
Bhandari M, Tornetta P 3rd, Rampersad SA, et al. (Sample) Size Matters! An Examination of Sample Size from the SPRINT trial study to prospectively evaluate reamed intramedullary nails in patients with tibial fractures. J Orthop Trauma. 2013;27:183–188.
Bhandari M, Tornetta P III, Ellis T, et al. Hierarchy of evidence: differences in results between nonrandomized studies and randomized trials in patients with femoral neck fractures. Arch Orthop Trauma Surg. 2004;124(1):10–16.
Bhandari M, Tornetta P III. Communicating the risks of surgery to patients. European J Trauma. 2004;30:177–180.
Bhandari M, Whang W, Kuo JC, et al. The risk of false-positive results in orthopaedic surgical trials. Clin Orthop Relat Res. 2003;413:63–69.
Bhandari M, Zlowodzki M, Cole PA. From eminence-based practice to evidence-based practice: a paradigm shift. Minn Med. 2004;4:51–54.
Box JF. Guinness, Gosset, Fisher, and small samples. Statistical Science. 1987;2:45–52.
Brighton B, Bhandari M, Tornetta P III, et al. Hierarchy of evidence: from case reports to randomized controlled trials. Clin Orthop Relat Res. 2003;413:19–24.
Chan S, Bhandari M. The quality of reporting of orthopaedic randomized trials with use of a checklist for nonpharmacological therapies. J Bone Joint Surg Am. 2007;89:1970–1978.
Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med. 2000;342:1887–1894.
Devereaux PJ, Bhandari M, Clarke M, et al. Need for expertise-based randomized controlled trials. BMJ. 2005;330(7482):88.
Devereaux PJ, Manns BJ, Ghali W, et al. In the dark: physician interpretations and textbook definitions of blinding terminology in randomized controlled trials. JAMA. 2001;285:2000–2003.
Dirschl DR, Tornetta P III, Bhandari M. Designing, conducting, and evaluating journal clubs in orthopaedic surgery. Clin Orthop Relat Res. 2003;413:146–157.
Griffin D, Audige L. Common statistical methods in orthopaedic clinical studies. Clin Orthop Relat Res. 2003;413:70–79.
Guyatt GH. Evidence-based medicine. ACP J Club. 1991;114:A16.
Haentjens P, Autier P, Boonen S. Clinical risk factors for hip fracture in elderly women: a case-control study. J Orthop Trauma. 2002;6:379–385.
Haynes RB, Mukherjee J, Sackett D, et al. Functional status changes following medical or surgical treatment for cerebral ischemia: results in the EC/IC, Bypass Study. JAMA. 1987;257:2043–2046.
Ioannidis JP, Haidich AB, Pappa M, et al. Comparison of evidence of treatment effects in randomized and nonrandomized studies. JAMA. 2001;286:821–830.
Jackowski D, Guyatt G. A guide to health measurement. Clin Orthop Relat Res. 2003;413:80–89.
Jaglal S, Lakhani Z, Schatzker J. Reliability, validity, and responsiveness of the lower extremity measure for patients with a hip fracture. J Bone Joint Surg Am. 2000;82-A:955–962.
Johanson NA, Charlson ME, Szatrowske TP, et al. A self-administered hip-rating questionnaire for the assessment of outcome after total hip replacement. J Bone Joint Surg Am. 1992;74:587–597.
Kunz R, Oxman AD. The unpredictability paradox: review of empirical comparisons of randomized and nonrandomized clinical trials. BMJ. 1998;317:1185–1190.
Lochner H, Bhandari M, Tornetta P III. Type II error rates (beta errors) in randomized trials in orthopaedic trauma. J Bone Joint Surg. 2002;83A:1650–1655.
Marsh JL, Weigel DP, Dirschl DR. Tibial plafond fractures. How do these ankles function over time? J Bone Joint Surg Am. 2003;85A:287–295.
Mathias S, Nayak USL, Isaacs B. Balance in the elderly patients: the “get-up-and-go” test. Arch Phys Med Rehab. 1986;67:387–389.
McConnell S, Kolopack P, Davis AM. The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC): a review of its utility and measurement properties. Arthritis Rheum. 2001;45:453–461.
Michener LA, McClure PW, Sennett BJ. American Shoulder and Elbow Surgeons Standardized Shoulder Assessment Form patient self-report section: reliability, validity, and responsiveness. J Shoulder Elbow Surg. 2002;11:587–594.
Miranda MA, Riemer BL, Butterfield SL, et al. Pelvic ring injuries. Along-term functional outcome study. Clin Orthop Relat Res. 1996;329:152–159.
Olerud C, Molander H. A scoring scale for symptom evaluation after ankle fracture. Arch Orthop Trauma Surg. 1984;103:190–194.
Ovre S, Sandvik L, Madsen JE, et al. Comparison of distribution, agreement, and correlation between the original and modified Merle d’Aubigne-Postel Score and the Harris Hip Score after acetabular fracture treatment: moderate-agreement, high-ceiling effect and excellent correlation in 450 patients. Acta Orthop Scand. 2005;76:796–802.
Piva SR, Fitzgerald GK, Irrgang JJ, et al. Get-up-and-go test in patients with knee osteoarthritis. Arch Phys Med Rehabil. 2004;85:284–289.
Pocock S, Assman S, Enos L, et al. Subgroup analysis, covariate adjustment, and baseline comparisons in clinical trial reporting: current practice and problems. Stats Med. 2002;21:2917–2930.
Rogers JC, Irrgang JJ. Measures of adult lower extremity function. Arthitis Rheum. 2003;49:S67–S84.
Roland M, Fairbank J. The Roland-Morris disability questionnaire and the Oswestry disability questionnaire. Spine. 2000;25:3115–3124.
Roos EM, Toksvig-Larsen S. Knee injury and Osteoarthritis Outcome Score (KOOS)— validation and comparison to the WOMAC in total knee replacement. Health Qual Life Outcomes. 2003;1:17.
Sackett DL, Haynes RB, Guyatt GH, et al. Clinical Epidemiology: A Basic Science for Clinical Medicine. Boston, MA: Little Brown; 1991.
Sackett DL, Richardson WS, Rosenberg WM, et al. Evidence-based Medicine: How to Practice and Teach EBM. New York, NY: Churchill Livingstone; 1997.
Saltzman CL, Domsic RT, Baumhauer JF. Foot and ankle research priority: report from the Research Council of the American Orthopaedic Foot and Ankle Society. Foot Ankle Int. 1997;18:447–448.
SMFA Swionkowski. Available online at Accessed September 10, 2009.
SooHoo NF, Shuler M, Fleming LL. Evaluation of the validity of the AOFAS Clinical Rating Systems by correlation to the SF-36. Foot Ankle Int. 2003;24:50–55.
SPRINT Investigators, Bhandari M, Guyatt G, et al. Randomized trial of reamed and unreamed intramedullary nailing of tibial shaft fractures. J Bone Joint Surg Am. 2008;90:2567–2578.
SPRINT Investigators, Bhandari M, Guyatt G, et al. Study to prospectively evaluate reamed intramedullary nails in patients with tibial fractures (SPRINT): study rationale and design. BMC Musculoskeletal Discord. 2008;9:91.
Sung J, Siegel J, Tornetta P III, et al. The orthopaedic trauma literature: an evaluation of statistically significant findings in orthopaedic trauma randomized trials. BMC Musculoskelet Disord. 2008;29(9):14.
The EC/IC Bypass Study Group. Failure of extracranial-intracranial arterial bypass to reduce the risk of ischemic stroke: results of an international randomized trial. N Engl J Med. 1985;313:1191–1200.
Ware J. Available online at Accessed September 10, 2009.
Wright JG, Swiontkowski MF, Heckman JD. Introducing levels of evidence to the journal. J Bone Joint Surg Am. 2003;85A:1–3.
Yeung M, Bhandari M. Uneven global distribution of randomized trials in hip fracture surgery. Acta Orthop. 2012;83(4):328–333.
Yusuf S, Wittes J, Probstfield J, et al. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991;266:93–98.
Zuckerman JD, Koval KJ, Aharonoff GB, et al. A functional recovery score for elderly hip patients: I. Development. J Orthop Trauma. 2000;14:20–25.
Zuckerman JD, Koval KJ, Aharonoff GB, et al. A functional recovery score for elderly hip patients: II. Validity and reliability. J Orthop Trauma. 2000;14:26–30.