Nicholas Bellamy
▪ Measurements of symptom severity, disease activity and consequence, and adverse effects of treatment as well as global assessments are all commonly used.
▪ Outcome measures must be valid, reliable, responsive, and feasible, and should be brief, simple, and easy to score, particularly for use in routine clinical care.
▪ The choice of outcome measures depends on disease state, the therapeutic repertoire of the intervention(s) exhibited, and whether they are to be used for research or routine practice.
▪ Core sets of outcome measures have been recommended for use in clinical trials for several rheumatic diseases.
▪ Clinical benchmarks are emerging against which the treatment response can be interpreted at the individual patient level.
Measurement forms an essential component of health care at all levels, from the treatment of individual patients to policy development. The measurements may provide descriptive information or play a role in evaluation of response to treatment or in prognostication. The focus of this chapter is on evaluating clinical health status in individual patients and groups of patients; that is, aspects of the condition that can be appreciated through questioning and clinical examination. In addition to allowing evaluation of individuals and groups of patients, clinical outcome measures find important applications in making formal assessments of the impact of health care decisions on health care costs, quality of care, quality of life, and burden of disease, as measured by morbidity and mortality in the population. Such assessments are an integral part of health care delivery systems. A list of abbreviations related to clinical measurement that are used in this chapter can be found in Box 2.1.
Clinical outcome measures examine the clinical consequences of a disease or disorder in terms of mortality (survivorship) and morbidity (symptom severity, impact of the condition on various aspects of health-related quality of life, negative consequences of treatment). In painful musculoskeletal conditions the impact on function is particularly important, as is the bidirectional relationship between perceived pain and emotional well-being. The paradigm of the “5 Ds”—death, disability, discomfort, drug (or other iatrogenic) reactions, and dollar (or economic) costs—advanced by Fries encompasses outcomes that are relevant to all patients because they are discernible at the individual patient level, represent the key consequences of disease and its treatment, and are the ultimate outcomes of the disease process.1 The components of health and the consequences of arthritis and musculoskeletal conditions are many and diverse. It is possible to measure multiple aspects of a patient's condition with evaluation procedures that are relevant to specific clinical conditions and assessment and measurement environments, and are aligned with assessment objectives.
It is important to appreciate that in the discipline of clinical measurement (synonym: clinical metrology) there is variable use of common terminology. The terms endpoint and outcome are used interchangeably. However, the terms instrument and tool can refer not only to electromechanical devices but also to questionnaires and other assessment techniques. The questionnaires, in turn, are variably referred to as health status questionnaires, health status measures, or health status instruments. The terms subjective (soft) and objective (hard) are best avoided, and instead the measurement technique should be referred to according to its qualitative (e.g., type of measure, purpose of measurement) and quantitative (e.g., reliability, validity, responsiveness) measurement characteristics.
Outcome measures may be divided into two broad categories: observer dependent (or assessor rated) and observer independent (or self-rated). In general, observer-independent clinical measures are based on self-administered questionnaires, usually referred to as patient-reported outcome measures (PROMs). In contrast, observer-dependent clinical measures include interviewer-scored questionnaires, physical examination findings, and tests of performance rated using technical instruments (e.g., grip strength, walk time). Certain techniques, such as face-to-face interview, walk time, grip strength, and articular index, involve an interaction between the assessor and the patient, which on occasion may bias the result obtained.
A fundamental aspect of measurement is whether all patients will be asked the same questions on every occasion, or whether the questions asked should be adapted to individual patients or the response patients give to initial questions should determine the nature of subsequent questions, using computer adaptive testing. The National Institutes of Health are currently exploring aspects of this issue in the Patient-Reported Outcomes Measurement Information Systems (PROMIS) initiative.2-4 This initiative has resulted in the development of item banks that form the basis of an alternative approach to index construction and presentation.
Outcome measurement procedures should fulfill three major criteria that are encompassed by those proposed by the Outcome Measures in Rheumatology Clinical Trials (OMERACT) group, as the OMERACT filter.1 The three components of the filter are “truth,” “discrimination,” and “feasibility.” Truth concerns, issues of validity, discrimination addresses issues of both reliability and responsiveness, and feasibility encompasses ease, time, cost, interpretability and arguably ethics. Measurement processes that are potentially hazardous to patients raise ethical issues that must be explained to and understood and accepted by patients and study participants. The importance and necessity of acquiring new information should be weighed against any attendant risks.
Validity is a measure of the extent to which an instrument specifically measures the phenomenon of interest. In contrast, reliability defines the extent to which the measurement procedure yields the same result on repeated determinations. Validity and reliability are separate clinimetric issues, and it does not follow that because a measure is reliable it is also valid or vice versa (Fig. 2.1). More specifically, validity is concerned with sources of nonrandom error (i.e., systemic error or bias), whereas reliability is concerned with sources of random error, also known as noise. Systemic error or bias may prevent an instrument from truly measuring what is intended, which results in inaccuracy. A number of different strategies may be employed to determine validity, each conceptually different and relying on different methods of assessment. There are four types of validity: face, content, construct, and criterion.
A measure has face validity if experts judge that it measures at least part of the defined phenomenon. In many instances this is self-evident, whereas in others, particularly in questionnaire-based measures of functional status, it may not be entirely obvious whether the measurement reflects physical, social, or emotional function.
An instrument can have face validity but fail to capture in its entirety the dimension of interest. Content validity, therefore, is a measure of comprehensiveness—the extent to which the measure encompasses all relevant aspects of the defined attribute. Content validity is generally determined by group consensus (e.g., nominal or Delphi techniques). The decision regarding which items should be included in a questionnaire-based instrument and which should be excluded is critical, because it defines the nature of the instrument and its future applicability. Any subsequent addition or deletion results in an instrument that requires further validation.
Given the variability in symptom expression by patients, one of two approaches can be employed to achieve comprehensiveness in questionnaire-based measures. The first is to include a standard battery of measures that probes frequently occurring symptoms that are clinically important to the majority of patients. In the second approach, the measurement process is tailored to the individual by measuring only those items that are important to that particular person. The issue of importance can be decided by patients, who rate the importance of their symptoms, or by clinical assessors, whose decision is based on their perception of the patient's symptoms.
Construct validity is of two types: convergent and discriminant. Both are tested by demonstrating relationships between measurement scores and a theoretical manifestation (i.e., construct or consequences of an attribute) of the disease. Convergent construct validity is assessed by the statistical correlation between scores on a single health component, as measured by two different instruments. If the correlation coefficient is positive and appreciably above zero, the new measure is said to have convergent construct validity. In contrast, discriminant construct validity testing compares correlation coefficients between scores on the same health component, as measured by two different instruments (e.g., separate measures of physical function) and between scores on that health component and each of several other health components (e.g., measures of social and emotional function). A measure has discriminant construct validity if the proposed measure correlates better with a second measure, accepted as more closely related to the construct, than it does with a third, more distantly related measure.
Criterion validity is assessed by statistically testing a new measurement technique against an independent criterion or standard (concurrent validity) or against a future standard (predictive validity). Criterion validity is an estimate of the extent to which a measure agrees with a gold standard (i.e., an external criterion of the phenomenon being measured). The major problem in criterion validity testing, for questionnaire-based measures, is the general lack of gold standards. Indeed, some purported gold standards may not themselves provide completely accurate estimates of the true value of a phenomenon. In contrast, electromechanical devices such as those evaluating strength, resistance, and range of movement can more often be validated using standard calibration techniques found in the engineering literature.
Consistency, reproducibility, repeatability, and agreement are synonyms for reliability. Reliability is the extent to which a measurement procedure yields the same result on repeated determinations (see Fig. 2.1). These determinations may be either different measurements performed at the same time (internal consistency) or the same measurements performed at different times (stability). Repeated measurements are rarely exactly the same, because there is usually some random error, noise, or degree of inconsistency. There are various methods of calculating reliability, each method reflecting a different aspect of instrument performance, so that different coefficients are derived using different methods. Defined sources of measurement error include the subject, the assessor, and the measuring instrument. Patient variability often arises from normal variation in symptoms or from fatigue or inattention. When observers are involved in capturing information, both interobserver and intraobserver reliability are important. Observer variability may be the result of inadequate standardization, the necessity for complex judgments, perceptual elements, or fatigue. When technical instruments are involved in the measurement process, variation in, for example, the cuff configuration of a modified sphygmomanometer or in the mechanical resistance of a dynamometer or dolorimeter may be important sources of error.
The primary goal of outcome assessment in evaluative research is to detect clinically important changes in some aspect of a condition. To detect change, a measurement technique needs to be targeted to aspects of the disease amenable to change, using content and scaling methods that allow detection of change, and be applied at a point in time when change might have occurred. Validity and reliability are important aspects of all measurement techniques, but the capacity to detect change is the quintessential requirement of a successful outcome measure for evaluating change over time and the patient's response to treatment. An assessment technique may fail to record any clinical improvement for a number of reasons (e.g., patient lacks response potential, lacks compliance with the treatment program, or has received inefficacious treatment; the outcome measure is insensitive; or a type II error occurs due to inadequate sample size).
It is necessary to have a basic understanding of the statistics used in the development and validation of health status measures. Although correlation coefficients are frequently used, they are a measure of the strength of association, not a measure of agreement, so that it is possible to have perfect correlation but zero agreement between two measurements. In evaluating construct validity using correlation coefficients, both strength and direction should be considered a priori; otherwise, data interpretation can become complex and biased.
The following are commonly used statistical tests for evaluating reliability, validity, and responsiveness:
▪ Bland and Altman scatter plots permit the distribution of pairwise data to be appreciated.
▪ The Cohen κ statistic, a metric reflecting agreement beyond chance, can be weighted or unweighted and adjustments made for bias and prevalence.
▪ Intraclass correlation coefficients are often used in intrarater and interrater agreement analyses, based on continuous data.
▪ The Cronbach α statistic, a measure of average interitem correlation, is used to evaluate internal consistency.
▪ Factor analysis is a useful procedure for evaluating structure provided the sample is sufficiently large and is also representative. Repeated factor analyses on the same instrument using different data sets may yield conflicting results.
▪ Rasch analysis is increasingly being employed to test outcome scales against mathematical models using a probabilistic form of hierarchical or Guttman scaling. The Rasch model attempts to scale variables along a continuum, but not all Rasch analyses of the same instrument necessarily agree with one another. Although, in general, individuals who can perform tasks of a particular degree of difficulty should be able to perform all tasks of lesser difficulty, this is not always the case.
▪ A number of techniques, including interperiod correlation matrix analysis, may be used to estimate measurement error, which can in turn be used to evaluate change beyond measurement error.
▪ ANOVA (analysis of variance) and Student's t-test are commonly used to evaluate responsiveness.
▪ Effect size (change score/baseline standard deviation) is used to evaluate responsiveness.
▪ Standardized response mean (change score/change score standard deviation) is used to evaluate responsiveness. Responsiveness, whether evaluated by ANOVA, Student's t-test, effect size or standardized response mean may differ between outcome measures used to evaluate different variables or even the same variable and may differ even within an outcome measure when it is presented in different scaling formats.
▪ Relative efficiency (square of the ratio of two t or z statistics) is used to evaluate relative responsiveness.
There are numerous techniques for assessing the beneficial and adverse outcomes of treatment. Some health status instruments have been developed specifically to assess certain conditions (condition-specific measures), whereas others have been developed to permit measurement across several or many diseases (generic measures). Readers interested in specific measurement issues should consult standard measurement texts. In this section a review is presented of the measurement of pain, impairment, ability/disability, and participation/handicap; patient and physician global assessments; multidimensional health status instruments (quality of life, utility); and measures of adverse drug reactions. Readers should refer to other sections of this textbook for information on laboratory-based and imaging-based measurement procedures.
Pain is an entirely subjective phenomenon, the perception of which is modulated by a variety of influences, and results in pain behaviors that may be observed. The severity of perceived pain can be rated only by the sufferer. In contrast, pain behavior can be rated by a trained assessor. Complex bidirectional interactions occur between pain and other factors, including personal factors unique to the individual. Pain can be assessed using a variety of techniques (Fig. 2.2), including the following:
Comparative studies of pain rating scales in rheumatoid arthritis (RA) and osteoarthritis (OA) suggest that there is a positive relationship between initial pain rating and subsequent pain relief scores and that all the aforementioned types of scales are responsive. Likert and visual analogue scales are the most frequently applied types of scales. Likert scales usually employ an odd number of descriptors (3, 5, 7, or 9), whereas visual analogue scales are most often 100 mm long and horizontal in orientation, with terminal descriptors. The numerical rating scale is often scaled from 0 to 10, displayed in linked boxes, and has terminal descriptors. The focus, phraseology, descriptors, and time frames for recall used in such scales often vary among different questionnaires. In comparative studies in RA and OA, the single-item global numerical rating pain scale appears comparable in responsiveness to the same question presented on a visual analogue scale and is slightly more responsive than the corresponding five-point Likert scale.
When pain scales are applied, it is important to specify either the global nature of the question or the specific aspect of pain that is being assessed (e.g., while stair climbing, while walking, at rest, during the night, overall) and clearly indicate the time interval over which pain is being evaluated (e.g., previous 48 hours, 1 week). Several of the standard established segregated multidimensional health status instruments, including the Health Assessment Questionnaire (HAQ),5 Arthritis Impact Measurement Scales (AIMS) and AIMS2,6,7 Western Ontario and McMaster (WOMAC) Osteoarthritis Index,8 and Australian-Canadian (AUSCAN) Hand Osteoarthritis Index,9 contain distinct pain subscales. These contrast with other instruments that either contain no pain subscale (e.g., Functional Index for Hand Osteoarthritis [FIHOA],10 Cochin index11) or in which pain is an integral part of an aggregated scoring system combining scores on several different dimensions (e.g., Indices of Clinical Severity [ICS]).12 It should be noted that the recall period differs among measures. A short recall period may offer advantage when the intervals between repeat assessments are short.
The terms impairment, disability, and handicap may be confused in everyday usage, even though the World Health Organization (WHO) has previously characterized these three different entities.13 Impairment is defined as any loss or abnormality of psychological, physiologic, or anatomic structure or function. It signifies that a pathologic state has reached a stage of detectability. In contrast, disability includes any restriction or lack of ability to perform an activity in the manner considered normal. Such disabilities include alteration in behavior and interactive processes such as communication, as well as strictly physical function. A handicap is manifest as a disadvantage experienced by an individual as a result of an impairment or disability, such that it limits or prevents the fulfillment of a role considered normal for that individual. In essence, therefore, impairment occurs at an organ level (intellectual, sensory, visceral, musculoskeletal), disability occurs at a personal level (behavior, self-care, locomotion, dexterity), and handicap occurs at a social level (independence, geographic mobility, employability, social integration).13 The concepts of disability and handicap, in particular, are considered unipolar; and partly for this reason the newer WHO International Classification of Functioning, Disability, and Health (ICF) is to be preferred, because it considers these same concepts in a bipolar framework of ability and participation.14 The ICF framework proposed by the WHO is particularly useful in conceptualizing the functional impact of many of the conditions described in other chapters of this text. The ICF framework considers a hierarchy of impairments, activities, and participation, modulated by contextual factors (environmental and personal), and extends the original WHO classification of impairment, disability, and handicap to incorporate positive aspects of health, including physical activity and participation in society.14 Several more modern outcome measures have been mapped to the ICF framework.1 The ICF framework may also provide a construct on which to base instrument development.
In clinically based musculoskeletal applications in research and practice, measurement is often directed at the assessment of abnormalities in structure (e.g., swelling, tenderness). The most important examination-based measurement procedure is the separate enumeration of swollen and tender joints in patients with inflammatory polyarthritis.1 The usual method employs separate counts of the number of swollen and tender joints using either the American College of Rheumatology (ACR) 68-joint count or the European League Against Rheumatism (EULAR) 28-joint count. The 28-joint count is often conducted as part of determining the 28-joint Disease Activity Score (DAS28).
It is convenient to chart individual joint involvement on a homunculus or mannequin (Fig. 2.3), as is the practice with the DAS28. Other methods have been developed for scoring joint involvement in OA, enthesopathy in ankylosing spondylitis (AS), and tender points in fibromyalgia.1 Methods for assessing other clinical signs, such as heat, erythema, and crepitus, have been less rigorously studied. All such procedures are liable to interrater and intrarater error and are best performed by skilled standardized assessors. Nevertheless, various impairment assessments can be conducted at acceptable levels of reliability by trained assessors.
The American Medical Association Guides to the Evaluation of Permanent Impairment14a provides a standardized method to rate impairment at maximum medical improvement. However, the guides are primarily employed in compensation/litigation applications, and different jurisdictions may use different versions of the guides.15
Functional measures can be subdivided into those based on observed tests of performance and questionnaires that assess functional capacity.
Performance-based measures might at first seem the most direct way to assess the patient's function. However, such measures often involve an interaction between the assessor and subject that may augment or diminish the true performance and raise concern regarding intraassessor and interassessor consistency and the need for assessor standardization. Furthermore, change in performance on the performance test may not correspond to a comparable change in the performance of relevant activities of daily living. The relevance of the change to the patient may therefore require interpretation. Nevertheless, some measures, such as the modified Schober test in AS, grip strength in RA, and range of movement in knee OA, remain in common usage in clinical rheumatology. In health disciplines such as orthopedic surgery and physiotherapy extensive use is made of performance-based measures in the routine management of patients with bone and joint diseases.1 Interrater and intrarater reliability have been quantified and standard procedures developed to train assessors and perform the evaluations.
Patients with musculoskeletal disorders may experience three types of functional disability: physical, emotional, and social. All three forms are amenable to outcome assessment using valid, reliable, and responsive questionnaires. Physical disability, in particular, has received considerable attention, and some techniques have been designed for a specific purpose whereas others have multipurpose applications.1 Physical disability is one of the two most important and immediate consequences of many musculoskeletal conditions. The capacity to measure this aspect of the disease, both in clinical trials and clinical practice, is paramount. Fully validated health status instruments including separate measures of pain and function are to be preferred over ad-hoc scales. Examples of disease-specific measures of function include the following:
▪ WOMAC Osteoarthritis Hip/Knee Index8
▪ AUSCAN Osteoarthritis Hand Index9
▪ Cochin hand index11
▪ FIHOA10
▪ Bath Ankylosing Spondylitis Functional Index (BASFI)1
▪ Fibromyalgia Impact Questionnaire (FIQ)1
In contrast, measures such as the HAQ,5 AIMS,6 and AIMS27 have multipurpose musculoskeletal applications and generic quality-of-life measures such as the Short Form-36 Health Survey (SF-36) and the European Health-Related Quality of Life (EuroQol) instrument can be used across a wide variety of diverse medical conditions.16 The SF-36 is available in different several formats. In recent years, there have been an increasing number of comparative studies of the performance of different health status measures. Different measures are underpinned by different constructs, vary in their responsiveness, and may impose different sample size requirements when used in clinical research applications.
The measurement of emotional disability in musculoskeletal disease is important because considerable psychological morbidity has been noted in patients with musculoskeletal disorders. There is a bidirectional inverse relationship between pain and emotional well-being. Furthermore, it is not surprising, given the degree of pain, disability, anxiety, depression, and diminished sense of well-being experienced by patients, that many musculoskeletal disorders have social consequences.
Measures of handicap reflect the social consequence of disease. Whether an individual is handicapped can be defined either by society or by an individual. The measurement of handicap in the rheumatic diseases has attracted relatively little attention. The Disease Repercussion Profile (DRP) is a valid and reliable measure of patient-perceived handicap. Relatively few interventional studies, even recent studies, include a measure of participation/handicap.
Global assessments by patient and physician of the patient's overall condition are commonly used in clinical trials and in clinical practice. It is extremely important to specify in the wording of the global question which aspects of the patient's condition are being considered (e.g., overall health status, symptom severity, disease activity, or an anatomic area). The patient global assessment is particularly important because it can be phrased to assess current status or change in symptom status and can be focused on a particular anatomic area, the condition in general, or the patient as a whole. Alternatively, it can be used to assess drug tolerability and/or efficacy or other aspects of the treatment program (palatability, compliance, affordability, convenience). The time frame over which the patient should consider his or her status should be defined (e.g., 48 hours, 1 week). It is debatable whether the physician global assessment of overall health adds much to the measurement process over and above the patient's own global assessment. However, the physician (or other assessor) can consider, in addition, aspects of the condition that are not assessable by the patient (e.g., radiographic change and biochemical, hematologic, and immunologic abnormalities) and may provide insight into whether the patient tends to amplify or minimize reported symptoms. Again, the physician requires clear specification as to which aspects of the condition should be considered when making his or her assessment. The time frame for the physician global assessment usually should be specified as “today” since the assessor generally has no knowledge of the patient's interval status other than that described by the patient and captured by the patient global assessment. At the present time there is no international consensus on the exact wording of the patient global assessment question or the preferred scaling format. However, a recent study has explored clinical benchmarking issues in RA, OA (hip/knee), OA (hand), AS, and low back pain using standardized patient global assessment questions developed through group consensus of expert opinion.17
Patients and physicians may differ in their perception of the patient's global assessment of disease activity or symptom severity. The determinants and consequences of this discrepancy need careful consideration, particularly by practitioners and researchers using interviewer-administered outcome questionnaires, in which there is potential for interaction between the interviewer and patient.
A large number of disease-specific and generic multidimensional health status measures have been developed. Many are very sophisticated instruments that have undergone extensive validation procedures and have high performance characteristics with superior levels of validity, reliability, and responsiveness. They are either self-administered or occasionally interviewer administered. It is very important for users to contact the originator before initial application to obtain instructions regarding usage (e.g., presentation, scoring, analysis), to procure an authentic current version of the instrument and a copy of the user's guide, and to obtain any required permissions. Many outcome measures are protected by copyright and/or trademark. The main reason for an originator to register copyright and trademark is to secure necessary protection for the measure, and this forms the basis for maintaining its integrity, authenticity, and appropriate usage.
Some health status measures have been developed in the form of segregated multidimensional indices that contain separate, distinct subscales exploring different aspects of the condition, such as HAQ, AIMS, AIMS2, WOMAC index, and AUSCAN index. They provide subscale scores on each of several separate dimensions. Others are in the form of aggregated multidimensional indices in which scores from several different dimensions are weighted and aggregated into a single score, such as the ICS, Pooled Index, and DAS. Health-related quality of life can be assessed either using a multi-item questionnaire, such as the SF-36 (and its derivatives), Nottingham Health Profile (NHP), or EuroQol, or using a utility-based methodology, for example the Health Utilities Index (HUI). A utility is a holistic measure of the quality of life that rates an individual along a continuum from death (0.0) to full health (1.0). Models for predicting utilities from nonutility instrument data are evolutionary. For example, a model has been developed and validated to derive a utility score from WOMAC data that, at the group level, approximates the utility score estimated from the HUI in the same group of patients.1 This provides an opportunity to derive utility scores from previously published studies that contain WOMAC data but that did not include a utility-based measure.
Adverse reactions caused by treatment can result in symptoms, clinical signs, laboratory abnormalities, or death. Problems often arise in detecting, categorizing, attributing, and grading adverse events. Adverse event rates differ significantly depending on whether the assessment is based on open-ended questioning or structured questionnaires or is determined by a standard protocol. Various systems have been developed for categorizing adverse reactions to treatment. Attribution is the process of ascribing adverse reactions to interventions or other causes. The etiologic relationship is often graded as “none,” “possible,” “probable,” or “definite.” A number of factors determine the assigned level of the relationship, including prior knowledge of the patient, the pharmacodynamic profile of the intervention, and the duration of treatment. The most difficult attribution decisions are in assigning the grades of “none” versus “possible” and “probable” versus “definite” and in grading the intensity of an adverse reaction. Often severity is rated as being “mild,” “moderate,” or “severe.” A mild adverse reaction is one that is easily tolerated by the patient, causes minimal discomfort, and does not interfere with everyday activities. A moderate adverse effect is an adverse experience that causes sufficient discomfort to interfere with normal everyday activities, whereas a severe reaction is an adverse experience that is incapacitating and prevents normal everyday activities. Finally, it is important to categorize the outcome of an adverse reaction, for example, as “resolved,” “improved,” “unchanged,” “worsened,” “hospitalization required,” “hospitalization prolonged,” or “death.”
There are more than 100 different rheumatic disorders, each presenting a different challenge in outcome measurement. Regulatory authorities such as the U.S. Food and Drug Administration (FDA) and the European Medicines Agency as well as international nongovernment organizations such as the Osteoarthritis Research Society International (OARSI), OMERACT group, ACR, and Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) have pioneered the development of several evidence-based consensus-driven guidance documents. The FDA document on PROMs is particularly important in understanding the challenges of measuring patient-centered outcomes in regulatory environments.18 In addition to specifying domains and core set variables (measures) and identifying instruments, some organizations have also initiated investigations leading to the proposal of responder criteria, that is, criteria that could be used to stratify patients according to whether they have experienced a meaningful improvement with treatment. Although improvement with treatment is a very important goal, it is also important to determine the value of the final health state attained. Work in this area is, by comparison, relatively limited. However, several proposals have emerged that identify achievable, acceptable, or desirable health states. Finally, although there is a distinct paucity of general population-based normative values for PROMs, there has been some recent progress in this area.
There are a large and continually growing number of outcome measures to evaluate RA, OA, AS, reactive arthritis, systemic lupus erythematosus (SLE), low back pain, and fibromyalgia, as well as more generally applicable measures and measures of handicap, coping, helplessness, and quality of life. It is beyond the scope of this chapter to describe the development, validation, and scope of application of each of the currently available measures or the exact methods used to develop guidance documents or to develop and propose response and state-attainment criteria or normative values. Measurement preferences differ between different disorders, but pain, function, and patient global assessment are often emphasized. Where patient responses are captured and transmitted electronically, the FDA document Guidance for Industry: Computerized Systems Used in Clinical Investigations (2007) is relevant.19
Three areas that require special attention when considering data capture on a global scale are short-forms, alternate-language translations, and electronic data capture. Concerns regarding respondent burden in completing PROMs has stimulated the development of shortened versions of existing measures. Although time to completion may be reduced, short-forming also reduces content validity. When short-forming is replicated in different cultures and interventional environments it is noted that the short forms differ. For example, in the case of proposed short forms of the WOMAC index, no two short forms are the same but each of the 24 items is included in at least one of the proposed short forms.20 The creation of alternate-language translations of PROMs requires skill, experience, and close attention to the standard operating procedures necessary to create linguistically valid and culturally appropriate translations.21 The process should be conducted only in association with the originator of the measure. Users should clearly differentiate between alternate-language translations created to a commercial standard and those resulting from ad-hoc processes.20 Electronic data capture is gradually replacing traditional paper-based capture of patient responses to PROMs. The development of electronic forms of existing measures may require modifications to language and format. The extent to which further linguistic and clinimetric validation are required is unclear at present. As with alternate-language creation, the development of electronic versions should be undertaken only by entities with the required special skills and experience and should adhere to standard operating procedures and be conducted only in association with the originator of the measure.20
In general, RA clinical trials evaluate either fast-acting, symptom-modifying drugs, such as nonsteroidal antiinflammatory drugs and analgesics, or slow-acting disease-modifying antirheumatic drugs, which include the biologic agents. Several valid, reliable, and responsive outcome measures have been developed for the evaluation of patients with RA. In addition, ICF core sets have been specified for RA and validated from a patient perspective, and commonly used instruments have been mapped to the framework. Normative data for the HAQ have been published and minimum clinically important improvement (MCII) and patient-acceptable symptom state (PASS) estimates explored for RA (Box 2.2). The following recommendations, developed by the OMERACT group and ratified by the ACR,1 currently comprise the ACR minimum core set for RA clinical trials: pain, physical function, number of swollen joints, number of tender joints, patient global assessment, investigator global assessment, erythrocyte sedimentation rate or C-reactive protein level, and imaging. Individual response criteria based on these clinical variables and referred to as the ACR201 have been proposed as follows: 20% or more improvement in tender and swollen joint counts and 20% or more improvement in scores on three of the five remaining ACR core set measures. These are particularly relevant for evaluating the response to symptom-modifying drugs. The sufficiency of the ACR20 criteria has been considered, and higher-level response requirements based on the ACR50 and ACR70 criteria evaluated.1 Compared with ACR20/50/70 criteria, cut points of 50/70/85 for the Simplified Disease Activity Index (SDAI) and the Clinical Disease Activity Index (CDAI) define minor, moderate, and major responses.22
The treat-to-target approach to management, which characterizes the ACR23 and EULAR24 recommendations for the management of RA, has stimulated further consideration of the definition of remission in RA. Definitions of remission and minimum disease activity have been proposed based on several outcome measures, including the DAS28, ACR/EULAR remission criteria, SDAI, and CDAI.25-27 Remission rates are instrument and cut-score dependent. Thus, in a comparative study in early RA, higher remission rates were observed when DAS28 criteria were used than when ACR/EULAR criteria were employed.28
Clinical trials in OA can be subdivided into those that assess symptom-modifying OA drug effects and those that assess structure-modifying drug effects. Several valid, reliable, and responsive outcome measures have been developed for the evaluation of patients with OA of the hip, knee, and hand. ICF core sets for OA have been elaborated and the WOMAC index, ICS, and Osteoarthritis Knee and Hip Quality of Life (OAKHQOL) questionnaire mapped to the ICF core set. A unique set of age- and gender-specific general population-based normative data have been reported for the WOMAC and AUSCAN index subscales (Figs. 2.4 and 2.5)29,30 and MCII and PASS estimates explored for OA based on the WOMAC index for hip and knee OA and the AUSCAN index for hand OA (see Box 2.2). MCII and PASS estimates may vary across instruments, subscales, countries, and therapeutic environments.31 Although there are no clinical measures specific for structure-modifying drug trials in OA, it is of note that the WOMAC index detected clinical benefits associated with structural conservation in a recent double-blind, randomized, controlled clinical trial of strontium ranelate.32


Core set measures for OA have been proposed by the OMERACT group and ratified, or in the case of hand OA, developed by the OARSI.1 OMERACT and OARSI core set measures for future phase 3 studies in hip, knee, and hand OA are pain, function, patient global assessment, and, for studies of 1 year or longer, joint imaging using validated methods and standardized techniques. OMERACT and OARSI have jointly proposed responder criteria for hip and knee OA, based on combinations of absolute and percentage changes in pain, function, and patient global assessment.1 The OMERACT-OARSI responder criteria are applicable to both hip and knee OA patients and are not intervention class specific. Response criteria developed for OA and proposed by other groups include definitions of minimum perceptible clinical improvement, minimum clinically important difference, MCII, WOMAC 20-50-70, and AUSCAN 20-50-70. In addition to these definitions based on the degree of change, other definitions have been proposed based on the health state attained, the so-called state-attainment criteria. Definitions based on PASS and the Bellamy Low-Intensity Symptom State-Attainment (BLISS) Index (see Box 2.2)33 are examples of initial attempts to establish working definitions for use in clinical practice and clinical research. Observations that statistically significant between-group differences are detectable at low or very low symptom intensity levels, so-called BLISS states, are of particular importance in the evaluation of therapeutic interventions33 because they provide indicators of achievable health states (see Box 2.2). Although further research is necessary, observations of BLISS states have been made with the WOMAC and AUSCAN indices and in different classes of interventions and confirm that such states are achievable by some, but not all, patients. Attempts by the OARSI to develop a surrogate indicator for total joint replacement in hip and knee OA has not, to date, resulted in cut scores for pain and function capable of discriminating between patients who are and who are not considered candidates for total joint replacement by their orthopedic surgeons.34,35
Several valid, reliable, and responsive outcome measures have been developed for the evaluation of symptom severity and disease status in patients with AS.36 Clinical measurement in AS encompasses both the peripheral joints and extraarticular involvement, as well as axial skeletal involvement. Through the OMERACT process and the Assessment in Ankylosing Spondylitis (ASAS) Working Group, core set measures have been identified for use in different clinical research settings.1 The ASAS group has also developed a definition of short-term improvement (ASAS20) based on four domains: physical function, pain, patient global assessment, and inflammation. ASAS20 defines improvement as 20% or more and 1 unit or more (0 to 10 scale) improvement in at least three of the four domains, with no worsening (defined as deterioration of 20% or more and 1 unit or more [0 to 10 scale]) on the fourth dimension.1 The recent emergence from the ASAS/OMERACT initiative of the Ankylosing Spondylitis Disease Activity Score (ASDAS), based on the weighted sum of measures of back pain, morning stiffness, patient global assessment, peripheral joint pain/swelling, and either C-reactive protein level or erythrocyte sedimentation rate, provides a composite index for evaluating disease activity in AS.37
The BASFI, Dougados Functional Index (DFI), Health Assessment Questionnaire for the Spondyloarthropathies (HAQ-S), and Revised Leeds Disability Questionnaire (RLDQ) instruments have been mapped to the ICF framework, and MCII and PASS estimates have been explored for AS (see Box 2.2). PASS estimates for the Bath Ankylosing Spondylitis Disease Activity Index (BASDAI) and BASFI have differed for observations made on patients with AS in different countries.38,39
Outcome measurement in psoriatic arthritis (PsA) has undergone considerable refinement in recent years, and there are now available several instruments for evaluating different aspects of the condition.40 Both single-domain and composite measures have been developed; the former assess only a single aspect of the condition, whereas the latter encompass more complex issues such as quality of life and disease activity.40,41 Consensus has been reached on core set domains for randomized controlled trials and longitudinal observational studies in PsA, namely, peripheral joint activity, skin activity, patient global assessment, pain, physical function, and health-related quality of life. In preparation for treat-to-target approaches to PsA management, a minimal disease activity definition has been proposed.42 Some progress has been made toward the development of response definitions for PsA-relevant outcome measures.41-43 In mapping standard measures of functioning to the ICF framework, it has been noted that several areas of concern to patients with PsA are inadequately covered.44
There are several validated instruments for conducting outcome measurement in patients with SLE. Modifications of some of those measures, in particular, the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI), Lupus Activity Index (LAI), and Systemic Lupus Activity Measure (SLAM), have been proposed for use in SLE in pregnancy. These modifications are referred to respectively as SLEPDAI, LAI-P, and m-SLAM.
Core set domains for clinical trials and longitudinal studies have been proposed by the OMERACT group and include (1) disease activity, (2) health-related quality of life, (3) organ damage, and (4) toxicity/adverse events. The ACR has proposed response criteria for SLE clinical trials.1 These criteria are based on absolute change, are instrument specific, and have been proposed for the British Isles Lupus Assessment Group (BILAG), SLEDAI, SLAM, European Consensus Lupus Activity Measurement (ECLAM), Safety of Estrogens in Lupus Erythematosus National Assessment Modification of SLEDAI (SELENA-SLEDAI), and Responder Index for Lupus Erythematosus (RIFLE) instruments. The criteria are in the form of threshold values for improvement and worsening. The Pediatric Rheumatology International Trials Organization and the ACR have proposed provisional response criteria for juvenile SLE based on at least 50% improvement in any two of five core set measures and with no more than one of the remaining measures worsening by more than 30%.1 Most recently a valid and reliable responder index based on the SLEDAI-2000 has been proposed and assesses 50% or more improvement in lupus disease activity.45,46 Definitions of “active disease” and minimal clinically meaningful change for the SLEDAI-2000 have also been published.47
In reviewing outcome measures for clinical trials in systemic sclerosis, the OMERACT group considered that the HAQ and Scleroderma Health Assessment Questionnaire (SHAQ) were the only two patient-centered measures ready for use in clinical trials. Although the measurement of disease activity in systemic sclerosis remains challenging, the European Scleroderma Study Group has proposed preliminary activity criteria based on the summation of scores across multiple variables. In addition, estimates of minimal clinically important difference for several patient-centered outcome measures (Health Assessment Questionnaire Disability Index [HAQ-DI], fatigue, pain, sleep, patient global assessment, SF-36) have been developed based on a clinical practice setting.48
There is a paucity of measurement tools to assess disease activity and condition-specific quality of life in Behçet disease. The development and validation of the Behçet's Disease Activity Index (BDAI)1 and the Behçet's Disease Quality of Life (BD-QoL) index1 therefore represent important contributions to outcome measurement in Behçet disease.
Core set measures for Sjögren syndrome were proposed several years ago: sicca symptoms, sicca objective (oral), sicca objective (ocular), health-related quality of life, laboratory measures, and fatigue. Subsequently two indices, the Sjögren's Syndrome Disease Damage Index (SSDDI) and the Sjögren's Syndrome Disease Activity Index (SSDAI), were proposed.1 Most recently the EULAR Sjögren's Task Force has developed two new indices, one to assess symptoms (EULAR Sjögren's Syndrome Patient Reported Index [ESSPRI]), and the other to assess disease activity (EULAR Sjögren's Syndrome Disease Activity Index [ESSDAI]).49,50 Although both indices require further validation, they create novel measurement opportunities since they are based respectively on patient49 and expert50 opinion.
Core set measures for polymyalgia rheumatica proposed by the European Collaborating Polymyalgia Rheumatica Group include five measures: pain, physician global assessment, morning stiffness, C-reactive protein level, and ability to elevate the upper limbs. A recent multinational collaborative evaluation of outcome measures in polymyalgia rheumatica has concluded that patient-reported global pain, hip pain, morning stiffness, physical function (Modified Health Assessment Questionnaire [MHAQ]), mental function and an inflammatory marker are the best measures of disease activity and treatment response.51
The International Myositis Assessment and Clinical Studies group has proposed three tools for the evaluation of patients with idiopathic inflammatory myopathies. These tools are the Myositis Intention to Treat Activity Index (MITAX), Myositis Disease Activity Assessment Visual Analogue Scales (MYOACT), and Myositis Damage Index (MDI). The interrater reliability and construct validity of the MDI were confirmed in a recent evaluation.52 Preliminary consensus-based estimates of clinically important improvement in adult and juvenile idiopathic inflammatory myopathies have been proposed as follows: muscle strength and physical function, 15%+; physician and patient global assessment, 20%+; and serum levels of muscle-associated enzymes, 30%+. The Pediatric Rheumatology International Trials Organization and the ACR have proposed core set measures for juvenile dermatomyositis that include physician global assessment of disease activity, muscle strength, global disease activity measure, parent's global assessment of patient's well-being, functional ability, and health-related quality of life.
Outcome assessment in the different pathologic entities that collectively form the vasculitides is challenging. Different conditions may require different indices and multidimensional measures to capture the impact of pathologic processes affecting different organs and structures. Attempts to develop a multisystem measure include the Combined Damage Assessment (CDA) Index for vasculitides involving small- and medium-sized vessels, Vasculitis Damage Index (VDI), and Birmingham Vasculitis Activity Score/Wegener's Granulomatosis (BVAS/WG) for evaluating patients with Wegener granulomatosis. The reliability, construct, and discriminant validity of the BVAS/WG have been assessed and the index proposed for use in clinical investigation. In a recent comparative study, although the CDA and VDI were both reliable, the CDA was considered more complex and less practical then the VDI.53 Since its inception, the BVAS, has undergone revision, and version 3 has been validated for use in systemic vasculitis.54,55 The OMERACT group has established core set domains for clinical trials of antinuclear cytoplasmic antibody–associated vasculitis: namely, disease activity, damage assessment, PROMs, and mortality.56
The main focus of osteoporosis studies is bone loss and fracture prevention. Core set measures for bone loss prevention studies (bone mineral density measured at two sites, biochemical markers) and fracture prevention studies (fracture; hip, knee, and spine bone mineral density; biochemical markers; change in height) in osteoporosis have been proposed by the OMERACT group. In addition, ICF core sets have been suggested for osteoporosis and the content of osteoporosis-targeted health status measures compared using the ICF framework. There are currently several condition-specific quality-of-life measures for osteoporosis, which differ in their content, format, and method of administration.57
Outcomes in patients with low back pain can be assessed with various instruments as well as with quality-of-life measures. The OMERACT group has proposed preliminary response criteria (Therapeutic Responder Index [TRI]) for chronic low back pain, based on at least 30% improvements in pain and in patient global assessment and no worsening in function.1 A further evaluation of the TRI suggests that it is particularly sensitive to the cutoff point used for improvement in the patient global assessment component.58 MCII and PASS estimates have been explored for low back pain based on the Roland Morris Disability Questionnaire.
The OMERACT group has developed core set domains for fibromyalgia: namely, pain, tenderness, fatigue, patient global assessment, multidimensional function, and sleep disturbance.59 Various outcome measures have been used in fibromyalgia clinical trials.60 An evaluation of outcome measurement performance has suggested that existing instruments adequately measure core OMERACT domains for fibromyalgia.60 Development of a fibromyalgia responder index and disease activity score is considered both feasible and a priority.60 Recently two responder definitions have been proposed, both including 30% or more reduction in pain and 10% or more improvement in physical function.61 In addition, the reliability, validity, and responsiveness of the Combined Index of Severity of Fibromyalgia (ICAF) have been confirmed.62
For many rheumatologists in routine practice, the goal is to track efficiently the progress of individual patients, even if there is no investigational goal. In addition to the reliability, validity, and responsiveness of the measure, practitioners also value attributes such as simplicity, rapid completion, and easy scoring.63
The major emphasis in musculoskeletal clinical metrology has been on the development of instruments for clinical research rather than for routine clinical purposes. The challenge now is to apply these techniques in clinical practice, in which feasibility issues become paramount (Box 2.3).
Combinations of disease-specific, generic health-related quality-of-life and global measures are particularly useful in clinical practice. Six issues of practical importance deserve special consideration:
1. Although measures of discomfort (pain), disability (function), and patient and physician global assessment are common to all rheumatic diseases, conditions with extraarticular consequences may require additional organ-specific assessments (e.g., skin scores in scleroderma).
2. Outcome measurement in children can be challenging. It is important to select scales that are valid, reliable, responsive, age-appropriate, and comprehensible in this particular group of patients.
3. In addition to defining a priori a schedule for assessing the beneficial effects of a treatment program, establishing a schedule of monitoring for adverse events, particularly for pharmacologic interventions, is equally important. Such a schedule should be capable of detecting both clinical and nonclinical (laboratory) forms of toxicity in a timely fashion.
4. Methods for handling data are important. For paper-based applications, flowcharts are particularly useful in documenting longitudinal change in clinical status as well as for charting the nonclinical (laboratory) tolerability of antirheumatic drugs. Expanding opportunities for electronic data capture are progressively solving many of the current problems with collection, storage, and analysis of quantitative data. Electronic data capture can be achieved on a variety of platforms including desktop and laptop computers, personal digital assistants, Web-based applications, and mobile telephones. Mobile telephones in particular deserve special attention because they are ubiquitous and provide opportunity for bidirectional or even multidirectional communication between patients and their health care provider(s). It should be noted that electronic data capture using mobile (cellular) phones is acceptable to patients; produces valid, reliable, and responsive data showing high levels of correlation with paper-based counterparts; can be achieved remotely, independently, and repeatedly; is able to transmit data over long distances directly into a host server for instant archiving; and can be conducted with a variety of commonly available mobile (cellular) phones.64-66
5. The concomitant use of disease-specific patient and physician global assessments and generic health-related quality-of-life measures provides a particularly powerful combination. In inflammatory polyarthritis these measures can readily be supplemented by an articular index in the form of a simple count of tender and swollen joints.
6. The development of responder criteria, state-attainment criteria, and normative data is an important step in the transfer of measurement procedures used in research environments into routine clinical practice. They permit clinical benchmarking and thereby facilitate data interpretation. Some of the proposed benchmarks should permit more effective monitoring of patient progress and response to treatment (Fig. 2.6).

Implementation of what might be termed quantitative rheumatology may require the development of new instruments or modifications to existing instruments, measurement and data-handling processes, and analysis, interpretation, and communication procedures to meet the demands of routine clinical practice. However, there is no reason to delay using some of the shorter, user-friendly, patient self-administered questionnaires and emerging technologies currently available. The ability to quantitate the clinical condition provides patients, medical practitioners, allied health professionals, third-party payers, and litigators with much important and useful information. Furthermore, proposed definitions of response and state attainment, the elaboration of general population-based normative values, and innovations in electronic data capture all contribute to an expanding capability to collect and interpret health status information in a manner capable of informing shared decision-making and goal setting in clinical practice, and the evaluation of new interventions and therapeutic strategies in clinical research.