Chapter Sixteen Correlation

CHAPTER CONTENTS

Introduction 175

Correlation 176

Correlation coefficients 177

Selection of correlation coefficients 177

Calculation of correlation coefficients 177

Pearson’s r 177

Spearman’s ρ 179

Uses of correlation in the health sciences 179

Prediction 179

Reliability and predictive validity of assessment 180

Estimating shared variance 181

Correlation and causation 182

Summary 182

Self-assessment 183

True or false 183

Multiple choice 183

Introduction

A fundamental aim of scientific and clinical research is to establish relationships between two or more sets of observations or variables. Finding such relationships or co-variations is often an initial step for identifying causal relationships.

The topic of correlation is concerned with expressing quantitatively the degree and direction of the relationship between variables. Correlations are useful in the health sciences in areas such as determining the validity and reliability of clinical measures (Ch. 12) or in expressing how health problems are related to crucial biological, behavioural or environmental factors (Ch. 5).

The specific aims of this chapter are to:

1. Define the terms correlation and correlation coefficient.

2. Explain the selection and calculation of correlation coefficients.

3. Outline some of the uses of correlation coefficients.

4. Define and calculate the coefficient of determination.

5. Discuss the relationship between correlation and causality.

Page 176

Correlation

Consider the following two statements:

1. There is a positive relationship between cigarette smoking and lung damage.

2. There is a negative relationship between being overweight and life expectancy.

You probably have a fair idea what the above two statements mean. The first statement implies that there is evidence that if you score high on one variable (cigarette smoking) you are likely to score high on the other variable (lung damage). The second statement describes the finding that scoring high on the variable ‘overweight’ tends to be associated with low measures on the variable ‘life expectancy’. The information missing from each of the statements is the numerical value for degree or magnitude of the association between the variables.

A correlation coefficient is a statistic which expresses numerically the magnitude and direction of the association between two variables.

In order to demonstrate that two variables are correlated, we must obtain measures of both variables for the same set of subjects or events. Let us look at an example to illustrate this point.

Assume that we are interested to see whether scores for anatomy examinations are correlated with scores for physiology. To keep the example simple, assume that there were only five (n = 5) students who sat for both examinations (see Table 16.1).

Table 16.1 Examination scores

	Score (out of 10)
Student	Anatomy (X)	Physiology (Y )
1	3	2.5
2	4	3.5
3	1	0
4	8	6
5	2	1

To provide a visual representation of the relationship between the two variables, we can plot the above data on a scattergram. A scattergram is a graph of the paired scores for each subject on the two variables. By convention, we call one of the variables x and the other one y. It is evident from Figure 16.1 that there is a positive relationship between the two variables. That is, students who have high scores for anatomy (variable X) tend to have high scores for physiology (variable Y). Also, for this set of data we can fit a straight line in close approximation of the points on the scattergram. This line is referred to as a line of ‘best fit’. This topic is discussed in statistics under ‘linear regression’. In general, a variety of relationships is possible between two variables; the scattergrams in Figure 16.2 illustrate some of these.

Figure 16.1 Scattergram of students’ scores in two examinations.

Figure 16.2 Scattergrams showing relationships between two variables: (A) positive linear correlation; (B) negative linear correlation; (C) non-linear correlation.

Figure 16.2A and B represent a linear correlation between the variables x and y. That is, a straight line is the most appropriate representation of the relationship between x and y. Figure 16.2C represents a non-linear correlation, where a curve best represents the relationship between x and y.

Figure 16.2A represents a positive correlation, indicating that high scores on x are related to high scores on y. For example, the relationship between cigarette smoking and lung damage is a positive correlation. Figure 16.2B represents a negative correlation, where high scores on x are associated with low scores on y. For example, the correlation between the variables ‘being overweight’ and ‘life expectancy’ is negative, meaning that the more you are overweight, the lower your life expectancy.

Page 177

Correlation coefficients

When we need to know or express the numerical value of the correlation between x and y, we calculate a statistic called the correlation coefficient. The correlation coefficient expresses quantitatively the magnitude and direction of the correlation.

Selection of correlation coefficients

There are several types of correlation coefficients used in statistics. Table 16.2 shows some of these correlation coefficients, and the conditions under which they are used. As the table indicates, the scale of measurements used determines the selection of the appropriate correlation coefficient.

Table 16.2 Correlation coefficient

Coefficient	Conditions where appropriate
φ (phi)	Both x and y measures on a nominal scale
ρ (rho)	Both x and y measures on, or transformed to, ordinal scales
r	Both x and y measures on an interval or ratio scale

All of the correlation coefficients shown in Table 16.2 are appropriate for quantifying linear relationships between variables. There are other correlation coefficients, such as η (eta) which are used for quantifying non-linear relationships. However, the discussion of the use and calculation of all the correlation coefficients is beyond the scope of this text. Rather, we will examine only the commonly used Pearson’s r, and Spearman’s ρ (rho).

Regardless of which correlation coefficient we employ, these statistics share the following characteristics:

1. Correlation coefficients are calculated from pairs of measurements on variables x and y for the same group of individuals.

2. A positive correlation is denoted by + and a negative correlation by −.

3. The values of the correlation coefficient range from + 1 to − 1, where + 1 means a perfect positive correlation, 0 means no correlation, and − 1 a perfect negative correlation.

4. The square of the correlation coefficient represents the coefficient of determination.

Page 168

Calculation of correlation coefficients

Pearson’s r

We have already stated that Pearson’s r is the appropriate correlation coefficient when both variables x and y are measured on an interval or a ratio scale. Further assumptions in using r are that both variables x and y are normally distributed, and that we are describing a linear (rather than curvilinear) relationship.

Page 178

Pearson’s r is a measure of the extent to which paired scores are correlated. To calculate r we need to represent the position of each paired score within its own distribution, so we convert each raw score to a z score. This transformation corrects for the two distributions x and y having different means and standard deviations. The formula for calculating Pearson’s r is:

where z_x = standard score corresponding to any raw x score, z_y = standard score corresponding to any raw y score, Σ = sum of standard score products and n = numbers of paired measurements.

Table 16.3 gives the calculations for the correlation coefficient for the data given in the earlier examination scores example.

Table 16.3 Example calculations for correlation coefficient

(The z scores are calculated as discussed in Ch. 15.)

This is quite a high correlation, indicating that the paired z scores fall roughly on a straight line. In general, the closer the relationship between the two variables approximates a straight line, the more r approaches + 1. Note that in social and biological sciences correlations this high do not usually occur. In general, we consider anything over 0.7 to be quite high, 0.3–0.7 to be moderate and less than 0.3 to be weak. The scattergrams in Figure 16.3 illustrate this point.

Figure 16.3 Scattergrams and corresponding approximate r values.

When n is large, the above equation is inconvenient to use to calculate r. Here we are not concerned with calculating r with a large n, although appropriate formulae or computer programs are available. For example, Coates & Steed (2003, Chapter 5) described how to use the program ‘SPSS’ for calculating correlation coefficients. The printouts provide not only the value of the correlation coefficients but also the statistical significance of the associations.

Assumptions for using r

It was pointed out earlier that r is used when two variables are scaled on interval or ratio scales and when it is shown that they are linearly associated. In addition, the sets of scores on each variable should be approximately normally distributed.

If any of the above assumptions are violated, then the correlation coefficient might be spuriously low. Therefore, other correlation coefficients should be used to represent association between two variables. A further problem may arise from the truncation of the range of values in one or both of the variables. This occurs when the distributions greatly deviate from normal shapes. The higher scale can be readily reduced to an ordinal scale.

Page 179

If we measure the correlation between examination scores and IQs of a group of health science students, we might find a low correlation because, by the time students present themselves to tertiary courses, most students with low IQs are eliminated. In this way, the distribution of IQs would not be normal but rather negatively skewed. In effect, the question of appropriate sampling is also relevant to correlations, as was outlined in Chapter 3.

Spearman’s ρ

When the obtained data are such that at least one of the variables x or y was measured on an ordinal scale and the other on an ordinal scale or higher, we use ρ to calculate the correlation between the two variables. The higher scale can be readily reduced to an ordinal scale.

If one or both variables were measured on a nominal scale, ρ is no longer appropriate as a statistic.

where d = difference in a pair of ranks and n = number of pairs.

The derivation of this formula will not be discussed here. The 6 is placed in the formula as a scaling device; it ensures that the possible range of ρ is from − 1 to + 1 and thus enables ρ and r to be compared. Let us consider an example to illustrate the use of ρ.

If an investigator is interested in the correlation between socioeconomic status and severity of respiratory illness, and assuming that both variables were measured on, or transformed to, an ordinal scale, the investigator rank-orders the scores from highest to lowest on each variable (Table 16.4).

Table 16.4 Rank-ordering of scores

Clearly, the association among the ranks for the paired scores on the two variables becomes closer the more ρ approaches + 1. If the ranks tend to be inverse, then ρ approaches − 1.

Uses of correlation in the health sciences

Prediction

When the correlation coefficient has been calculated it may be used to predict the value of one variable (y) given the value of the other variable (x). For instance, take a hypothetical example that the correlation between cigarette smoking and lung damage is r = + 1. We can see from Figure 16.4 that, given any score on x, we can transform this into a z score (z_x) and then using the graph we can read off the corresponding z score on y (z_y). Of course, it is extremely rare that there should be a perfect (r = + 1) correlation between two variables. In this case, the smaller the correlation coefficient, the greater the probability of making an error in prediction. For example, consider Figure 16.5, where the scattergram shown represents a hypothetical correlation of approximately r = 0.7. Here, for any transformed value on variable x (say Zx₁) there is a range of values of z_y that correspond. Our best guess here is Zy₁, the average value, but clearly a range of scores is possible, as shown on the figure.

Figure 16.4 Hypothetical relationship between cigarette smoking and lung damage. r = +1.0.

Figure 16.5 Hypothetical relationship between cigarette smoking and lung damage. r = + 0.7. With a correlation of <1.0 the data points will cluster around rather than exactly on the line and may vary between the range of values shown for any given values of x or y and their corresponding z scores.

Page 180

That is, as the correlation coefficient approaches 0, the range of error in prediction increases. A more appropriate and precise way of making predictions is in terms of regression analysis, but this topic is not covered in this introductory book.

Reliability and predictive validity of assessment

As you will recall, reliability refers to measurements using instruments or to subjective judgments remaining relatively the same on repeated administration (Ch. 12). This is called test–retest reliability and its degree is measured by a correlation coefficient. The correlation coefficient can also be used to determine the degree of inter-observer reliability.

As an example, assume that we are interested in the inter-observer reliability for two neurologists who are assessing patients for degrees of functional disability following spinal cord damage. Table 16.5 represents a set of hypothetical results of their independent assessment of the same set of five patients.

Table 16.5 Assessment of disability (%) by observers X and Y

Page 181

Observer Y clearly attributes greater degrees of disability to the patients than observer X. However, as stated earlier, this need not affect the correlation coefficient. If we treat the measurement as an ordinal scale, we can see from Table 16.5 that the ranks given to the patients by the two observers correspond, so that it can be shown that ρ = + 1.

Clearly, the higher the correlation, the greater the reliability. If we had treated the measurements as representing interval or ratio data, we would have calculated Pearson’s r to represent quantitatively the reliability of the measurement.

Predictive validity is also expressed as a correlation coefficient. For example, say that you devise an assessment procedure to predict how much people will benefit from a rehabilitation programme. If the correlation between the assessment and a measure of the success of the rehabilitation programme is high (say 0.7 or 0.8), the assessment procedure has high predictive validity. If, however, the correlation is low (say 0.3 or 0.4), the predictive validity of the assessment is low, and it would be unwise to use the assessment as a screening procedure for the entry of patients into the rehabilitation programme.

Estimating shared variance

A useful statistic is the square of the correlation coefficient (r²) which represents the proportion of variance in one variable accounted for by the other. This is called the coefficient of determination.

If, for example, the correlation between variable x (height) and variable y (weight) is r = 0.7, then the coefficient of determination is r² = 0.49 or 49%. This means that 49% of the variability of weight can be accounted for in terms of height. You might ask, what about the other 51% of the variability? This would be accounted for by other factors, say for instance, a tendency to eat fatty foods. The point here is that even a relatively high correlation coefficient (r = 0.7) accounts for less than 50% of the variability.

This is a difficult concept, so it might be worth remembering that ‘variability’ (see Ch. 14) refers to how scores are spread out about the mean. That is, as in the above example, some people will be heavy, some average, some light. So we can account for 49% of the total variability of weight (x) in terms of height (y) if r = 0.7. The other 51% can be explained in terms of other factors, such as somatotype. The greater the correlation coefficient, the greater the coefficient of determination, and the more the variability in y can be accounted for in terms of x.

Page 182

Correlation and causation

In Chapter 4, we pointed out that there were at least three criteria for establishing a causal relationship. Correlation or co-variation is only one of these criteria. We have already discussed in Chapter 6 that even a high correlation between two variables does not necessarily imply a causal relationship. That is, there are a variety of plausible explanations for finding a correlation. As an example, let us take cigarette smoking (x) and lung damage (y). A high positive correlation could result from any of the following circumstances:

1. x causes y.

2. y causes x.

3. There is a third variable, which causes changes in both x and y.

4. The correlation represents a spurious or chance association.

For the example above, (1) would imply that cigarette smoking causes lung damage, (2) that lung damage causes cigarette smoking, and (3) that there was a variable (e.g. stress) which caused both increased smoking and lung damage. We need further information to identify which of the competing hypotheses is most plausible.

Some associations between variables are completely spurious (4). For example, there might be a correlation between the amount of margarine consumed and the number of cases of influenza over a period of time in a community, but each of the two events might have entirely different, unrelated causes. Also, some correlation coefficients do not reach statistical significance, that is, they may be due to chance (see Chs 18 and 19).

Also, if we are using a sample to estimate the correlation in a population, we must be certain that the outcome is not due to biased sampling or sampling error. That is, the investigator needs to show that a correlation coefficient is statistically significant and not just due to random sampling error. We will look at the concept of statistical significance in the next chapter.

While demonstrating correlation does not establish causality, we can use correlations as a source for subsequent hypotheses. For instance, in this example, work on carcinogens in cigarette tars and the causes of emphysema has confirmed that it is probably true that smoking does, in fact, cause lung damage (x causes y). However, given that there is often multiple causation of health problems, option (3) cannot be ruled out.

Techniques are available which can, under favourable conditions, enable investigators to use correlations as a basis for distinguishing between competing plausible hypotheses. One such technique, called path analysis, involves postulating a probable temporal sequence of events and then establishing a correlation between them. This technique was borrowed from systems theory and has been applied in areas of clinical research, such as epidemiology.

Summary

In this chapter we outlined how the degree and direction between two variables can be quantified using statistics called correlation coefficients. Of the correlation coefficients, the use of two were outlined: Pearson’s r and Spearman’s ρ. Definitional formulae and simple calculations were presented, with the understanding that more complex calculations of correlation coefficients are done with computers.

Several uses for r and ρ were discussed: for prediction, for quantifying the reliability and validity of measurements, and by using r² in estimating amount of shared variability. Finally, we discussed the caution necessary in causal interpretation of correlation coefficients. Showing a strong association between variables is an important step towards establishing causal links but further evidence is required to decide the direction of causality.

As with other descriptive statistics, caution is necessary when correlation coefficients are calculated for a sample and then generalized to a population. The question of generalization from a sample statistic to a population parameter will be discussed in the following chapters.

Page 183

Self-assessment

Explain the meaning of the following terms:

coefficient of determination

correlation

correlation coefficient

curvilinear correlation

linear correlation

negative correlation

positive correlation

shared variance

zero correlation

True or false

1. Correlation is defined as the relative difference between two variables.

2. The association between two variables can be plotted on a scattergram.

3. If the distribution of paired scores is best represented by a curve, the relationship is non-linear.

4. When we speak of a positive (+) relationship, we mean that high scores on one variable are associated with high scores on the other variable.

5. In a negative (−) relationship, low scores on one variable are associated with low scores on the other.

6. There are several types of correlation coefficients, the selection of which is determined by the level of scaling of the two variables.

7. When both variables are measured on an interval or ratio scale, Pearson’s r is the most appropriate correlation coefficient.

8. When both variables are measured on, or converted to, ordinal scales, we must use ρ (phi) to express correlation.

9. For two variables measured on nominal scales, we use ρ (rho) to express correlation.

10. Pearson’s r is calculated by a formula where Σz_xz_y stands for the sum of the z score pairs multiplied together.

11. When calculating Spearman’s ρ, Σd₂ is the sum of the square of the differences between the means.

12. When we use Pearson’s r, we assume that both variables are continuous and normally distributed.

13. The calculated values of correlation coefficients range between 0 and − 1.

14. When there is no linear association between two variables, r or ρ will be close to zero.

15. A correlation coefficient of + 1 represents a very low linear correlation.

16. Where there is a correlation of r = + 1 between two variables, then the z scores of variable x will be equal to the z scores of variable y.

17. The coefficient of determination is the square of the correlation coefficient.

18. If r = 0.3, then the coefficient of determination will be 9.0.

19. If r² = 0.36 for a set of data, this implies that 36% of the variability of y is explained in terms of x.

20. As the correlation coefficient approaches zero, the possible error in linear prediction increases.

21. The closer the correlation coefficient is to zero, the greater the predictive validity of a test.

22. If a correlation coefficient for the test–retest reliability of a test is close to 1, then the test is unreliable.

23. Given a − 1.00 correlation coefficient for two variables, a raw score of 50 on the first variable must be accompanied by a score of − 50 on the second variable, for a given case.

24. Even a high correlation is not necessarily indicative of a causal relationship between two variables.

25. As the value of r increases, the proportion of variability of y that can be accounted for by x decreases.

26. Eta (η) is the appropriate correlation coefficient to use when two variables are nominally scaled.

27. If the relationship between two variables is non-linear, the value of the correlation coefficient must be negative.

28. Spearman’s ρ is used where one or both variables are at least of interval scaling.

29. A scattergram is used to help to decide if the relationship between two variables is linear or curvilinear.

30. If the correlation between variables A and B is high, then there must be a direct causal relationship between A and B.

Multiple choice

1. Which of the following statements about correlations is false?

a Spearman’s ρ (rho) is appropriate to use when the relationship between two variables is non-linear.

b A correlation coefficient of − 0.8 represents a higher degree of association between two variables than a correlation coefficient of + 0.6.

Page 184

c The construction of a scattergram is useful for evaluating whether a relationship between two variables is linear or curvilinear.

d In a perfect positive correlation between two quantitatively measured variables, each individual obtains the same z scores on each variable.

e Negative correlation implies that high scores on one variable are related to low scores on another variable.

2. A scattergram:

a is a statistical test

b must be linear

c must be curvilinear

d is a graph of x and y scores

e is none of the above.

3. If the relationship between variables x and y is linear, then the points on the scattergram:

a will fall exactly on a straight line

b will fall on a curve

c must represent population parameters

d are independent of the variance

e are best represented by a straight line.

4. If the relationship between x and y is positive, as variable y decreases, variable x :

a increases

b decreases

c remains the same

d changes linearly

e varies.

5. In a ‘negative’ relationship:

a as x increases, y increases

b as x decreases, y decreases

c as x increases, y decreases

d both a and b

e none of the above.

6. The lowest strength of association is reflected by which of the following correlation coefficients?

a 0.95

b − 0.60

c − 0.33

d 0.29

e none of the above, as it cannot be determined.

7. The highest strength of association is reflected by which of the following correlation coefficients:

a − 1.0

b − 0.95

c 0.1

d 0.85

e none of the above, as it cannot be determined.

8. Which of the following statements is false?

a In a perfect positive correlation, each individual obtains the same z score on each variable.

b Spearman’s ρ is used when one or both variables are at least of interval scaling.

c The range of the correlation coefficient is from − 1 to + 1.

d A correlation of r = 0.85 implies a stronger association than r = − 0.70.

9. We can calculate the correlation coefficient if given:

a the top scores from one test and lowest scores from another test

b at least two scores from the same tests

c two sets of measurements for the same subjects

d either a or b.

10. The correlation between two variables x and y is − 1.00. A given individual has a z score of z_x = 1.4 on variable x. We predict the z score on y is:

a 1.4 above the mean

b − 1.4

c between 1.4 and − 1.4

d − 1.00.

11. Professor Shnook demonstrated a correlation of − 0.85 between body weight and IQ. This means that:

a obesity decreases intelligence

b heavy people have higher IQs than light people

c people with high IQs are likely to be light

d malnutrition damages the brain

e none of the above.

12. A correlation coefficient of + 0.5 was found between exposure to stressful life events and incidence of stress-induced disorders. A correlation of this direction and magnitude indicates that:

a a high level of exposure to stressful life events causes a high level of stress-induced disorders

Page 185

b a high level of stress-induced disorders causes a high likelihood of exposure to stressful life events

c either or both of a and b may be correct – we cannot be sure

d the levels of stress-induced disorders and of exposure to stressful life events are not causally related in any way.

13. You are told that there is a high inverse association between the variables ‘amount of exercise’ and ‘incidence of heart disease’. A correlation coefficient consistent with the above statement is:

a r = 0.8

b r = 0.2

c r = − 0.2

d r = − 0.8.

14. Of the following measurement levels, which is at a minimum required for the valid calculation of a Pearson correlation coefficient?

a nominal

b ordinal

c interval

d ratio.

15. Of the following measurement levels, which is required for the valid calculation of the Spearman correlation coefficient?

a nominal

b ordinal

c interval

d ratio.

16. You are told that there is a high, positive correlation between measures of ‘fitness’ and ‘hours of exercise’. The correlation coefficient consistent with the above statement is:

a 0.3

b 0.2

c − 0.8

d − 0.3

e none of these.

You are interested in selecting a test suitable for client assessment in a clinical situation. You find 4 tests in the literature, tests P, Q, R and S. Each of these tests has been validated against a clinically relevant variable. The test–retest and predictive validity of the four hypothetical tests are given in the table below.

Questions 17 and 18 refer to the above information.

Test	Test–retest reliability (r)	Predictive validity (r)
P	0.8	0.50
Q	0.9	0.18
R	0.5	0.40
S	0.2	0.03

17. Which test would you choose for clinical assessment?

a P

b Q

c R

d S

18. Which test has the best test–retest reliability?

a P

b Q

c R

d S

An investigation aims to establish for a sample of subjects the relationship between blood cholesterol levels (in mg/cc) and blood pressure (in mmHg).

Questions 19 and 20 refer to this investigation.

19. The correlation coefficient appropriate for establishing the degree of correlation between the two variables (assuming a linear relationship):

a is determined by the sample size

b depends on the direction of the causality

c is Spearman’s ρ

d is Pearson’s r

e both b and c.

20. Say that the correlation coefficient is calculated to be + 0.7. A correlation of this direction and magnitude indicates that:

a high blood pressure causes high blood cholesterol

b high blood cholesterol causes high blood pressure

c there might be a third variable which causes both high blood pressure and high blood cholesterol

d none of the statements a, b or c is consistent with the evidence

Page 186

e any of the statements a, b or c might be correct; we cannot be sure from the available evidence.

21. When deciding which measure of correlation to employ with a specific set of data, you should consider:

a whether the relationship is linear or non-linear

b the type of scale of measurement for each variable

c a and b

d neither a nor b.

22. The proportion of variance accounted for by the level of correlation between two variables is calculated by:

b r²

c Σx

d you cannot calculate this proportion under any circumstances.

23. If the correlation between two variables A and B is 0.36, then the proportion of variance accounted for is:

a 0.13

b 0.06

c 0.60

d 0.64.

24. We find that the correlation coefficient representing the outcome for the predictive validity of a test is r = 0.2. Such a finding would indicate that the test had:

a low predictive validity

b acceptable predictive validity

c high predictive validity

d 20% predictive validity.