Chapter Thirteen Organization and presentation of data

CHAPTER CONTENTS

Introduction 143

The organization and presentation of nominal or ordinal data 144

Organization of discrete data 144

Graphing discrete data 144

Organization and presentation of interval or ratio data 146

Grouped frequency distributions 146

Graphing frequency distributions 147

Simple descriptive statistics 148

Ratios 149

Proportions 149

Percentages 149

Rates 149

Summary 150

Self-assessment 150

True or false 151

Multiple choice 151

Introduction

The summary and interpretation of data from quantitative research entail the use of statistics. We use statistics to organize and interpret research observations and measurements. A statistic is also a number that is obtained by the mathematical manipulation of the research data. Descriptive statistics describe characteristics of the researcher’s data.

In previous chapters we examined how interviews, observations and measurement are used to produce data in clinical investigations or research. It can be difficult to make sense of raw data when they consist of a large number of measurements. Before we can interpret or communicate the information provided by a research project, the raw data must be organized and presented in a clear and intelligible fashion. We do this using statistics. The aim of this chapter is to outline methods used in descriptive statistics for the organization, tabulation and graphic presentation of data. We will also examine the use of some simple statistics directly derived from the tabulation of the data.

The aims of this chapter are to:

1. Outline methods for organizing and representing data in the form of frequency distributions, tables or graphs.

2. Demonstrate how the measurement scale used for data collection influences the organization and presentation of the evidence.

Page 144

3. Discuss the calculation and use of some simple descriptive statistics, including percentages, ratios and rates.

The organization and presentation of nominal or ordinal data

A fundamental consideration in selecting appropriate statistics is the question of whether the data are discrete or continuous. Nominal and ordinal data are necessarily discrete, so that the organization of the data involves counting the number (frequency) of cases falling into each category of measurement. Let us examine two simple examples as an illustration.

Organization of discrete data

Example 1: nominal data

We are interested in the sex of patients (nominal data) undergoing gall bladder surgery (cholecyst-ectomy) at a public hospital over a period of one year. The raw data indicating the sex (M or F) of the patients is simply read off the patients’ records, as follows:

F, M, M, F, F, F, M, F, M, F, M, F, F, F, F, F, F, F, F, F, M, F, F, F, M, M, F, M, F, M

Grouping the above nominal data involves counting the number of cases (or measurements) falling into each category. The total is M = 10 and F = 20. The data can be presented in tabular form. Table 13.1 shows the following conventions in tabulating data:

1. Tables must be clearly and fully labelled – both the table as a whole and the categories – so that readers can interpret unambiguously what they are observing.

2. f represents frequency of cases or measurements falling into a given category.

3. n represents the total number of cases or measurements in a sample.

4. N represents the number of cases in a population. (See Section 3 for the difference between samples and populations.)

Table 13.1 Frequency distribution of gender of patients undergoing cholecystectomy at a hospital over a period of 1 year

Gender	f
Males (M)	10
Females (F)	20
	n = 30

Example 2: ordinal data

Ordinal data are presented by counting the number of cases (frequency) of each ordered rank making up the scale.

An investigator intends to evaluate the effectiveness of a new analgesic versus placebo treatment. A post-test only control group design is used: the experimental group receives the analgesic and the control group the placebo. Twenty patients are randomly assigned into each of the two groups. Pain intensity is assessed by the patients’ pain reports five hours after minor surgery, on the following scale:

5: Excruciating pain

4: Severe pain

3: Moderate pain

2: Mild pain

1: No pain

The raw data are:

Experimental group: 3,4,5,3,3,3,4,2,1,3,2,1,3,4,5, 2,3,3,3,3

Control group: 5,4,4,4,5,3,4,3,2,4,4,2,4,5,3,4,4,4, 5,5

After tallying the results, the above data can be presented as a frequency distribution, as shown in Table 13.2. This demonstrates that when the data have been tabulated, we can see the outcome of the investigation. Here, the pain reported by the experimental group is less than that of the control group.

Table 13.2 Reported pain intensity of patients following placebo and analgesic treatments

Pain intensity	Experimental group (analgesic)	Control group (placebo)
	f	f
1	2	0
2	3	2
3	10	3
4	3	10
5	2	5
	n = 20	n = 20

Graphing discrete data

Once a frequency distribution of the raw data has been tabulated, a variety of techniques is available for the pictorial or graphical presentation of a given set of measurements. Frequency distributions of qualitative data are often plotted as bar graphs (also termed ‘column’ graphs), or shown pictorially as pie diagrams.

Page 145

A bar graph involves plotting the frequency of each category and drawing a bar, the height of which represents the frequency of a given category. Figure 13.1 graphs the data given in Table 13.1.

Figure 13.1 Bar (column) graph of patients undergoing cholecystectomy.

Figure 13.1 demonstrates conventions in plotting bar graphs:

1. The y-axis, also called the ordinate, is used to plot frequencies.

2. The x-axis, also called the abscissa, is used to indicate the categories.

3. The bars do not touch each other, reflecting the discontinuity of the measurement categories.

It should be noted that care must be exercised in interpreting graphs, as the axes may be translated or compressed causing a false visual impression of the data. Make sure that you inspect the values along the axes, so that you are not misled. It is also acceptable to calculate the percentage of scores falling into each category and to display the percentages instead of frequencies.

It can be seen in Figure 13.2 that by presenting the data for the experimental and control groups on the same graph, the reader gains a visual impression of the possible effectiveness of the analgesic treatment in contrast to that of the control intervention or treatment. Nominal data can also be meaningfully presented as a pie chart, where the percentage of each category is converted into a proportional part of a circle or ‘pie’. For example, in a given hospital we have the hypothetical spending patterns shown in Table 13.3.

Figure 13.2 Bar (column) graph of patients’ pain intensity.

Table 13.3 Hypothetical spending patterns

Item	Cost ($)	Percentage of total
Wages and salaries	1 500 000	50.00
Medical supplies	500 000	16.67
Food and provisions	500 000	16.67
Administrative costs	500 000	16.67
Total	$3 000 000	100.00

Figure 13.3 represents a pie chart of the information. In constructing Figure 13.3, we converted the numbers into percentages and then into degrees (out of a total of 100% = 360°), that is, each 1% of the total is represented by 3.6° in the circle.

Figure 13.3 Pie diagram of a hospital’s spending pattern.

Page 146

Organization and presentation of interval or ratio data

Previously we have noted that interval and ratio scales of measurement produce real numbers that can be processed according to the standard rules of arithmetic. Interval and ratio measurements typically produce continuous data (such as weight, length, time, IQ), implying that increasingly accurate values of the variable are possible to obtain, depending on the precision of the measurement process. For example, the weight of a neonate could be measured as 4, 4.2, 4.18 or 4.183 kg.

Grouped frequency distributions

When the continuous data are made up of a large number of varied measurements, it is useful to present the data as grouped frequency distributions. When drawing up a grouped frequency distribution, the following conventions should be taken into account:

1. The table of grouped frequency distributions should have no more than nine groups of values, otherwise it is too difficult to inspect. However, if too few groups are used, the meaning of the data is obscured, as varied measurements are combined into too few equivalent categories.

2. There should be equally sized class intervals, the width of which is represented by i.

3. Individual scores within a given class interval ‘lose’ their precise identity. The midpoint of each class interval is taken to represent the class interval.

Example

On admission to hospital, patients are routinely weighed. You are asked to summarize the weights of 50 male patients who were admitted in your ward over a period of time. The weights (raw data) are as follows (to nearest kg):

75, 67, 76, 71, 73, 86, 72, 77, 80, 75, 80, 96, 93, 75, 73, 83, 81, 82, 73, 92, 81, 87, 76, 84, 78, 79, 99, 100, 88, 77, 71, 76, 75, 83, 66, 79, 95, 85, 77, 87, 90, 73, 72, 68, 84, 69, 78, 77, 84, 94

The steps in constructing a grouped frequency distribution are:

1. Organize data into an ordered array, and find the frequency of each score (see Table 13.4).

2. Find the range of scores. The range is the difference between the highest and lowest score plus 1. We add 1 to include the real limits for continuous data. In this case the range is 100 − 66 + 1 = 35.

3. Decide on the width (i) of the class intervals; i can be approximated by dividing the range by the number of groups or class intervals. In this instance, if we decide on seven classes, i will be 35 ÷ 7 = 5.0. When i is a decimal, it should be rounded up to the nearest whole number: here, i = 5. As stated earlier, the number of class intervals is arbitrary and will be chosen by the researcher, depending on the properties of the data. By convention, more than nine class intervals are rarely employed.

Page 147

4. The next step is to determine the lowest class interval, and then list the limits of each class interval. Clearly, the lowest class interval must include the lowest score in the distribution.

5. Then, the frequency of scores is determined from each class interval and tabulated, as in Table 13.5.

Table 13.4 Ordered array of data

Table 13.5 Grouped frequency distribution of patients’ weight in a given ward

Class interval	f
66–70	4
71–75	12
76–80	13
81–85	9
86–90	5
91–95	4
96–100	3
	n = 50

It is easier to understand the data by inspecting Table 13.5 than by looking at the raw data. However, some precision in the data has been lost as somewhat different scores have been assigned into the same class intervals.

Graphing frequency distributions

The two common types of graphs used to graph frequency distribution of quantitative data are histograms and frequency polygons.

Histograms

A histogram resembles a bar graph but the bars are drawn to touch each other. The fact that the bars touch each other reflects the underlying continuity of the data. The height of the bars along the y-axis represents the frequency of each score or class interval plotted along the x-axis. With grouped data, the midpoint of each class interval becomes the midpoint of each bar, and the width of the bar corresponds to the real limits of each class interval.

For example, consider the lowest class interval 66–70 on Table 13.5. Because the data are continuous, the real upper and lower limits of the class interval are 65.5 and 70.4 (Fig. 13.4). Although all the weights are given as whole numbers of kilograms, these will in most cases be the result of rounding off by the nursing staff to the nearest whole number. Thus, someone who actually weighed 70.2 kg would have been recorded as weighing 70 kg and would fall into the 66–70 class. As Figure 13.4 shows, i = 5, and the midpoint of the class interval is 68.

Figure 13.4 Real limits of a class interval.

Frequency polygons

Any data which can be represented by a histogram can also be graphed as a frequency polygon. For this type of graph, a point is plotted over the midpoint of each class interval, at a height representing the frequency of the scores. Figure 13.5 represents a histogram and a frequency polygon for the data in Table 13.5.

Figure 13.5 Combined histogram and frequency polygon for the same data (see Table 13.5).

Page 148

Frequency polygons allow the reader to interpolate, that is, to estimate the frequency of values in between those actually measured or graphed. Of course, interpolation cannot be done for discrete data (for example, Fig. 13.1), as values between categories have no meaning. When a frequency polygon is plotted, it can take on a variety of shapes. The shapes which are of particular importance for frequency polygons are shown in Figure 13.6.

Figure 13.6 Shapes of frequency polygons: (A) normal distribution; (B) negative skewing; (C) positive skewing.

Figure 13.6A represents a bell-shaped, normal distribution. It is symmetrical, in the sense that one-half is the mirror image of the other. The curve indicates that most of the scores fell in the middle, with relatively few scores towards either ‘tail’. Figure 13.6B represents a negatively skewed distribution, with most of the scores being high and spreading out toward the lower end of the distribution. Figure 13.6C represents a positively skewed distribution, with most of the scores being low, but with some scores spreading out towards the upper end of the distribution.

An easy way to remember the direction of the skew is to consider the region where the ‘tail’ or portion of the graph with lower frequencies falls. For example, if the tail is located at the low score end of the x-axis, the graph is negatively skewed. The significance of the skew or a symmetrical, normal distribution will be discussed in subsequent sections.

Simple descriptive statistics

Once the data have been summarized in frequency distribution, it is often useful to make comparisons concerning the relative frequencies of scores falling into specific categories. The following statistics are useful for understanding comparative trends in the data, and can be used for measurements on any scale: nominal, ordinal, interval or ratio. A statistic is a number resulting from a computation based upon the raw data. The calculation of statistics is essential for ‘crunching’ the raw data into single numbers that represent the characteristics of the full data set.

Page 149

Ratios

Ratios are statistics which express the relative frequency of one set of frequencies, A, in relation to another, B. The formula for ratios is:

Therefore, the ratio of males to females for the data presented in Table 13.1 is:

Ratios are useful in the health sciences when we are interested in the distribution of illnesses or symptoms or the categories of subjects requiring or benefiting from treatment. The ratio calculated above tells us about the relative frequency of gall bladder surgery for males and females.

Proportions

Proportions are statistics which are calculated by putting the frequency of one category over that of the total numbers in the sample or the population:

Therefore, the proportion of males in the sample represented in Table 13.1 is:

Percentages

Proportions can be transformed into percentages, by multiplying by 100. Of course, this is how we obtained the values of the y-axis for Figure 13.2. To illustrate, patients scoring 5 (excruciating pain) in the control group:

The values of the pie chart (Fig. 13.3) also involved such calculations.

Rates

When summarizing the results of epidemiological investigations it is often useful to use this statistic to represent the level at which a disorder is present in a given population. The two rates which are commonly used in the health sciences are:

• Incidence rates, which represent the number of new cases of a disorder reported within a time period.

• Prevalence rates, which represent the total number of cases suffering from a disorder.

Let us illustrate the above equation by applying it to hypothetical data. Let us consider the condition herpes, a nasty little condition associated with the virus herpes simplex that attacks various parts (lips, etc.) of the body. Assume that an epidemiologist is interested in the spread of the condition in a given community.

1. Assume that all the population above the age of 15 years (N = 1 000 000) is at risk of herpes.

2. In 2005, 5000 new cases were reported.

3. In 2005, there was a total (old and new active cases) of 15 000 known cases.

Page 150

Here, substituting into the equation:

The statistic 0.005 is not seen as the best way to represent a rate.

Often, epidemiologists select a base to make the statistic more understandable. The base represents a number for transforming the rate. The base selected depends on the magnitude of the rate; conventionally a multiple of 10, such as 1000, 10 000 or 100 000 is selected. In this instance we select 1000 as the base. Therefore, substituting into the equation, we obtain:

The above statistics can be graphed. For example, we may wish to represent pictorially the incidence of herpes simplex in the community over a period of 5 years. The (fictitious) evidence is shown in Table 13.6.

Table 13.6 Incidence of herpes

Year	Incidence of herpes (per 1000)
1982	8
1983	15
1984	17
1985	16
1986	10

The graph of the time-series for the incidence over time (Fig. 13.7) gives us a visual impression of the changing incidence rate of the problem in question.

Figure 13.7 Incidence rate of genital herpes 1982–1986.

Summary

We outlined several techniques for organizing, tabulating and graphically presenting both discontinuous (nominal, ordinal) and continuous (interval, ratio) data. It was shown that raw data can be organized and tabulated as a frequency distribution, by counting the number of cases falling into specific categories or class intervals. Data composed of a large number of highly varied measurements were shown to be best presented by grouping the scores into class intervals.

Several techniques of graphing data were discussed: bar graphs and pie charts for discontinuous data, and histograms and frequency polygons for continuous data. The possible shapes of frequency polygons were also examined. It was shown that data grouped in frequency distributions could also be represented as ratios, proportions, percentages or rates. These statistics were obtained by computations based upon the raw data, and were shown to be useful for representing the characteristics of all the raw data. In the next chapter, we will examine further techniques of ‘crunching’ or condensing data by using appropriate descriptive statistics.

Self-assessment

Explain the meaning of the following terms:

bar graph

bell-shaped curve

continuous data

cumulative frequency

discontinuous data

frequency distribution

Page 151

frequency polygon

histogram

incidence rate

negative skew

ordered array

pie diagram

positive skew

proportion

rate

ratio

real limit

True or false

1. The organization of nominal or ordinal data involves counting the number of scores falling into discrete categories.

2. Nominal and ordinal data are best graphed as histograms.

3. It is useful to organize interval or ratio data into ordered arrays before constructing frequency distributions.

4. Given a large number of varied scores, we can construct a grouped frequency distribution, usually with about seven class intervals.

5. The midpoint of a given class interval is i.

6. Frequency distributions of continuous data can be presented graphically as histograms or frequency polygons.

7. In a negatively skewed distribution most of the scores are low, with a few high scores spread along the x-axis.

8. With a bell-shaped or normal distribution, most of the scores are located at the two extreme ends of the distribution.

9. ‘Incidence rates’ are statistics which represent the number of active cases of a disorder within a specific time period.

10. For a chronic condition like arthritis, we would expect the prevalence rates to be smaller than the incidence rates.

11. A ‘base’ represents a number for transforming a rate into a more easily understandable statistic.

12. The wider i is, the more specific information is lost about the actual data.

13. A frequency polygon is appropriate for graphing continuously distributed variables.

14. The percentile rank of a score is equal to the frequency of the scores falling up to and including the score.

15. If a curve is negatively skewed, the distribution of the scores has a ‘tail’ towards the lower values of the variable.

16. The height of a bar (column) represents the frequency, rather than the value, of a variable.

17. A pie diagram is inappropriate for representing nominally scaled data.

18. Incidence rates represent the number of new cases of a disorder reported within a time period.

19. The ‘base’ for calculating incidence rates is always 10 000.

20. We cannot construct frequency tables for nominally or ordinally scaled data.

21. Ratios are the same as proportions.

22. We can only calculate proportions for nominally scaled data.

Multiple choice

1. Patients indicate their satisfaction with treatment by responding to a question with four options:

(1) very dissatisfied

(2) dissatisfied

(3) satisfied

(4) very satisfied.

This is an example of a(n) (i) scale, and the resulting frequency distribution should be plotted as a (ii):

a (i) nominal (ii) bar graph

b (i) ordinal (ii) bar graph

c (i) interval (ii) histogram

d (i) ordinal (ii) histogram

e none of the above.

2. Interval or ratio data should be graphed as a:

a histogram

b bar graph

c frequency polygon

d a and b

e a and c.

3. Your class is asked to do an exam in theoretical physics. Given that only a few students know anything about this subject, the distribution of scores would be:

a symmetrical

b negatively skewed

c positively skewed

d bell-shaped

e both a and b.

Page 152

4. If a curve is symmetrical:

a most of the scores fall at the higher values of the x-axis

b most of the scores fall at the lower end of the x-axis

c most of the scores fall at the higher end of the y-axis

d most of the scores fall at the lower end of the y-axis

e if folded in half, the two sides of the curves will coincide.

5. The percentile rank of a person’s score on a test is 35. This means that:

a the person got 35% of the items correct

b the person performed better than 65% of the sample doing the test

c both a and b

d the person’s score was equal to or better than 35% of the sample doing the test.

6. You are interested in calculating the incidence rate for Huntington’s disease, which is a very rare disorder. Which of the following numbers would be your best ‘base’?

a 10

b 100

c 1000

d 10 000

7. A continuous scale of measurement is different from a discrete scale in that a continuous scale:

a is an interval scale, not a ratio scale

b never provides exact measurements

c can take an infinite number of intermediate possible values

d never uses decimal numbers

e b and c.

A researcher wished to study the effectiveness of a new treatment, A, upon the severity of migraine. A random sample of 50 subjects (n = 50) was selected from a group of migraine sufferers attending a pain clinic. Patients were randomly allocated to be treated either with A, or with a currently available biofeedback treatment, B. A pre-test/post-test experimental design was used. Level of pain was assessed using a standardized pain questionnaire, measuring pain responses on a scale between 1–100.

Questions 8–14 refer to the above information.

8. The independent variable in this study was:

a the new treatment A

b the type of treatment used

c the migraine

d the patients’ scores on the pain questionnaire.

9. The dependent variable in the above study was:

a the new treatment A

b the type of treatment used

c the migraine

d the patients’ scores on the pain questionnaire.

10. The scale of measurement used to assess the pain responses was:

a nominal

b ordinal

c interval

d ratio.

The data for the post-test pain scores for the two groups are summarized in the table below.

Pain ratings	Treatment A	Treatment B
31–40	1	1
41–50	2	2
51–60	3	−
61–70	10	3
71–80	4	4
81–90	3	10
91–100	2	5

Questions 11–14 refer to this table.

11. What would be an appropriate way for this information to be graphed?

a frequency polygon

b histogram

c bar graph

d either a or b.

12. What are the real limits of the lowest category?

a 31–40

b 31.5–40.4

c 30.5–40.4

d 30.5–39.5.

13. Which of the following statements is (are) true?

a The post-test scores for treatment B are skewed.

b i is equal to 9.

Page 153

c 6% of treatment A post-test scores are under 50.5.

d All of the above statements are true.

14. Which of the following statements is supported by the data, assuming that the pre-test scores were equivalent for the two groups?

a Treatment A appears to be more effective than B.

b Treatment B appears to be more effective than A.

c The two treatments appear to be equivalent.

d Treatment B appears to be harmful.

e a and d.

The total number of deaths reported in a hypothetical country for a given year was 120 000. The following lists ‘deaths by cause’ as a percentage of all deaths:

Heart disease	35%
Cancer	25%
Cerebrovascular disease	15%
Trauma	10%
Respiratory illness	5%
Infections	5%
Other causes	5%

15. Data such as the above are compiled by asking hospitals, physicians, etc. to report deaths to a central agency. This type of information collection is best described as:

a a survey

b an experiment

c a mathematical model

d field research.

16. The variable ‘cause of death’ is measured on a(n):

a nominal scale

b ordinal scale

c interval scale

d ratio scale.

17. The above table should be graphed as a:

a frequency polygon

b histogram

c bar graph

d a and b.

18. The number of people who died of either heart disease or cancer was:

a 35 000

b 60 000

c 72 000

d 90 000.

19. Which of the following statements is false?

a The proportion of all deaths caused by cancer is 0.25.

b 5000 persons died of respiratory illnesses.

c Twice as many persons died of trauma than of infections.

d 18 000 persons died of cerebrovascular disease.

20. Of the people who died of trauma, the male:female ratio was 2:1. How many females died of trauma?

a 4000

b 8000

c 12 000

d Insufficient information to calculate answer.

21. Which of the following statements is true?

a A graph of continuous data enables the reader to interpolate.

b When constructing a frequency distribution for grouped scores, i should be equal to the number of class intervals.

c A histogram is like a bar graph, except that with a histogram the bars do not touch each other.

d A bell-shaped curve is an example of a skewed distribution.

22. Descriptive statistics:

a are used to make inferences about populations from small samples

b are only appropriate in non-experimental designs

c summarize data about samples or populations

d are derived from probability theory.

A researcher has collected data concerning the amount of time in seconds that it took a group of normal and a group of brain-damaged subjects to complete a standard motor task. The data are shown in the table at the top of page 154, arranged in a grouped frequency distribution.

Questions 23–29 refer to these data.

23. The total number of subjects used in this study was:

a 7

b 26

c 46

d 50.

24. The class interval width i is:

a 2

b 3

c 4

d 7.

Class interval	Time in seconds Normals	Brain-damaged
12–14	1	0
15–17	2	0
18–20	5	1
21–23	10	2
24–27	4	4
28–30	3	10
31–33	1	3

Page 154

25. The real limits for the class interval 15–17 are:

a 14.5–17.4

b 15 + 0.5

c 14–18

d 2.

26. The scale of measurement used for the dependent variable was:

a nominal

b ordinal

c interval

d ratio.

27. Which of the following statements is (are) true?

a The distribution of scores for the normals approximates a normal distribution.

b The scores for the brain-damaged subjects was highly skewed.

c The above investigation should not be classified as an experiment.

d All the above are true.

28. On the basis of previous evidence, the score of 27.4 is taken as demonstrating adequate motor function, while longer times for completing the task demonstrate some motor impairment. The proportion of normals who scored under 27.4 is X while the proportion of brain-damaged who scored under this score was Y (insert values for X and Y).

a 4, 13

b 0.85, 0.35

c 0.80, 0.17

d 0.26, 0.20.

29. Assuming external validity, which of the following statements is supported by the evidence?

a Brain-damaged subjects take longer to complete the standard task.

b The greater the degree of brain damage, the slower the task completion.

c The task is valid for discriminating subjects with brain damage.

d a and c.

30. The selection of appropriate descriptive statistics for organizing quantitative data is most influenced by the:

a design of the research project

b scales of measurement used to collect data

c number of research participants

d expected magnitude of the differences among the groups.