Introduction to Biostatistics: Part 1. Measurement scales and their summary statisticsPDF
ACP J Club. 2005 Jul-Aug;143:A8. doi:10.7326/ACPJC-2005-143-1-A08
This is the first in a series of editorials that will be appearing in ACP Journal Club. The aim of the series is to introduce readers to the basic principles of statistics to enable effective evaluation of research evidence and data presented in clinical journals. This first article provides an overview of measurement scales and their summary statistics. Subsequent notes will focus on measures of association (e.g., absolute and relative risk), measuring statistical uncertainty using confidence intervals and P values, precision and bias (including sample size and power), the evaluation of diagnostic tests (predictive values), correlation and regression (interpreting scatter plots and coefficients), meta-analysis (interpreting forest plots), and survival analysis.
In any study, observations are made on each individual. These observations vary both between and within individuals and are thus referred to as “variables.” We may summarize the data collected in a study either numerically, in the form of summary statistics, or in tabular or graphical form. The advantage of the first method is that individual statistics (such as means or proportions) can be used to summarize the data simply; on the other hand all, or most, of the data can be presented in a table or figure. The appropriate summary method (as well as the statistical analysis) depends on the type of variable and its measurement scale. For example, only if the distribution is approximately normal (symmetrical and bell-shaped) should the mean be used to summarize the data.
The 2 main types of measurement scales are categorical and numerical (Figure 1). Categorical variables have a set of labels for category membership (e.g., diabetic and nondiabetic); numerical variables are a count (e.g., number of physician visits), a measure on a particular instrument (e.g., blood pressure), or a summary score (e.g., SF-36 score).
Tabular and graphical presentation
Tables and graphs can present a distribution simply (Figure 1). For a single categorical variable, the frequency of observations in each category can be tabulated. The graphical equivalent is a bar graph or bar chart. For a numerical variable, a histogram is the simplest way to present the data. To present the data in a table, unless the scale is very narrow, categories need to be created representing the number of observations within particular group intervals. The number of observations within each interval is presented in the frequency table, allowing the calculation of both relative frequency (the percentage of observations in each category) and cumulative relative frequency (the percentage of observations in that category or below it).
Both categorical and numerical data can be summarized using summary statistics (Figure 2). Appropriate summary statistics for categorical data are the number of observations, and their proportion or percentage, in each category. Numerical data are summarized using an “average” value, such as the mean or median, together with a measure of the spread of the observations around this value, such as range or standard deviation. The mode is only rarely used. The mean and standard deviation are the most informative measures, since they use all the data in their calculation. They should, however, only be used for normally distributed numerical variables, since any skewness in the data (see Comments in Figure 2) also distorts the values of the mean and standard deviation. Nonnormally distributed variables should be summarized using the median and either the range or interquartile range.
Stuart Carney, MB, ChB, MPH, MRCPsych
Helen Doll, BSc, Dip App Stats, MSc
Department of PsychiatryDepartment of Public HealthUniversity of Oxford
Oxford, England, UK