Presented at a workshop on evidence-based decision making organized by the Ministry of Health Kingdom of Saudi Arabia Riyadh 24-26 April 2010 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Bioethics Faculty of Medicine King Fahd Medical College
1.0 DISCRETE / CATEGORICAL DATA SUMMARY
Discrete data is categorized in groups and can be qualitative or quantitative. It can be categorized before summarization. The main types of statistics used are measures of location such as rates, ratios, and proportions.
A rate is the number of events in a given population over a defined time period and has 3 components: a numerator, a denominator, and time. The numerator is included in the denominator. The incidence rate of disease is defined as a /{(a+b)t} where a = number of new cases, b = number free of disease at start of time interval, and t = duration of the time of observation.
A crude rate for a population assumes homogeneity and ignores subgroups differences. It is therefore un-weighted, misleading, and unrepresentative. Inference and population comparisons based on crude rates are not valid. Rates can be specific for age, gender, race, and cause.
Specific rates are more informative than crude rates but are cognitively difficult to internalize, digest, and understand so many rates or be able to reach some conclusions.
An adjusted or standardized rate is a representative summary that is a weighted average of specific rates free of the deficiencies of both the crude and specific rates. Standardization eliminates the ‘confusing’ or ‘confounding’ effects due to subgroups.
A ratio is generally defined as a : b where a= number of cases of a disease and b = number without disease. Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. The proportional mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that year. This ratio is useful in occupational studies because it provides information on the relative importance of a specific cause of death. The maternal mortality ratio is the total number of maternal deaths divided by the total live births. The fetal death ratio is the ratio of fetal deaths to live births.
A proportion is the number of events expressed as a fraction of the total population at risk without a time dimension. The formula of a proportion is a/(a+b) and the numerator is part of the denominator. Examples of proportions are: prevalence proportion, neonatal mortality proportion, and the perinatal mortality proportion. The term prevalence rate is a common misnomer since prevalence is a proportion and not a rate. Prevalence describes a still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard.
2.0 CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY
The arithmetic mean is the sum of the observations' values divided by the total number of observations and reflects the impact of all observations. The robust arithmetic mean is the mean of the remaining observations when a fixed percentage of the smallest and largest observations are eliminated. The mid-range is the arithmetic mean of the values of the smallest and the largest observations. The weighted arithmetic mean is used when there is a need to place extra emphasis on some values by using different weights. The indexed arithmetic mean is stated with reference with an index mean.
The mode is the value of the most frequent observation. It is rarely used in science and its mathematical properties have not been explored. It is intuitive, easy to compute, and is the only average suitable for nominal data. It is useless for small samples because it is unstable due to sampling fluctuation. It cannot be manipulated mathematically. It is not a unique average, one data set can have more than 1 mode.
The median is the value of the middle observation in a series ordered by magnitude. It is intuitive and is best used for erratically spaced or heavily skewed data. The median can be computed even if the extreme values are unknown in open-ended distributions. It is less stable to sampling fluctuation than the arithmetic mean.
3.0 CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION/VARIATION
3.1 MEASURES OF VARIATION BASED ON THE MEAN
Mean deviation is the arithmetic mean of absolute differences of each observation from the mean. It is simple to compute but is rarely used because it is not intuitive and allows no further mathematical manipulation. The variance is the sum of the squared deviations of each observation from the mean divided by the sample size, n, (for large samples) or n-1 (for small samples). It can be manipulated mathematically but is not intuitive due to use of square units. The standard deviation, the commonest measure of variation, is the square root of the variance. It is intuitive and is in linear and not in square units. The standard deviation, s, is from a population but the standard error of the mean, s, is from a sample with s being more precise and smaller than s. The relation between the standard deviation, s, and the standard error, s, is given by the expression s = s /(n-1) where n = sample size.
The percentage of observations covered by mean +/- 1 SD is 66.6%, mean +/- 2 SD is 95%, and mean +/- 4 SD virtually 100%.
The standardized z-score defines the distance of a value of an observation from the mean in SD units.
The coefficient of variation (CV) is the ratio of the standard deviation to the arithmetic mean usually expressed as a percentage. CV is used to compare variations among samples with different units of measurement and from different populations.
3.2 MEASURES OF VARIATION BASED ON QUANTILES
Quantiles (quartiles, deciles, and percentiles) are measures of variation based on division of a set of observations (arranged in order by size) into equal intervals and stating the value of observation at the end of the given interval. Quantiles have an intuitive appeal.
Quartiles are based on dividing observations into 4 equal intervals. Deciles are based 10, quartiles on 4, and percentiles on 100 intervals. The inter-quartile range, Q3 - Q1, and the semi inter-quartile range, ½ (Q3 - Q1) have the advantages of being simple, intuitive, related to the median, and less sensitive to extreme values. Quartiles have the disadvantages of being unstable for small samples and not allowing further mathematical manipulation.
Deciles are rarely used.
Percentiles, also called centile scores, are a form of cumulative frequency and can be read off a cumulative frequency curve. They are direct and very intelligible. The 2.5th percentile corresponds to mean - 2SD. The 16th percentile corresponds to mean - 1SD. The 50th percentile corresponds to mean + 0 SD. The 84th percentile corresponds to mean + 1SD. The 97.5th percentile corresponds to mean + 2SD. The percentile rank indicates the percentage of the observations exceeded by the observation of interest. The percentile range gives the difference between the values of any two centiles.