search this site.

080921L - DATA SUMMARY

Print Friendly and PDFPrint Friendly

Background material by Professor Omar Hasan Kasule Sr, for Year 1 Semester 1 Biomed SPSS session on 21st September 2008


1.0 DISCRETE DATA SUMMARY
1.1  Rates
A rate is the number of events in a given population over a defined time period and has 3 components: a numerator, a denominator, and time. The numerator is included in the denominator.

Incidence rate is a commonly used measure in medicine and public health. The incidence rate of disease is defined as a /{(a+b)t} where a = number of new cases, b = number free of disease at start of time interval, and t = duration of the time of observation.

A crude rate is computed based on the whole population. It assumes homogeneity and ignores subgroups differences. It is therefore un-weighted, misleading, and unrepresentative. Inference and population comparisons based on crude rates are not valid.

Specific rates take sub-group differences into consideration. They can be specific for age, gender, race, and cause. Specific rates are more informative than crude rates but are cognitively difficult to internalize, digest, and understand so many rates or be able to reach some conclusions.

Adjusted rates or standardized rates are another way of taking care of sub-group differences is to use adjusted rates. An Adjusted or standardized rate is a representative summary that is a weighted average of specific rates free of the deficiencies of both the crude and specific rates.

1.2 Ratios
Ratio is generally defined as a : b where a= number of cases of a disease and b = number without disease. Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. The proportional mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that year. The maternal mortality ratio is the total number of maternal deaths divided by the total live births. The fetal death ratio is the ratio of fetal deaths to live births.

1.3 Proportions
A proportion is the number of events expressed as a fraction of the total population at risk without a time dimension. The formula of a proportion is a/(a+b) and the numerator is part of the denominator. The proportion most commonly used in medicine is the prevalence of disease. Prevalence describes a still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard.

2.0  CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY
2.1 Concept of averages
Biological phenomena vary around the average. The average represents what is normal by being the point of equilibrium. The average is a representative summary of the data using one value. Three averages are commonly used: the mean, the mode, and the median.

There are 3 types of means: the arithmetic mean, the geometric mean, and the harmonic mean. The most popular is the arithmetic mean. The arithmetic mean is considered the most useful measure of central tendency in data analysis. The geometric and harmonic means are not usually used in public health. The median is gaining popularity. It is the basis of some non-parametric tests as will be discussed later. The mode has very little public health importance.

2.1 The arithmetic mean
The arithmetic mean is the sum of the observations' values divided by the total number of observations and reflects the impact of all observations. The arithmetic mean enjoys 2 desirable statistical advantages. It is the best single summary statistic. It has a rigorous mathematical definition. Its disadvantage is that it is affected by extreme values.

2.2 The mode
The mode is the value of the most frequent observation. It is rarely used in science. It is intuitive, easy to compute, and is the only average suitable for nominal data. It is useless for small samples because it is unstable due to sampling fluctuation. It cannot be manipulated mathematically. It is not a unique average, one data set can have more than 1 mode.
 
2.3 The median
The median is value of the middle observation in a series ordered by magnitude. It is intuitive and is best used for erratically spaced or heavily skewed data. The median can be computed even if the extreme values are unknown in open-ended distributions. It is less stable to sampling fluctuation than the arithmetic mean.

3.0 CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION / VARIATION
3.1 Measures of variation based on the mean
·        The variance is the sum of the squared deviations of each observation from the mean divided by the sample size, n, (for large samples) or n-1 (for small samples). It can be manipulated mathematically but is not intuitive due to use of square units.

·        The standard deviation, the commonest measure of variation, is the square root of the variance. It is intuitive and is in linear and not in square units. It is the most popular measure of variation.. The percentage of observations covered by mean +/- 1 SD is 66.6%, mean +/- 2 SD is 95%, and mean +/- 4 SD virtually 100%. The standard deviation has the following advantages: it is resistant to sampling variation, it can be manipulated mathematically, and together with the mean it fully describes a normal curve. Its disadvantage is that it is affected by extreme values.

3.2 Measures of variation based on quantiles
·        Quantiles (quartiles, deciles, and percentiles) are measures of variation based on division of a set of observations (arranged in order by size) into equal intervals and stating the value of observation at the end of the given interval. Quantiles have an intuitive appeal.

·        Quartiles are based on dividing observations into 4 equal intervals. Deciles are based 10, quartiles on 4, and percentiles on 100 intervals. The inter-quartile range, Q3 - Q1, and the semi inter-quartile range, ½ (Q3 - Q1) have the advantages of being simple, intuitive, related to the median, and less sensitive to extreme values. Quartiles have the disadvantages of being unstable for small samples and not allowing further mathematical manipulation.

·        Percentiles, also called centile scores, are a form of cumulative frequency and can be read off a cumulative frequency curve. They are direct and very intelligible. The 2.5th percentile corresponds to mean - 2SD. The 16th percentile corresponds to mean - 1SD. The 50th percentile corresponds to mean + 0 SD. The 84th percentile corresponds to mean + 1SD. The 97.5th percentile corresponds to mean + 2SD. The percentile rank indicates the percentage of the observations exceeded by the observation of interest. The percentile range gives the difference between the values of any two centiles.

3.3 The range
The full range is based on extreme values. It is defined by giving the minimum and maximum values or by giving the difference between the maximum and the minimum values. The range has several advantages: it is a simple measure, intuitive, easy to compute, and useful for preliminary or rough work. Its disadvantages are: it is affected by extreme values, it is sensitive to sampling fluctuations, and it has no further mathematical manipulation.