search this site.

220111P - DESCRIPTIVE STATISTICS FOR DISCRETE DATA

Print Friendly and PDFPrint Friendly

Presented at the Research Methodology Winter Camp of AlMaarefa University on January 11, 2022 at 1.00pm by Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Bioethics

 

OVERVIEW

  • Data can be summarized using parameters, computed from populations, or statistics, computed from samples.
  • Discrete data is based on counting. It is data categorized into groups.
  • The main statistics used to measures location (validity or accuracy) are rates, hazards, ratios, and proportions.
  • The main statistics used to measure spread (precision) are: variance and range.
  • The commonest descriptive statistic of discrete data is the proportion.

 

PROPORTIONS

  • A proportion is the number of events expressed as a fraction of the total population at risk without a time dimension.
  • The formula of a proportion is a/(a+b) and the numerator is part of the denominator.
  • An example of a proportion is the prevalence of disease defined as the total number of disease cases divided by the total population.
  • The variance of a proportion is defined as p(1-p)/n where n= sample size and p=prevalence of the disease.
  • An example:

 

OTHER DESCRIPTIVE STATISTICS FOR DISCRETE DATA

  • RATES: A rate is the number of events in a given population over a defined time period and has 3 components: a numerator, a denominator, and time for example the incidence of disease which is the number of new cases of a disease in 1 year divided by the total population.
  • RATIOSRatio is generally defined as a: b where a= number of cases of disease and b = number without the disease.
  • HAZARD: A hazard is defined as the number of events at time t among those who survive until time t.

 


UNIT 3.4

CONTINUOUS DATA SUMMARY 1:

MEASURES OF CENTRAL TENDENCY

 

Learning Objectives:

  • Use of averages in the data summary.
  • Definition, properties, advantages, and disadvantages of various types of averages.
  • Relations among the various averages.
  • Choice of average to use.

 

Key Words and Terms:

  • Arithmetic mean, indexed mean
  • Arithmetic mean, robust mean
  • Arithmetic mean, the midrange
  • Arithmetic mean, weighted mean
  • Mean, arithmetic mean
  • Mean, geometric mean
  • Mean, harmonic mean
  • Median
  • Mode

 

3.4.1 CONCEPT OF AVERAGES

Biological phenomena vary around the average. The average represents what is normal by being the point of equilibrium. The average is a representative summary of the data using one value. Three averages are commonly used: the mean, the mode, and the median. There are 3 types of means: the arithmetic mean, the geometric mean, and the harmonic mean. The most popular is the arithmetic mean. The arithmetic mean is considered the most useful measure of central tendency in data analysis. The geometric and harmonic means are not usually used in public health. The median is gaining popularity. It is the basis of some non-parametric tests as will be discussed later. The mode has very little public health importance.

 

3.4.2 MEANS

The arithmetic mean is the sum of the observations' values divided by the total number of observations and reflects the impact of all observations. The robust arithmetic mean is the mean of the remaining observations when a fixed percentage of the smallest and largest observations are eliminated. The mid-range is the arithmetic mean of the values of the smallest and the largest observations. The weighted arithmetic mean is used when there is a need to place extra emphasis on some values by using different weights. The indexed arithmetic mean is stated with reference with an index mean. The consumer price index (CPI) is an example of an indexed mean. The arithmetic mean has 4 properties under the central limit theorem (CLT) assumptions: the sample mean is an unbiased estimator of the population mean, the mean of all sample means is the population mean, the variance of the sample means is narrower than the population variance, and the distribution of sample means tends to the normal as the sample size increases regardless of the shape of the underlying population distribution.


The arithmetic mean enjoys 4 desirable statistical advantages: best single summary statistic, rigorous mathematical definition, further mathematical manipulation, and stability with regard to sampling error. Its disadvantage is that it is affected by extreme values. It is more sensitive to extreme values than the median or the mode. The geometric mean (GM) is defined as the nth root of the product of n observations and is less than the arithmetic means for the same data. It is used if the observations vary by a constant proportion, such as in serological and microbiological assays, to summarize divergent tendencies of very skewed data. It exaggerates the impact of small values while it diminishes the impact of big values. Its disadvantages are that it is cumbersome to compute and it is not intuitive. The harmonic mean (HM) is defined as the arithmetic mean of the sum of reciprocals for a series of values. It is used in economics and business and not in public health. Its computation is cumbersome and it is not intuitive.

 

3.4.3 MODE

The mode is the value of the most frequent observation. It is rarely used in science and its mathematical properties have not been explored. It is intuitive, easy to compute, and is the only average suitable for nominal data. It is useless for small samples because it is unstable due to sampling fluctuation. It cannot be manipulated mathematically. It is not a unique average, one data set can have more than 1 mode.

 

3.4.4 MEDIAN

The median is the value of the middle observation in a series ordered by magnitude. It is intuitive and is best used for erratically spaced or heavily skewed data. The median can be computed even if the extreme values are unknown in open-ended distributions. It is less stable to sampling fluctuation than the arithmetic mean.

 

3.4.5 DISCUSSIONS

Mean = mode = median for symmetrical data. Mean > median for right-skewed data. Mean < median for left-skewed data. In general, mode-median = 2(median-mean). The mean with the standard deviation is best used to summarize symmetrical data. The median with inter-quartile ranges is best used to summarize skewed data. For some data sets, it is best to show all the 3 types of averages. The following rules govern mathematical operations on averages involving constants. If a constant is added to each observation, the same constant is added to the average. If a constant is subtracted from each observation, the same constant is subtracted from the average. If a constant is multiplied by each observation, the average is multiplied by the same constant. If each observation is divided by a constant, the average is divided by the same constant.

 

 

UNIT 3.5

CONTINUOUS DATA SUMMARY 2:

MEASURES OF DISPERSION/VARIATION

 

Learning Objectives:

  • Definition, properties, advantages and disadvantages of common measures of variation: variance, standard deviation, and z-score.
  • Definition and use of quartiles and percentiles.
  • Relation among percentile, standard deviation, and area under a normal curve

 

Key Words and Terms:

  • Analysis of Variance
  • Coefficient of Variation
  • Inter-quartile range
  • Mean deviation
  • Percentile range
  • Quantiles
  • Quartiles
  • Range
  • Standard deviation
  • Variance
  • Variation, biological variation
  • Variation, inter-subject variation
  • Variation, intra-subject variation
  • Variation, measurement variation
  • Variation, observer variation
  • Variation, seasonal variation
  • Variation, temporal variation
  • Z-Score / Standard Score

 

3.5.1 INTRODUCTION

Variations are biological, measurement, or temporal. Time series analysis relates biological to temporal variation. Analysis of variance (ANOVA) relates biological variation (inter- or between-subject) to measurement variation (intra- or within-subject) variation. Biological variation is more common than measurement variation. Temporal variation is measured in calendar time or in chronological time. Measures of variation can be classified as absolute (range, inter-quartile range, mean deviation, variance, standard deviation, quantiles) or relative (coefficient of variation and standardized z-score). Some measures are based on the mean (mean deviation, the variance, the standard deviation, z score, the t score, the stanine, and the coefficient of variation) whereas others are based on quantiles (quartiles, deciles, and percentiles).

 

3.5.2 MEASURES OF VARIATION BASED ON THE MEAN

Mean deviation is the arithmetic mean of absolute differences of each observation from the mean. It is simple to compute but is rarely used because it is not intuitive and allows no further mathematical manipulation. The variance is the sum of the squared deviations of each observation from the mean divided by the sample size, n, (for large samples) or n-1 (for small samples). It can be manipulated mathematically but is not intuitive due to the use of square units. The standard deviation, the commonest measure of variation, is the square root of the variance. It is intuitive and is linear and not in square units. The standard deviation, s, is from a population but the standard error of the mean, s, is from a sample with s being more precise and smaller than s. The relation between the standard deviation, s, and the standard error, s, is given by the expression s = s /(n-1) where n = sample size.

 

The percentage of observations covered by mean +/- 1 SD is 66.6%, mean +/- 2 SD is 95%, and mean +/- 4 SD is virtually 100%. The standard deviation has the following advantages: it is resistant to sampling variation, it can be manipulated mathematically, and together with the mean it fully describes a normal curve. Its disadvantage is that it is affected by extreme values. The standardized z-score defines the distance of a value of an observation from the mean in SD units. The coefficient of variation (CV) is the ratio of the standard deviation to the arithmetic mean usually expressed as a percentage. CV is used to compare variations among samples with different units of measurement and from different populations.

 

3.5.3 MEASURES OF VARIATION BASED ON QUANTILES

Quantiles (quartiles, deciles, and percentiles) are measures of variation based on the division of a set of observations (arranged in order by size) into equal intervals and stating the value of observation at the end of the given interval. Quantiles have an intuitive appeal. Quartiles are based on dividing observations into 4 equal intervals. Deciles are based on 10, quartiles on 4, and percentiles on 100 intervals. The inter-quartile range, Q3 - Q1, and the semi interquartile range, ½ (Q3 - Q1) have the advantages of being simple, intuitive, related to the median, and less sensitive to extreme values. Quartiles have the disadvantages of being unstable for small samples and not allowing further mathematical manipulation. Deciles are rarely used. Percentiles, also called centile scores, are a form of cumulative frequency and can be read off a cumulative frequency curve. They are direct and very intelligible. The 2.5th percentile corresponds to mean - 2SD. The 16th percentile corresponds to mean - 1SD. The 50th percentile corresponds to mean + 0 SD. The 84th percentile corresponds to mean + 1SD. The 97.5th percentile corresponds to mean + 2SD. The percentile rank indicates the percentage of the observations exceeded by the observation of interest. The percentile range gives the difference between the values of any two centiles.

 

3.5.4 THE RANGE OF OTHER MEASURES OF VARIATION:

The full range is based on extreme values. It is defined by giving the minimum and maximum values or by giving the difference between the maximum and the minimum values. The modified range is determined after eliminating the top 10% and bottom 10% of observations. The range has several advantages: it is a simple measure, intuitive, easy to compute, and useful for preliminary or rough work. Its disadvantages are: it is affected by extreme values, it is sensitive to sampling fluctuations, and it has no further mathematical manipulation. The numerical rank expresses the observation's position in counting when the observations are arranged in order of magnitude from the best to the worst. The percentile rank indicates the percentage of the observations exceeded by the observation of interest.

 

3.5.5 OPERATIONS / MANIPULATIONS

Adding or subtracting a constant to each observation has no effect on the variance. Multiplying or dividing each observation by a constant implies multiplying or dividing the variance by that constant respectively. A pooled variance can be computed as a weighted average of the respective variances of the samples involved.