Background reading by Professor Omar Hasan Kasule Sr. for the July 19-23 sessions of the course 'Essentials of Epidemiology in Public Health' at the Department of Social and Preventive Medicine University of Malaya
3.1 DATA STORAGE and RETRIEVAL
3.1.1 DATA STORAGE
Data gives rise to information that in turn gives rise to knowledge. Knowledge leads to understanding. Understanding leads to wisdom. Data may be univariate if it has only one variable. It may be bivariate if it has two variables allowing correlation. It may be multivariate with several variables allowing more sophisticated analyses. A document is stored data in any form: paper, book, letter, message, image, e-mail, voice, and sound. Some documents are ephemeral but can still be retrieved for the brief time that they exist and are recoverable. Data is physically stored as bytes. A byte has 8 bits and can therefore represent 28 = 256 characters. ASCII is a machine language that uses only 127 codes (95 character codes and 25 control codes). ANSI is an extension of ASCII used by Microsoft. Different languages use different numbers of codes for example Greek uses 219 characters, Cyrillic uses 259 characters, Arabic uses 196 characters, and Chinese uses 65, 536 characters. Data compression makes document retrieval easier because the search is carried out in a smaller space. Character, image, and sound data can all be compressed; however compression may involve loss of some data. Data may be formatted in tables of several types of databases (relational, hierarchical, and network). It may be unformatted such as images, sound, or electronic monitoring in the hospital. Formatted documents are easier to retrieve. Files may be sequential files, indexed files, tree structured files, and clustered files. Files may be described as sequential, indexed, tree structured, or clustered. Data compression facilitates data storage and data retrieval. Medline and PDQ are examples of medical data bases. MEDLINE was established in 1971. Every year 400,000 articles from 3,700 journals are added and are indexed using medical subject headings (MESH). GRATEFUL MED is a query language used to search MEDLINE. PDQ is a data base about cancer
3.1.2 DATA RETRIEVAL
Document surrogates used in data retrieval are: identifiers, abstracts, extracts, reviews, indexes, and queries. Queries are short documents used to retrieve larger documents by matching, mapping, or use of Boolean logic (and, or, but). Queries may in natural or probabilistic language. Fuzzy queries are deliberately not rigid to increase the probability of retrieval. Other forms of data retrieval are term extraction (based on low frequency of important terms), term association (based on terms that normally occur together), lexical measures (using specialized formulas), trigger phrases (like figure, table, conclusion), synonyms (same meaning), antonyms (opposite meaning), homographs (same spelling but different meaning), and homonyms (same sound but different spelling). Stemming algorithms help in retrieval by removing ends of words leaving only the roots. Specialized mathematical techniques are used to assess the effectiveness of data retrieval.
3.1.3 DATA WAREHOUSING
Data warehousing is a method of extraction of data from various sources, storing it as historical and integrated data for use in decision-support systems. Meta data is a term used for definition of data stored in the data warehouse (i.e. data about data). A data model is a graphic representation of the data either as diagrams or charts. The data model reflects the essential features of an organization. The purpose of a data model is to facilitate communication between the analyst and the user. It also helps create a logical discipline in database design.
3.1.4 DATA MINING
Data mining is the discovery part of knowledge discovery in data (KDD) involving knowledge engineering, classification, and problem solving. KDD starts with selection, cleaning, enrichment, and coding. The products of data mining are pattern recognition. These patterns are then applied to new situations in predicting and profiling. Artificial intelligence (AI), based on machine learning, imbues computers with some creativity and decision making capabilities using specific algorithms.
3.1.5 DATA REPLICATION
Data replication is a copy management service that involves copying the data and also managing the copies. It ensures that all parts of the organization have access to updated data. It is also an insurance against data loss in case of computer crashes because there will be an alternative data source. Databases must be designed and configured to facilitate replication. The replication infrastructure must be in place from the start. Care must be taken to make sure that replicated data is consistent and in synchrony with the master copy. The process of replication may inadvertently create redundancy in the system. In synchronous data replication there is no latency in data consistency. All replicas of the data are the same because of immediate updating. In asynchronous data replication the updating is not immediate and consistency is loose. Asynchronous replication is easier and cheaper.
3.2 DATA PRESENTATION AS DIAGRAMS
3.2.1 DATA GROUPING
Data grouping summarizes data but leads to loss of information due to grouping errors. The suitable number of classes is 10-20. The bigger the class interval, the bigger the grouping error. Classes should be mutually exclusive, of equal width, and cover all the data. The upper and lower class limits can be true or approximate. The approximate limits are easier to tabulate. Data can be dichotomous (2 groups), trichotomous (3 groups) or polychotomous (>3 groups).
3.2.2 DATA TABULATION
Tabulation summarizes data in logical groupings for easy visual inspection. A table shows cell frequency (cell number), cell number as a percentage of the overall total (cell %), cell number as a row percentage (row%), cell number as a column percentage (column %), cumulative frequency, cumulative frequency%, relative (proportional) frequency, and relative frequency %. Ideal tables are simple, easy to read, correctly scaled, titled, labeled, self explanatory, with marginal and overall totals. The commonest table is the 2 x 2 contingency table. Other configurations are the 2 x k table and the r x c table.
Diagrams present data visually. An ideal diagram is self-explanatory, simple, not crowded, of appropriate size, and emphasizes data and not graphics. The 1-way bar diagram, the stem and leaf, the pie chart, and a map are diagrams showing only 1 variable. A bar diagram uses ‘bars’ to indicate frequency and is classified as a bar chart, a histogram, or a vertical line graph. The bar chart, with spaces between bars, and the line graph, with vertical lines instead of bars, are used for discrete, nominal or ordinal data. The histogram, with no spaces between bars, is used for continuous data. The area of the bar and not its height is proportional to frequency. If the class intervals are equal, the height of the bar is proportional to frequency. The bar diagram is intuitive for the non specialist. The stem and leaf diagram shows actual numerical values with the aid of a key and not their representation as bars. It has equal class intervals, shows the shape of the distribution with easy identification of the minimum value, maximum value, and modal class. The pie chart (pie diagram) shows relative frequency % converted into angles of a circle (called sector angle). The area of each sector is proportional to the frequency. Several pie charts make a doughnut chart. Values of one variable can be indicated on a map by use of different shading, cross-hatching, dotting, and colors. A pictogram shows pictures of the variable being measured as used instead of bars. A pictogram shows pictures of the variable being measured as used instead of bars.
3.2.4 DIGRAMS SHOWING 2 or MORE QUANTITIVE VARIABLES
Two variables can be shown on line graphs, dot plots, time series plots, 2-way bar charts, box plots, scatter diagrams (scatter-grams), and pictograms. More than 2 variables can be shown on scatter plots with varying dot sizes, scatter plot matrices, multiple time series plots, stacked bar charts, divided bar charts, overlay bar charts, and multiway bar charts. Use of different colors helps clarity.
A line graph is produced when frequency is plotted against the class interval midpoint. Joining the points by straight lines produces a frequency polygon and joining them with a smoothed line produces a frequency curve. A line graph shows cumulative frequency, cumulative frequency %, moving averages, time series, trends (cyclic and non-cyclic), medians, quartiles, and percentiles. Plotting the line graph with the y-axis in logarithmic units and the x-axis as arithmetic units enables representation of a wider variation than with a linear scale. A dot plot uses dots instead of bars. A time series plot is a graph of the value of a variable against time. A 2-way, 3-way, or even 4-way bar diagram, constructed using computer graphics, show 2 variables. The scatter diagram is also called the x-y scatter or the scatter plot.
3.2.5 SHAPES OF DISTRIBUTIONS
Bar diagrams and line graphs are distributions. The unimodal shape is the commonest shape. The 2 humps of the bimodal need not be equal. More than 2 peaks is unusual. A perfectly symmetrical distribution is bell-shaped and is centered on the mean. Skew to right (+ve skew) is more common than skew to the left (-ve skew). Leptokurtosis is a narrow sharp peak. Platykurtosis is a wide flat hump. The common shapes are the normal, the s-curve (ogive), the reverse J-curve (exponential), and the uniform. Diagrams can be misleading due to poor labeling, inappropriate scaling, omitting the zero origin, presence of outliers, and presence of high leverage points, or using a wrong model (linear vs quadratic). Widening and narrowing the scales produces different impressions of the data. Double vertical scales can misleadingly be used to show spurious associations. Omitting zero misleads unless broken line are used to show discontinuity.
3.3 DISCRETE DATA SUMMARY
3.3.1 DEFINITIONS
Data can be summarized using parameters, computed from populations, or statistics, computed from samples. Discrete data is categorized in groups and can be qualitative or quantitative. It can be categorized before summarization. The main types of statistics used are measures of location such as rates, hazards, ratios, and proportions and measures of spread. Measures of location indicate accuracy or validity. Measures of spread or variation, such as variance and range, indicate precision.
3.3.2 RATES
A rate is the number of events in a given population over a defined time period and has 3 components: a numerator, a denominator, and time. The numerator is included in the denominator. The incidence rate of disease is defined as a /{(a+b)t} where a = number of new cases, b = number free of disease at start of time interval, and t = duration of the time of observation. A crude rate for a population assumes homogeneity and ignores subgroups differences. It is therefore un-weighted, misleading, and unrepresentative. Inference and population comparisons based on crude rates are not valid. Rates can be specific for age, gender, race, and cause. Specific rates are more informative than crude rates but are cognitively difficult to internalize, digest, and understand so many rates or be able to reach some conclusions. An Adjusted or standardized rate is a representative summary that is a weighted average of specific rates free of the deficiencies of both the crude and specific rates. Standardization eliminates the ‘confusing’ or ‘confounding’ effects due to subgroups.
Standardization can be by direct standardization, indirect standardization, life expectancy or regression techniques. Both direct and indirect standardization involve the same principles but use different weights. Direct standardization is used when age-specific rates are available and indirect standardization is used when age-specific rates are not available. Both direct and indirect standardization use a ‘standard population’ which can be a combination of the two or more populations being compared, use of just one of the comparison populations as a standard for the others, using the national population, and using the world population. Life expectancy is a form of age-standardized standardized mortality rate. Regression techniques provide a means of simultaneous adjustment of the impact of various factors on the rate.
3.3.3 HAZARDS
A hazard is defined as the number of events at time t among those who survive until time t. Hazard can also be defined as relative hazard with respect to a specific risk factor. At a specific point in time, relative hazard expresses the hazard among the exposed compared to the hazard among the non-exposed.
3.3.4 RATIOS
Ratio is generally defined as a : b where a= number of cases of a disease and b = number without disease. Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. The proportional mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that year. This ratio is useful in occupational studies because it provides information on the relative importance of a specific cause of death. The maternal mortality ratio is the total number of maternal deaths divided by the total live births. The fetal death ratio is the ratio of fetal deaths to live births.
3.3.5 PROPORTIONS
A proportion is the number of events expressed as a fraction of the total population at risk without a time dimension. The formula of a proportion is a/(a+b) and the numerator is part of the denominator. Examples of proportions are: prevalence proportion, neonatal mortality proportion, and the perinatal mortality proportion. The term prevalence rate is a common misnomer since prevalence is a proportion and not a rate. Prevalence describes a still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard. The variance of a sample proportion can be computed as {p(1-p)/n} if n < 0.05 or N = ¥ or N is large in relation to n. Sample proportion can alternatively be computed as [{p(1-p)/n}1/2 {N-n}] / [N-1] If n >= 0.05N or N is small in relation to n. The pooled estimate for variance of the difference of 2 proportions is computed as var (p1 – p0) = [{p1(1 - p1)/n1} + p0(1 - p0)/n0}]. The pooled estimate for sum of 2 proportions is computed as: var (p1 + p0) = Var (p1 + p0) = [{p1(1 - p1)/n1} + p0(1 - p0)/n0}].
3.4 CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY
3.4.1 CONCEPT OF AVERAGES
Biological phenomena vary around the average. The average represents what is normal by being the point of equilibrium. The average is a representative summary of the data using one value. Three averages are commonly used: the mean, the mode, and the median. There are 3 types of means: the arithmetic mean, the geometric mean, and the harmonic mean. The most popular is the arithmetic mean. The arithmetic mean is considered the most useful measure of central tendency in data analysis. The geometric and harmonic means are not usually used in public health. The median is gaining popularity. It is the basis of some non-parametric tests as will be discussed later. The mode has very little public health importance.
3.4.2 MEANS
The arithmetic mean is the sum of the observations' values divided by the total number of observations and reflects the impact of all observations. The robust arithmetic mean is the mean of the remaining observations when a fixed percentage of the smallest and largest observations are eliminated. The mid-range is the arithmetic mean of the values of the smallest and the largest observations. The weighted arithmetic mean is used when there is a need to place extra emphasis on some values by using different weights. The indexed arithmetic mean is stated with reference with an index mean. The consumer price index (CPI) is an example of an indexed mean. The arithmetic mean has 4 properties under the central limit theorem (CLT) assumptions: the sample mean is an unbiased estimator of the population mean, the mean of all sample means is the population mean, the variance of the sample means is narrower than the population variance, and the distribution of sample means tends to the normal as the sample size increases regardless of the shape of the underlying population distribution.
The arithmetic mean enjoys 4 desirable statistical advantages: best single summary statistic, rigorous mathematical definition, further mathematical manipulation, and stability with regard to sampling error. Its disadvantage is that it is affected by extreme values. It is more sensitive to extreme values than the median or the mode. The geometric mean (GM) is defined as the nth root of the product of n observations and is less that the arithmetic mean for the same data. It is used if the observations vary by a constant proportion, such as in serological and microbiological assays, to summarize divergent tendencies of very skewed data. It exaggerates the impact of small values while it diminishes the impact of big values. Its disadvantages are that it is cumbersome to compute and it is not intuitive. The harmonic mean (HM) is defined as the arithmetic mean of the sum of reciprocals for a series of values. It is used in economics and business and not in public health. Its computation is cumbersome and it is not intuitive.
3.4.3 MODE
The mode is the value of the most frequent observation. It is rarely used in science and its mathematical properties have not been explored. It is intuitive, easy to compute, and is the only average suitable for nominal data. It is useless for small samples because it is unstable due to sampling fluctuation. It cannot be manipulated mathematically. It is not a unique average, one data set can have more than 1 mode.
3.4.4 MEDIAN
The median is value of the middle observation in a series ordered by magnitude. It is intuitive and is best used for erratically spaced or heavily skewed data. The median can be computed even if the extreme values are unknown in open-ended distributions. It is less stable to sampling fluctuation than the arithmetic mean.
3.4.5 DISCUSSIONS
Mean = mode = median for symmetrical data. Mean > median for right skewed data. Mean < median for left skewed data.. In general, mode-median = 2(median-mean). The mean with the standard deviation are best used to summarize symmetrical data. The median with inter-quartile ranges is best used to summarize skewed data. For some data sets it is best to show all the 3 types of averages. The following rules govern mathematical operations on averages involving constants. If a constant is added to each observation, the same constant is added to the average. If a constant is subtracted from each observation, the same constant is subtracted from the average. If a constant is multiplied into each observation, the average is multiplied by the same constant. If each observation is divided by a constant, the average is divided by the same constant.
3.5 CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION/VARIATION
3.5.1 INTRODUCTION
Variations are biological, measurement, or temporal. Time series analysis relates biological to temporal variation. Analysis of variance (ANOVA) relates biological variation (inter- or between subject) to measurement variation (intra- or within subject) variation. Biological variation is more common than measurement variation. Temporal variation is measured in calendar time or in chronological time. Measures of variation can be classified as absolute (range, inter-quartile range, mean deviation, variance, standard deviation, quantiles) or relative (coefficient of variation and standardized z-score). Some measures are based on the mean (mean deviation, the variance, the standard deviation, z score, the t score, the stanine, and the coefficient of variation) whereas others are based on quantiles (quartiles, deciles, and percentiles).
3.5.2 MEASURES OF VARIATION BASED ON THE MEAN
Mean deviation is the arithmetic mean of absolute differences of each observation from the mean. It is simple to compute but is rarely used because it is not intuitive and allows no further mathematical manipulation. The variance is the sum of the squared deviations of each observation from the mean divided by the sample size, n, (for large samples) or n-1 (for small samples). It can be manipulated mathematically but is not intuitive due to use of square units. The standard deviation, the commonest measure of variation, is the square root of the variance. It is intuitive and is in linear and not in square units. The standard deviation, s, is from a population but the standard error of the mean, s, is from a sample with s being more precise and smaller than s. The relation between the standard deviation, s, and the standard error, s, is given by the expression s = s /(n-1) where n = sample size.
The percentage of observations covered by mean +/- 1 SD is 66.6%, mean +/- 2 SD is 95%, and mean +/- 4 SD virtually 100%. The standard deviation has the following advantages: it is resistant to sampling variation, it can be manipulated mathematically, and together with the mean it fully describes a normal curve. Its disadvantage is that it is affected by extreme values. The standardized z-score defines the distance of a value of an observation from the mean in SD units. The coefficient of variation (CV) is the ratio of the standard deviation to the arithmetic mean usually expressed as a percentage. CV is used to compare variations among samples with different units of measurement and from different populations.
3.5.3 MEASURES OF VARIATION BASED ON QUANTILES
Quantiles (quartiles, deciles, and percentiles) are measures of variation based on division of a set of observations (arranged in order by size) into equal intervals and stating the value of observation at the end of the given interval. Quantiles have an intuitive appeal. Quartiles are based on dividing observations into 4 equal intervals. Deciles are based 10, quartiles on 4, and percentiles on 100 intervals. The inter-quartile range, Q3 - Q1, and the semi inter-quartile range, ½ (Q3 - Q1) have the advantages of being simple, intuitive, related to the median, and less sensitive to extreme values. Quartiles have the disadvantages of being unstable for small samples and not allowing further mathematical manipulation. Deciles are rarely used. Percentiles, also called centile scores, are a form of cumulative frequency and can be read off a cumulative frequency curve. They are direct and very intelligible. The 2.5th percentile corresponds to mean - 2SD. The 16th percentile corresponds to mean - 1SD. The 50th percentile corresponds to mean + 0 SD. The 84th percentile corresponds to mean + 1SD. The 97.5th percentile corresponds to mean + 2SD. The percentile rank indicates the percentage of the observations exceeded by the observation of interest. The percentile range gives the difference between the values of any two centiles.
3.5.4 THE RANGEOTHER MEASURES OF VARIATION:
The full range is based on extreme values. It is defined by giving the minimum and maximum values or by giving the difference between the maximum and the minimum values. The modified range is determined after eliminating the top 10% and bottom 10% of observations. The range has several advantages: it is a simple measure, intuitive, easy to compute, and useful for preliminary or rough work. Its disadvantages are: it is affected by extreme values, it is sensitive to sampling fluctuations, and it has no further mathematical manipulation. The numerical rank expresses the observation's position in counting when the observations are arranged in order of magnitude from the best to the worst. The percentile rank indicates the percentage of the observations exceeded by the observation of interest.
3.5.5 OPERATIONS / MANIPULATIONS
Adding or subtracting a constant to each observation has no effect on the variance. Multiplying or dividing each observation by a constant implies multiplying or dividing the variance by that constant respectively. A pooled variance can be computed as a weighted average of the respective variances of the samples involved.