Integrated Medical Education Resources: 1007P- 2.0 MATHEMATICAL FOUNDATIONS OF BIOSTATISTICS

Background reading by Professor Omar Hasan Kasule Sr. for the July 19-23 sessions of the course 'Essentials of Epidemiology in Public Health' at the Department of Social and Preventive Medicine University of Malaya

2.1 PROBABILITY

2.1.1 DEFINITIONS

Probability is modeling of chance random events and a measure of the likelihood of their occurrence. The concept of pure chance is not absolutely true from a tauhidi perspective. All events are pre-determined by the Creator. Humans use chance or probability estimates because of limited knowledge. What appears random or chance to humans has an underlying deterministic order known only to the Creator. The consistency of probabilities and predictions is based on sunan al llaah. Probability is commonly defined as relative frequency of an event on repeated trials under the same conditions. Each possible outcome is called a sample point. The set of all possible outcomes is called the probability space, S. If the probability space consists of a finite number of equally likely events, probability of event is defined as: Pr (A) = n (A) / n (S) where n(A) = number of events of type A and n(S) = the total number in the probability space. Special mathematical techniques called arrangements, permutations and combinations, can enable us calculate the probability space theoretically without having to carry out the trials.

2.1.2 CLASSIFICATION OF PROBABILITY

Probability can be subjective (based personal feelings or intuition) or objective (based on real data or experience). Objective probability can be measured or computed. Prior probability is knowable or calculable without experimentation. Posterior probability is calculable from results of experimentation. Bayesian probability combines prior probability (objective, subjective, or a belief) with new data (from experimentation) to reach a conclusion called posterior probability. Bayesian probability is a good representation of how conclusions are made from empirical observation in real life. Conditional probability is employed when there is partial information or when we want to make probability computations easier by assuming conditionality. In conditional probability, the event depends on occurrence of a previous event.

2.1.3 TYPES OF EVENTS

On the scale of exclusion, events are classified as mutually exclusive or non-mutually exclusive. Mutually exclusive events are those that cannot occur together like being dead and being alive. Not all mutually exclusive events are equally likely. On the scale of independence, events are classified as independent or dependent. Under independence, the occurrence of one event is not affected by occurrence or non-occurrence of another. Independent events can occur at the same instant or subsequently. Some independent events are equally likely while others are not. On the scale of exhaustion, two events A and B are said to be exhaustive if between them they occupy all the probability space ie A U B = S and Pr (A U B) = 1.

Confusion occurs between mutually exclusive and independent events. Mutually exclusive events cannot both occur at the same time i.e. Pr (A n B) = 0. Mutually exclusive events cannot be independent of one another because the occurrence of one will prevent the other one from occurring. Independent events can both occur at the same time but the occurrence of one is not affected by the occurrence of the other i.e. Pr (A n B) = Pr (A) x Pr (B).

2.1.4 LAWS OF PROBABILITY and MATHEMATICAL PROPERTIES

The Total probability space is equal to 1.0. This is stated mathematically as Pr (S) = 1.0. If the probability of occurrence is p, the probability of non-occurrence is 1-p. If the sample space has equally likely outcomes, then Pr (A) = n (A) / n(S). For two events p and q, p + q =1 where p is probability of occurrence of an event and q is probability of its non-occurrence. Note that certainty has a probability of 1. This can be restated as Pr (Ā) = 1 – Pr (A) or as Pr (A) + Pr (Ā) = 1. The additive law, also called the ‘OR’ rule, refers to occurrence of any or both events and is stated as Pr (A u B) = Pr (A) + Pr (B) - Pr (A n B) where Pr (A n B) = 0 for mutually exclusive events. The multiplicative law for independent events refers to the joint occurrence of the events or the ‘AND’ rule and is stated as Pr (A n B) = Pr (A) x Pr (B). The range of probability is 0.0 to 1.0 an cannot be negative. Pr = 0.0 means the event is impossible. Pr = 1.0 means the event is absolutely certain. The odds of an event can be defined as {Pr (A)} / {1 – Pr (A)}.

2.1.5 USES OF PROBABILITY

Probability is used in classical statistical inference, Bayesian statistical inference, clinical decision-making, queuing theories, and probability trees.

2.2 VARIABLES

2.2.1 CONSTANTS AND VARIABLES

A constant has only one unvarying value under all circumstances for example p and c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with intrinsic numerical value). A random quantitative variable results when numerical values are assigned to results of measurement or counting. It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions and decimals The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis depends on the type of variable.

2.2.2 QUALITATIVE RANDOM VARIABLES

Qualitative variables (nominal, ordinal, and ranked) are attribute or categorical with no intrinsic numerical value. The nominal has no ordering, the ordinal has ordering, and the ranked has observations arrayed in ascending or descending orders of magnitude.

2.2.3 QUANTITATIVE (NUMERICAL) DISCRETE RANDOM VARIABLES

The discrete random variables are the Bernoulli, the binomial, the multinomial, the negative binomial, the Poisson, the geometric, the hyper geometric, and the uniform. The Bernoulli is the number of successes in a single unrepeated trial with only 2 outcomes. The binomial is the number of successes in more than 2 consecutive trials each with a dichotomous outcome. The multinomial is the number of successes in several independent trials with each trial having more than 2 outcomes. The negative binomial is the total number of repeated trials until a given number of successes is achieved. The Poisson is the number of events for which no upper limit can be assigned a priori. The geometric is the number of trials until the first success is achieved. The hyper geometric is the number selected from a sub-sample of a larger sample for example selecting males from a sample of n persons from a population N. The uniform has the same value at repeated trials.

2.2.4 QUANTITATIVE (NUMERICAL) CONTINUOUS RANDOM VARIABLES

The continuous random variables can be natural such as the normal, the exponential, and the uniform or artificial such as chi square, t, and F variables. The normal represents the result of a measurement on the continuous numerical scale such as height and weight. The exponential is the time until the first occurrence of the event of interest. The uniform represents results of a measurement and takes on the same value at repeated trials.

The continuous R.V can be measured on either the interval or the ratio scales. Only 2 measurements are made on the interval scale, the calendar and the thermometer. The rest of measurements are on the ratio scale. The interval scale has the following properties: the difference between 2 readings has a meaning, the magnitude of the difference between 2 readings is the same at all parts of the scale, the ratio of 2 readings has no meaning, zero is arbitrary with no biological meaning, and both negative and positive values are allowed. The ratio scale zero has the following properties: zero has a biological significance, values can only be positive; the difference between 2 readings has a meaning, the ratio of 2 readings has a meaning and can be interpreted, and intervals between 2 readings have the same meaning at different parts of the scale.

2.2.5 RANDOM VARIABLES: PROPERTIES AND MATHEMATICAL OPERATIONS

A random variable has 6 properties. The expectation of a random variable is a central value around which it hovers most of the time. The variations of the random variable around the expectation are measured by its variance. Covariance measures the co-variability of the two random variables. Correlation measures the linear relation between two random variables. Skew ness measures the bias of the distribution of the random variable from the center. Kurtosis measures the peaked ness of the random variable is at the point of its expectation.

Quantitative variables can be transformed into qualitative ones. Qualitative variables can be transformed into quantitative ones but this is less desirable. The continuous variable can be transformed into the discrete variable. Transformation of the discrete into the continuous may be misleading. Choice of statistical analytic technique is made according to the scale.

Statistical distributions are graphical representation of mathematical functions of random variables. Each random variable mentioned above has a corresponding statistical distribution that specifies all possible values of a variable with the corresponding probability. Each statistical distribution is associated with a specific statistical analytic technique.

Permutations and combinations are mathematical descriptions of arranging and grouping objects in various ways. Permutations are ordered arrangements. Combinations are arrangements in which the order in which objects are selected does not matter. Sets can be defined in two ways either as a roster of component elements or as a rule. The complement of set A is designated as A^c consists of all elements of the universe that are not members of set A such that A È A^c = U. Union of sets is designated as A È B. Intersection of sets is designated as A Ç B. AÌ B means that set A is a subset of set B. B A means that A is a subset of B. A null set contains nothing 0 = {}. Set A is said to be equal to set B if and only if A Ì B and B ÌA. de Morgan’s first law is AÈB = A’ÇB’ and the second law is AÇB = A’ÈB’

2.3 THE NORMAL CURVE and ESTIMATION

2.3.1 INTRODUCTION

Abraham de Moivre first described the formula for the normal curve in 1744. In the 19^th century, Pierre Simon Laplace and Carl Friedrich Gauss re-discovered the normal curve each working independently. Around 1835 Adolph Quetelet first used the normal curve as an approximation to the histogram. The normal curve may be one of the unifying principles of nature reflecting sunan al llaah. The normal curve fits so many natural data distributions making it very useful in statistics. The normal curve can be used for data that is initially not normally distributed. Such data can be made normally distributed by suitable mathematical transformations. The binomial, the Poisson, the t, and the chi square distributions become normal curves if the sample size is large enough.

2.3.2 PROPERTIES & CHARACTERISTICS OF THE NORMAL CURVE

The normal curve is described fully by its mean and its standard deviation. A standardized normal curve has mean = 0 and standard deviation = 1. Two curves may have the same mean but different standard deviations. Two curves may have different means but the same standard deviation. For a normal curve the ratio of the inter-quartile range to the standard deviation is approximately 0.67. The normal curve is perfectly symmetrical about the mean. Although continuous, it models discrete data well for large sample sizes. It is asymptotic ie approaches the x-axis but never touches it.

2.3.3 USE OF THE NORMAL CURVE FOR NON-NORMAL DATA

Before using the normal curve to model a data set, tests have to be carried out to test the normality of the data: These tests include: a bell-shaped histogram, a straight line on probability paper, and use of special computer programs. If the data is not normal it can be normalized by logarithmic, power, reciprocal, or z score transformation.

2.3.4 THE Z-SCORE and the AREA UNDER THE CURVE

The Z score is deviation of a measurement from the mean measured in SD units. The standard normal variable, z, has mean 0 and variance 1 written as z ~ N (0,1). Z scores are used to compare different data sets, to determine a cut-off or critical value, and to replace the original variable in analysis. The area under the curve is relative frequency or probability. Mean +/- 1SD covers 68% of observations. Mean +/- 2 SD covers 95% of observations. Mean +/- 3 SD covers 99% of observations. Mean – 1 SD is Q₁. Mean + 1 SD is Q₂. The area under the curve between mean – 2 SD and mean + 2 SD is the probability of 95% confidence interval (CI).

2.3.5 ESTIMATION

The difference between population parameters and sample statistics is due to the estimation error. The error gets smaller as the sample size becomes bigger. There are 3 types of estimates: the point estimate, the pooled estimate, and the interval estimate. Point estimation being just one point may be in error. Pooled estimation is a weighted combination of parameters from more than one population or sample. Interval estimation is preferred to point estimation because it shows the influence of random sampling and is used for both hypothesis testing and assessing precision. In Interval estimation the confidence interval is stated as the lower confidence level and the upper confidence level using a usual or customary confidence of 95%. In a common sense way the 95% confidence interval (CI) means that we are 95% sure that the true value of the parameter is within the interval. We can also say that the probability that the true parameter lies in the confidence interval is 95%. Expressed mathematically, the 95% confidence interval for parameter q is defined as Pr (a < q < b) = 0.95. A third way of describing the 95% is to imagine taking repeated samples from a population and computing a mean for each sample. Ninety-five percent (95%) of the sample means will be within the 95% CI.

The 95% CI for the mean is computed as mean +/- {1.96 ns²/n-1} / n^1/2 where n = sample size, s =sample standard deviation (standard error). The 95% CI for a proportion is computed as p_s +/- 1.96 {(p_sq_s / n)}^1/2 where p_s = sample proportion, q_s = 1 - p_s,, and n = sample size. A narrow 95% CI indicates higher precision. Ninety-five percent confidence intervals can be defined for differences of 2 random variables. They can also be defined for the ratios of 2 random variables.

Validity tells us how well an instrument measures what it is supposed to measure. The mean is a measure of validity (parameter of location). The standard deviation is a measure of precision (spread). Validity and precision are both desirable but may not always be achieved simultaneously. A valid measurement may not be precise. A precise measurement may not be valid.

2.4 HYPOTHESES

2.4.1 HYPOTHESES AND THE SCIENTIFIC METHOD

The scientific method consists of hypothesis formulation, experimentation to test the hypothesis, and drawing conclusions. Hypotheses are statements of prior belief. They are modified by results of experiments to give rise to new hypotheses. The new hypotheses then in turn become the basis for new experiments. There are two traditions of formal hypothesis testing: the p-value approach (significance testing) and the confidence interval approach (Neyman-Pearson testing). The two approaches are mathematically and conceptually related.

2.4.2 NULL HYPOTHESIS (H₀) & ALTERNATIVE HYPOTHESIS (H_A):

The null or research hypothesis, H₀, states that there is no difference between two comparison groups and that the apparent difference seen is due to sampling error. The alternative hypothesis, H_A, disagrees with the null hypothesis. H₀ and H_A are complimentary and exhaustive. They both cover all the possibilities. A hypothesis can be rejected but cannot be proved. A hypothesis cannot be proved in a conclusive way but an objective measure of the probability of its truth can be given in the form of a p-value. The concepts of conditional probability can be used to define parameters related to statistical testing. Type 1 error = a error = Probability of rejecting a true H₀ = False positive = Pr (rejecting H₀ | H₀is true). Type 2 error = berror = Probability of not rejecting a false H₀ = False negative = Pr (not rejecting H₀ | H₀ is false). The confidence level (1 - a) = True positive = Pr (not rejecting H₀ | H₀ is true). Power (1-b) = True negative = Pr (rejecting H₀ | H₀ is false).

2.4.3 HYPOTHESIS TESTING USING P-VALUES

Parameters of significance testing are the critical region, the significance level, the p-value, type 1 error, type II error, and power. The critical or rejection region, designated as a, is the far end or tail of the distribution. The non-rejection region consists of normal or moderate values and is designated as 1-a. a, the pre-set level of significance customarily set at 0.05, is the probability that a test statistic falls in the rejection region. It is alternatively defined as the probability of wrongfully rejecting H₀ 5% of the time, a ratio of 1:20. The p value, the observed significance level, is the percentage of extreme observations away from the null or mean value. The p value is not set in advance but is computed from the data. P value can be defined in a commonsense way as the probability of rejecting a true hypothesis by mistake. P values for large samples that are normally distributed is derived from 4 test statistics that are computed from the data: t, F, c, and β. P values for small samples that are not normally distributed are computed directly from the data using exact methods based on the binomial distribution. The decision rules are: If the p < 0.05 H₀ is rejected (test statistically significant). If the p>0.05 H₀ is not rejected (test not statistically significant).

2.4.4 HYPOTHESIS TESTING USING CONFIDENCE INTERVALS

The 95% confidence interval is more informative than the p-value approach because it indicates precision. Under H₀ the null value is defined as 0 (when the difference between comparison groups=0) or as 1.0 (when the ratio between comparison groups=1). The 95% CIs can be computed from the data using approximate Gaussian (for large samples) or exact binomial methods (for small samples). The decision rule are: if the CI contains the null value, H₀ is not rejected. If the CI When the interval does not contain the null value, H₀ is rejected.

2.4.5 CONCLUSIONS and INTERPRETATIONS

A statistically significant test implies that the following are true: H₀ is false, H₀ is rejected, observations are not compatible with H₀, observations are not due to sampling variation, and observations are real/true biological phenomena. A statistically non significant test implies the following are true: H₀ is not false (we do not say true), H₀ is not rejected, observations are compatible with H₀, observations are due to sampling variation or random errors of measurement, and observations are artificial, apparent and not real biological phenomena. Statistical significance may have no clinical/practical significance/importance. This is due to other factors being involved but are not studied. It may also be due to invalid measurements. Clinically important differences may not reach statistical significance due to small sample size or due to measurement that are not discriminating enough. Hypothesis testing may be 1-sided or 2-sided. The 1-sided test considers extraneous values on one side (1 tail) and is rarely used. The 2-sided test considers extraneous values on 2 sides (2 tails), is a more popular conservative test, and looks for any change in the parameter whatever its direction.

2.5 SAMPLES

2.5.1 SAMPLES and POPULATIONS

The word population in statistical usage is defined as a set of objects, states, or events with a common observable characteristic or attribute. Elements are the members of the population or a sample thereof. A sample is a representative subset of the population selected to obtain information on the population. A Sampling plan is the whole process of selecting a sample. Sampling design is both the sampling plan and the estimation methods. Sampling starts by defining a sampling frame (list of individuals to be sampled). The sampling units are the people or objects to be sampled. Samples are studied because of lower and easier logistics. Some populations are hypothetical and cannot be studied except by sampling. Samples are used for estimation of population parameters, estimation of total population, and inference on populations. The sample is selected from the study population (population of interest). The study population is definable in an exact way and is part of the target population. The population studied may be finite or infinite.

2.5.2 RANDOM (PROBABILITY) SAMPLING

In random sampling any element has the same inclusion probability. Random does not always assure representativeness especially for small samples. Sampling with replacement is based on the binomial and sampling without replacement based on the hyper-geometric random distribution. The two types are similar for large samples. Simple random sampling is random selection from the population used when the population is approximately homogenous. Stratified random sampling involves dividing the population into groups called strata and simple random sampling is carried out in each stratum. Systematic random sampling is used if an ordered list is available such that every nth unit is included. Multi-stage random sampling is simple random sampling 2 or more stages.

2.5.3 NON-SCIENTIFIC SAMPLING

Convenience or casual sampling is subjective, depends on whims, and there is no concern about objectivity. A quota sample is subjective selection of a pre-fixed number from each category.

2.5.4 OTHER TYPES OF SAMPLING

Cluster sampling uses clusters (groups of individuals) as sampling units instead of individuals. Epidemiological samples involve random sampling of human populations. There are basically three types of epidemiological sampling schemes: cross-sectional, case control, and follow-up (or cohort). Environmental sampling, static or continuous, uses direct measurements and has the advantages of being objective, individualized, quantitative, specific, and sensitive.