Integrated Medical Education Resources: 0900L - MODULE 5.0 STUDY DESIGN AND ANALYSIS

MODULE OUTLINE

5.1 FIELD EPIDEMIOLOGY

5.1.1 Sample Size Determination

5.1.2 Sources of Secondary Data

5.1.3 Primary Data Collection by Questionnaire

5.1.4 Physical Primary Data Collection

5.1.5 Data Management and Data Analysis

5.2 CROSS-SECTIONAL DESIGN

5.2.1 Definition

5.2.2 Design and Data Collection

5.2.3 Statistical Parameters

5.2.4 Ecologic Design

5.2.5 Health Surveys

5.3 CASE-CONTROL DESIGN

5.3.1 Basics

5.3.2 Design and Data Collection of Case-Base Studies

5.3.3 Statistical Parameters

5.3.4 Strengths and Weaknesses

5.3.5 Sample Size Computation:

5.4 FOLLOW-UP DESIGN

5.4.1 Definition

5.4.2 Design and Data Collection

5.4.3 Statistical Parameters

5.4.4 Strengths and Weaknesses

5.4.5 Sample Size Computation

5.5 RANDOMIZED DESIGN: COMMUNITY TRIALS

5.5.1 Overview

5.5.2 Design of a Community Intervention Study

5.5.3 Community Trials: Strengths and Weaknesses

5.5.4 Procedure of the Community Trial

5.5.5 Data Interpretation

5.6 RANDOMIZED DESIGN: CLINICAL TRIALS

5.6.1 Study Design for Phase 3 Randomized Clinical Trials

5.6.2 Data Collection

5.6.3 Analysis and Interpretation

UNIT 5.1

FIELD EPIDEMIOLOGY

· Sample determination

· Data collection

· Data Management

Key Words and Terms:

· Analysis, bivariate analysis

· Analysis, multivariate analysis

· Analysis, simple analysis

· Analysis, stratified analysis

· Analysis, univariate analysis

· Analytic models, likelihood model

· Analytic models, probability model

· Analytic models, regression model

· Data coding

· Data compression

· Data editing

· Data encryption

· Data entry

· Data interpretation

· Data modeling

· Data processing

· Data reduction

· Data replication

· Data transformation

· Data value

· Data, data summary

· Data, grouped

· Data, primary data collection

· Data, secondary data collection

· Database design

· Database mgmt system

· Estimation

· Inference

· Questionnaire, face-to-face

· Questionnaire, mail

· Questionnaire, telephone

· Questionnaire, computer

· Sample, sample size

· Study power

· Test for association

· Test for effect

· Test for interaction

· Test for effect modification

· Test for trend

UNIT OUTLINE

5.1.1 SAMPLE SIZE DETERMINATION

A. Introduction

B. Sample Size for Estimation Of Population Parameters

C. Sample Size for Inference On Sample Means

D. Sample Size for Inference On 2 Sample Proportions

E. Sample Size for Experimental Studies

5.1.2 SOURCES OF SECONDARY DATA

A. General Population and Household Census

B. Vital Statistics

C. Routinely-Collected Data

D. Epidemiological Studies

E. Special Surveys:

5.1.3 PRIMARY DATA COLLECTION BY QUESTIONNAIRE

A. Questionnaire Design

B. Preparation for Data Collection:

C. Questionnaire Administration by Face-To-Face Interview

D. Questionnaire Administration by Telephone

E. Questionnaire Administration by Mail

F. Computer-Administered Questionnaire:

5.1.4 PHYSICAL PRIMARY DATA COLLECTION

A. Clinical Examination

B. Psychological/Psychiatric Examination

C. Environmental or Occupational Exposure

D. Biological Measurements

E. Experiments

5.1.5 DATA MANAGEMENT AND DATA ANALYSIS

A. Data Management

B. Preliminaries to Data Analysis

C. Discrete Data Analysis: Unstratified Analysis

D. Discrete Data: Stratified Analysis

E. Multivariate Models

F. Polytomous Exposures and Outcomes

5.1.1 SAMPLE SIZE DETERMINATION

A. INTRODUCTION

Samples are selected so that they can be used to collect data to answer specific questions. At the conceptual level, sample selection is a tool to study the heterogeneity of the population. If a population is perfectly homogenous, then a sample of 1 person however selected will be sufficient to study that population. If a population has several perfectly homogenous subgroups then selection of one element from each group will provide a sample that sufficiently describes the population. Similarly a sample of one group with all its elements will be sufficient to represent the population.

The size of the sample needed depends on the nature of the question of hypothesis being tested. The following are considerations in the determination of the sample size: the budget available for the study, the time within which results are needed, minimization of sampling error, and achieving pre-specified parameters of precision. The most important consideration is the precision of the estimates. If the sample size is too small the study will not have sufficient power to answer the question under consideration accurately. If the sample size is bigger than is necessary there will be a waste of resource as information is collected from more persons that are needed.

Power is ability to detect a difference. Power is determined by the significance level, magnitude of the difference, and sample size. Power = 1 – beta = Pr (rejecting H₀ when H₀ is false) = Pr (true negative). The larger the sample size, the narrower the confidence interval. The higher the confidence level, the wider the confidence interval. Power can be computed or looked up in appropriate tables. The bigger the sample size the more powerful the study. Beyond an optimal sample size, increase in power does not justify costs of larger sample. Sample size can be computed or looked up in tables. We have to balance the requirement to have as powerful a study as is desired with the cost associated with large studies.

There are procedures and formulas for computing sample sizes. There are special computer programs such as EPI-INFO that can be used to compute sample sizes.

B. SAMPLE SIZE FOR ESTIMATION OF POPULATION PARAMETERS

SIMPLE RANDOM SAMPLES

If it is desired to estimate the mean with accuracy such that the lower bound is m - c and the upper bound is m + c and with probability 1-a, the sample size is given by the formula n = Ns² / {(N-1) D² + s²} where D = c/1.96. We can estimate the 1 - a % confidence intervals for the mean estimated from a simple random sample as (sample average) +/- Z_a/2 {Var (x) }}^1/2 where Var (x) = {s² / (n-1)}{(N-n)/N}. A simpler formula gives the sample size as n = z²1.96 s²/d² where s = population standard deviation and d= minimum detectable difference.

If the determination of a population proportion, p, is desired with a certain accuracy such that it ranges from the low bound of p-c to the higher bound of p+c, the sample size required is given by the formula n= Ns² / {(N-1) c + s²}. This formula can be rewritten as n = {N p(1-p) } / {(N-1) c + p (1-p) }. The formula can be rewritten without the ‘c’ term as n = Np (1-p) / {(N-1)D² + p (1-p) } where D = c/1.96 where Z_a/2 = 1.96. A simpler formula for sample size is given as n = {z²/d²}p(1-p). We can estimate the 1-a% confidence intervals for the proportion computed from the sample as (sample proportion) +/- Z_a/2 {Var(p)}^1/2 and Var(p) = p(1-p) / (n-1) . (N-n)/N.

STRATIFIED RANDOM SAMPLE

The sample size needed to determine the average with accuracy of +c or –c and 1-a % confidence is given by the expression {å (N_i² s² /n_i)} / {N² (c/z_a/2)² + (å N_i s_i²) where n_i = nn_{i. .}The sample size needed to determine the proportion with accuracy of +c or –c is given by the expression {å (N_i² p_i (1- p_i) / n_I} / {N² ((c/z_a/2)² + å N_i p_i (1- p_i)} where n_i = nn_i.The unbiased estimator of the population average is given by the summation åw_ix_iwith i=1….i=n. The unbiased estimator of the variance of the average is given as the summation åw_i² s_i² / n_i (N_i – n_i) / (N_i – 1) from i =1 to i =n. The unbiased estimator of the population proportion is given by the summation å w_i p_i from i = 1 to i= n. The unbiased estimator of the variance of the proportion is given by the summation å w_i² p_i (1- p_i) / n_i (N_i – n_i) / (N_i – 1).

MULTI-STAGE RANDOM SAMPLE

In a 2-stage sampling, the sample average is given by the expression M/m å (w_i x_i-bar). The variance is given by (M/N)² s₀₁/n (M-n) / (n-1) + M/m å w_i² s_i²/n_i (N_i - n_i)/(N_i –1).

The sample proportion is given by the expression M/m åw_i p_i. The variance of the proportion is given by the expression (M/N)²s₀₂/m (M-m) / (M-1) + M/m å w_i² p_i (1- p_i)/ p_i (N_i –n_i) / (N_i – 1) where M = number of groups in the population, m = number of groups selected in the first stage, N= number of elements in the population, N_i = number of elements in ith group, n_i = number of elements selected from the ith group, x_i-bar = sample mean from ith group, and p_i = sample proportion from ith group.

CLUSTER SAMPLE

C. SAMPLE SIZE FOR INFERENCE ON SAMPLE MEANS

SIMPLE RANDOM SAMPLES

The sample size needed to compare averages of measurements of two independent groups is given by the formula n₁ = (1 + 1/r) (Z_a/2 + Z_b)² s_d² / (m₂ - m₁)² where r = n₂/n₁ (the ratio of the number in group 1 divided by the number on group2), Z_a/2 = 1.96 for 95% confidence, Z_b= [d/n{nr/(r+1)}^1/2] – [Z_a/2] or in simplified form Z_b= [{d – d*} / se(d)] – [Z_a/2], d = the magnitude of the difference one wishes to detect (the non-null value of the difference), d* = s_d = standard deviation of the differences, m₁ = average of group 1 and m₂ = average of group2. The sample size needed to compare averages of measurements of two matched groups in a matched study is given by the formula n = (Z_a/2 + Z_b)² s_d² / (m₂ - m₁)² where Z_a/2 = 1.96 for 95% confidence, s_d = standard deviation of the differences, s_d = average of group 1 and m₁ = average of group2. If the correlation coefficient between measurements between the two groups is known, the formula above is adjusted to become n = 2 (1-r)(Z_a/2 + Z_b)² s_d² / (m₂ - m₁)²

The values of Z_a/2 usually used for various levels of significance are as follows: for α = 0.001 Z_a/2 = 3.291, for α = 0.005 Z_a/2 = 2.807, for α = 0.01 Z_a/2 = 2.576, for α = 0.02 Z_a/2 = 2.326, for α = 0.05 Z_a/2 = 1.96, for α = 0.10 Z_a/2 = 1.645 (Jennifer L Kelsey et al Methods in Observational Epidemiology 2^nd edition OUP New York and Oxford 1996)

STRATIFIED RANDOM SAMPLE

MULTI-STAGE RANDOM SAMPLE

CLUSTER SAMPLE

D. SAMPLE SIZE FOR INFERENCE ON 2 SAMPLE PROPORTIONS

SIMPLE RANDOM SAMPLES

The sample size needed to compare percentages (proportions) in two independent samples is given by n₁ = [Z_a/2 (1+1/r)^1/2 p(1-p) + Z_b {p₁(1- p₁) + p₁(1- p₁) + p₂(1- p₂)/r }²] / [p₁ - p₂)²] where Z_a/2 = 1.96 for 95% confidence, Z_b= [{n(d)²r} / {(r +1) p(1-p)}]^1/2 – [Z_a/2], p= (p₁ + p₂)/2 or weighted p = {p₁+ (r)(p₀)} / {1 + r}, p₁ = proportion in group1 and p₂ = proportion in group2. r = n₂/n₁, n₂ = number in group2 and n₁ = number in group 1. In case of an unmatched case control study, the formula is modified to become n₁ = [Z_a/2 (1+1/r)^1/2 p(1-p) + Z_b {1/p₁(1- p₁) + p₁(1- p₁) + 1/rp₂(1- p₂) }²] / ln (OR) where Z_a/2 = 1.96 for 95% confidence, Z_b= for % power, p= (p₁ + p₂)/2, p₁ = proportion in group1 and p₂ = proportion in group2. r = n₂/n₁, n₂ = number in group2 and n₁ = number in group1, OR = expected odds ratio.

The relation between p₁ and p₀ is different according to the effect measure being used. If the effect measure is the odds ratio (OR) the relation is given as p₁ = {(p₀)(OR)} / {1 + p₀(OR -1)}. If the effect measure is the risk ratio the relation is given as p₁ = (p₀)(RR).

STRATIFIED RANDOM SAMPLE

MULTI-STAGE RANDOM SAMPLE

CLUSTER SAMPLE

E. SAMPLE SIZE FOR EXPERIMENTAL STUDIES

CLINICAL TRIALS

Formulas for suitable sample sizes for clinical trials are complicated. Recourse is often made to rules of thumb estimations such as the following. The 50:50 rule of thumb for counted outcome (discrete events) says that for an 80% chance of detecting a 50% relative reduction in event rate, at least 50 events are needed in the control group. The rule of thumb for measured outcomes states that the sample size is approximated by 16 (s/d)² where s = the standard deviation of individual measurements in each group and d = minimum difference in average measurement that needs to be detected.

In a clinical trial comparing outcome as proportions in 2 groups we use the formula for comparison of 2 proportions that has been discussed above. The experimenter will have to state the following: the alpha level, the study power, and the outcome difference that should be detected by the study. Alpha is usually set at 0.05 or 5%. Study power is usually set at 80% ( a beta level of 0.2).

In a clinical trial comparing outcome as means in 2 groups, we use the formula for comparison of 2 means as discussed before.

LABORATORY EXPERIMENTS