Synopsis for use in teaching sessions of the postgraduate course ‘Essentials of Epidemiology in Public Health’ Department of Social and Preventive Medicine, Faculty of Medicine, University Malaya Malaysia July 17th 2009
5.1 FIELD EPIDEMIOLOGY
5.1.1 Sample Size Determination
5.1.2 Sources of Secondary Data
5.1.3 Primary Data Collection by Questionnaire
5.1.4 Physical Primary Data Collection
5.1.5 Data Management and Data Analysis
5.2 CROSS-SECTIONAL DESIGN
5.2.1 Definition
5.2.2 Design and Data Collection
5.2.3 Statistical Parameters
5.2.3 Ecologic Design 
5.2.4 Health Surveys
5.3 CASE CONTROL DESIGN
5.3.1 Basics 
5.3.2 Design and Data Collection of Case-Base Studies
5.3.3 Statistical Parameters
5.3.4 Strengths and Weaknesses
5.3.5 Sample Size Computation: 
5.4 FOLLOW-UP DESIGN
5.4.1 Definition
5.4.2 Design and Data Collection
5.4.3 Statistical Parameters
5.4.4 Strengths and Weaknesses
5.4.5 Sample Size Computation
5.5 RANDOMIZED DESIGN
5.5.1 Randomized Design in Community Trials
5.5.2 Study Design for Phase 3 Randomized Clinical Trials 
5.5.3 Data Collection in Randomized Clinical Trials
5.5.4 Analysis and Interpretation of Randomized Clinical Trials
5.1.1 SAMPLE SIZE DETERMINATION
The size of the sample depends on the hypothesis, the budget, the study durations, and the precision required. If the sample is too small the study will lack sufficient power to answer the study question. A sample bigger than necessary is a waste of resources. Power is ability to detect a difference and is determined by the significance level, magnitude of the difference, and sample size. Power = 1 – b = Pr (rejecting H0 when H0 is false) = Pr (true negative). The bigger the sample size the more powerful the study. Beyond an optimal sample size, increase in power does not justify costs of larger sample. There are procedures, formulas, and computer programs for determining sample sizes for different study designs. 
5.1.2 SOURCES OF SECONDARY DATA
Secondary data is from decennial censuses, vital statistics, routinely collected data, epidemiological studies, and special health surveys. Census data is reliable. It is wide in scope covering demographic, social, economic, and health information. The census describes population composition by sex, race/ethnicity, residence, marriage, socio-economic indicators. Vital events are births, deaths, Marriage & divorce, and some disease conditions. Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions (like prisons, army, schools), disease registries, and administrative records. Observational epidemiological studies are of 3 types: cross-sectional, case-control, and follow-up/cohort studies. Special surveys cover a larger population that epidemiological studies and may be health, nutritional, or socio-demographic surveys.
5.1.3 PRIMARY DATA COLLECTION BY QUESTIONNAIRE
Questionnaire design involves content, wording of questions, format and layout. The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study. Informed consent and confidentiality must be respected. A protocol sets out data collection procedures. Questionnaire administration by face-to-face interview is the best but is expensive. Questionnaire administration by telephone is cheaper. Questionnaire administration by mail is very cheap but has a lower response rate. Computer-administered questionnaire is associated with more honest responses.
5.1.4 PHYSICAL PRIMARY DATA COLLECTION
Data can be obtained by clinical examination, standardized psychological/psychiatric evaluation, measurement of environmental or occupational exposure, and assay of biological specimens (endobiotic or xenobiotic) and laboratory experiments. Pharmacological experiments involve bioassay, quantal dose-effect curves, dose-response curves, and studies of drug elimination. Physiology experiments involve measurements of parameters of the various body systems. Microbiology experiments involve bacterial counts, immunoasays, and serological assays. Biochemical experiments involve measurements of concentrations of various substances. Statistical and graphical techniques are used to display and summarize this data.
5.1.5 DATA MANAGEMENT AND DATA ANALYSIS
Self-coding or pre-coded questionnaires are preferable. Data is input as text, multiple choice, numeric, date and time, and yes/no responses. In double entry techniques, 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. Data in the computer can be checked manually against the original questionnaire. Interactive data entry enables detection and correction of logical and entry errors immediately. Data replication is a copy management service that involves copying the data and also managing the copies. Synchronous data replication is instantaneous updating with no latency in data consistency. In asynchronous data replication the updating is not immediate and consistency is loose. 
Data editing is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical, statistical, range, and consistency checks. All values are at the same level of precision (number of decimal places) to make computations consistent and decrease rounding off errors. The kappa statistic is used to measure inter-rater agreement. Data editing identifies and corrects errors such as invalid or inconsistent values. Data is validated and its consistency is tested. The main data problems are missing data, coding and entry errors, inconsistencies, irregular patterns, digit preference, out-liers, rounding-off / significant figures, questions with multiple valid responses, and record duplication. Data transformation is the process of creating new derived variables preliminary to analysis and includes mathematical operations such as division, multiplication, addition, or subtraction; mathematical transformations such as logarithmic, trigonometric, power, and z-transformations. 
Data analysis consists of data summarization, estimation and interpretation. Simple manual inspection of the data is needed before statistical procedures. Preliminary examination consists of looking at tables and graphics. Descriptive statistics are used to detect errors, ascertain the normality of the data, and know the size of cells. Missing values may be imputed or incomplete observations may be eliminated. Tests for association, effect, or trend involve construction and testing of hypotheses. The tests for association are the t, chi-square, linear correlation, and logistic regression tests or coefficients. The common effect measures Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures. The probability, likelihood, and regression models are used in analysis. Analytic procedures and computer programs vary for continuous and discrete data, for person-time and count data, for simple and stratified analysis, for univariate, bivariate and multivariate analysis, and for polychotomous outcome variables. Procedures are different for large samples and small samples.
UNIT 5.2
CROSS-SECTIONAL DESIGN 
5.2.1 DEFINITION
The cross-sectional study, also called the prevalence study or naturalistic sampling, has the objective of determination of prevalence of risk factors and prevalence of disease at a point in time (calendar time or an event like birth or death).  Disease and exposure are ascertained simultaneously. A cross-sectional study can be descriptive or analytic or both.  It may be done once or may be repeated. Individual-based studies collect information on individuals. Group-based (ecologic) studies collect aggregate information about groups of individuals. Cross-sectional studies are used in community diagnosis, preliminary study of disease etiology, assessment of health status, disease surveillance, public health planning, and program evaluation. Cross-sectional studies have the advantages of simplicity, and rapid execution to provide rapid answers. Their disadvantages are: inability to study etiology because the time sequence between exposure and outcome is unknown, inability to study diseases with low prevalence, high respondent bias, poor documentation of confounding factors, and over-representation of diseases of long duration.
5.2.2 DESIGN AND DATA COLLECTION
The study may be based on the whole population or a sample. It may be based on individual sampling units or groups of individuals. The study sample is divided into 4 groups: a = exposed cases, b = unexposed cases, c = exposed noncases, and d = unexposed noncases. The total sample size is n = a + b + c + d; n is the only quantity fixed before data collection. The marginal totals are n1 = a+b,  n0 = b+d, m1 = a+b, and m0 = c+d. None of the marginal totals is fixed. Sampling methods can be simple random sampling, cluster sampling, systematic sampling, and multi-stage sampling. Sample size is determined using specific formulas. Cases are identified from clinical examinations, interviews, or clinical records. Data is collected by clinical examination, questionnaires, personal interview, and review of clinical records.
5.2.3 STATISTICAL PARAMETERS
The following descriptive statistics can be computed from a cross-sectional study: mean, standard deviation, median, percentile, quartiles, ratios, proportions, prevalence of the risk factor, n1/n, and the prevalence of the disease, m1/n. The following analytic statistics can be computed: correlation coefficient, regression coefficient, odds ratio, and rate difference. The prevalence difference is computed as p1 – p0 = a/n1 - b/n0. The prevalence ratio is computed as p1/p0 = (a/n1) / (b/n0). The prevalence odds ratio is computed as POR = {p1(1 - p1)} / { p0(1 - p0)}.
5.2.4 ECOLOGIC DESIGN 
Ecological studies, exploratory or analytic, study aggregate and not individual information. Groups commonly used are schools, factories, and countries. Exposure is measured as an overall group index. Outcome is measured as rates, proportions, and means. The correlation and regression coefficients are used as effect measures. The advantages of ecological studies are: low cost, convenience, easy analysis, and interpretation. They have several weaknesses. They generate but cannot test hypotheses. They cannot be used in definitive etiological research. They suffer from ecological fallacy (relation at the aggregate is not true at the individual level). They lack data to control for confounding. Data is often inaccurate or incomplete. Collinearity is a common problem. 
5.2.5 HEALTH SURVEYS
Surveys involve more subjects than the usual epidemiological sample are used for measurement of health and disease, assessment of needs, assessment service utilization and care. They may be population or sample surveys. Planning of surveys includes: literature survey, stating objectives, identifying and prioritizing the problem, formulating a hypothesis, defining the population, defining the sampling frame, determining sample size and sampling method, training study personnel, considering logistics (approvals, manpower, materials and equipment., finance, transport, communication, and  accommodation), preparing and  pre-testing the study questionnaire. Surveys may be cross sectional or longitudinal. The household is the usual sampling unit. Sampling may be simple random sampling, systematic sampling, stratified sampling, cluster sampling, or multistage sampling. Existing data may be used or new data may be collected using a questionnaire (postal, telephone, diaries, and interview), physical examinations, direct observation, and laboratory investigations. Structure and contents of the survey report is determined by potential readers. The report is used to communicate information and also apply for funding. 
5.3.1 BASICS 
The case-control study is popular because or its low cost, rapid results, and flexibility. It uses a small numbers of subjects. It is used for disease (rare and non rare) as well as non disease situations. A case control study can be exploratory or definitive. The variants of case control studies are the case-base, the case-cohort, the case-only, and the crossover designs. In the case-base design, cases are all diseased individuals in the population and controls are a random sample of disease-free individuals in the same base population. The case-cohort design is sampling from a cohort (closed or open). The case-only design is used in genetic studies in which the control exposure distribution can be worked out theoretically. The crossover design is used for sporadic exposures. The same individual can serve as a case or as a control several times without any prejudice to the study. 
5.3.2 DESIGN and DATA COLLECTION OF CASE-BASE STUDIES
The marginal totals, a+b and c+d, are fixed by design before data collection thus prevalence cannot be computed. The source population for cases and controls must be the same. Cases are sourced from clinical records, hospital discharge records, disease registries, data from surveillance programs, employment records, and death certificates. Cases are either all cases of a disease or a sample thereof. Only incident cases (new cases) are selected. Controls must be from the same population base as the cases and must be like cases in everything except having the disease being studied. Information comparability between the case series and the control series must be assured. Hospital, community, neighborhood, friend, dead, and relative controls are used. There is little gain in efficiency beyond a 1:2 case control ratio unless control data is obtained at no cost. Confounding can be prevented or controlled by stratification and matching. Exposure information is obtained from interviews, hospital records, pharmacy records, vital records, disease registry, employment records, environmental data, genetic determinants, biomarker, physical measurements, and laboratory measurements. A nested case-control study can be carried out within a follow-up study. In this case, blood and other biological specimens collected from the cohort at the start can be analyzed for exposure information when cases of disease appear. 
5.3.3 STATISTICAL PARAMETERS
5.3.4 STRENGTHS AND WEAKNESSES
The case-control study design has the following strengths/advantages: computation of the OR as an approximation of the RR, low cost, short duration, and convenience for subjects because they are contacted/interviewed only once. The case control design several disadvantages: RR is approximated and is not measured, Pr(E+/D+) is computed instead of the more informative Pr(D+/E+), rates are not obtained because marginal totals are artificial and not natural being fixed by design, the time sequence between exposure and disease outcome is not clear, vulnerability to bias (misclassification, selection, and confounding), inability to study multiple outcomes, lack of precision in evaluating rare exposures, inability to validate historical exposure information, and inability to control for relevant confounding factors. 
5.3.5 SAMPLE SIZE COMPUTATION
The bigger the samples size the bigger the power. Since confounding reduces the power of a study, increasing the sample size mitigates the effects of confounding.  For best results and ease of analysis, the number of cases should equal the number of controls. In actual practice the supply of cases is limited whereas controls are available in abundance. For a given number of cases power increases with increase of number of controls; not much marginal increase in power is obtained if the case control ratio is higher than 1:6. Economic considerations play a part in determining the case: control ratio. Specific formulas are used to compute sample size under each of the following situations: unmatched design with equal numbers, unmatched design with unequal numbers, matched (1:1) design, and matched (1:many) design.
UNIT 5.4
5.4.1 DEFINITION
A follow up study (also called cohort study, incident study, prospective study, or longitudinal study), compares disease in exposed to disease in non-exposed groups after a period of follow-up. It can be prospective (forward), retrospective (backward), or ambispective (both forward and backward) follow-up. In a nested case control design, a case control study is carried out within a larger follow up study. The follow-up cohorts may be closed (fixed cohort) or open (dynamic cohort). Analysis of fixed cohorts is based on CI and that of open cohorts on IR. 
5.4.2 DESIGN and DATA COLLECTION
The study population is divided into the exposed and unexposed populations. A sample is taken from the exposed and another sample is taken from the unexposed. Both the exposed and unexposed samples are followed for appearance of disease. The study may include matching, (one-to-one or one-to-many), pre and post comparisons, multiple control groups, and stratification. The study cohort is from special exposure groups, such as factory workers, or groups offering special resources, such as health insurance subscribers. Information on exposure can be obtained from the following sources: existing records, interviews/questionnaires, medical examinations, laboratory tests for biomarkers, testing or evaluation of the environment. The time of occurrence of the outcome must be defined precisely. The ascertainment of the outcome event must be standardized with clear criteria. Follow-up can be achieved by letter, telephone, surveillance of death certificates and hospitals. Care must be taken to make sure that surveillance, follow-up, and ascertainment for the 2 groups are the same.
In non-random non-response on exposure, the risk ratio is valid but the distribution of exposure in the community is not valid. In non-random non-response on outcome, the odds ratio is valid but the disease incidence rate is not valid. There is a more complex situation when there is non-response on both exposure and outcome. In general random non-response is better than non-random or differential non-response. Loss to follow-up can be related to the outcome, the exposure and to both outcome and exposure. The consequences of loss to follow-up are similar to those of non-response. In cases of regular follow-up, it is assumed that the loss occurred immediately after the last follow-up. If the loss to follow-up is related to an event such as death, it can be assumed that the loss was half-way between the last observation and the death. 
Five types of bias can arise in follow-up studies. Selection bias arises when the sample is not representative of the population. Follow-up bias arises when the loss to follow-up is unequal among the exposed and the unexposed, when disease occurrence leads to loss to follow up, when people may move out of the study area because of the exposure being studied, and when the observation of the two groups is unequal. Information/misclassification bias arises due to measurement error or misdiagnosis. Confounding bias arises usually due to age and smoking because both are associated with many diseases. Post-hoc bias arises when cohort data is used to make observations that were not anticipated before.
5.4.3 STATISTICAL PARAMETERS
Both incidence and risk statistics can be computed. The incidence statistics are the incidence rate and the cumulative incidence. The risk statistics are either the risk difference or the various ratio statistics (risk ratio, the rate ratio, the relative risk, or the odds ratio). 
5.4.4 STRENGTHS and WEAKNESSES
The cohort design has 4 advantages: it gives a true risk ratio based on incidence rates, the time sequence is clear since exposure precedes disease, incidence rates can be determined directly, and several outcomes of the same exposure can be studied simultaneously. It has 5 disadvantages: loss to subjects and interest due to long follow-up, inability to compute prevalence rate of the risk factor, use of large samples to ensure enough cases of outcome, and high cost. The cost can be decreased by using existing monitoring/surveillance systems, historical cohorts, general population information instead of studying the unexposed population, and the nested case control design. Follow-up studies are not suitable for study of diseases with low incidence. 
5.4.5 SAMPLE SIZE COMPUTATION
Two factors are considered: the estimated proportion of the risk factor in the general unexposed population, the minimum detectable difference in outcome between the exposed and unexposed groups. Sample size computations are usually made assuming 95% confidence intervals.
UNIT 5.5
5.5.1 OVERVIEW
A community intervention study targets the whole community and not individuals. It has 3 advantages over individual intervention. It is easier to change the community social environment than to change individual behavior. High-risk lifestyles and behaviors are influenced more by community norms than by individual preferences. Interventions are tested in the actual natural conditions of the community, and cheaper. The Salk vaccine trial carried out in 1954 had 200,000 subjects in the experimental group and a similar number in the control group. The aspirin-myocardial infarction study was a therapeutic intervention that randomized 4524 men to two groups. The intervention group received 1.0 gram of aspirin daily whereas the reference group received a placebo. The Women’s Health Study involved randomization of 40,000 healthy women into two groups to study prevention of cancer and cardiovascular disease. One group received vitamin E and low dose aspirin. The other group received a placebo. The alpha tocopherol and beta carotene cancer prevention trial randomized 19,233 mid-age men who were cigarette smokers.
B. DESIGN OF A COMMMUNITY INTERVENTION STUDY
There are basically 4 different study designs. In a single community design, disease incidence is measured before and after intervention. In a 2-community design, one community receives an intervention whereas another one serves as the control. In a one-to-many, the intervention community has several control community. In a many-to-many design there are study with multiple intervention communities and multiple control communities. Allocation of a community to either the intervention or the control group is by randomization. Matching and stratification can also be used in more sophisticated designs. The intervention and the assessment of the outcome may involve the whole community or a sample of the community. Outcome measures may be individual level measures or community level measures.
C. COMMUNITY TRIALS: STRENGTHS AND WEAKNESSES
The strength of the community intervention study is that it can evaluate a public health intervention in natural field circumstances. It however suffers from 2 main weaknesses: selection bias and controls getting the intervention. Selection bias is likely to occur when allocation is by community. People in the control community may receive the intervention under study on their own because tight control as occurs in laboratory experimental or animal studies is not possible with humans.
D. PROCEDURE OF THE COMMUNITY TRIAL
Rare phenomena and short follow-up periods require larger sample sizes. The intensity, frequency, and duration of the intervention must be adequate. Very short follow-up leads to insufficient data and too long follow-up has high attrition. All procedures must be identical for both areas. Quantitative criteria are best for end-point assessment. Interviews using questionnaires, physical and biochemical parameters, morbidity, and mortality may be used as end-points. Use of morbidity and mortality as end-points is not the best option because of existence of many competing causes of mortality and morbidity. Examination for and recording of the end-point must be blind. The assessment of the end-point may be based on longitudinal change or by repeated cross-sectional surveys. The net change is computed as {(I1 - I0) / I0 } – {(R1  - R0)/ R0} or as (I1/ I0} / {R1 / R0} - 1
E. DATA INTERPRETATION
Interpretation of the results may be complicated by secular trends; it is therefore recommended that the study duration be as short as is reasonable. Negative findings could be due to an inadequate intervention effort either not intense enough or not long enough. Negative findings will be found when intervention was against a non-causal factor, the intervention was against a wrong target group, or the sample size was not adequate. End-point assessment may be biased by more diagnostic effort in the intervention group. Clustering must be taken into account in analysis of community intervention data. Community level measures such as means and proportions may be heavily confounded and therefore not reliable.
5.6.1 STUDY DESIGN FOR PHASE 3 RANDOMIZED CLINICAL TRIALS
The study protocol describes objectives, the background, the sample, the treatments, data collection and analysis, informed consent; regulatory regulations, and drug ordering. Trials may be single center or multi-center, single-stage or multi-stage, factorial, or crossover. The aim of randomization in controlled clinical trials is to make sure that there is no selection bias and that the two series are as alike as possible by randomly balancing confounding factors. Equal allocation in randomization is the most efficient design. Methods of randomization include alternate cases and sealed serially numbered envelopes. Stratified randomization is akin to block design of experimental studies. Randomization is not successful with small samples and does not always ensure correct conclusions.
5.5.3 DATA COLLECTION IN RANDOMIZED CLINICAL TRIALS
Data collected is on patients (eg weight), tumors (eg TNM staging); tumor markers (eg AFP), response to treatment (complete response, partial response, no response, disease progression, no evidence of disease, recurrence), survival (disease-free survival, time to recurrence, survival until death), adverse effects (type of toxicity, severity, onset, duration), and quality of life (clinical observation, clinical interview, self report by patient). Case report forms design must have a logical order, be clear and not ambiguous, minimize text, have self-explanatory questions, and ensure that every question must be answered. In single blinding the diagnosis is known but the treatment is not. In double blinding both the treatment and the diagnosis are unknown. The trial is stopped when there is evidence of a difference or when there is risk to the treatment group. Quality control involves measures to ensure that information is not lost. Institutional differences in reporting, and patient management must be analyzed and eliminated if possible. A review panel or carry out inter-observer rating to assure data consistence. 
5.5.4 ANALYSIS and INTERPRETATION IN RANDOMIZED CLINICAL TRIALS
Comparison of response proportions is by chi-square, exact test, chi-square for trend. Drawing survival curves is by K-M & life-table methods. Comparing survival & remission is by the Wilcoxon and log-rank tests. Prognostic factors of response, remission, duration, and survival times are investigated using Cox’s proportional hazards regression model. Meta-analysis combines data from several related clinical trials. Differences between the two treatment and control groups are due to sampling variation/chance, inherent differences not controlled by randomization, unequal evaluation not controlled by double-blinding, true effects of the treatment, and non compliance. Problems in trials are incomplete patient accounting, removing 'bad' cases from series, failure to censor the dead, removing cases due to ‘competing causes of death’, analysis before study maturation, misuse of the ‘p-value’, lack of proper statistical questions and conclusions, lack of proper substantive questions and conclusions, use of partial of data; use of inappropriate formulas, errors in measuring response, and censoring of various types. 
 Contact
 Contact