search this site.

180502P - INTRODUCTION TO RESEARCH

Print Friendly and PDFPrint Friendly

Presentation at Ministry of Health, Riyadh 03 May 2018 by Professor Omar Hasan Kasule Sr. MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Chairman of IRB, King Fahad Medical City


LECTURE 1: BASIC CONCEPTS

EMPIRICISM 

Epidemiological methodology, following the scientific method, is empirical

Epidemiology relies on and respects only empirical findings.

Empiricism refers to reliance on physical proof.


INDUCTIVE VS DEDUCTIVE INFERENCE 

Epidemiological methodology, following the scientific method, is inductive

Inductive inference = from the specific to the general 

Induction is building a theory on several individual observations

Deductive inference is from the general to the specific 


RELATIVITY VS. ABSOLUTISM 

Nothing is absolute, everything is relative

Science is not deterministic or absolute

Some sciences are deterministic than others for example laboratory data vs epidemiological data


CLASSICAL VS BAYESIAN INFERENCES

Classical inference depends only on the data collected at the moment. It assumes starting the experiment with a clean slate 

Bayesian inference combines prior information (objective, subjective, or a belief) with new information (from experimentation) to reach a conclusion 

Bayesian inference is a good representation of how conclusions are made from empirical observation in real life 


STATISTICAL VS SUBSTANTIVE QUESTIONS AND CONCLUSIONS 

An investigator starts with a substantive question that is formulated as a statistical question.

Data is then collected and is analyzed to reach a statistical conclusion. 

The statistical conclusion is used with other knowledge to reach a substantive conclusion. 

Statistics has a limitations: it gives statistical and not substantive answers.

The statistical conclusion refers to groups and not individuals. 

The statistical conclusion summarizes but does not interpret data. 


LECTURE 2: VARIABLES

QUALITATIVE RANDOM VARIABLES

Qualitative variables (nominal, ordinal, and ranked) are attribute or categorical with no intrinsic numerical value.

The nominal has no ordering, the ordinal has ordering, and the ranked has observations arrayed in ascending or descending orders of magnitude. 


QUANTITATIVE (NUMERICAL) DISCRETE RANDOM VARIABLES 1 

The discrete random variables are based on counting.

There are several ways of counting each giving rise to a different discrete variable

The commonest ways of counting give rise to: the Bernoulli, the binomial, the multinomial, the negative binomial, the Poisson, the geometric, the hyper geometric, and the uniform. 


QUANTITATIVE (NUMERICAL) CONTINUOUS RANDOM VARIABLES 

The continuous random variables are based on measurement. There are 3 common ones: the normal, the exponential, and the uniform. 

The normal represents the result of a measurement on the continuous numerical scale such as height and weight.

The exponential is the time until the first occurrence of the event of interest. 

The uniform represents results of a measurement and takes on the same value at repeated trials. 


THE NORMAL DISTRIBUTION




THE EXPONENTIAL DISTRIBUTION


THE UNIFORM DISTRIBUTION



LECTURE 3: SIX PROPERTIES OF RANDOM VARIABLES AS A BASIS FOR DESCRIPTIVE STATISTICS:


THE SIX (6) PROPERTIES

1. EXPECTATION: The expectation of a random variable is a central value around which it hovers most of the time.

2. VARIANCE: The variations of the random variable around the expectation are measured by its variance.

3. COVARIANCE: Covariance measures the co-variability of the two random variables. 

4. CORRELATION: Correlation measures the linear relation between two random variables. 

5. SKEWNESS: Skewness measures the bias of the distribution of the random variable from the center. 

6. KURTOSIS: Kurtosis measures the peaked ness of the random variable is at the point of its expectation 


NORMAL DISTRIBUTION SHOWING EXPECTATION 



NORMAL DISTRIBUTION SHOWING VARIANCE 



SCATTER PLOT SHOWING CORRELATION



 

DISTRIBUTION SHOWING POSITIVE & NEGATIVE SKEW



 

DISTRIBUTION SHOWING KURTOSIS



LECTURE 4: STATISTICAL PROPERTIES OF DISEASE MEASURES- DISCRETE DATA


RATES 

A rate is the number of events in a given population over a defined time period and has 3 components: a numerator, a denominator, and time.

The numerator is included in the denominator. The incidence rate of disease is defined as a /{(a+b)t} where a = number of new cases, b = number free of disease at start of time interval, and t = duration of the time of observation.

Types of rates: crude, specific, and standardized


HAZARDS

A hazard is defined as the number of events at time t among those who survive until time t

Hazard can also be defined as relative hazard with respect to a specific risk factor. At a specific point in time, relative hazard expresses the hazard among the exposed compared to the hazard among the non-exposed. 


RATIOS

Ratio is generally defined as a : b where a= number of cases of a disease and b = number without disease.

Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio. 


PROPORTIONS (aka prevalence)

A proportion is the number of events expressed as a fraction of the total population at risk without a time dimension.

The formula of a proportion is a/(a+b) and the numerator is part of the denominator. 



LECTURE 5: STATISTICAL PROPERTIES OF DISEASE MEASURES- CONTINUOUS DATA:


MEANS 

The arithmetic mean is the sum of the observations' values divided by the total number of observations and reflects the impact of all observations. 


MODE 

The mode is the value of the most frequent observation. It is rarely used in science and its mathematical properties have not been explored.

It is intuitive, easy to compute, and is the only average suitable for nominal data. 

It is not a unique average, one data set can have more than 1 mode. 


MEDIAN

The median is value of the middle observation in a series ordered by magnitude.

It is intuitive and is best used for erratically spaced or heavily skewed data.

The median can be computed even if the extreme values are unknown in open-ended distributions 


LECTURE 6: INFERENTIAL STATISTICS


HYPOTHESES AND THE SCIENTIFIC METHOD 

The scientific method consists of hypothesis formulation, experimentation to test the hypothesis, and drawing conclusions.

Hypotheses are statements of prior belief. They are modified by results of experiments to give rise to new hypotheses. The new hypotheses then in turn become the basis for new experiments. 


NULL HYPOTHESIS (H0) & ALTERNATIVE HYPOTHESIS (HA): 

The null or research hypothesis, H0, states that there is no difference between two comparison groups and that the apparent difference seen is due to sampling error.

The alternative hypothesis, HA, disagrees with the null hypothesis. H0 and HA are complimentary and exhaustive.

They both cover all the possibilities.

A hypothesis can be rejected but cannot be proved. 


HYPOTHESIS TESTING USING P-VALUES

P value can be defined in a commonsense way as the probability of rejecting a true hypothesis by mistake.

P values for large samples that are normally distributed is derived from 4 test statistics that are computed from the data: t, F, c, and β. 

P values for small samples that are not normally distributed are computed directly from the data using exact methods based on the binomial distribution.

The decision rules are: If the p < 0.05 H0 is rejected (test statistically significant). If the p>0.05 H0 is not rejected (test not statistically significant). 


CONCLUSIONS and INTERPRETATIONS

A statistically significant test implies that the following are true: H0 is false, H0 is rejected, observations are not compatible with H0, and observations are real/true biological phenomena.

A statistically non-significant test implies the following are true: H0 is not false (we do not say true), H0 is not rejected, and observations are artificial, apparent and not real biological phenomena.

Statistical significance may have no clinical/practical significance/importance. 


LECTURE 7: DISCRETE DATA ANALYSIS


SIMPLE ANALYSIS OF PROPORTIONS 

Inference on discrete data is based on the binomial/multinomial distribution.

It uses an approximate method (the chi-square) used for large samples and one exact method (Fisher's Exact Method) used for small samples.

Approximate methods are accurate for large samples and are inaccurate for small samples. There is nothing to prevent exact methods from being used for large samples. 


LECTURE 8: CONTINUOUS DATA ANALYSIS:


OVERVIEW OF PARAMETRIC ANALYSIS 

Inference on numeric continuous data is based on the comparison of sample means. Two test statistics are commonly used: t- and F-statistics. 

The t and F are used for small or moderate samples. 

}The t-statistic is used to compare 2 samples. 

The F statistic is used to compare 3 or more samples.


LECTURE 9: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS) CORRELATION: 


CORRELATION

Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. 

Correlation indicates only association; the association is not necessarily causative. It measures linear relation and not variability.

The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers.

Then Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation. 


LECTURE 10: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS): REGRESSION


LINEAR REGRESSION 2 

The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable.


LOGISTIC REGRESSION

Logistic regression is non-linear regression with y dichotomous/binary such that logit (y) = a+b1x1 + b2x2 + ...bnxn 

Logistic regression is used in epidemiology because of a dichotomized outcome variable and direct derivation of the odds ratio from the regression coefficient as shown in the formula OR = eβ. 

Multiple logistic regression is used for matched analysis, stratified analysis to control for confounders, and prediction.


LECTURE 11: FIELD EPIDIDEMIOLOGY


SAMPLE SIZE DETERMINATION 

The size of the sample depends on the hypothesis, the budget, the study durations, and the precision required.

If the sample is too small the study will lack sufficient power to answer the study question.

A sample bigger than necessary is a waste of resources. 

Power is ability to detect a difference. The bigger the sample size the more powerful the study. 

Beyond an optimal sample size, increase in power does not justify costs of larger sample. There are procedures, formulas, and computer programs for determining sample sizes for different study designs. 


SOURCES OF SECONDARY DATA

Secondary data is from decennial censuses, vital statistics, routinely collected data, epidemiological studies, and special health surveys. Census data is reliable. It is wide in scope covering demographic, social, economic, and health information. Vital events are births, deaths, Marriage & divorce, and some disease conditions. 

Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions (like prisons, army, schools), disease registries, and administrative records. 


PRIMARY DATA COLLECTION BY QUESTIONNAIRE

Questionnaire design involves content, wording of questions, format and layout.

The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study.

Informed consent and confidentiality must be respected.

A protocol sets out data collection procedures. 

Questionnaire administration by face-to-face interview is the best but is expensive. Questionnaire administration by telephone is cheaper. 

Questionnaire administration by mail is very cheap but has a lower response rate. 

Computer-administered questionnaire is associated with more honest responses.


DATA MANAGEMENT AND DATA ANALYSIS 

Self-coding or pre-coded questionnaires are preferable.

Data editing is the process of correcting data collection and data entry errors. It identifies and corrects errors such as invalid or inconsistent values.

Data analysis consists of data summarization, estimation and interpretation.

Descriptive statistics are used to detect errors, ascertain the normality of the data, and know the size of cells.

The tests for association are the t, chi-square, linear correlation, and logistic regression tests or coefficients.

The common effect measures Odds Ratio, Risk Ratio, Rate difference. 


LECTURE 12: CROSS SECTIONAL STUDY DESIGN

 

DEFINITION

The cross-sectional study has the objective of determination of prevalence of risk factors and prevalence of disease at a point in time (calendar time or an event like birth or death).

Disease and exposure are ascertained simultaneously.

Cross-sectional studies have the advantages of simplicity, and rapid execution to provide rapid answers.

The disadvantage of cross-sectional studies are: inability to study etiology because the time sequence between exposure and outcome is unknown.


2x2 CONTIGENCY TABLE FOR A CROSS-SECTIONAL STUDY

 


HEALTH SURVEYS

Surveys involve more subjects than the usual epidemiological sample are used for measurement of health and disease, assessment of needs, assessment service utilization and care. 

Surveys may be cross sectional or longitudinal.

The household is the usual sampling unit.


LECTURE 13: CASE CONTROL STUDY DESIGN


BASICS

The case-control study is popular because of its low cost, rapid results, and flexibility. It uses a small numbers of subjects. It is used for disease (rare and non-rare) as well as non-disease situations.

Controls must be from the same population base as the cases and must be like cases in everything except having the disease being studied. 

Information comparability between the case series and the control series must be assured. 


2x2 CONTINEGENCY TABLE OF A CASE CONTROL STUDY



STRENGTHS OF A CASE CONTROL STUDY DESIGN

Low cost,

Short duration,

Convenience for subjects because they are contacted/interviewed only once.


WEAKNESSES OF THE CASE CONTROL STUDY DESIGN

The time sequence between exposure and disease outcome is not clear,

Vulnerability to bias (misclassification, selection, and confounding),

Inability to study multiple outcomes. 15


LECTURE 14: FOLLOW UP STUDY DESIGN


DEFINITION

A follow up study (also called cohort study, incident study, prospective study, or longitudinal study), compares disease in exposed to disease in non-exposed groups after a period of follow-up.

It can be prospective (forward), retrospective (backward), or ambispective (both forward and backward) follow-up.


DESIGN and DATA COLLECTION

The study population is divided into the exposed and unexposed populations.

A sample is taken from the exposed and another sample is taken from the unexposed.

Both the exposed and unexposed samples are followed for appearance of disease.


STRENGTHS OF THE FOLLOW UP DESIGN

The time sequence is clear since exposure precedes disease,

Several outcomes of the same exposure can be studied simultaneously.


WEAKNESSES OF THE FOLLOW UP STUDY DESIGN

Loss to subjects and interest due to long follow-up,

Use of large samples to ensure enough cases of outcome,

High cost.

Not suitable for study of diseases with low incidence. 20


LECTURE 15: RANDOMIZED STUDY DESIGN: COMMUNITY TRIAL


OVERVIEW

A community intervention study targets the whole community and not individuals.

It has 3 advantages over individual intervention (a) It is easier to change the community social environment than to change individual behavior (b) High-risk lifestyles and behaviors are influenced more by community norms than by individual preferences. (c) Interventions are tested in the actual natural conditions of the community, and cheaper.

Outcome measures may be individual level measures or community level measures.


DESIGNS OF A COMMMUNITY INTERVENTION STUDY

In a single community design, disease incidence is measured before and after intervention. 

In a 2-community design, one community receives an intervention whereas another one serves as the control. 


STRENGTHS and WEAKNESSES OF THE COMMUNITY RANDOMIZED STUDY DESIGN

Strength: it can evaluate a public health intervention in natural field circumstances. 

Weakness: selection bias

Weakness: controls getting the intervention. 


LECTURE 16: RANDOMIZED STUDY DESIGN: CLINICAL TRIAL


STUDY DESIGN FOR PHASE 3 RANDOMIZED CLINICAL TRIALS

The study protocol describes objectives, the background, the sample, the treatments, data collection and analysis, informed consent; regulatory regulations, and drug ordering.

Trials may be single center or multi-center, single-stage or multi- stage, factorial, or crossover.

The aim of randomization in controlled clinical trials is to make sure that there is no selection bias and that the two series are as alike as possible by randomly balancing confounding factors.


DATA COLLECTION IN RANDOMIZED CLINICAL TRIALS

Case report forms design must have a logical order, be clear and not ambiguous, minimize text, have self-explanatory questions, and ensure that every question must be answered. 

In single blinding the diagnosis is known but the treatment is not. In double blinding both the treatment and the diagnosis are unknown. 


LECTURE 17: STUDY ANALYSIS AND INTERPRETATION: MEASURES OF ASSOCIATION and EFFECT:

GENERAL CONCEPTS

Data analysis involves construction of hypotheses and testing them.

Simple manual inspection of the data is needed can help identify outliers, assess the normality of data, identify commonsense relationships, and alert the investigator to errors in computer analysis.

Two procedures are employed in analytic epidemiology: test for association and measures of effect. The test for association is done first. The assessment of the effect measures is done after finding an association. Measures of effect are applied to discrete data.

Measures of trend can discover relationships that are too small to be picked up by association and effect measures.


TESTS OF ASSOCIATION FOR CONTINUOUS DATA

The Spearman chi-square test is used to test association of 2 or more proportions in contingency tables. 

The exact test is used to test proportions for small sample sizes. 


MEASURES OF EFFECT

The Mantel-Haenszel Odds Ratio is used for 2 proportions in single or stratified 2x2 contingency table. 

Logistic regression can be used as an alternative to the MH procedure. 


META ANALYSIS

Meta-analysis refers to methods used to combine data from more than one study to produce a quantitative summary statistic. 

Meta-analysis enables computation of an effect estimate for a larger number of study subjects thus enabling picking up statistical significance that would be missed if analysis were based on small individual studies. 


LECTURE 18: STUDY ANALYSIS AND INTERPRETATION: SOURCES AND TREATMENT OF BIAS:


MISCLASSIFICATION BIAS

Misclassification is inaccurate assignment of exposure or disease status. It may be random or non-random

Misclassification bias is classified as information bias, detection bias, and proto-pathic bias. 


SELECTION BIAS

Selection bias arises when subjects included in the study differ in a systematic way from those not included. 

Selection bias due to disease ascertainment procedures includes publicity, exposure, diagnostic, detection, referral, self-selection, and Berkson biases. 

The Hawthorne self-selection bias is also called the healthy worker effect since sick people are not employed or are dismissed. 

The Berkson fallacy arises due to differential admission of some cases to hospital in proportions such that studies based on the hospital give a wrong picture of disease-exposure relations in the community. 

Selection bias during data collection is represented by non-response bias and follow-up bias. 

Prevention of selection bias is by avoiding its causes that were mentioned above.  There is no treatment for selection bias once it has occurred. 


CONFOUNDING BIAS

Confounding is mixing up of effects. Confounding bias arises when the disease-exposure relationship is disturbed by an extraneous factor called the confounding variable, related to both disease and exposure but unequally distributed. 

Prevention of confounding at the design stage by eliminating the effect of the confounding factor can be achieved using 4 strategies: pair-matching, stratification, randomisation, and restriction. 

Confounding can be treated at the analysis stage by various adjustment methods (both non-multivariate and multi-variate). 

Non-multivariate treatment of confounding employs standardization and stratified Mantel-Haenszel analysis. 

Multivariate treatment of confounding employs multivariate adjustment procedures: multiple linear regression, linear discriminant function, and multiple logistic regression.


SURVEY ERROR and SAMPLING BIAS

Total survey error is the sum of the sampling error and three non-sampling errors (measurement error, non-response error, and coverage error). 

Sampling error decreases with increasing sample size. 

Sampling bias, positive or negative, arises when results from the sample are consistently wrong (biased) away from the true population parameter. 

The sources of bias are: incomplete or inappropriate sampling frame, use of a wrong sampling unit, non-response bias, measurement bias, coverage bias, and sampling bias.