Synopsis for use in teaching sessions of the postgraduate course ‘Essentials of Epidemiology in Public Health’ Department of Social and Preventive Medicine, Faculty of Medicine, University Malaya Malaysia July 16th 2009
MODULE OUTLINE
5.1 DISCRETE DATA ANALYSIS
5.1.1 Simple Analysis of Proportions
5.1.2 Stratified and Matched Analysis of Proportions
5.1.3 Exact Analysis of Proportions
5.1.4 Analysis of Rates and Hazards
5.1.5 Analysis of Ratios
5.2 CONTINUOUS DATA ANALYSIS
5.2.1 Overview of Parametric Analysis
5.2.2 Parametric Analysis for 2 Sample Means
5.2.3 Parametric Analysis for 3 or More Sample Means
5.2.4 Over-view of Non Parametric Analysis
5.2.5 Procedures of Non-Parametric Analysis
5.3 CORRELATION
5.3.1 Description
5.3.2 Pearson's Correlation Coefficient, R
5.3.3 Other Correlation Coefficients
5.3.4 The Coefficient of Determination, R2
5.3.5 Non-Parametric Correlation Analysis
UNIT 5.1
DISCRETE DATA ANALYSIS
Inference on discrete data is based on the binomial/multinomial distribution. It uses 2 approximate methods (z-statistic and the chi-square) used for large samples and one exact method (Fisher's Exact Method) used for small samples. Approximate methods are accurate for large samples and are inaccurate for small samples. There is nothing to prevent exact methods from being used for large samples. The first steps in the analysis is to ascertain the normal distribution of the data, equality of variances of sample proportions being compared, and adequacy of the sample size. The data is laid out in contingency tables and is inspected manually before application of statistical tests. The z and chi square tests give approximately the same results because chi square is a square of z. The z (or c) statistic is computed as the difference between compared proportions expressed in z-score units. The z test is used to compare one proportion against a standard or to compare two proportions. The Pearson chi square is computed based on the observed and expected frequencies of each cell in the contingency table and is in essence a measure of the deviation from the ‘average’. It can be used to test 2 or more proportions. Large contingency tables are better partitioned or collapsed before applying the chi square test.
5.1.2 STRATIFIED and MATCHED ANALYSIS OF PROPORTIONS
The Mantel-Haenszel chi-square is used to test 2 proportions in stratified data. The MacNemar chi square is used for pair matched data.
5.1.3 EXACT ANALYSIS OF PROPORTIONS
Exact methods are used instead of the chisquare test for small samples less than 20.They can be used in 2 x 2, 2 x k, and r x c contingency tables. They involve direct computation of the p-value using factorials and probability. The p-value is computed as the probability of results more extreme than the observed data.
5.1.4 ANALYSIS OF RATES and HAZARDS
Methods are available for simple and stratified analysis for incidence rates
5.1.5 ANALYSIS OF RATIOS
Methods are available for simple and stratified analysis of risk ratios.
UNIT 5.2
CONTINUOUS DATA ANALYSIS
5.2.1 OVERVIEW OF PARAMETRIC ANALYSIS
Inference on numeric continuous data is based on the comparison of sample means. Three test statistics are commonly used: z, t- and F-statistics. The z-statistic is used for large samples. The t and F are used for small or moderate samples. The z-statistic and the t-statistic are used to compare 2 samples. The F statistic is used to compare 3 or more samples.
The student t-test is the most commonly used test statistic for inference on continuous numerical data. It is defined for independent and paired samples. It is robust and can give valid results even if the assumptions of normal distribution and equal variance are not perfectly fulfilled. It is used uniformly for sample sizes below 60 and for larger samples if the population standard deviation is not known. For larger samples there is no distinction between testing based on the z statistic and testing based on the t statistic. T
he F-test is a generalized test used in inference on 3 or more sample means in procedures called analysis of variance, ANOVA. Assumptions of independent observations, normal distribution, and equal variances in the samples compared are necessary for validity of all 3 test statistics. If variances are not equal, data can be transformed and harmonic or weighted means may be used. If sample sizes are not equal, equality can be achieved by randomly discarding some observations. The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution, that the variances are approximately equal, and that the sample size is adequate. The formulas for the z, t, and F statistics vary depending on whether the samples are paired or are an unpaired. They also vary depending on whether the samples have equal numbers of observation or the number of observations in each sample is different.
5.2.2 PARAMETRIC ANALYSIS FOR 2 SAMPLE MEANS
Simple testing procedures assume 1 factor analysis, approximately normal distribution, equal variances, and equal numbers in each sample. Both the z and t test statistics can be used in both the p-value and confidence interval approaches.
5.2.3 PARAMETRIC ANALYSIS FOR 3 OR MORE SAMPLE MEANS
For the F-test only the p-value approach can be used since the confidence interval approach is inapplicable. One-way ANOVA involves comparison of 3 or more samples on one factor like height or weight. The F test and 1-way ANOVA are 2 names for the same procedure. ANOVA has become less popular because modern regression packages can do all what it did before. ANOVA can discover an omnibus association. However carrying out several pair-wise t tests to discover the specific sources of the omnibus association can lead to the problem of multiple comparisons in which some pair-wise associations may be significant by chance. .Multiple Analysis of Variance (MANOVA) is used to study 3 or more factors simultaneously. Such analyses are used for randomized block, factorial, Latin square, nested, and cross-over designs.
5.2.4 OVER VIEW OF NON PARAMETRIC ANALYSIS FOR CONTINUOUS DATA
Non-parametric methods were first introduced as rough, quick and dirty methods and became popular because of being un constrained by normality assumptions. They are about 95% as efficient as the more complicated and involved parametric methods. They are simple, easy to understand, and easy to use. They can be used for non-Gaussian data or data whose distribution is unknown. They work well for small data sets but not for large data sets. They also cannot be used with complicated experimental designs. Generally non-parametric methods are used where parametric methods are not suitable. Such situations occur when the test for normality is negative, when assumptions of the central limit theorem do not apply, and when the distribution of the parent population is not known. Virtually each parametric test has an equivalent non-parametric one.
5.2.5 PROCEDURES OF NON-PARAMETRIC ANALYSIS
Specialized computer programs can carry out all the non-parametric tests. The sign test, the signed rank test, and the rank sum tests are based on the median. The sign test is used for analysis of 1 sample median. The signed rank test is used for 2 paired sample medians. The rank sum test is used for 2 independent sample medians. The Kruskall-Wallis test is a 1-way test for 3 or more independent sample medians. The Friedman test is a 2-way test for 3 or more independent sample medians. Note that the Mann-Whitney test gives results equivalent to those of the signed rank test. The Kendall test gives results equivalent to those of the Spearman correlation coefficient.
UNIT 5.3
CORRELATION ANALYSIS
5.3.1 DESCRIPTION
Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. A correlation matrix is used to explore for pairs of variables likely to be associated. Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence. Correlation indicates only association; the association is not necessarily causative. It measures linear relation and not variability. Correlation analysis has the objectives of describing the relation between x and y, prediction of y if x is known, prediction of x if y is known, studying trends, and studying the effect of a third factor on the relation between x and y. The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers. Then Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation. It has a complicated formula but can be computed easily by modern computers. It essentially is a measure of the scatter of the data.
5.3.2 PEARSON'S CORRELATION COEFFICIENT, r
The value of the Pearson simple linear correlation coefficient is invariant when a constant is added to the y or x variable or when the x and y variables are multiplied or divided by a constant. The coefficient can be used to compare scatter in 2 data sets measured in different units because it is not affected by the unit of measure. Inspecting a scatter-gram helps interpret the coefficient. The correlation is not interpretable for small samples. Values of 0.25 - 0.50 indicate a fair degree of association. Values of 0.50 - 0.75 indicate moderate to fair relation. Values above 0.75 indicate good to excellent relation. Values of r = 0 indicate either no correlation or that the two variables are related in a non-linear way. Very high correlation coefficients may be due to collinearity or restrictions of the range of x or y and not due to actual biological relationship. In perfect positive correlation, r=1. In perfect negative correlation, r=-1. In cases of no correlation, r=0. In cases of no correlation with r=0, the scatter-plot is circular. The t test is used to test significance of the coefficient and to compute 95% confidence intervals of the coefficient of correlation. Random measurement errors, selection bias, sample heterogeneity, and non-linear (curvilinear) relations reduce r whereas differential (non-random) errors increase the correlation. The coefficient will be wrong or misleading for non-linear relations. The linear correlation coefficient is not used when the relation is non-linear, outliers exist, the observations are clustered in 2 or 4 groups, and if one of the variables is fixed in advance.
5.3.3 OTHER CORRELATION COEFFICIENTS
When the relation between x and y is influenced by a third variable, the coefficient of partial correlation explains the net relationship. The correlation ratio, used for curvilinear relations, is interpreted as the variability of y accounted for by x. The biserial or tetrachomic correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative. The contingency coefficient is used for 2 qualitative nominal (ie unordered) variables each of which has 2 or more categories. The coefficient of mean square contingency is used when both variables are qualitative. The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables. It describes the strength of the linear relation between y and a set of x variables. It is obtained from the multiple regression function as the positive square root of the coefficient of determination.. The partial correlation coefficient denotes the conditional relation between one independent variable and a response variable if all other variables are held constant.
5.3.4 THE COEFFICIENT OF DETERMINATION, r2
The square of the linear correlation coefficient is called the coefficient of determination. It is the proportion of variation in the dependent variable, y, explained by the variation in the independent variable, x.
5.3.5 NON-PARAMETRIC CORRELATION ANALYSIS
The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid. Its significance is tested using the t test. The advantage of rank correlation is that comparisons can be carried out even if actual values of the observations are not known. It suffices to know the ranks.
UNIT 5.4
REGRESSION ANALYSIS
5.4.1 LINEAR REGRESSION
Regression to the mean, first described by Francis Galton (1822-1911) is one of the basic laws of nature, sunan al llah fi al kawn. Parametric regression models are cross sectional (linear, logistic, or log-linear) or longitudinal (linear and proportional hazards). Regression relates independent with dependent variables. The variables may be raw data, dummy indicator variables, or scores. The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable. Its validity is based on 4 assumptions: linearity of the x-y relation, normal distribution of the y variable for any given value of x, homoscedacity (constant y variance for all x values), and y values are independent for each value of x. The t test can be used to test the significance of the regression coefficient and to compare regression coefficients of 2 lines. Multiple linear regression, a form of multivariate analysis, is defined by y=a+b1x1 + b2x2 + …bnxn. Y can be interval, dichotomous, ordinal, or nominal and x can be interval or dichotomous but not ordinal or nominal. Interactive (product) variables can be included in the model. Linear regression is used for prediction (intrapolation and extrapolation) and for analysis of variance.
5.4.2 LOGISTIC REGRESSION
Logistic regression is non-linear regression with y dichotomous/binary such that logit (y) = a+b1x1 + b2x2 + …bnxn Logistic regression is used in epidemiology because of a dichotomized outcome variable and direct derivation of the odds ratio from the regression coefficient as shown in the formula OR = eβ. Significance of the regression coefficient is tested using either the likelihood ratio or the Wald test. Multiple logistic regression is used for matched analysis, stratified analysis to control for confounders, and prediction.
5.4.3 FITTING REGRESSION MODELS
Fitting the simple regression model is very straightforward since it has only one independent variable. Fitting the multiple regression model is by step-up, step-down, and step-wise selection of x variables. Step-up or forwards selection starts with a minimal set of x variables and one x variable is added at a time. Step-down or backward elimination starts with a full model and one variable is eliminated at a time. Step-wise selection is a combination of step up and step down selection. Variables are retained or eliminated on the basis of their p-value. Model validation is by using new data, data splitting, the jackknife procedures, and the boot strap procedure. Mis-specfication occurs when a linear relation is assumed for a curvilinear one. Over-specification is including too many unnecessary variables. Extraneous variables cause model overfit. Omitting important variables causes an under-fit model. Bias due to missing data can be dealt with by deleting incomplete observations, using an indicator variable for missing data, estimating missing values, or collecting additional data.
5.4.4 ASSESSING REGRESSION MODELS
The best model is one with the highest coefficient of determination or one for which any additions do not make any significant changes in the coefficient. The model is assessed by the following: testing linearity, row diagnostics, column diagnostics, hypothesis testing, residual analysis, impact assessment of individual observations, and the coefficient of determination. Row diagnostics identify the following: outliers, influential observations, unequal variances (heteroscedacity), and correlated errors. Column diagnostics deal mainly with multicollinearity that is correlations among several x variables causing model redundancy and imprecision. Collinear variables should be dropped leaving only the important one. Hypothesis testing of omnibus significance of the model uses the F ratio. Hypothesis testing of individual x variables uses the t test. Residuals are defined as the difference between the observed values and the predicted values. A good model fit will have most residuals near zero and the residual plot will be normal in shape. The impact of specific observations is measured by their leverage or by Cook’s distance. The coefficient of determination defined as r2 varies 0-1.0 and is a measure of goodness of fit. The fit of the model can be improved by using polynomial functions, linearizing transformations, creation of categorical or interaction variables, and dropping outliers.
5.4.5 ALTERNATIVES TO REGRESSION
Methods based on grouping or General Linear Models (GLIM) are an alternative to regression. Methods based on grouping/classification are principal components analysis, discriminant analysis, factor analysis, and cluster analysis. The General Linear Models (GLIM) unlike the general regression model allows for the fact that explanatory variables can be linear combinations of other variables and does not give unique parameter estimates. It works well with continuous as well as categorical variables and has no restrictions on parameters.
UNIT 5.5
TIME SERIES and SURVIVAL ANALYSIS
5.5.1 TIME SERIES ANALYSIS
Longitudinal data is summarized in the following ways: graphical presentation, longitudinal regression, auto-regression, autocorrelation, repeated measures ANOVA, and tests for trend. A time series plot of y against time shows time trends, seasonal patterns, random / irregular patterns, or mixtures of the above. Moving averages may be plotted instead of raw scores for a more stable curve. Time series plots are used for showing trends and forecasting. Longitudinal regression models, additive or multiplicative, can be used to model time-varying data. Auto-regression is a regression model relating a variable to its immediate predecessor. Autocorrelation is correlation between a variable and its lagged version (immediate predecessor). A chi-square test for trend can be constructed for 2 x k contingency tables where k represents time periods. Forecasts can be made using time series.
5.5.2 INTRODUCTION TO SURVIVAL ANALYSIS
Survival analysis is used to study survival duration and the effects of covariates on survival. It uses parametric methods (Weibull, lognormal, or gamma) or non-parametric methods (life-table, Kaplan-Maier, and the Proportional hazards). Time is measured as time to relapse, length of remission, remission duration, survival after relapse, time to death, or time to a complication. The best zero time is point of randomization. Other zero times are: enrolment, the first visit, first symptoms, diagnosis, and start of treatment. Problems of survival analysis are censoring, truncation, and competing causes of death. Censoring is loss of information due to withdrawal from the study, study termination, loss to follow-up, or death due to a competing risk. In left censoring observation ends before a given point in time. In right censoring the subject is last seen alive at a given time and is not followed up subsequently. Interval censoring, a mixture of left and right censoring, occurs between two given time given points in time. Right censoring is more common than left censoring. Random censoring occurs uniformly throughout the study, is not related to outcome, and is not biased. Non-random censoring is due to investigator manipulation and can cause bias. Progressive censoring occurs in studies in which entry and censoring times are different for each subject. Clinical trials analysis based on the intention to treat is more conservative than censored analysis. In left truncation, only individuals who survive a certain time are included in the sample. In right truncation only individuals who have experienced the event of interest by a given time are included in the sample. Competing causes of death are one cause of censoring that bias survival estimates.
5.5.3 NON-REGRESSION SURVIVAL ANALYSIS
Two non-regression methods are used in survival analysis: the life-table and the Kaplan-Maier methods. The life-table methods better with large data sets and when the time of occurrence of an event cannot be measured precisely. It leads to bias by assuming that withdrawals occur at the start of the interval when in reality they occur throughout the interval. The Kaplan-Maier method is best used for small data sets in which the time of event occurrence is measured precisely. It is an improvement on the life-table method in the handling of withdrawals. The assumption could therefore create bias or imprecision. The Kaplan-Maier method avoids this complication by not fixing the time intervals in advance.
5.5.4 REGRESSION METHODS FOR SURVIVAL ANALYSIS
The Proportional hazards, a semi-parametric method proposed by Sir David Cox in 1972, is the most popular regression method for survival analysis. It is used on data whose distribution is unknown.
5.5.5 COMPARING SURVIVAL CURVES
The non-parametric methods for comparing 2 survival distributions are: Gehan’s generalized Wilcoxon test, the Cox-Mantel test, the log-rank test, Peto’s generalized Wilcoxon test, the Mantel-Haenszel test, and Cox’s F test. The parametric tests are the likelihood ratio test and Cox’s F test. The log-rank test is more sensitive if the assumptions of proportional hazards hold. The Wilcoxon test is more sensitive to differences between the curves at the earlier failure times. It is less sensitive than the log-rank test for later failure times. It gives more weight to the earlier part of the survival curve. The Mantel-Haenszel test relies on methods of analyzing incidence density ratios. The log-rank test attaches equal importance to all failure times irrespective of whether they are early or late. A modification of the log-rank test by Peto attaches more importance to earlier failure times. Cox’s regression is a semi-parametric method for studying several covariates simultaneously. The log-linear exponential and the linear exponential regression methods are parametric approaches to studying prognostic covariates. Risk factors for death can be identified using linear discriminant functions and the linear logistic regression method.