search this site.

170717P - PRINCIPLES OF EPIDEMIOLOGY HEALTH RESEARCH COURSE: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS) CORRELATION

Print Friendly and PDFPrint Friendly

Presentation at a Course on Principles of Epidemiology Health Research Faculty of Medicine, King Fahad Medical City October 11-12, 2017 by Professor Omar Hasan Kasule Sr. MB ChB (MUK). MPH (Harvard), DrPH (Harvard) Chairman of the Institutional Review Board / Research Ethics Committee at King Fahad Medical City, Riyadh.


LECTURE # 10-A: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS) CORRELATION


CORRELATION:

Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. 

Correlation indicates only association; the association is not necessarily causative. It measures linear relation and not variability. 


OVERVIEW OF NON-PARAMETRIC ANALYSIS FOR CONTINUOUS DATA:

Non-parametric methods were first introduced as rough, quick and dirty methods and became popular because of being un- constrained by normality assumptions. 

They are about 95% as efficient as the more complicated and involved parametric methods. They are simple, easy to understand, and easy to use. 

Generally non-parametric methods are used where parametric methods are not suitable. 


OVERVIEW OF NON-PARAMETRIC ANALYSIS FOR CONTINUOUS DATA: Con’t…

The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers.

Then Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation. 


OTHER CORRELATION COEFFICIENTS:

When the relation between x and y is influenced by a third variable, the coefficient of partial correlation explains the net relationship.

The correlation ratio, used for curvilinear relations, is interpreted as the variability of y accounted for by x. 


OTHER CORRELATION COEFFICIENTS, Con’t. - 1:

The biserial or tetrachomic correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative. 

The contingency coefficient is used for 2 qualitative nominal (i.e. unordered) variables each of which has 2 or more categories. 

The coefficient of mean square contingency is used when both variables are qualitative. 


OTHER CORRELATION COEFFICIENTS, Con’t. - 2:

The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables.

The partial correlation coefficient denotes the conditional relation between one independent variable and a response variable if all other variables are held constant. 


THE COEFFICIENT OF DETERMINATION, r2:

The square of the linear correlation coefficient is called the coefficient of determination. 

It is the proportion of variation in the dependent variable, y, explained by the variation in the independent variable, x.


NON-PARAMETRIC CORRELATION ANALYSIS:

The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid.

The advantage of rank correlation is that comparisons can be carried out even if actual values of the observations are not known. It suffices to know the ranks. 



LECTURE # 10-B: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS): REGRESSION 


LINEAR REGRESSION - 1:

Regression to the mean, first described by Francis Galton (1822- 1911) is one of the basic laws of nature, sunan al llah fi al kawn.

Parametric regression models are cross sectional (linear, logistic, or log-linear) or longitudinal (linear and proportional hazards).

Regression relates independent with dependent variables.


LINEAR REGRESSION - 2: 

The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable.

Multiple linear regression, a form of multivariate analysis, is defined by y=a+b1x1 + b2x2 + ...bnxn. 

Linear regression is used for prediction (intrapolation and extrapolation) and for analysis of variance.


LOGISTIC REGRESSION:

Logistic regression is non-linear regression with y dichotomous/binary such that logit (y) = a+b1x1 + b2x2 + ...bnxn 

Logistic regression is used in epidemiology because of a dichotomized outcome variable and direct derivation of the odds ratio from the regression coefficient as shown in the formula OR = eβ.

Multiple logistic regression is used for matched analysis, stratified analysis to control for confounders, and prediction.


FITTING REGRESSION MODELS:

Step-up or forwards selection starts with a minimal set of x variables and one x variable is added at a time.

Step-down or backward elimination starts with a full model and one variable is eliminated at a time.

Step-wise selection is a combination of step up and step down selection. 

Variables are retained or eliminated on the basis of their p-value.


ASSESSING REGRESSION MODELS:

The best model is one with the highest coefficient of determination.

The coefficient of determination defined as r2 varies 0-1.0 and is a measure of goodness of fit. 



LECTURE 10-C: MIXED DATA ANALYSIS (DISCRETE AND CONTINUOUS): TIME SERIES ANALYSIS and SURVIVAL ANALYSIS


TIME SERIES ANALYSIS:

Longitudinal data is summarized in the following ways: graphical presentation, longitudinal regression, auto-regression, autocorrelation, repeated measures ANOVA, and tests for trend.

A time series plot of y against time shows time trends, seasonal patterns, random / irregular patterns, or mixtures of the above.

Forecasts can be made using time series.


TIME SERIES ANALYSIS, Con’t.:

Longitudinal regression models, additive or multiplicative, can be used to model time-varying data. A

Auto-regression is a regression model relating a variable to its immediate predecessor.

Autocorrelation is correlation between a variable and its lagged version (immediate predecessor). 

A chi-square test for trend can be constructed for 2 x k contingency tables where k represents time periods.


INTRODUCTION TO SURVIVAL ANALYSIS:

Survival analysis is used to study survival duration and the effects of covariates on survival. It uses parametric methods (Weibull, lognormal, or gamma) or non-parametric methods (life- table, Kaplan-Maier, and the Proportional hazards). 

Time is measured as time to relapse, length of remission, remission duration, survival after relapse, time to death, or time to a complication. 


INTRODUCTION TO SURVIVAL ANALYSIS, Con’t.: 

The best zero time is point of randomization. Other zero times are: enrolment, the first visit, first symptoms, diagnosis, and start of treatment.

Problems of survival analysis are censoring, truncation, and competing causes of death. Censoring is loss of information due to withdrawal from the study, study termination, loss to follow-up, or death due to a competing risk. 


NON-REGRESSION SURVIVAL ANALYSIS:

Two non-regression methods are used in survival analysis: The life-table and the Kaplan-Maier methods.

The life-table methods better with large data sets and when the time of occurrence of an event cannot be measured precisely. It leads to bias by assuming that withdrawals occur at the start of the interval when in reality they occur throughout the interval.


NON-REGRESSION SURVIVAL ANALYSIS, Con’t.:

The Kaplan-Maier method is best used for small data sets in which the time of event occurrence is measured precisely.


REGRESSION METHODS FOR SURVIVAL ANALYSIS:

The Proportional hazards, a semi-parametric method proposed by Sir David Cox in 1972, is the most popular regression method for survival analysis.

It is used on data whose distribution is unknown.


COMPARING SURVIVAL CURVES:

The Proportional hazards, a semi-parametric method proposed by Sir David Cox in 1972, is the most popular regression method for survival analysis. 

It is used on data whose distribution is unknown. 


COMPARING SURVIVAL CURVES:

The non-parametric methods for comparing 2 survival distributions are: Gehan’s generalized Wilcoxon test, the Cox- Mantel test, the log-rank test, Peto’s generalized Wilcoxon test, the Mantel-Haenszel test, and Cox’s F test.

The parametric tests are the likelihood ratio test and Cox’s F test. The log-rank test is more sensitive if the assumptions of proportional hazards hold.