Integrated Medical Education Resources: 200702P

Presented in the Biostatistics module of the Clinical Research Coordinators Course on June 23, 2020 11.00-12.00 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Bioethics King Fahad Medical City

FIGURE 1: CORRELATION MATRIX

WHAT IS CORRELATION?

Correlation analysis is used as preliminary data analysis before applying more sophisticated methods.
Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence.
Correlation indicates only association; the association is not necessarily causative
Correlation measures linear relation and not variability.

OBJECTIVES OF CORRELATION

A correlation matrix is used to explore for pairs of variables likely to be associated.
Correlation analysis has the objectives of describing the relation between x and y
Correlation predicts y if x is known
Correlation predicts x if y is known
Correlation can study trends

FIGURE 2: SCATTERPLOT: visual impression of the data and identifying outliers

THE PEARSON COEFFICIENT OF LINEAR CORRELATION, r

• Then Pearson coefficient of correlation, r, is the commonest statistic for linear correlation.

• The Pearson coefficient of linear correlation essentially measures the scatter of the data

• The Pearson coefficient has a complicated formula but can be computed easily by modern computers.

• Its values range from 0.0 (no correlation) to 1.0(perfect correlation)

• It can take negative values if there is a correlation but in the opposite direction.

FIGURE 3: POSITIVE CORRELATION

FIGURE 4: NEGATIVE CORRELATION

INTERPRETATION OF THE PEARSON LINEAR CORRELATION COEFFICIENT, r

• r= 0.25 - 0.50 indicate a fair degree of association.

• r= 0.50 - 0.75 indicate moderate to fair relation.

• r= > 0.75 indicate good to excellent relation.

• r = 0 indicate either no correlation or non-linear correlation

• r=1.0 is perfect positive linear correlation

• r=-1.0 is perfect negative linear correlation

• Very high correlations above 0.9 are suspicious (something wrong with the data)

SITUATIONS IN WHICH THE PEARSON LINEAR CORRELATION COEFFICIENT IS NOT USED/IS MISLEADING

• Relation between x and y is non-linear

• The data has outliers

• Observations are clustered

• One of the variables is fixed

ALTERNATIVES TO PEARSON COEFFICIENT (NOT USED REGULARLY)

• Coefficient of partial correlation = relation between x and y is influenced by a third variable

• Correlation ratio is used for curvilinear relations

• biserial or tetrachoric correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative

• contingency coefficient is used for 2 qualitative nominals (ie unordered) variables each of which has 2 or more categories

• The coefficient of mean square contingency is used when both variables are qualitative

• The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables.

• The square of the linear correlation coefficient is called the coefficient of determination.

NON-PARAMETRIC CORRELATION COEFFICIENTS (USED FOR DATA THAT IS NOT NORMAL ie BELL SHAPED)

• The Spearman is the non-parametric equivalent of the Pearson correlation

• The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid.

200702P - CORRELATION ANALYSIS