search this site.

200702P - CORRELATION ANALYSIS

Print Friendly and PDFPrint Friendly

Presented in the Biostatistics module of the Clinical Research Coordinators Course on June 23, 2020 11.00-12.00 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard)  Professor of Epidemiology and Bioethics King Fahad Medical City


FIGURE 1: CORRELATION MATRIX


WHAT IS CORRELATION?

  • Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. 
  • Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence.
  • Correlation indicates only association; the association is not necessarily causative
  • Correlation measures linear relation and not variability.


OBJECTIVES OF CORRELATION

  • A correlation matrix is used to explore for pairs of variables likely to be associated.
  • Correlation analysis has the objectives of describing the relation between x and y
  • Correlation predicts y if x is known
  • Correlation predicts x if y is known
  • Correlation can study trends


FIGURE 2: SCATTERPLOT: visual impression of the data and identifying outliers






THE PEARSON COEFFICIENT OF LINEAR CORRELATION, r

Then Pearson coefficient of correlation, r, is the commonest statistic for linear correlation. 

The Pearson coefficient of linear correlation essentially measures the scatter of the data

The Pearson coefficient has a complicated formula but can be computed easily by modern computers.

Its values range from 0.0 (no correlation) to 1.0(perfect correlation)

It can take negative values if there is a correlation but in the opposite direction.


FIGURE 3: POSITIVE CORRELATION




FIGURE 4: NEGATIVE CORRELATION





INTERPRETATION OF THE PEARSON LINEAR CORRELATION COEFFICIENT, r

r= 0.25 - 0.50 indicate a fair degree of association. 

r= 0.50 - 0.75 indicate moderate to fair relation. 

r= > 0.75 indicate good to excellent relation. 

r = 0 indicate either no correlation or non-linear correlation

r=1.0 is perfect positive linear correlation

r=-1.0 is perfect negative linear correlation

Very high correlations above 0.9 are suspicious (something wrong with the data)


SITUATIONS IN WHICH THE PEARSON LINEAR CORRELATION COEFFICIENT IS NOT USED/IS MISLEADING

Relation between x and y is non-linear

The data has outliers

Observations are clustered

One of the variables is fixed


ALTERNATIVES TO PEARSON COEFFICIENT (NOT USED REGULARLY)

Coefficient of partial correlation = relation between x and y is influenced by a third variable

Correlation ratio is used for curvilinear relations

biserial or tetrachoric correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative

contingency coefficient is used  for 2 qualitative nominals (ie unordered) variables each of which has 2 or more categories

The coefficient of mean square contingency is used when both variables are qualitative

The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables.

The square of the linear correlation coefficient is called the coefficient of determination.


NON-PARAMETRIC CORRELATION COEFFICIENTS (USED FOR DATA THAT IS NOT NORMAL ie BELL SHAPED)

The Spearman is the non-parametric equivalent of the Pearson correlation

The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid.