Presented in the Biostatistics module of the Clinical Research Coordinators Course on July 2, 2020 10.00-12.00 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Bioethics King Fahad Medical City
FIGURE 1: CORRELATION MATRIX
WHAT IS CORRELATION?
• Correlation analysis is used as preliminary data analysis before applying more sophisticated methods.
• Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence.
• Correlation indicates only association; the association is not necessarily causative
• Correlation measures linear relation and not variability.
OBJECTIVES OF CORRELATION
• A correlation matrix is used to explore for pairs of variables likely to be associated.
• Correlation analysis has the objectives of describing the relation between x and y
• Correlation predicts y if x is known
• Correlation predicts x if y is known
• Correlation can study trends
FIGURE 2: SCATTERPLOT: visual impression of the data and identifying outliers
THE PEARSON COEFFICIENT OF LINEAR CORRELATION, r
• Then Pearson coefficient of correlation, r, is the commonest statistic for linear correlation.
• The Pearson coefficient of linear correlation essentially measures the scatter of the data
• The Pearson coefficient has a complicated formula but can be computed easily by modern computers.
• Its values range from 0.0 (no correlation) to 1.0(perfect correlation)
• It can take negative values if there is a correlation but in the opposite direction.
FIGURE 3: POSITIVE CORRELATION
FIGURE 4: NEGATIVE CORRELATION
INTERPRETATION OF THE PEARSON LINEAR CORRELATION COEFFICIENT, r
• r= 0.25 - 0.50 indicate a fair degree of association.
• r= 0.50 - 0.75 indicate moderate to fair relation.
• r= > 0.75 indicate good to excellent relation.
• r = 0 indicate either no correlation or non-linear correlation
• r=1.0 is perfect positive linear correlation
• r=-1.0 is perfect negative linear correlation
• Very high correlations above 0.9 are suspicious (something wrong with the data)
SITUATIONS IN WHICH THE PEARSON LINEAR CORRELATION COEFFICIENT IS NOT USED/IS MISLEADING
• Relation between x and y is non-linear
• The data has outliers
• Observations are clustered
• One of the variables is fixed
ALTERNATIVES TO PEARSON COEFFICIENT (NOT USED REGULARLY)
• Coefficient of partial correlation = relation between x and y is influenced by a third variable
• Correlation ratio is used for curvilinear relations
• biserial or tetrachoric correlation coefficient is used in linear relations when one variable is quantitative and the other is qualitative
• contingency coefficient is used for 2 qualitative nominals (ie unordered) variables each of which has 2 or more categories
• The coefficient of mean square contingency is used when both variables are qualitative
• The multiple correlation coefficient is used to describe the relationship in which a given variable is being correlated with several other variables.
• The square of the linear correlation coefficient is called the coefficient of determination.
NON-PARAMETRIC CORRELATION COEFFICIENTS (USED FOR DATA THAT IS NOT NORMAL ie BELL SHAPED)
• The Spearman is the non-parametric equivalent of the Pearson correlation
• The Spearman rank correlation coefficient is used for non-normal data for which the Pearson linear correlation coefficient would be invalid.
ASSIGNMENT
Using the class data set
1. Draw a scatter diagram of weight (in cm) on the y or vertical axis against weight (in Kg) on the x-axis (horizontal axis)
2. Compute the Pearson Linear Correlation Coefficient between weight and height
3. Interpret the coefficient