Integrated Medical Education Resources: 040525P - INTRODUCTION TO MULTIVARIATE ANALYSIS

Paper presented at a workshop on data analysis using SPSS held at the Kulliyah of Medicine, International Islamic University Kuantan MALAYSIA on 25^th May 2004 by Prof Dr Omar Hasan Kasule, Sr. MB ChB (MUK), MPH & DrPH (Harvard)

VARIABLES

Understanding variables and their properties is essential to understanding statistical analysis. A constant has only one unvarying value under all circumstances for example p and c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with intrinsic numerical value). Qualitative variables can be nominal (no specific order of magnitude), ordinal (specific order) or ranked. A random quantitative variable results when numerical values are assigned to results of measurement or counting. It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions and decimals. The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis depends on the type of variable. Many mistakes in data analysis arise from not knowing the difference between discrete and continuous variables and wrongly applying the wrong statistical technique.

PRELIMINARIES OF DATA ANALYSIS

Simple manual inspection of the data is needed before applying sophisticated statistical tests.. Indiscriminate application of the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among variables that could alert the investigator to errors in computer analysis.

Data analysis is essentially construction and testing of hypotheses. Two procedures are employed in statistical analysis. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The tests for association commonly employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures

TYPES OF ANALYSIS

Univariate analysis is testing a hypothesis about one mean or one proportion. The t test is used to test hypotheses about a single sample mean. The chisquare test is used to test hypotheses about a single sample proportion. Univariate testing answers the question whether the given mean or proportion is significantly different from zero.

Bivariate analysis is testing the hypothesis whether two means or two proportions are significantly different from one another. The choice of the statistical test for association in bivariate analysis is made according to Table #1

Multivariate analysis in its commonest form is essentially bivariate analysis with adjustment for extraneous variables that confuse (or confound) the bivariate relation. Choice of statistical test of association for multi-variate analysis is made according to table #2

STATISTICAL MODELS IN DATA ANALYSIS

Observations or raw data has to be fit to a specific statistical model. Once the model is fit it can be used for prediction. There are basically three types of models: probability models, likelihood models, and regression models. The probability model is deterministic and stochastic. Probability models commonly used in statistical analysis are the binomial and the normal distributions. The likelihood model derives the maximum likelihood estimator from the data. The maximum likelihood estimate, MLE, is the most likely value of the parameter from the given data and is derived interactively. The regression model may be a Poisson regression model or may be binomial logistic regression model. The model allows modeling the interaction among confounders and the interaction between the exposure and the confounders. It can be used to explore additive and synergistic relations.

NON REGRESSION MULTIVARIATE ANALYSIS (STRATIFIED ANALYSIS)

Stratified analysis has two main purposes: study effect modification / interaction (variation of effect measures by stratum) and control bias (confounding bias and other types of bias). It usually starts by an examination of stratum-specific effect measures. If there is variation by stratum, heterogeneity, no further analysis is undertaken and the final results are reported as stratum-specific measures. If there is homogeneity of effect measures across strata, a summary estimates computed. Heterogeneity is identified using the chi square fir homogeneity. The summary estimates may be a chi square or an odds ratio computed using the Mantel-Haenszel procedure.

REGRESSION MULTIVARIATE ANALYSIS

Multivariate models solve 2 problems that arose when stratified analysis was used. Stratified analysis breaks down when data is sparse with very low numbers in some strata. Stratified analysis would be very cumbersome if it were used for more than 3 variables. There are three main types of multivariate models: the linear model, the logistic model, and the proportional hazards model. The linear model is E(Y) = b₀ + å_i=1b_ix_i. The binary logistic model is of the form ln(p/1-p) = e^åⁱ⁼¹^b^ixi. The proportional hazards regression relates hazard at a given time to risk factors such that y_i= ln{h_i(t) / h₀(t)} = b₁ x_1i + b₂ x_2i + ….The coefficients of proportional hazards regression are interpreted like coefficients of logistic regression.

TABLE #1:

CHOICE OF STATISTICAL TECHNIQUE FOR BIVARIATE ANALYSIS[1]

First variable	Second Variable	Test
Continuous	Dichotomous, unpaired	2-sample t test
Continuous	Dichotomous, paired	Paired t test ( 1 sample t test after taking differences for each pair)
Continuous	Nominal (>= groups)	1-way ANOVA
Continuous	Continuous	Linear correlation (Pearson) or linear regression
Ordinal	Dichotomous, unpaired	Mann-Whitney U test or Chi-square test for linear trend
Ordinal	Dichotomous, paired	Wilcoxon test
Ordinal	Ordinal	Spearman Correlation or Kendall Correlation
Ordinal	Continuous	Categorize the continuous and use Spearman correlation, Kendal correlation or the chi square test
Dichtomous	Dichotomous, unpaired	Chi-square test or Fisher exact probability test
Dichotomous	Dichotomous, paired	McNemar chi-square test
Dichtomous	Nominal	Chi-square test
Nominal	Nominal	Chi-square test

TABLE #2:

CHOICE OF STATISTICAL TECHNIQUE FOR MULTIVARIATE ANALYSIS[1]

Dependent variable	Independent Variables	Test
Continuous	All categorical	ANOVA (analysis of variance)
Continuous	Mixture of categorical and continuous	ANCOVA (Analysis of covariance)
Continuous	All continuous	Multiple linear regression
Dichotomous	All categorical	Multiple logistic regression or log-linear analysis
Dichtomous	Mixture of categorical and continuous	Logistic regression
Time-dependent Dichotomous	Mixture of categorical and continuous	Cox’s proportinal hazards model
Dichotomous	All continuous	Logistic regression or discriminant function analysis
Nominal	All categorical	Log-linear analysis
Nominal	Mixture of categorical and continuous	Group the continuous and perform log linear analysis
Nominal	All continuous	Discriminant function analysis or categorize the continuous and perform log-linear analysis

NB: Categorical includes nominal, ordinal and dichotomous

NOTE

[1] (Jekel et al Epidemiology, Biostatistics, and Preventive Medicine WB Saunders page 175):