search this site.

040525P - INTRODUCTION TO MULTIVARIATE ANALYSIS

Print Friendly and PDFPrint Friendly

Paper presented at a workshop on data analysis using SPSS held at the Kulliyah of Medicine, International Islamic University Kuantan MALAYSIA on 25th May 2004 by Prof Dr Omar Hasan Kasule, Sr. MB ChB (MUK), MPH & DrPH (Harvard)


VARIABLES
Understanding variables and their properties is essential to understanding statistical analysis.  A constant has only one unvarying value under all circumstances for example p and c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with intrinsic numerical value). Qualitative variables can be nominal (no specific order of magnitude), ordinal (specific order) or ranked. A random quantitative variable results when numerical values are assigned to results of measurement or counting. It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions and decimals. The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis depends on the type of variable. Many mistakes in data analysis arise from not knowing the difference between discrete and continuous variables and wrongly applying the wrong statistical technique.

PRELIMINARIES OF DATA ANALYSIS
Simple manual inspection of the data is needed before applying sophisticated statistical tests.. Indiscriminate application of the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among variables that could alert the investigator to errors in computer analysis.

Data analysis is essentially construction and testing of hypotheses. Two procedures are employed in statistical analysis. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The tests for association commonly employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures

TYPES OF ANALYSIS
Univariate analysis is testing a hypothesis about one mean or one proportion. The t test is used to test hypotheses about a single sample mean. The chisquare test is used to test hypotheses about a single sample proportion. Univariate testing answers the question whether the given mean or proportion is significantly different from zero.

Bivariate analysis is testing the hypothesis whether two means or two proportions are significantly different from one another. The choice of the statistical test for association in bivariate analysis is made according to Table #1

Multivariate analysis in its commonest form is essentially bivariate analysis with adjustment for extraneous variables that confuse (or confound) the bivariate relation. Choice of statistical test of association for multi-variate analysis is made according to table #2

STATISTICAL MODELS IN DATA ANALYSIS
Observations or raw data has to be fit to a specific statistical model. Once the model is fit it can be used for prediction. There are basically three types of models: probability models, likelihood models, and regression models. The probability model is deterministic and stochastic. Probability models commonly used in statistical analysis are the binomial and the normal distributions. The likelihood model derives the maximum likelihood estimator from the data. The maximum likelihood estimate, MLE, is the most likely value of the parameter from the given data and is derived interactively. The regression model may be a Poisson regression model or may be binomial logistic regression model. The model allows modeling the interaction among confounders and the interaction between the exposure and the confounders. It can be used to explore additive and synergistic relations.

NON REGRESSION MULTIVARIATE ANALYSIS (STRATIFIED ANALYSIS)
Stratified analysis has two main purposes: study effect modification / interaction (variation of effect measures by stratum) and control bias (confounding bias and other types of bias). It usually starts by an examination of stratum-specific effect measures. If there is variation by stratum, heterogeneity, no further analysis is undertaken and the final results are reported as stratum-specific measures. If there is homogeneity of effect measures across strata, a summary estimates computed. Heterogeneity is identified using the chi square fir homogeneity. The summary estimates may be a chi square or an odds ratio computed using the Mantel-Haenszel procedure.

REGRESSION MULTIVARIATE ANALYSIS
Multivariate models solve 2 problems that arose when stratified analysis was used. Stratified analysis breaks down when data is sparse with very low numbers in some strata. Stratified analysis would be very cumbersome if it were used for more than 3 variables. There are three main types of multivariate models: the linear model, the logistic model, and the proportional hazards model. The linear model is E(Y) = b0 + åi=1 bixi. The binary logistic model is of the form ln(p/1-p) = eåi=1 bixi. The proportional hazards regression relates hazard at a given time to risk factors such that yi = ln{hi(t) / h0(t)} = b1 x1i + b2 x2i + ….The coefficients of proportional hazards regression are interpreted like coefficients of logistic regression.


TABLE #1:
CHOICE OF STATISTICAL TECHNIQUE FOR BIVARIATE ANALYSIS[1]
First variable
Second Variable
Test
Continuous
Dichotomous, unpaired
2-sample t test
Continuous
Dichotomous, paired
Paired t test ( 1 sample t test after taking differences for each pair)
Continuous
Nominal (>= groups)
1-way ANOVA
Continuous
Continuous
Linear correlation (Pearson) or linear regression
Ordinal
Dichotomous, unpaired
Mann-Whitney U test or Chi-square test for linear trend
Ordinal
Dichotomous, paired
Wilcoxon test
Ordinal
Ordinal
Spearman Correlation or Kendall Correlation
Ordinal
Continuous
Categorize the continuous and use Spearman correlation, Kendal correlation or the chi square test
Dichtomous
Dichotomous, unpaired
Chi-square test or Fisher exact probability test
Dichotomous
Dichotomous, paired
McNemar chi-square test
Dichtomous
Nominal
Chi-square test
Nominal
Nominal
Chi-square test


TABLE #2:
CHOICE OF STATISTICAL TECHNIQUE FOR MULTIVARIATE ANALYSIS[1]
Dependent variable
Independent Variables
Test
Continuous
All categorical
ANOVA (analysis of variance)
Continuous
Mixture of categorical and continuous
ANCOVA (Analysis of covariance)
Continuous
All continuous
Multiple linear regression
Dichotomous
All categorical
Multiple logistic regression or log-linear analysis
Dichtomous
Mixture of categorical and continuous
Logistic regression
Time-dependent
Dichotomous
Mixture of categorical and continuous
Cox’s proportinal hazards model
Dichotomous
All continuous
Logistic regression or discriminant function analysis
Nominal
All categorical
Log-linear analysis
Nominal
Mixture of categorical and continuous
Group the continuous and perform log linear analysis
Nominal
All continuous
Discriminant function analysis or categorize the continuous and perform log-linear analysis

NB: Categorical includes nominal, ordinal and dichotomous


NOTE

[1] (Jekel et al Epidemiology, Biostatistics, and Preventive Medicine WB Saunders page 175):