Paper presented at a workshop on data analysis using SPSS held at the Kulliyah of Medicine, International Islamic University Kuantan MALAYSIA on 25th May 2004 by Prof Dr Omar Hasan Kasule, Sr. MB ChB (MUK), MPH & DrPH (Harvard)
VARIABLES
Understanding variables and their properties is essential to understanding statistical analysis. A constant has only one unvarying value under all circumstances for example p and c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with intrinsic numerical value). Qualitative variables can be nominal (no specific order of magnitude), ordinal (specific order) or ranked. A random quantitative variable results when numerical values are assigned to results of measurement or counting. It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions and decimals. The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis depends on the type of variable. Many mistakes in data analysis arise from not knowing the difference between discrete and continuous variables and wrongly applying the wrong statistical technique.
PRELIMINARIES OF DATA ANALYSIS
Simple manual inspection of the data is needed before applying sophisticated statistical tests.. Indiscriminate application of the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among variables that could alert the investigator to errors in computer analysis.
Data analysis is essentially construction and testing of hypotheses. Two procedures are employed in statistical analysis. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The tests for association commonly employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures
TYPES OF ANALYSIS
Univariate analysis is testing a hypothesis about one mean or one proportion. The t test is used to test hypotheses about a single sample mean. The chisquare test is used to test hypotheses about a single sample proportion. Univariate testing answers the question whether the given mean or proportion is significantly different from zero.
Bivariate analysis is testing the hypothesis whether two means or two proportions are significantly different from one another. The choice of the statistical test for association in bivariate analysis is made according to Table #1
Multivariate analysis in its commonest form is essentially bivariate analysis with adjustment for extraneous variables that confuse (or confound) the bivariate relation. Choice of statistical test of association for multi-variate analysis is made according to table #2
STATISTICAL MODELS IN DATA ANALYSIS
Observations or raw data has to be fit to a specific statistical model. Once the model is fit it can be used for prediction. There are basically three types of models: probability models, likelihood models, and regression models. The probability model is deterministic and stochastic. Probability models commonly used in statistical analysis are the binomial and the normal distributions. The likelihood model derives the maximum likelihood estimator from the data. The maximum likelihood estimate, MLE, is the most likely value of the parameter from the given data and is derived interactively. The regression model may be a Poisson regression model or may be binomial logistic regression model. The model allows modeling the interaction among confounders and the interaction between the exposure and the confounders. It can be used to explore additive and synergistic relations.
Multivariate models solve 2 problems that arose when stratified analysis was used. Stratified analysis breaks down when data is sparse with very low numbers in some strata. Stratified analysis would be very cumbersome if it were used for more than 3 variables. There are three main types of multivariate models: the linear model, the logistic model, and the proportional hazards model. The linear model is E(Y) = b0 + åi=1 bixi. The binary logistic model is of the form ln(p/1-p) = eåi=1 bixi. The proportional hazards regression relates hazard at a given time to risk factors such that yi = ln{hi(t) / h0(t)} = b1 x1i + b2 x2i + ….The coefficients of proportional hazards regression are interpreted like coefficients of logistic regression.
TABLE #1:
CHOICE OF STATISTICAL TECHNIQUE FOR BIVARIATE ANALYSIS[1]
First variable | Second Variable | Test |
Continuous | Dichotomous, unpaired | 2-sample t test |
Continuous | Dichotomous, paired | Paired t test ( 1 sample t test after taking differences for each pair) |
Continuous | Nominal (>= groups) | 1-way ANOVA |
Continuous | Continuous | Linear correlation (Pearson) or linear regression |
Ordinal | Dichotomous, unpaired | Mann-Whitney U test or Chi-square test for linear trend |
Ordinal | Dichotomous, paired | Wilcoxon test |
Ordinal | Ordinal | Spearman Correlation or Kendall Correlation |
Ordinal | Continuous | Categorize the continuous and use Spearman correlation, Kendal correlation or the chi square test |
Dichtomous | Dichotomous, unpaired | Chi-square test or Fisher exact probability test |
Dichotomous | Dichotomous, paired | McNemar chi-square test |
Dichtomous | Nominal | Chi-square test |
Nominal | Nominal | Chi-square test |
TABLE #2:
CHOICE OF STATISTICAL TECHNIQUE FOR MULTIVARIATE ANALYSIS[1]
Dependent variable | Independent Variables | Test |
Continuous | All categorical | ANOVA (analysis of variance) |
Continuous | Mixture of categorical and continuous | ANCOVA (Analysis of covariance) |
Continuous | All continuous | Multiple linear regression |
Dichotomous | All categorical | Multiple logistic regression or log-linear analysis |
Dichtomous | Mixture of categorical and continuous | Logistic regression |
Time-dependent Dichotomous | Mixture of categorical and continuous | Cox’s proportinal hazards model |
Dichotomous | All continuous | Logistic regression or discriminant function analysis |
Nominal | All categorical | Log-linear analysis |
Nominal | Mixture of categorical and continuous | Group the continuous and perform log linear analysis |
Nominal | All continuous | Discriminant function analysis or categorize the continuous and perform log-linear analysis |
NB: Categorical includes nominal, ordinal and dichotomous
NOTE
[1] (Jekel et al Epidemiology, Biostatistics, and Preventive Medicine WB Saunders page 175):