Integrated Medical Education Resources: 1004L- DATA ANALYSIS

Presented at a workshop on evidence-based decision making organized by the Ministry of Health Kingdom of Saudi Arabia Riyadh 24-26 April 2010 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Bioethics Faculty of Medicine King Fahd Medical College

1.0 DISCRETE DATA ANALYSIS

Discrete data, also called categorical data, is based on counting and has no fractions or decimals. It is analyzed using the chisquare statistic for large samples and the exact test for small samples.

The Mantel-Haenszel chi-square is used to test 2 proportions in stratified data.

The MacNemar chi square is used for pair matched data.

2.0 CONTINUOUS DATA

Inference on numeric continuous data is based on the comparison of sample means. Three test statistics are commonly used: the z-, the t- and the F-statistics. The z-statistic is used for large samples. The t and F are used for small or moderate samples. The z-statistic and the t-statistic are used to compare 2 samples. The F statistic is used to compare 3 or more samples.

The student t-test is the most commonly used test statistic for inference on continuous numerical data. It is used for independent and paired samples. It is used uniformly for sample sizes below 60 and for larger samples if the population standard deviation is not known. The F-test is used to compare 3 or more groups

The formulas for the z, t, and F statistics vary depending on whether the samples are paired or are an unpaired. They also vary depending on whether the samples have equal numbers of observation or the number of observations in each sample is different.

3.0 CORRELATION ANALYSIS

Correlation analysis is used as preliminary data analysis before applying more sophisticated methods. Correlation describes the relation between 2 random variables (bivariate relation) about the same person or object with no prior evidence of inter-dependence. Correlation indicates only association; the association is not necessarily causative. I The first step in correlation analysis is to inspect a scatter plot of the data to obtain a visual impression of the data layout and identify out-liers. Then Pearson’s coefficient of correlation (product moments correlation), r, is the commonest statistic for linear correlation.

4.0 REGRESSION ANALYSIS

The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable.

Multiple linear regression, a form of multivariate analysis, is defined by y=a+b₁x₁ + b₂x₂ + …b_nx_n. Linear regression is used for prediction (intrapolation and extrapolation) and for analysis of variance.

Logistic regression is non-linear regression with y dichotomous/binary being predicted by one x or several x's.

5.0 TIME SERIES ANALYSIS

Longitudinal data is summarized as a time series plot of y against time showing time trends, seasonal patterns, random / irregular patterns, or mixtures of the above. Moving averages may be plotted instead of raw scores for a more stable curve. Time series plots are used for showing trends and forecasting.

6.0 SURVIVAL ANALYSIS

Survival analysis is used to study survival duration and the effects of various factors on survival. Two non-regression methods are used in survival analysis: the life-table and the Kaplan-Maier methods. The life-table methods better with large data sets and when the time of occurrence of an event cannot be measured precisely. It leads to bias by assuming that withdrawals occur at the start of the interval when in reality they occur throughout the interval. The Kaplan-Maier method is best used for small data sets in which the time of event occurrence is measured precisely. It is an improvement on the life-table method in the handling of withdrawals. The assumption could therefore create bias or imprecision. The Kaplan-Maier method avoids this complication by not fixing the time intervals in advance.