Integrated Medical Education Resources: 061201P - ARTIFICIAL DATA SET FOR PRACTICE ANALYSIS BY POSTGRADUATE EPIDEMIOLOGY STUDENTS

Data analysis workshop by Professor Omar Hasan Kasule for MPH candidates at Universiti Malaya 24^th November – 01^st December 2006

INSTRUCTIONS

The attached data set is quite abstract and the numbers were selected in a roughly random way. The aim is to give you practice in managing and analyzing data. Some of the conclusions you reach may not be logical because the data is not natural. The advantage of this is that you will focus on what the data is telling you and not any pre-conceived ideas or prior knowledge.

The assignment instructions are deliberately made too general to force you to think of all what can be done so do not limit your imagination and to make choices. You will have to use your ingenuity to complete the data management and analysis starting from converting a word file into an SPSS file and then looking for the various analytic programs. It is possible that some analyses are not found in SPSS and you may have to do them by hand using formulas looked up in specialized books or analytic programs other than SPSS. You may need to compute and use extra variables. Be humble there may be some analyzes you cannot make.

Please note that completing this data analysis exercise will involve heavy time investment so budget your time carefully and judiciously. The analyzes are too many and you may consider working as one of two groups so that you discuss together but share the actual computer work because it will be extensive.

The data set is basically a cohort study with a nested case control study. It is also analyzable as a cross-sectional study using the status at the point in time that the rectangular data file shows. All the analytic procedures will have to be repeated three times for each of the 3 study designs: cross-sectional, case control, and follow up. Cross sectional analysis will use the data as shown in the rectangular file. For case control analysis you will have to randomly select 20 cases from cases of throat cancer and 20 controls from the non-cancer patients. For cohort analysis you will use the follow up times provided.

DATA MANAGEMENT

Undertake data validation and data editing and solve any data problems you identify for example handling of missing data and outliers if any. Problems in the data should not be a bar to further analysis since this is an exercise.

DESCRIPTIVE STATISTICS

Assess the normality of relevant variables in the data set and normalize the non-normal ones.
Find out how you would check equality of variances of cancer and smoking prevalence as a condition for using large sample tests
Produce all relevant summary statistics for all variables in the data set giving both point estimates and measures of variation/dispersion
Draw and interpret a scattergram of weight against height
Compute a linear correlation matrix for relevant variables and compute other types of correlation coefficients between each pair of relevant variables. Test for the significance of the linear correlation coefficients.
Construct a multiple linear regression model relating weight to height and adjusting for relevant confounders. Interpret indicators of goodness of fit from the print out.
Using the t test statistic determine whether throat cancer risk is associated with weight
Repeat the analysis above using a corresponding non-parametric test and assume for purposes of this exercise that the data was not normally distributed.
Compute the incidence rate of throat cancer and give a 95% confidence intervals
Compute the prevalence of throat cancer and give a 95% confidence intervals
Compute and draw a survival curve for throat cancer patients using the Life table Method
Compute and draw a survival curve for throat cancer patients using the Kaplan-Meier method

ANALYTIC STATISTICS: UNSTRATIFIED ANALYSIS

Compute the chisquare for association between throat cancer and smoking.
Use Fischer’s exact test to test for association between throat cancer and smoking.
Compute the rate ratio of throat cancer in smokers vs non smokers and give the 95% confidence intervals
Compute the rate difference of throat cancer smokers and nonsmokers and give a 95% confidence interval
Compute the prevalence difference of throat cancer smokers vs nonsmokers and give a 95% confidence interval
Compute the prevalence odds ratio of throat cancer in smokers vs non-smokers and give 95% confidence intervals
Using the odds ratio from above compute all the various attributable measures that you know

ANALYTIC STATISTICS: STRATIFYING BY RELEVANT POTENTIAL CONFOUNDERS

Carry out tests for homogeneity of chisquares/odds ratios of throat cancer in smokers vs nonsmokers by different levels of (a) potential confounding variable(s)
Compute the MH chisquare of association between throat cancer and smoking stratifying by (a) relevant confounder(s)
Compute the MH Odds ratio with 95% confidence intervals for throat cancer in smokers vs nonsmokers stratifying for (a) relevant confounder(s)

ANALYTIC STATISTICS: REGRESSION

Construct a logistic regression model relating throat cancer to smoking identifying and adjusting for (a) potential confounder(s). Try all 3 methods of model fitting (step up, step down, and step wise) and use a 0.05 cut-off point. Derive the odds ratio, test for its significance, and derive its 95% confidence intervals. Interpret the indicators of model fit from your printouts.
Explore for interaction/effect modification by using interaction (multiplication) variables. If you find a significant interaction term determine whether it changes the odds ratio and show how this is done.

ANALYSIS STATISTICS: SURVIVAL

Use the Lifetable method to construct separate survival curves for drug A and drug B. Use (a) suitable test(s) of significance
Use the Kaplan-Meier method to construct separate survival curves for drug A and drug B. Use (a) suitable test(s) of significance
Use Cox’s model to explore the effects of treatment on survival and the effect(s) of prognostic variable(s).