Presented at a pre-conference workshop in conjunction with the Malaysia-Indonesia-Brunei Medical Conference held at Rizqun International Hotel Gadong Brunei 27-29th July 2007 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard) Professor of Epidemiology and Islamic Medicine Institute of Medicine Universiti Brunei Darussalam
Learning Objectives:
· Use of multivariate methods in controlling confounding and prediction
· Definition of the logistic regression line and equation
· Three methods of fitting multiple regression: step-up, step-down, and stepwise
· Assessment Goodness of Fit
Key Words and Terms:
· Confounding
· Interactive factor
· Likelihood Function
· Model, overfit
· Model, reduced
· Model, under-fit
· Model, validation
· Principle of parsimony
· Regression, multiple logistic
· Selection, automatic variable
· Selection, backward
· Selection, forward
· Selection, stepwise
· Standard Error Of The Estimate
· Variable, dependent
· Variable, dummy
· Variable, independent
· Variable, indicator
1.0 LOGISTIC REGRESSION
1.1 A. DEFINITION
Logistic regression is a type of non-linear regression. In logistic regression, the dependent variable, y, is stated as the logit which is the logarithmic transformation of a proportion or a probability. In logistic regression the outcome, variable y, is binary or dichotomized.
The derivation of the logistic model is simple and straightforward. If the outcome y is dichotomous, it can take on only two values, 0 and 1. We can define p = pr(y = 1). The odds of y = 1 can be computed as p/(1-p). The logit is defined as the logtransformation of the odds thus logit (pi) = natural logarithm of pi/(1-pi) = a + b1x1 + bnxn.
By simple mathematical manipulation we can define p from above as pi = 1 / [1 + e-(a + b1x1 … bnxn). The parameters are fitted by MLE. The logistic function written as ex/1 + ex is the inverse of the logit function written as log (p/1-p).
Logistic regression is very useful in epidemiological analysis for 2 reasons. (a) A dichotomized outcome variable and derivation of the odds ratio directly from the regression coefficient. (b) Logistic regression is also used in classification into two categories. The cut-off point is set for dichotomizing the outcome for example at Pr (y =1) = .2, 0.5, 0.7 etc. The model is set up and the coefficients are computed. The coefficients are then used to classify any person with a given profile of independent x variables.
1.2 TESTS OF SIGNIFICANCE
The likelihood ratio test is constructed by subtracting the likelihoods of two models one with and the other without the covariate whose significance is being tested. The chi square statistic is used thus (c1)2 = (-2lnL1) - (-2lnL0) = -2ln(L1/ L0). The alternative test is the Wald test that is based on the regression coefficient thus z = b/se(b). The likelihood ratio and the Wald tests give the same result for large samples. For smaller samples the likelihood ratio test is more reliable. The confidence limits for the odds ratio can be computed as eb +/- 1.96 se(b
1.3 STATISTICAL PACKAGE
When using statistical packages to model the logistic relation care must be taken to make sure that the right response is being modeled. The packages normally model the logit of the non-event (y=0) by default. The usual output of a logistic regression is: the parameter estimate (the logistic regression coefficient), the standard error of the estimate, the Wald chisquare which is defined as {b/se(b)}2, the p-value being the probability of a result higher than the given value of chisquare, the standardized estimate which is defined as b/{(p2/3)/s}1/2, the odds ratio (OR) defined as the exponent of b = eb, the 95% confidence intervals for OR, the global chisquare, and 6 statistics that describe the association of predicted and observed probabilities: concordant pairs, discordant pairs, tied pairs, Somer’s D, Gamma, Tau-a.
1.4 ANALYSIS OF MATCHED DATA
ANALYSIS OF 1:1 MATCHED DATA
The variables are manipulated. The difference between each case-control pair is used as explanatory variable and the usual logistic regression model is fitted.
ANALYSIS OF N:M MATCHED DATA
The proportional hazards procedure used to fit a conditional logistic regression model is fitted for 1:M or N:M matched data. A stratum is formed for each matched pair based on age or some other variable. A survival variable is created such that all cases in a stratum have the same event time and the controls are censored at a later time. The survival variable is 1 for cases and 0 for controls.
2.0 MULTIPLE LOGISTIC REGRESSION
The purpose of multiple logistic regression is to adjust for many co-factors in situations with a dichotomous outcome variable. Stratified analysis is an alternative method of adjustment but it breaks down rapidly if there are too many strata or if the strata are thin. Multiple logistic regression is able to model many thin strata and give meaningful results.
In the logistic regression model, the dependent variable, y, is nominal or discrete. The independent variable, x, can be nominal or discrete; nominal being better.
The multiple logistic regression model is: logit = a + b1x1 + b1x1. + b2x2+ b4x4 …. bnxn. It can be seen that this is mathematically equivalent to y = 1/ (1 + e - (b1x1 + b1x1. + b2x2+ b4x4 ….….bnxn) )
The logistic model is fitted using maximum likelihood estimation, MLE. The conditional logistic model is used for matched data.
The adjusted odds ratio is estimated directly from the regression coefficient as OR = eb
The predicted probability is given as p-hat = 1 / [1 + exp(-a-hat – b-hatixi)]. The predicted probabilities can then be compared with the actual observed probabilities. A 2 x 2 table is then created as follows
PREDICTED | OBSERVED | |
0 | 1 | |
1 | A | B |
0 | C | D |
The Brier score is used to assess prediction. The smaller the score, the better the prediction. However in assessing model prediction we have to use a different set of data. Bias can occur if the data used for modeling is the same data used for assessing model prediction.
3.0 FITTING THE MULTIPLE REGRESSION MODEL
For any data set we can set up several linear models and an infinite number of non-linear models. Selecting the best regression model can be quite complicated. Not every variable available in the data set should be included in the model. Only relevant variables must be included in the model. A model is said to be correctly specified if it contains all relevant independent variables including interaction terms with no redundant or extraneous terms. A model is said to be under-specified if it misses important independent variables. An over-specified model contains redundant independent variables. The extraneous variables may be unrelated to independent variables or the dependent variable.
There are several approaches to reducing the number of potential variables in the model. Variables can be excluded on theoretical grounds using biological knowledge of causal or non-causal associations. A correlation matrix is useful for preliminary exploration of relations among variables. If two variables are correlated we drop the one with a large amount of missing data, greater measurement error, and less theoretical unimportance. Also dropped are variables that are unrelated to the outcome variable in bivariate analysis. The number of variables can also be reduced by combining variables into a single variable or a single scale.
Four procedures are used for fitting the multiple regression line: subset or best, step-up, step-down, and step-wise. The best fitting model is one with an unbiased estimate of the b coefficient and minimum variance. Residual diagnostics and evaluation of multicollinearity are carried out on the fit model to make sure it is the best.
In subset or best regression, the computer is told to compute all possible models with 1,2,3, or more covariates and select the best fitting based on likelihood score or the chisquare.
Step-up is forward entry or forward selection and it starts with a minimal model. It involves adding one variable at a time without trying to delete any variable.
In step-down or backward elimination we start with a full model or maximal model consisting of all variables then we delete one variable at a time without trying to add any new variables.
Step-wise selection is a combination of step up and step down selection. All variables are run to select the one with the largest absolute value of the t ratio. The selected variable is entered first into the model. Variables are added to the model one at a time if they make a significant contribution as assessed by a pre-specified t value. Alternatively the selection could be based on changes in the p value, the point estimate or the standard error of the estimate. After each addition of a new variable, the variable with the least contribution is removed based on a pre-specified t value. The following rules of thumb are used to make decisions about variable inclusion and exclusion. Generally if the t ratio =<1 the variable is omitted. If the t ratio is 1.0 to 2.0 the variable is considered and a decision is made to include or exclude. If the t ratio is >=2.0 the variable is included. Stepwise model selection has the following disadvantages: (a) too many models have to be checked before arriving at the best model (b) it ignores the effect of outliers (c) It ignores non-linear models (d) It uses the t ratio as a criterion and ignores R2 and s. (e) It does not consider the joint effects of independent variables (f) The order in which variables are introduced may affect the final result (g) Purely automatic routines do not consider the investigator’s special knowledge.
Variable selection procedures are useful if the purpose of the regression is prediction. They are less useful if the purpose is study of causal relations.
Significance testing and 95% CI can be done for the intercept and the regression coefficients using the t-test. The test hypotheses are in the form H0: a = 0 and H0: b = 0
Data splitting is a method used to validate variable selection. The data is split into 2 parts. One part is used for variable selection and the other part is used to evaluate the variable selection.
The actual fitting of the regression model can be carried out using several approaches. The most popular is the maximum likelihood method which could be based on the Poisson, binomial, or hypergeometric distributions.
Model specification errors can occur when important variables are omitted from the model. Failure to account for non-linear relations leads to mis-specification. Over-specification is including too many variables in the model with the risk of introducing collinearity. Variable selection procedures can be used to overcome this problem by selecting the best subset of explanatory variables that gives the maximum R2 for the given p value. A model is said to be overfit if extraneous variables are included. These variables however do not bias the parameter estimates. A model is said to be under-fit if important variables are not included. Under-fitting is a cause of bias in parameter estimates.
Missing data causes bias. The extent of bias due to missing data can be assessed by comparing observations with missing data against those without missing data on the most important variables. There are several approaches to dealing with missing data. Cases with missing data can be deleted. In order to determine how many missing cases exist in the data, a new variable is created with value 1 if data is missing and 0 if data is not missing on any independent variable. The dummy variable will adjust for missing data in the analysis. Additional efforts can be made to obtain extra data. The number of independent variables can be reduced by combining or scaling variables which reduces the problem of missing data. There are several methods of estimating the value of missing data.
Non-convergence is a common problem. In a situation of no convergence, the likelihood equation for a logistic regression model does not have a finite solution and logist returns a message of ‘infinite parameters’. The following actions are taken in case of infinite parameters: (a) checking raw data for transcription errors (b) categorizing quantitative variables (c) using fewer explanatory variables (d) collecting more data or (e) reclassify the response variable by using a different cut-off point.
4.0 ASSESSING REGRESSION MODELS
4.1 The ideal model
Selection of the best model is guided by the coefficient of determination, the significance of the regression coefficient, and residual analysis. The best model is one with the highest coefficient of determination or one for which any additions do not make any significant changes in the coefficient. Insignificant predictor items are best eliminated from the model unless there is a special reason for wishing to retain them. Model misspecification occurs when a linear relation is assumed for a curvilinear situation. A model may also be misspecified is important variables are omitted. After fitting the model several diagnostic procedures can be carried out to assess its validity and appropriateness. Tests of linearity are carried out first. Then row and column diagnostics are performed.
4.2 Validating a regression model
There are basically four approaches to validating a regression model that has been fitted. New data may be collected and may be used to test the model. Alternatively existing data may be randomly split into 2 parts; one part is used to develop the model and the other part is used to test the model. In the jack-knife approach, observations are deleted from the model one at a time with the model being recomputed to see whether there are any differences; a valid model will not change because of such removal of some observations. In the bootstrap approach, random samples are selected from the data (with replacement) and the model is refit for each sample. Constancy of the model indicates its validity.
4.3 Assessment of good of fit in logistic regression models
There are basically three options: (a) Hosmer and Lameshow Goodness of Fit test (b) The generalized coefficient of determination and (c) the adjusted generalized coefficient of determination.
The Hosmer and Lameshow Goodness of Fit test calculates the Pearson Chisquare for a 2 x g table with i groups. It essentially involves comparing observed with expected or predicted values. The chisquare with g-2 degrees of freedom is given by: c2 = gåi=1 [{(Oi - NipI)2} / { NipI (1 - pI)}] where Ni = number of observations in group I, Oi = number of outcomes in group I, pI = average estimated probability of the event in the ith group.
The generalized coefficient of determination is given by the expression R2 = 1 – [L(0) / L(b)]2/p where L(0) = likelihood for a model consisting of the intercept only and L(b) = likelihood of the specified model.
The adjusted generalized coefficient of determination is computed as the ratio of the observed coefficient of determination to the maximum coefficient of determination. The maximum coefficient of determination is given bt the expression 1 – [ L(0)]2/p
4.4 Improving the fit of the regression model
The interaction term is defined as the product of two terms for example var3 = var1 * var2. Interaction terms produce a better fit. Use of interactioin terms also improves model fit. More than one method of creating interacton variables may be used to improve the model. For example interaction and indicator variables may be combined. In the model y = a + b1x1 + b2x2+ b3x1x2 , x2 is a dummy variables. If x2 = 0, the model becomes y = a + b1x1. If x2 = 1, the model becomes y = a + b3 + b1x1. A dummy variable can be attached to each indicator variable for example in the model y = a + b1x1 b2x2 + b4x1x3 + b5x2x3. Some significant interactions may turn out to be difficult to interpret clinically. Interaction is suspected when a variable thought to be significant on theoretical grounds turns out to be insignificant in the regression model. This indicates that its significance is under certain conditions of interaction. Thus testing for interaction becomes a form of sub-group analysis.
Other approaches: The regression can be improved by addition of a suppressor variable to the model in order to enhance the importance of other variables. The regression model can also be improved by dropping outliers. A constant could be added or subtracted from each independent variable as in the model y = a + b (x –100) or y = a + b (x + 50).