search this site.

0900L - MODULE 4.0 INFERENTIAL STATISTICS

Print Friendly and PDFPrint Friendly

Copyright by Professor Omar Hasan Kasule Sr.


MODULE OUTLINE

4.1 DISCRETE DATA ANALYSIS
4.1.1 Simple Analysis of Proportions
4.1.2 Stratified and Matched Analysis of Proportions
4.1.3 Exact Analysis of Proportions
4.1.4 Analysis of Rates and Hazards
4.1.5 Analysis of Ratios

4.2 CONTINUOUS DATA ANALYSIS
4.2.1 Overview of Parametric Analysis
4.2.2 Parametric Analysis for 2 Sample Means
4.2.3 Parametric Analysis for 3 or More Sample Means
4.2.4 Overview of Non Parametric Analysis
4.2.5 Procedures of Non-Parametric Analysis

4.3 CORRELATION ANALYSIS
4.3.1 Description
4.3.2 Pearson's Correlation Coefficient, r
4.3.3 Other Correlation Coefficients
4.3.4 The Coefficient of Determination, R2
4.3.5 Non-Parametric Correlation Analysis

4.4 REGRESSION ANALYSIS
4.4.1 Linear Regression
4.4.2 Logistic Regression
4.4.3 Fitting Regression Models
4.4.4 Assessing Regression Models
4.4.5 Alternatives to Regression

4.5 TIME SERIES AND SURVIVAL ANALYSIS
4.5.1 Time Series Analysis
4.5.2 Introduction to Survival Analysis
4.5.3 Non-Regression Survival Analysis
4.5.4 Regression Methods for Survival Analysis
4.5.5 Comparing Survival Curves


UNIT 4.1

DISCRETE DATA ANALYSIS


Learning Objectives:

·    Analysis of proportions using approximate and exact methods
·    Analysis of rates and hazards
·    Analysis of ratios


Key Words and Terms:

·    Analysis, analysis of 2 proportions
·    Analysis, analysis of one proportion
·    Analysis, approximate methods of analysis
·    Analysis, binary data analysis
·    Analysis, categorical data analysis
·    Analysis, discrete data analysis
·    Analysis, exact methods of analysis
·    Analysis, matched analysis
·    Analysis, simple analysis
·    Analysis, stratified analysis
·    Chi-square, chi-square statistic of homogeneity
·    Chi-square, Mantel-Haenszel chi square
·    Chi-square, McNemar chi-square
·    Chi-square, Pearson chi-square statistic of association
·    Chisquare, trend
·    Chisquare, weighted
·    Degrees of freedom
·    Expected, frequency
·    Expected, probability
·    Homogeneity of variance
·    Homogeneity, test of homogeneity
·    Observed, frequency
·    Observed, probability
·    Observed, values
·    Table , contingency table
·    Table, collapsing of complex tables
·    Tables, partitioning of complex tables
·    Total, column total
·    Total, grand total
·    Total, marginal total
·    Total, row total


UNIT OUTLINE

4.1.1 SIMPLE ANALYSIS OF PROPORTIONS
A. Preliminary Considerations
B. Testing Of One Binomial Proportion
C. Testing 2 Independent Binomial Proportions Using Z
D. Testing For Two Binomial Proportions In 2 x 2 Table Using c2
E. Testing Multinomials

4.1.2 STRATIFIED and MATCHED ANALYSIS OF PROPORTIONS
A. Mantel-Haenszel Chi-Square of Association in Stratified 2 x 2 Tables
B. Mantel-Haenszel Chi-Square of Homogeneity in Stratified 2 x 2 Tables
C. Matched Analysis of Stratified 2 x 2 tables

4.1.3 EXACT ANALYSIS OF PROPORTIONS

A. Exact Methods

B. Testing One Sample Proportion
C. Testing 2 Sample Proportions in a 2 x 2 Table
D. Testing In More Complex Tables

4.1.4 ANALYSIS OF RATES and HAZARDS
A. Simple Analysis for Incidence Rates
B. Mantel-Haenszel Chi-Square of Association Incidence Rates in Stratified 2 x 2 Tables
C. Analysis for Hazards

4.1.5 ANALYSIS OF RATIOS
A. Simple Analysis for Risk Ratio
B. Mantel-Haenszel Chi-Square of Association for Risk Ratio in Stratified 2 x 2 Tables


4.1.1 SIMPLE (UNSTRATIFIED) ANALYSIS OF PROPORTIONS
A. PRELIMINARY CONSIDERATIONS
APPROXIMATE vs EXACT METHODS
Discrete data is analyzed as proportions. Inference on discrete data is based on the binomial/multinomial distribution. It uses 2 approximate methods (z-statistic and the chi-square) and one exact method (Fisher's Exact Method). The exact methods are parametric and are valid for large sample sizes. The exact methods are valid for both large and small sample sizes. Approximate methods ate based on the large sample approximations that distributions approximate the normal distribution when the sample size is large enough. These methods are therefore not sufficiently accurate for small samples. The z and chi-statistics are used for large samples. The exact methods are used for small samples. There is nothing to prevent exact methods from being used for large samples. Hypothesis testing for proportions is similar to that for means. The mean of a proportion can be taken to be p (1-p) and the variance can be taken to be n p (1-p). We can then proceed to use the same formulas as we used for the z-test.

DATA CHARACTERISTICS
Gaussian Distribution Of The Data: The first step is to ascertain whether the data distribution follows an approximate Gaussian distribution. The approximate methods are most valid when the data is Gaussian.

Equal Variances: It is possible to compute variances for proportions using the binomial theorem. The variances of proportions in the compared samples must be approximately equal for the statistical tests to be valid.
Adequacy of The Sample Size: For approximate methods to be valid, the sample size must be adequate. There are special statistical procedures for ascertaining sample size.

DATA LAY-OUT
The data for approximate methods is laid out in the form of contingency tables: 2 x 2, 2 x k, m x n.. Visual inspection is recommended before application of statistical tests. The 2 x 2 table could be laid out in three ways. It can show the actual observations using the nij notation:
n11
n12
N1.
n21
n22
N2.
n.1
n.2
N


The table can be laid out showing probabilities using the pij notation:
p11
p12
pi.
p21
p22
P2.
p.1
p.2
1.0


We could also write the expected cell values under the null hypothesis assuming that the margins are held constant
n1. x n.1/n
ni. x n.2/n
n1.
n2. x n.1/n
n2. x n.2/n
n2.
n.1
n.2
n

STATING THE HYPOTHESES
The null hypothesis and the alternative hypotheses must be stated clearly. The following formulations are acceptable. In inference on 1 proportion using z or chi-tests, H0: sample proportion - population proportion = 0. HA: sample mean > or < population proportion. In Inference on 2 sample proportions using z or chi-square test, H0: sample proportion #1 - sample proportion #2 = 0. HA: sample proportion #1 > or < sample proportion #2. The 2 sample hypothesis can alternatively be stated in terms of probability as H0: p11 = p21  and HA: p12 = p22.  In Inference on 3 or more sample means using the chi-square test: H0: sample proportion #1 = sample proportion #2 = sample proportion #4 = sample proportion #…n

FIXING THE TESTING PARAMETERS
Testing parameters are fixed. For the p-value approach, the 5% or 0.05 level of significance is customarily used. There is nothing preventing using any other level like 2.5% or 10%.

CHOOSING THE TEST STATISTICS
Z-statistic: The z-statistic is used for large samples that have a Gaussian distribution. The chi-square statistic is the most commonly used approximate method for sample data. It is computationally easy. The z statistic is computed as the difference between 2 compared proportions expressed in standard or z-score units. Thus z = {p2 - p1}/ {p(1-p)/n}1/2 where p is the pooled variance of p2  and p1. In case of comparing one sample proportion against the population proportion, the above formula simplifies to z = {p2 – p0} / {p0 (1 - p0)/n}.  

Chisquare Statistic: The chisquare is popular because it is easy to compute. The chisquare test is based on comparing expected with observed values. The chi-square statistic is the square of the z-statistic (also called the chi statistic). The chi-square statistic essentially measures the deviation from the 'average'. The Pearson chi-square statistic is summation over all cells of (observed - expected)2/expected. Each chisquare computed is associated with degrees of freedom computed as (number of rows – 1) (number of columns –1). The expected frequency is computed as the (row total x column total)/ grand total.

B. TESTING OF ONE BINOMIAL PROPORTION
The binomial model is used. The incidence proportion is computed as n/N where n = number of cases and N=total population. The 95% confidence intervals for the incidence proportion is computed as exp [logit n/N +/- 1.96 {Var (logit n/N}1/2 where logit n/N = ln{(n/N) / (1 – n/N)} and Var (logit n/N) = {1/n + 1/(N-n)}. The chi square statistic is used to test the null hypothesis, c2 = {O – E} / {E(N – E)/N}1/2 where O = n and E is computed according to known population risk.

C. TESTING 2 INDEPENDENT BINOMIAL PROPORTIONS USING Z
The binomial model is used. The data is laid out as shown in the table below;


Exposure +
Exposure -

Disease +
a
B
m1
Disease -
c
D
m0

n1
n0
N

The rate ratio is computed as RR = (a/n1) / (b/n0) with 95% confidence intervals as exp [ln(RR) +/- 1.96 {Var (lnRR)}1/2] where Var(lnRR) = {1/ n1 - 1/b + 1/b - 1/ n0}. The rate difference is computed as RD = (a/n1) - (b/n0) with 95% confidence intervals given as RD +/- 1.96 {Var(RD)}1/2 where Var(RD) = [{ac/(n12 (n1 – 1)} + {bd/(n02 (n0 – 1)}]. The odds ratio is computed as {p1(1 - p1)} /  {p0(1 - p0)} = (a/c) – (b/d) = ad/bc. The 95% confidence intervals for the odds ratio are given by exp [ln(OR) +/- 1.96 {Var (lnRR)}1/2] where Var {ln(OR)} = (1/a + 1/c + 1/b +1/d).

The z test statistic is computed as the difference between sample and standard proportion divided by the sample standard error or standard deviation. Z = {|p – p0|} / {p(1- p)/n}1/2 ~ N(0,1) where p(1-p) is the variance. We can explain the testing of two sample proportions in symbols as follows. Let the respective proportions of the two samples be p1 = x1/n1 and p2 = x2/n2. The difference between the two proportions is given by p1 - p2. The standard error of the difference in proportions, se(p1 - p2) = (pooled variance of p1 and p2)1/2 = [{p1(p1 –1)/n1} +  {p2p2 –1)/n2}]1/2 . Pooled variance could alternatively be computed as by using the pooled proportion pp = (x1 + x2 ) / (n1 + n2) and computing the pooled variance as [{pp(p1 –1)/n1} +  {p2 (pp –1)/n2}]1/2. The z test statistic is thus computed as z = (difference between the proportions) / standard error of the difference =  |p1 - p2| / [{pp(1 - pp)/n1} +  {pp(1 - pp)/n2}]1/2  = |p1 - p2| / [{p1(p1 –1)/n1} +  {p2 (p2 –1)/n2}]1/2 ~ N(0,1). The 95% CI for the difference between the proportions is computed as p1 - p2  +/- 1.96 [{pp(pp –1)/n1} +  {pp(pp –1)/n2}]1/2 or p1 - p2  +/- 1.96 [{p1(p1 –1)/n1} +  {p2(pp2–1)/n2}]1/2. The test for two proportions using the z statistic is formulated as z = [ |p1 - p0| ] / [pp(1 - pp)/n1 + pp(1 - pp)/n0]1/2 where pp = (n1 p1 + n0 p0) / (n1 + n0).

D. TESTING FOR TWO BINOMIAL PROPORTIONS IN 2 x 2 TABLE USING c2
The chi-square is computed from the data using appropriate formulas and takes various shapes. There is no chi-square test for only one proportion. Most chi-square testing involves two proportions and a 2 x 2 contingency table is used. The contingency table lay-out and the formulas are different for paired and independent samples. The chi-square for paired data is called the MacNemar chi-square. The formula of the Pearson chisquare for independent samples is å {(Observed – expected)2 / E} ~ c1 . The same formula is used for the MacNemar chisquare for two paired samples but the observed and expected are confined to the discordant pairs only since the concordant pairs provide no additional information or contrast. Before the advent of high-speed computers, special computational formulas had been developed for computing chi-square for 2 x 2 contingency tables. The chi-square statistic computed as explained above is referred to the appropriate table to look up the p-value under the appropriate degrees of freedom. The general formula for degrees of freedom is: (rows - 1) x (columns - 1). The following decision rules are then used: If the p-value is larger than the level of significance of 0.05, the null hypothesis is not rejected. If the p-value is smaller than the level of significance of 0.05, the null hypothesis is rejected. There is a simple computational formula for the chisquare in 2 x 2 tables. The formula for independent samples is using the Mantel-Haenszel formula is c12 = [(n-1)(ad – bc)2] / [(a+d) (b+c) (a +b) (c +d)]. The chi-square formula could alternatively be written as The chi square statistic is used to test the null hypothesis thus c2 = (O – E)/V1/2 where E(a) = m1 n1/N and V(a) =  (m1 n1 m0 n0)/ N2 (N-1). The formula for the MacNemar is c12 = (a+b) / (a –b)

E. TESTING MULTINOMIALS
TESTING 1 MULTINOMIAL SAMPLE 1 x k
We use the formula å {(Observed – expected)2 / E} ~ c1

TESTING TWO INDEPENDENT MULTINOMIALS 2 x k
We use the expression åij {(Observedij – Expectedij)2/Expectedij} ~ c(r-1)(c-1) where Eij = Oi . x O. j / O. .

TESTING SEVERAL INDEPENDENT MULTINOMIAL PROPORTIONS r X cThe chi-square statistic is versatile and can be computed for r x c tables as åij {(Observedij – Expectedij)2/Expectedij} ~ c(r-1)(c-1). The large tables could be partitioned and chi-square statistics computed for the partitions. They can also be reduced or collapsed.

<Read more…>