Integrated Medical Education Resources: 240422P - STATISTICAL CONSIDERATIONS OF DATA WITH SMALL SAMPLE SIZE OR MISSING DATA: An Overview

Presented at an online workshop for students of the Faculty of Pharmacy PNU on April 22, 2024 by Prof Omar Hasan Kasule Sr MB ChB (MUK), MPH (Harvard), DrPH (Harvard), Chairman, Research Ethics Committee King Abdullah Ibn Abdulaziz University Hospital.

1.0 OBJECTIVES:

To learn the statistical tests used for data with a small sample size.
Methods of handling missing data.
Implications of small sample size and missing data on research findings.

2.0 DEFINITION OF A SMALL SAMPLE

The central limit theorem (CLT) does not hold for small samples.
A small sample is defined as n <30 for the quantitative outcome or [np or n (1–p)] <8 (where P is the proportion) for the categorical outcome.
Small samples do not fulfill CLT assumptions, the most important being normal distribution.

3.0 STATISTICAL TESTS FOR DATA WITH SMALL SAMPLE SIZE

Continuous data: t-test n<30 or n<60
Categorical data: Fisher’s exact test
Others: non-parametric tests, regression models, meta-analysis models

4.0 META ANALYSIS: ACHIEVING GENERALIZABILITY FROM SEVERAL SIMILAR SMALL SAMPLES USING EXCEL META ESSENTIALS

Reference: Suurmond, R., van Rhee, H., & Hak, T. (2017). Introduction, comparison, and validation of Meta‐Essentials: A free and simple tool for meta‐analysis. Research Synthesis Methods, 8(4), 537–553. http://doi.org/10.1002/jrsm.1260
Set inclusion/exclusion criteria and identify reports using specific keywords
Check the quality of each report using its methodology and results
Select reports that are homogeneous enough to be combined
Replace the numbers in the Excel sheet with numbers from your search
Compute the combined estimate and other parameters

5.0 HANDLING MISSING DATA

Deletion of missing items or variables
Imputation to replace missing data

6.0 IMPLICATIONS OF SMALL SAMPLE SIZE AND MISSING DATA ON RESEARCH FINDINGS

Very small samples undermine the internal and external validity of a study by being prone to errors: type II (false negative) due to inadequate power and type I (false positive) due to bias.
Small sample size is not always bad. It is associated with more careful and accurate high-quality data collection to achieve internal validity, in the knowledge that external validity is achieved by meta-analysis.
Comparative studies with good control of confounders can use small samples. Animal studies carried out in very controlled conditions do not require large samples.
If the population is less than 100, do not sample survey all of them, and you will have population and not sample statistics.

7.0 IMPLICATIONS OF LARGE SAMPLE SIZE ON RESEARCH FINDINGS

Move to mega samples being driven by easy data availability using IT, but are not always necessary, though journal editors and research funders like them.
Large samples overcome problems of missing data and distribute unknown confounders randomly
Large samples are needed to detect small differences or study rare events

8.0 RELATION BETWEEN MISSING DATA AND SMALL SAMPLE SIZE

Missing data reduces the effective sample size because most analysis programs eliminate the missing items.
The larger the sample size the less the relative effect of missing data. Very large sample sizes can tolerate missing data.
Most statistical analysis formulas are based on large samples. They are not efficient for missing data and small sample sizes

9.0 EXAMPLE OF MISSING DATA

Subject	Sex	Age	Education	Income	Weight	Height
1	na	20	na	na	56.0	na
2	2	30	2	High	60.0	130.0
3	1	25	na	High	58.0	125.0
4	na	na	na	High	na	na
5	2	40	1	High	65.0	na
6	2	50	2	High	66.0	na
7	1	60	na	High	68.0	200
8	2	70	na	High	70.0	189
9	1	12	1	Low	20.0	20
10	2	10	na	Low	25.0	25

Deletion?

Association between sex and weight: t test
Association between sex and weight: Fisher Exact test
Correlation between weight and height: Spearman correlation coefficient
Association between age and height: regression coefficient

10.0 CLASSIFICATION OF MISSING DATA

Random and nonrandom. Random missing data introduces bias. Non-random missing data reduces efficiency and reliability.
Variable and item missing data.
Sometimes it is worth collecting data again.

11.0 CAUSES OF MISSING DATA

Non-response: respondent does not know (include a do not know category).
Non-response due to the item not being clear to the respondent.
Data entry errors: data lost in keying or management.

12.0 PRACTICAL APPROACHES TO PREVENT MISSING DATA:

Make the hypothesis narrow and concise to lead to specific,
Items are limited to what is relevant as independent, dependent, or confounding variables.
Pilot test and remove problematic items, and you may even return and realign your hypotheses
Use a relevant sample.
Short and simple questionnaire to encourage completion
Call respondents to fill in missing data.
Care in recording, transcribing, and editing data

13.0 HANDLING MISSING DATA BY DELETION:

Complete case analysis uses only the data of variables observed at each time point with a reduced sample size. Listwise deletion removes subjects with missing data.
Available case analysis uses only the data available for each analysis and uses more of the sample. Pairwise deletion deletes a subject only for the analysis whose data is missing.
Deletions are done by software programs while we are not aware. Please check for the effective sample size in the output.
Delete the entire row, delete the entire column

14.0 HANDLING MISSING DATA BY IMPUTATION

Imputation is the substitution of missing data. Imputation enables you to make use of collected data, though some is missing. Imputation is not predicted.
If data is missing completely at random, the best imputation method is to use the mean, median, and mode. Mean imputation is the simple method of imputation and preserves the mean of the data but lowers the variance.
If missing at random, use multiple imputations, regression imputation (preserves correlation but reduces variability).
If missing not at random, use pattern substitution, maximum likelihood estimation.
Simple imputation uses the mean or median. Multiple imputation uses multivariate methods, for example, predicted regression values replace the missing data.
Evaluate imputation effectiveness by comparing the original data with the imputed data and how well they match.

15.0 META ESSENTIALS FREE SOFTWARE