search this site.

240422P - STATISTICAL CONSIDERATIONS OF DATA WITH SMALL SAMPLE SIZE OR MISSING DATA: An Overview

Print Friendly and PDFPrint Friendly

Presented at an online workshop for students of the Faculty of Pharmacy PNU on April 22, 2024 by Prof Omar Hasan Kasule Sr MB ChB (MUK), MPH (Harvard), DrPH (Harvard), Chairman, Research Ethics Committee King Abdullah Ibn Abdulaziz University Hospital.

 

1.0 OBJECTIVES:

  • To learn the statistical tests used for data with a small sample size.
  • Methods of handling missing data.
  • Implications of small sample size and missing data on research findings.

 

2.0 DEFINITION OF A SMALL SAMPLE

  • The central limit theorem (CLT) does not hold for small samples.
  • A small sample is defined as n <30 for the quantitative outcome or [np or n (1–p)] <8 (where P is the proportion) for the categorical outcome.
  • Small samples do not fulfill CLT assumptions, the most important being normal distribution.

 

3.0 STATISTICAL TESTS FOR DATA WITH SMALL SAMPLE SIZE

  • Continuous data: t-test n<30 or n<60
  • Categorical data: Fisher’s exact test
  • Others: non-parametric tests, regression models, meta-analysis models

 

4.0 META ANALYSIS: ACHIEVING GENERALIZABILITY FROM SEVERAL SIMILAR SMALL SAMPLES USING EXCEL META ESSENTIALS

  • Reference: Suurmond, R., van Rhee, H., & Hak, T. (2017). Introduction, comparison, and validation of Meta‐Essentials: A free and simple tool for meta‐analysis. Research Synthesis Methods8(4), 537–553. http://doi.org/10.1002/jrsm.1260
  • Set inclusion/exclusion criteria and identify reports using specific keywords
  • Check the quality of each report using its methodology and results
  • Select reports that are homogeneous enough to be combined
  • Replace the numbers in the Excel sheet with numbers from your search
  • Compute the combined estimate and other parameters

 

5.0 HANDLING MISSING DATA

  • Deletion of missing items or variables
  • Imputation to replace missing data

 

6.0 IMPLICATIONS OF SMALL SAMPLE SIZE AND MISSING DATA ON RESEARCH FINDINGS
  • Very small samples undermine the internal and external validity of a study by being prone to errors: type II (false negative) due to inadequate power and type I (false positive) due to bias.
  • Small sample size is not always bad. It is associated with more careful and accurate high-quality data collection to achieve internal validity, in the knowledge that external validity is achieved by meta-analysis.
  • Comparative studies with good control of confounders can use small samples. Animal studies carried out in very controlled conditions do not require large samples.
  • If the population is less than 100, do not sample survey all of them, and you will have population and not sample statistics.
 
7.0 IMPLICATIONS OF LARGE SAMPLE SIZE ON RESEARCH FINDINGS
  • Move to mega samples being driven by easy data availability using IT, but are not always necessary, though journal editors and research funders like them.
  • Large samples overcome problems of missing data and distribute unknown confounders randomly
  • Large samples are needed to detect small differences or study rare events  
 
8.0 RELATION BETWEEN MISSING DATA AND SMALL SAMPLE SIZE
  • Missing data reduces the effective sample size because most analysis programs eliminate the missing items.
  • The larger the sample size the less the relative effect of missing data. Very large sample sizes can tolerate missing data.
  • Most statistical analysis formulas are based on large samples. They are not efficient for missing data and small sample sizes
 

9.0 EXAMPLE OF MISSING DATA

Subject

Sex

Age

Education

Income

Weight

Height

1

na

20

na

na

56.0

na

2

2

30

2

High

60.0

130.0

3

1

25

na

High

58.0

125.0

4

na

na

na

High

na

na

5

2

40

1

High

65.0

na

6

2

50

2

High

66.0

na

7

1

60

na

High

68.0

200

8

2

70

na

High

70.0

189

9

1

12

1

Low

20.0

20

10

2

10

na

Low

25.0

25

 

Deletion?

  • Association between sex and weight: t test
  • Association between sex and weight: Fisher Exact test
  • Correlation between weight and height: Spearman correlation coefficient
  • Association between age and height: regression coefficient

 

10.0 CLASSIFICATION OF MISSING DATA

  • Random and nonrandom. Random missing data introduces bias. Non-random missing data reduces efficiency and reliability.
  • Variable and item missing data.
  • Sometimes it is worth collecting data again.

 

11.0 CAUSES OF MISSING DATA

  • Non-response: respondent does not know (include a do not know category).
  • Non-response due to the item not being clear to the respondent.
  • Data entry errors: data lost in keying or management.

 

12.0 PRACTICAL APPROACHES TO PREVENT MISSING DATA:

  • Make the hypothesis narrow and concise to lead to specific,
  • Items are limited to what is relevant as independent, dependent, or confounding variables.
  • Pilot test and remove problematic items, and you may even return and realign your hypotheses
  • Use a relevant sample.
  • Short and simple questionnaire to encourage completion
  • Call respondents to fill in missing data.
  • Care in recording, transcribing, and editing data

 

13.0 HANDLING MISSING DATA BY DELETION:

  • Complete case analysis uses only the data of variables observed at each time point with a reduced sample size. Listwise deletion removes subjects with missing data.
  • Available case analysis uses only the data available for each analysis and uses more of the sample. Pairwise deletion deletes a subject only for the analysis whose data is missing.
  • Deletions are done by software programs while we are not aware. Please check for the effective sample size in the output.
  • Delete the entire row, delete the entire column

 

14.0 HANDLING MISSING DATA BY IMPUTATION

  • Imputation is the substitution of missing data. Imputation enables you to make use of collected data, though some is missing. Imputation is not predicted.
  • If data is missing completely at random, the best imputation method is to use the mean, median, and mode. Mean imputation is the simple method of imputation and preserves the mean of the data but lowers the variance.
  • If missing at random, use multiple imputations, regression imputation (preserves correlation but reduces variability).
  • If missing not at random, use pattern substitution, maximum likelihood estimation.
  • Simple imputation uses the mean or median. Multiple imputation uses multivariate methods, for example, predicted regression values replace the missing data.
  • Evaluate imputation effectiveness by comparing the original data with the imputed data and how well they match.

 

15.0 META ESSENTIALS FREE SOFTWARE