Practical and Theoretical Implications of Epidemiological Research Based on Full Sample Large Data Analysis
Omar Hasan Kasule Sr1
1Institute of Medicine, Universiti Brunei Darussalam
This study was motivated by the observation that recent epidemiological research is based on existing large data-bases and that 100% sampling was the usual practice. This is a reversal of the traditional epidemiological practice of selecting a probability sample from a study population in order to reach conclusion about the target population.
The objective of the research was to survey epidemiological research published in 2006 in three high-impact journals to ascertain whether the tendency of 100% sampling from large data bases had become the norm. The three journals selected for study were the American Journal of Epidemiology, the European Journal of Epidemiology, and the International Journal of Epidemiology.
A pre-tested data abstract form was used to abstract the following essential information from each original research article: title, authors, issue and volume number, date of publication, type of study (cross sectional, case control, cohort, randomized community control, randomized clinical), target population, study population, sampling fraction, type of sampling (simple random, stratified random, systematic random, multi-stage, non-random), source of data (existing data base, fresh data collection, prior study), and total number of study subjects. The data was keyed into an SPSS data base for categorical analysis using the chi-square test statistic to test for association.
Results will be presented showing the increasing trend of doing epidemiological research based on large data sets of routinely collected data or data left over from previous research. The research trends will be described and characterized regarding size of study, methods of sampling, and implications on both internal and external validity
.
The findings of the study indicate a major change in epidemiological research with serious practical and theoretical implications. The availability of large data bases and high speed computers has encouraged epidemiologists to analyze data without probability sampling. A large data set gives very stable parameters but the same degree of precision could have been obtained from a smaller sample. What is the lost is the ability of the epidemiologist to inspect a small manageable data set, internalize it, and let his intuition act before the data is analyzed. The more intimate contact of the epidemiologist with the data traditionally accounted for deep understanding and discussion which are missed in the new trend. Easy availability of large databases also encourages epidemiologists to plunge into data analysis before serious thought about the research questions. In some cases the research questions can be prompted by preliminary analysis which can lead to numerous biases. Use of large data sets has the advantage of external validity which had never been the primary objective of epidemiological research. Epidemiologists have traditionally aimed at carrying out a small study based on probability sampling so that they can easily identify and control confounding and other sources of bias with the ultimate aim of internal validity. They knew that external validity (generalization) would be attained inductively by consideration of several studies that are internally valid. Use of large sets of routinely collected data also raises the issue of the quality of the data which is collected with service and administrative and not research considerations in mind.
The paper concludes by highlighting that more thought should be given to the implications of the observed change in the paradigms and practices of epidemiological research.