search this site.

200623P - DATA COLLECTION AND MANAGEMENT

Print Friendly and PDFPrint Friendly

Presented in the Biostatistics module of the Clinical Research Coordinators Course on June 23, 2020, 10.00 – 11.00 by Professor Omar Hasan Kasule MB ChB (MUK), MPH (Harvard), DrPH (Harvard)  Professor of Epidemiology and Bioethics King Fahad Medical City


SOURCES OF SECONDARY DATA

Decennial censuses: Census data is reliable. It is wide in scope covering demographic, social, economic, and health information. The census describes population composition by sex, race/ethnicity, residence, marriage, socio-economic indicators.

Vital statistics: Vital events are births, deaths, marriage & divorce, and some disease conditions 

Routinely collected data:  Routinely collected data are cheap but may be unavailable or incomplete. They are obtained from medical facilities, life and health insurance companies, institutions (like prisons, army, schools), disease registries, and administrative records

Epidemiological studies: Observational epidemiological studies are of 3 types: cross-sectional, case-control, and follow-up/cohort studies

Special health surveys: Special surveys cover a larger population than epidemiological studies and maybe health, nutritional, or socio-demographic surveys.


PRIMARY DATA COLLECTION BY QUESTIONNAIRE

Questionnaire design involves content, the wording of questions, format, and layout. 

The reliability and validity of the questionnaire as well as practical logistics should be tested during the pilot study

Informed consent and confidentiality must be respected. 

A protocol must be written to set out data collection procedures. 

Questionnaire administration by face-to-face interview is the best but is expensive. 

Questionnaire administration by telephone is cheaper. 

Questionnaire administration by mail is very cheap but has a lower response rate. 

computer-administered questionnaire is associated with more honest responses.


PHYSICAL PRIMARY DATA COLLECTION

Data can be obtained by clinical examination, standardized psychological/psychiatric evaluation, measurement of environmental or occupational exposure, and an assay of biological specimens (endobiotic or xenobiotic) and laboratory experiments. 

Pharmacological experiments involve bioassay, quantal dose-effect curves, dose-response curves, and studies of drug elimination. 

Physiology experiments involve measurements of parameters of the various body systems. 

Microbiology experiments involve bacterial counts, immunoassays, and serological assays. 

Biochemical experiments involve measurements of concentrations of various substances. 

Statistical and graphical techniques are used to display and summarize this data.


DATA ENTRY and VALIDATION

Self-coding or pre-coded questionnaires are preferable. 

Data is input as text, multiple-choice, numeric, date and time, and yes/no responses. 

In double-entry techniques, 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. 

Data in the computer can be checked manually against the original questionnaire. 

Interactive data entry enables detection and correction of logical and entry errors immediately. 


DATA REPLICATION

Data replication is a copy management service that involves copying the data and also managing the copies. 

Synchronous data replication is instantaneous updating with no latency in data consistency. 

In asynchronous data replication the updating is not immediate and consistency is loose. 


DATA EDITING

Data editing is the process of correcting data collection and data entry errors. 

The data is 'cleaned' using logical, statistical, range, and consistency checks. 

All values are at the same level of precision (number of decimal places) to make computations consistent and decrease rounding off errors. 

The kappa statistic is used to measure inter-rater agreement. 

Data editing identifies and corrects errors such as invalid or inconsistent values. 

The main data problems are missing data, coding and entry errors, inconsistencies, irregular patterns, digit preference, outliers, rounding-off / significant figures, questions with multiple valid responses, and record duplication. 

Data transformation is the process of creating new derived variables preliminary to analysis and includes mathematical operations such as division, multiplication, addition, or subtraction; mathematical transformations such as logarithmic, trigonometric, power, and z-transformations. 


DATA STORAGE

Data gives rise to information that in turn gives rise to knowledge. Knowledge leads to understanding. Understanding leads to wisdom. 

Data may be univariate if it has only one variable. It may be bivariate if it has two variables allowing correlation. It may be multivariate with several variables allowing more sophisticated analyses. 

A document is stored data in any form: paper, book, letter, message, image, e-mail, voice, and sound. Some documents are ephemeral but can still be retrieved for the brief time that they exist and are recoverable. 

Data is physically stored as bytes. A byte has 8 bits and can therefore represent 28 = 256 characters. 


DATA COMPRESSION AND FORMATING

Data compression makes document retrieval easier because the search is carried out in a smaller space. Character, image, and sound data can all be compressed; however, compression may involve loss of some data. Data compression facilitates data storage and data retrieval.

Data may be formatted in tables of several types of databases (relational, hierarchical, and network). It may be unformatted such as images, sound, or electronic monitoring in the hospital. Formatted documents are easier to retrieve. 

Files may be sequential files, indexed files, tree-structured files, and clustered files. Files may be described as sequential, indexed, tree-structured, or clustered. 

Medline and PDQ are examples of medical databases. MEDLINE was established in 1971. Every year 400,000 articles from 3,700 journals are added and are indexed using medical subject headings (MESH). GRATEFUL MED is a query language used to search MEDLINE. PDQ is a database about cancer


DATA WAREHOUSING

Data warehousing is a method of extraction of data from various sources, storing it as historical and integrated data for use in decision-support systems. 

Metadata is a term used for the definition of data stored in the data warehouse (i.e. data about data). 

A data model is a graphic representation of the data either as diagrams or charts. The data model reflects the essential features of an organization. The purpose of a data model is to facilitate communication between the analyst and the user. It also helps create a logical discipline in database design. 


DATA MINING

Data mining is the discovery part of knowledge discovery in data (KDD) involving knowledge engineering, classification, and problem-solving. KDD starts with selection, cleaning, enrichment, and coding. 

The products of data mining are pattern recognition. These patterns are then applied to new situations in predicting and profiling. 

Artificial intelligence (AI), based on machine learning, imbues computers with some creativity and decision-making capabilities using specific algorithms.