Integrated Medical Education Resources: 130326P - RESEARCH DATA MANAGEMENT (Word Slides)

Workshop presented at the Research Data Base Management Workshop organized by the Directorate of Research Ministry of Health on 26^th March by Professor Omar HasanKasuleSr MB ChB (MUK), MPH (Harvard), DrPH (Harvard) EM; omarkasule@yahoo.com

Terminology

· A field/attribute/variable/variate is the characteristic measured for each member e.g name & weight. A value/element is the actual measurement or count like 5 cm, 10kg.

· A record/observation is a collection of all variables belonging to one individual. A file is a collection of records.

· A data-base is a collection of files.

· A data dictionary is an explanation or index of the data.

· A data base goes through a life cycle of its own called the data life cycle. Data is collected and is stored. New data has to be collected to update the old one.

· A census comprises all values of a defined finite population are obtained i.e. totality.

· Data models can take any of three shapes: relational, hierarchical, and network. Relational data is when access to any file or data element is possible from the outside and is random. A hierarchical data set is organized in layers. Folders opened at the beginning have other folders within that could not be seen from the outside.

Data coding

· Self-coding or pre-coded questionnaires are preferable to those requiring coding after data collection.

· Errors and inconsistencies could be introduced into the data during manual coding.

· A good pre-coded questionnaire can be produced after piloting the study

Data entry

· Both random and non-random errors could occur in data entry. The following methods can be used to detect such errors:

· Double entry techniques in which 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. The items are then re-checked in the questionnaires and reconciliation is carried out. This method is based on the assumption that the probability of 2 persons making the same random error on entering an item of information is very low. The items on which the 2 agree are therefore likely to be valid.

· The data entered in the computer could be checked manually against the original questionnaire.

· Interactive data entry is becoming popular. It enables detection and correction of logical and entry errors immediately. The computer data entry program could be programmed to detect entries with unacceptable values or that are logically inconsistent.

Data replication

· Data replication is a copy management service that involves copying the data and also managing the copies.

· Synchronous data replication is instantaneous updating with no latency in data consistency.

· In asynchronous data replication the updating is not immediate and consistency is loose.

Data problems

· Missing data,

· Coding and entry errors,

· Inconsistencies,

· Irregular patterns,

· Digit preference,

· Out-liers,

· Rounding-off / significant figures,

· Questions with multiple valid responses,

· Record duplication.

Data input, processing, and editing

· Data input is of 5 different types: text, multiple choice, numeric, date and time, and yes/no responses.

· A full screen data entry is very easy. It may be a data entry screen or may be in the form of a spreadsheet or grid. Special import and export programs enable conversion of data from one form to another.

· Data editing consists of identifying and correcting errors. Some values of variables are unusual or impossible. The distribution of the variables may not be reasonable.

· There may be inconsistencies in the data for example the age of a mother may not consistent with the number of children. A male reported with hysterectomy is an obvious mistake. A person with a height of 2 meters cannot have a weight of 50 kilograms.

· The codes may not be used in a consistent way for example 1 may be male in some records and female in some others.

Validation of epidemiological data

· Some epidemiological data can be validated and others cannot.

· Cigarette smoking can be validated by assessment of urine nicotine and blood carboxyhemoglobin.

· Food frequency data can be validated using the 7-day weighed diary method.

· Self reported weight or height can be validated by standardized measurement.

· Self reported disease status can be validated by clinical evaluation.

Assessing inconsistency due to various factors

· Consistency is measured as repeatability/reproducibility, reliability, and agreement.

· Reliability is when there is consistency between observers or techniques of measurement.

· Agreement refers to consistency between observers. Variation in measurements may be intra-observer or inter-observer.

· Repeatability/reproducibility is obtaining the same result on repeat assessment under the same conditions. Repeatability can be analyzed to separate effects due to various factors.

· In a Latin square design, several observers interview the same patients in a pre-determined order. ANOVA is then used to separate the effects due to the observer, due to the order, and due to the subject.

Assessing consistency of categorical data using the kappa statistic

The kappa statistic can be used to study reproducibility in which two observers assess subjects on categorical variables as shown in the table below

	Observer A+	Observer A-	Total
Observer B+	A	B	a+b
Observer B-	C	D	c+d
Total	a+c	b+d	N

· The kappa statistic is computed as k = (p₁ - p₀) / (1 - p₀) where p₁ = a+d/N and p_{0 =}total expected agreement = {(a+c)/N (a+b)/N} + {(b+d)/N (c+d)/N}.

Assessing consistency of continuous data using the mean difference

· For continuous variables the data lay-out is as shown below:

Subject	Observer A	Observer B	Difference
1
2
3
.
.
N

· We compute the mean difference and the standard error of the mean difference in order to construct a 95% confidence interval for the mean difference.

· If the interval contains the null value of zero, we can conclude that there is agreement between the two observers.

Preliminary data examination

· Examination of tables and graphics (histograms, stem and leaf, dot plot, box plot, side by side plot, scatter plot, histogram, line graphs etc).

· Descriptive statistics are used to detect errors, ascertain the normality of the data, and know the size of cells.

· Tabulation requires creating about 5 categories. The category boundaries can be chosen in several ways including using percentiles.

· Open ended categories should be avoided.

How to deal with missing values

· Analysis may be confined to complete records only with records that have missing data being deleted. This approach is easy to implement and is valid if the incomplete records are random. If the missing phenomenon is systematic then this procedure will introduce bias.

· Predict (impute) the missing data using regression analysis or maximum likelihood methods.

Data transformation 1: creating new variables

· Transformation is the process of creating new derived variables preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical transformations.

· New variables may be generated by using the following arithmetical operations: (a) carrying out mathematical operations on the old variables such as division or multiplication (b) combining 2 or more variables to generate a new one by addition, subtraction, multiplication or division.

· New variables can also be generated by using mathematical transformations of variables for the purposes of: stabilizing variances, linearizing relations, normalizing distributions (making them conform to the Gaussian distribution), or presenting data in a more acceptable scale of measurement.

Data transformation2: mathematical transformations

· Four types of mathematical transformations are carried out on count or measurement data: logarithmic, trigonometric, power, and z-transformations.

· Logarithmic transformation: both the natural (base e) and Niepierian (base 10) logarithmic transformations can be used.

· Trigonometric transformation involves recording data as its sine, cosine, and tangent transformation.

· Power transformations can take any of three types: the exponential transformation, the square roottransformation, and the reciprocal transformation.

· Data could also be expressed in terms of the z-score which is the difference between the data value and the group mean divided by the group standard deviation.

· The probit and logit transformations are used to data expressed as proportions.

Data transformation 3: mathematical transformations

· There are preferred transformations for purposes of stabilizing the variance.

· The log transformation is preferred for measured data that follows the gamma distribution.

· The square root transformation is preferred for count data that follows the Poisson distribution.

· For proportions following the binomial distribution, the preferred transformation is the arcsine of the square root of the proportion i.e. sin^-1 (x_i)^1/2.

Data transformation 4: Ladder of transformations

· The ladder if transformation is used as a guide to what type of transformation to make. Both x and y variables can be transformed.

· The range of independent variables is divided into 3 roughly equal portions with roughly equal data points in each portion. A representative point is selected in each portion which should be a point that is roughly in the middle and need not be a data point.

· There is no need for transformations if the line connecting the 1^st to the 2^nd point has the same slope as the line connecting the 2^nd to the 3^rd points.

Data transformation 4: Ladder of transformations

· If the slopes are not equal there is a need for transformation.

· A line is drawn to connect the first and third representative points. If the middle point is above the line we say that the case is concave. If the middle point is below the line we say that the case is convex.

· In the convex case we go up the ladder of transformation of y in the order y, y^1/2, log(y), -1/y^1/2, -1/y, -1/y², ….. In the concave case we go up the ladder of x transformation in the order: x, x², x³, x⁴, x⁵, ….. In the concave case we go down the ladder of y transformation in the order: y, y², y³, y⁴, y⁵, ……

· In the concave case we go down the ladder of x transformation in the order: x, x^1/2, log(x), -1/x^1/2, -1/x, -1/x², …. (page 198 AshiishSen and Muni Srivastava; Regression Analysis: Theory, Methods, and Applications. Springer)

EXERCISE #1: class data questionnaire: collection and entry

1. Construct a questionnaire to collect the following information from members of the class. Please do not give your identity and approximate if you do not have full information: age, gender, weight, height, diastolic blood pressure, total cholesterol, number of pets at home, left or right-handed, wearing glasses or contact lenses,

2. Pass the questionnaires around the class and let each one enter the data in excel on their laptop

ID	Age	Gender	Weight(kg)	Height (cm)	BP (mmHg)	No. Pets	Handedness	Glasses

EXERCISE #2: data editing

Edit the data looking for the following: