Workshop
presented at the Research Data Base Management Workshop organized by the
Directorate of Research Ministry of Health on 26th March by
Professor Omar HasanKasuleSr MB ChB (MUK), MPH (Harvard), DrPH (Harvard) EM;
omarkasule@yahoo.com
Terminology
·
A field/attribute/variable/variate
is the characteristic measured for each member e.g name & weight. A value/element is the actual measurement
or count like 5 cm, 10kg.
·
A record/observation
is a collection of all variables belonging to one individual. A file is a collection of records.
·
A data-base
is a collection of files.
·
A data
dictionary is an explanation or index of the data.
·
A data base goes through a life cycle of its
own called the data life cycle. Data is collected and is stored. New
data has to be collected to update the old one.
·
A census
comprises all values of a defined finite population are obtained i.e. totality.
·
Data models can take any of three shapes: relational,
hierarchical, and network. Relational data is when access to any file or data
element is possible from the outside and is random. A hierarchical data set is
organized in layers. Folders opened at the beginning have other folders within
that could not be seen from the outside.
Data coding
·
Self-coding or pre-coded questionnaires are
preferable to those requiring coding after data collection.
·
Errors and inconsistencies could be introduced
into the data during manual coding.
·
A good pre-coded questionnaire can be produced
after piloting the study
Data entry
·
Both random and non-random errors could occur
in data entry. The following methods can be used to detect such errors:
·
Double entry techniques in which 2 data entry
clerks enter the same data and a check is made by computer on items on which
they differ. The items are then re-checked in the questionnaires and
reconciliation is carried out. This method is based on the assumption that the
probability of 2 persons making the same random error on entering an item of
information is very low. The items on which the 2 agree are therefore likely to
be valid.
·
The data entered in the computer could be
checked manually against the original questionnaire.
·
Interactive data entry is becoming popular. It
enables detection and correction of logical and entry errors immediately. The
computer data entry program could be programmed to detect entries with
unacceptable values or that are logically inconsistent.
Data
replication
·
Data replication is a copy management service
that involves copying the data and also managing the copies.
·
Synchronous data replication is instantaneous
updating with no latency in data consistency.
·
In asynchronous data replication the updating
is not immediate and consistency is loose.
Data
problems
·
Missing data,
·
Coding and entry
errors,
·
Inconsistencies,
·
Irregular patterns,
·
Digit preference,
·
Out-liers,
·
Rounding-off /
significant figures,
·
Questions with
multiple valid responses,
·
Record duplication.
Data
input, processing, and editing
·
Data input is of 5 different types: text, multiple choice, numeric,
date and time, and yes/no responses.
·
A full screen data entry is very easy. It may be a data entry screen or
may be in the form of a spreadsheet or grid.
Special import and export programs enable conversion of data from one
form to another.
·
Data editing consists of identifying and correcting errors. Some values
of variables are unusual or impossible. The distribution of the variables may
not be reasonable.
·
There may be inconsistencies in the data for example the age of a
mother may not consistent with the number of children. A male reported with
hysterectomy is an obvious mistake. A person with a height of 2 meters cannot
have a weight of 50 kilograms.
·
The codes may not be used in a consistent way for example 1 may be male
in some records and female in some others.
Validation
of epidemiological data
·
Some epidemiological data can be validated and others cannot.
·
Cigarette smoking can be validated by assessment of urine nicotine and
blood carboxyhemoglobin.
·
Food frequency data can be validated using the 7-day weighed diary
method.
·
Self reported weight or height can be validated by standardized
measurement.
·
Self reported disease status can be validated by clinical evaluation.
Assessing
inconsistency due to various factors
·
Consistency is measured as repeatability/reproducibility, reliability,
and agreement.
·
Reliability is when there is consistency between observers or
techniques of measurement.
·
Agreement refers to consistency between observers. Variation in
measurements may be intra-observer or inter-observer.
·
Repeatability/reproducibility is obtaining the same result on repeat
assessment under the same conditions. Repeatability can be analyzed to separate
effects due to various factors.
·
In a Latin square design, several observers interview the same patients
in a pre-determined order. ANOVA is then used to separate the effects due to
the observer, due to the order, and due to the subject.
Assessing
consistency of categorical data using the kappa statistic
The kappa statistic can
be used to study reproducibility in which two observers assess subjects on
categorical variables as shown in the table below
|
Observer
A+
|
Observer
A-
|
Total
|
Observer B+
|
A
|
B
|
a+b
|
Observer B-
|
C
|
D
|
c+d
|
Total
|
a+c
|
b+d
|
N
|
·
The kappa statistic is computed as k = (p1 - p0) / (1 - p0) where p1
= a+d/N and p0 = total expected agreement = {(a+c)/N (a+b)/N} +
{(b+d)/N (c+d)/N}.
Assessing
consistency of continuous data using the mean difference
·
For continuous variables the data lay-out is as shown below:
Subject
|
Observer
A
|
Observer
B
|
Difference
|
1
|
|
|
|
2
|
|
|
|
3
|
|
|
|
.
|
|
|
|
.
|
|
|
|
N
|
|
|
|
·
We compute the mean difference and the standard error of the mean
difference in order to construct a 95% confidence interval for the mean
difference.
·
If the interval contains the null value of zero, we can conclude that
there is agreement between the two observers.
Preliminary
data examination
·
Examination of tables and graphics (histograms, stem and leaf, dot
plot, box plot, side by side plot, scatter plot, histogram, line graphs etc).
·
Descriptive statistics are used to detect errors, ascertain the normality
of the data, and know the size of cells.
·
Tabulation requires creating about 5 categories. The category
boundaries can be chosen in several ways including using percentiles.
·
Open ended categories should be avoided.
How
to deal with missing values
·
Analysis may be confined to complete records only with records that
have missing data being deleted. This approach is easy to implement and is
valid if the incomplete records are random. If the missing phenomenon is
systematic then this procedure will introduce bias.
·
Predict (impute) the missing data using regression analysis or maximum
likelihood methods.
Data
transformation 1: creating new variables
·
Transformation is the process of creating new derived variables
preliminary to analysis. The transformations may be simple using ordinary
arithmetical operators or more complex using mathematical transformations.
·
New variables may be generated by using the following arithmetical
operations: (a) carrying out mathematical operations on the old variables such
as division or multiplication (b) combining 2 or more variables to generate a
new one by addition, subtraction, multiplication or division.
·
New variables can also be generated by using mathematical
transformations of variables for the purposes of: stabilizing variances,
linearizing relations, normalizing distributions (making them conform to the
Gaussian distribution), or presenting data in a more acceptable scale of
measurement.
Data
transformation2: mathematical transformations
·
Four types of mathematical transformations are carried out on count or
measurement data: logarithmic, trigonometric, power, and z-transformations.
·
Logarithmic transformation: both the natural (base e) and Niepierian
(base 10) logarithmic transformations can be used.
·
Trigonometric transformation involves recording data as its sine,
cosine, and tangent transformation.
·
Power transformations can take any of three types: the exponential
transformation, the square roottransformation, and the reciprocal
transformation.
·
Data could also be expressed in terms of the z-score which is the
difference between the data value and the group mean divided by the group
standard deviation.
·
The probit and logit transformations are used to data expressed as
proportions.
Data
transformation 3: mathematical transformations
·
There are preferred transformations for
purposes of stabilizing the variance.
·
The log transformation is preferred for
measured data that follows the gamma distribution.
·
The square root transformation is preferred for
count data that follows the Poisson distribution.
·
For proportions following the binomial
distribution, the preferred transformation is the arcsine of the square root of
the proportion i.e. sin-1 (xi)1/2.
Data
transformation 4: Ladder of
transformations
·
The ladder if transformation is used as a guide
to what type of transformation to make. Both x and y variables can be
transformed.
·
The range of independent variables is divided
into 3 roughly equal portions with roughly equal data points in each portion. A
representative point is selected in each portion which should be a point that
is roughly in the middle and need not be a data point.
·
There is no need for transformations if the
line connecting the 1st to the 2nd point has the same
slope as the line connecting the 2nd to the 3rd points.
Data
transformation 4: Ladder of
transformations
·
If the slopes are not equal there is a need for
transformation.
·
A line is drawn to connect the first and third
representative points. If the middle point is above the line we say that the
case is concave. If the middle point is below the line we say that the case is
convex.
·
In the convex case we go up the ladder of
transformation of y in the order y, y1/2, log(y), -1/y1/2,
-1/y, -1/y2, ….. In the concave case we go up the ladder of x
transformation in the order: x, x2, x3, x4, x5,
….. In the concave case we go down the ladder of y transformation in the order:
y, y2, y3, y4, y5, ……
·
In the concave case we go down the ladder of x
transformation in the order: x, x1/2, log(x), -1/x1/2,
-1/x, -1/x2, …. (page 198 AshiishSen and Muni Srivastava; Regression
Analysis: Theory, Methods, and Applications. Springer)
EXERCISE
#1: class data questionnaire: collection and entry
1. Construct
a questionnaire to collect the following information from members of the class.
Please do not give your identity and approximate if you do not have full
information: age, gender, weight, height, diastolic blood pressure, total
cholesterol, number of pets at home, left or right-handed, wearing glasses or
contact lenses,
2. Pass the
questionnaires around the class and let each one enter the data in excel on
their laptop
ID
|
Age
|
Gender
|
Weight(kg)
|
Height (cm)
|
BP (mmHg)
|
No. Pets
|
Handedness
|
Glasses
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
EXERCISE
#2: data editing
Edit the
data looking for the following:
·
Missing data,
·
Coding and entry
errors,
·
Inconsistencies,
·
Irregular patterns,
·
Digit preference,
·
Out-liers,
·
Rounding-off /
significant figures,
·
Questions with
multiple valid responses,
·
Record duplication.
EXERCISE
#3: data transformation
Think of
any new variables that can be derived from the primary variables