search this site.

130501P - DISCRETE DATA SUMMARY

Print Friendly and PDFPrint Friendly

Presentation at a Training Program on Biostatistics for physician managers working in Public Health Administration, Qassim Province on May 1, 2013 by Professor Omar HasanKasuleSr MB ChB (MUK), MPH (Harvard), DrPH (Harvard) EM: omarkasule@yahoo.com


Parameters and statistics
  • Data can be summarized using parameters or statistics.
  • Parameters are computed from population data. Statistics are computed from sample data.
  • Since most statistical work involves samples, sample statistics are the most popular.
  • Statistics are supposed to be good estimates of population parameters that are the real focus of interest.

Discrete data
  • Discrete data arises from counting number of objectives or phenomena categorized in various groups.
  •  Discrete data can be from qualitative variables or from quantitative variables.
  • Qualitative variables are easier to summarize as discrete counts and frequencies.
  • Summarization of quantitative variables is more complex. In many cases quantitative variables are dichotomized to enable summarization as qualitative variables.
  • Numerical discrete variables are used directly in the analysis.
  • Numerical continuous have to be transformed into numerical discrete or qualitative variables before analysis.

Statistics for discrete data
  • The main types of statistics used are measures of location such as rates, hazards, ratios, and proportions and measures of spread.
  • Measures of location indicate accuracy or validity.
  • Measures of spread or variation indicate precision.    

Rates: definition
  •  A rate is a measure of the number of events in a given population over a defined time period.
  • A rate has 3 components: a numerator, a denominator, and time. The numerator of a rate is included in its denominator.
  • Incidence of disease is a type of rate. It describes a moving and dynamic picture of disease e.g. IMR, CBR, and IR.
  • The general formula of a rate is the total number of disease or characteristic in a given time period divided by the total number of persons at risk (those with disease + those without disease).
  • In symbols this is written as a / (a+b)t where a= number of new cases, b= number without disease, and t= time of observation.

Crude rates
  • A crude rate is computed for the general population assuming conditions of homogeneity.
  • A crude rate ignores differences in events among subgroups of the population.
  • Crude rates are un-weighted and are misleading.
  • Comparison of crude rates in 2 populations is not possible. No valid inference based on crude rates is possible because of confounding.
  • The Simpson paradox, due to confounding, arises when the conclusion based on crude rates contradicts that based on rates of specific sub-groups in the population.

Example of crude rates
  • A village of 100 inhabitants may have a death rate of 60 per 100 per year which may appear alarming until we get more information about the age distribution. If more that 90% of the population are senior citizens aged above 85 years of age the rate may be understandable.
  • A death rate of 60 per 100 would be completely incomprehensible for another village of the same size but with a younger population distribution.

Specific rates
  • Specific rates are rates computed for specific groups in the population. They are more informative than crude rates because they relate to subgroups that are likely to enjoy internal homogeneity.
  • The following types of specific rates are commonly used: age-specific, sex-specific, place-specific, race-specific, and cause-specific rates.
  • A disadvantage of specific rates is that there will be so many of them depending on the number of subgroups. It is difficult to internalize, digest, and understand so many rates or be able to reach some conclusions.
  • The human mind always looks for summary indices and always tries to summarize detailed information into 1 or 2 indicators.

Adjusted /standardized rates 1
  • Adjusted or standardized rates are summary population rates used instead of crude rates. They avoid the disadvantages of both specific and crude rates. They are just one accurate and representative summary statistic.
  • The process is adjustment or standardization is used to turn specific rates into standardized rates.  Standardization seeks to summarize the rate after removing the ‘confusing’ or ‘confounding’ effects of subgroups.
  • We thus talk about standardizing for age, gender, or race to remove the direct impact of these 3 factors that are not equally distributed in the parent population.

Adjusted /standardized rates 2
  • Adjusted rates have the advantage that they provide a means of comparison across populations with different proportions of distribution of sub groups.
  • For example standardized rates of 2 populations with different proportions of the elderly can be compared directly because the impact of age was removed during the process of computing the standard rate. Such a comparison is not possible if crude rates were used.           

Standardization 1
  • Standardization is a statistical technique that involves adjustment of a rate or a proportion for 1 or 2 confounding factors.
  • The main objective of standardization is to enable comparison of rates in different populations. There are 4 approaches to standardization: direct standardization, indirect standardization, computation of life expectancies, and regression techniques.
  • Standardization provides a single summary index is easier to compare across populations than use of several specific rates that are cognitively difficult to process.
  • Use of specific rates in comparison may not be valid when some strata have too few subjects to be reliable. In many cases specific rates may not be available especially for occupational studies.

Standardization 2
  • Both direct and indirect standardization involve the same principles but use different weights.
  • Direct standardization is used when age-specific rates are available. The population rates are applied to the age distribution of a standard population to compute the standardized rate.
  • Indirect standardization is used when age-specific rates are not available. The rates of the standard population are applied to the age distribution of the study sample to compute the observed/expected ratio.
  • Both direct and indirect standardization use a ‘standard population’. There are 4 possible sources of the standard: a combination of the two or more populations being compared, use of just one of the comparison populations as a standard for the others, using the national population, and using the world population.
  • Life expectancy is a form of age-standardized standardized mortality rate. Regression techniques provide a means of simultaneous adjustment of the impact of various factors on the rate. 
Definition of hazard
  • A hazard is defined as the number of events at time t among those who survive until time t.
  • For example if 100 children are born and the following numbers die at various ages: 5 at age 1, 10 at age 2, 15 at age 30, and 10 at age 4.
  • The hazard of death at each age can be computed as 5/100 at age 1, 15/95 at age 2, 30/80 at age 3, and 10/50 at age 4. 
  • Hazard can also be defined as relative hazard with respect to a specific risk factor which is the ratio of the hazard in the presence of a given risk factor to the hazard in the absence of the given risk factor.
  • At a specific point in time, relative hazard expresses the hazard among the exposed compared to the hazard among the non-exposed.

DEFINITION OF RATIO
  • The general formula for a ratio is number of cases of a disease divided by the number without disease, a/b.
  • Examples of ratios are: the proportional mortality ratio, the maternal mortality ratio, and the fetal death ratio.
  • The proportional mortality ratio is the number of deaths in a year due to a specific disease divided by the total number of deaths in that year. This ratio is useful in occupational studies because it provides information on the relative importance of a specific cause of death.
  • The maternal mortality ratio is the total number of maternal deaths divided by the total live births.

DEFINITION OF A PROPORTION
  • Proportions are used for enumeration. A proportion is the number of events expressed as a fraction of the total population at risk. It has only 2 components: the numerator and the denominator. The numerator is included in denominator. The time period is not defined but is somehow assumed. The general formula for a proportion is a/(a+b).
  • Examples of proportions are: prevalence proportion, neonatal mortality proportion, and the perinatal mortality proportion. Prevalence describes a still/stationary picture of disease. Like rates, proportions can be crude, specific, and standard.
  • The term prevalence rate has become very common. However prevalence is not a rate because the time dimension is not involved. Prevalence is a proportion. The term maternal mortality rate is also used extensively. It is actually a ratio and is neither a rate nor a proportion
  • A variance and standard deviation can be computed for a proportion to enable computation of a 95% confidence interval.
  • The standard deviation of a sample proportion is given by the expression se(p) = {p(1-p)/n}1/2 if n < 0.05 or N = ¥ or N is large in relation to n.