Integrated Medical Education Resources: 1009P- LONG ESSAY QUESTIONS

Q1. Discuss 3 ethical issues in epidemiological research

A1. There are basically 7 issues (a) ETHICAL APPROVAL: A study involving humans must get approval from a recognized body. For approval the study must fulfill certain criteria. It must be scientifically valid. It is unethical to waste resources (time and money) on a study that will give invalid conclusions. In 1992 the Council for International Organizations of the Medical Sciences published ‘Guidelines for Ethical Review of Epidemiological Studies’ (b) INDIVIDUAL vs. COMMUNITY RIGHTS: There is sometimes a conflict between the requirement to protect the rights of the individual and protection of the community. Restrictions may have to be made on an individual in the public interest. (c) BENEFITS vs. RISKS: Public health interventions carry risks and costs that must be balanced against the benefits.(d) INFORMED CONSENT: Study subjects must be free to participate in the study, abstain from participation, or elect to withdraw from the study at any stage. (e) PRIVACY AND CONFIDENTIALITY: Data collected in an epidemiological study should not be released to any third party without consent of the subject. However such data can be subpoenaed by a court of law when public interest takes precedence over individual rights. Data is reported in the aggregate without any personal identifiers. Access to data is limited. There are issues that must be resolved: who owns the data? An epidemiologic study may uncover previously unrecognized disease. Pre-symptomatic disorders that do not require immediate medical attention cause no ethical problems. Disorders that require intervention create an ethical problem because the epidemiologist is required to breach confidentiality in the process of making sure that the patient gets the necessary care and that innocent persons will not be exposed to infectious disease. (f) CONFLICT OF INTEREST: Epidemiologists employed in academia can work relatively independently. Those working in government and industry are controlled by vested interests. (g) STUDY INTERPRETATION and COMMUNICATION: Risk reports that are not yet confirmed can be picked up by the media. It is difficult to keep epidemiological findings secret. Media have a tendency to sensationalize issues that complicates later intelligent debates. They may not understand differences among published epidemiological findings and over-blow controversies. These controversies are best evaluated by a careful study of the underlying evidence. MacMahon et al 1981 found that coffee causes pancreatic cancer whereas Feinstein et al. 1981 found that coffee did not cause cancer. Barefoot et al. 1983 found that type A personality was associated with heart disease but Shekelle at al. 1987 found that it was not. Vegetable-derived margarine had been thought to be good for the heart but Willet and Asherio 1994 found that it was bad for the heart. Falck et al 1992 found that pesticides caused breast cancer whereas Krieger et al 1994 found that they did not. Steinberg et al 1991 found that estrogen replacement therapy causes breast cancer whereas Kaufmann et al 1984 found that it did not. Beta carotene thought to prevent cancer was found by Omenn at al 1996 to cause cancer. Miller at al 1989 found oral contraceptives to cause cancer but the Cancer and Steroid Hormone Study Group of 1986 found that it did not (page 330 Ross C Brownson and Diana B Petiti: Applied Epidemiology: Theory to Practice. OUP New York and Oxford 1998). Study findings affect policy. Epidemiologists must know how to communicate risk to the public. It is an ethical obligation to report research findings to subjects so that they may take measures to lessen risk. Epidemiological evidence is different from legal evidence but fate sometimes determines that the two meet in a court of law. Epidemiological evidence may not be accepted in a court of law because it has few certainties; it is all probabilistic. Epidemiological evidence is concerned with populations whereas legal evidence pertains to individuals.

Q2. Illustrate with examples 2 ways of using clinical epidemiological methods in diagnosis of disease and abnormality

A2. Disease is anatomical, biochemical, physiologic or psychological derangement. Clinical diagnosis is an effort to recognize the class or group to which a person's illness belongs. Epidemiology as the study of the distribution and determinants of disease provides background information needed in clinical diagnosis. Statistical abnormality, used to define disease, is defined as deviation beyond 2 standard deviations. The most often used strategy in clinical diagnosis is the hypothetico-deductive in which a hypothesis is formed from early clues and then history, clinical examination, and diagnostic tests are undertaken to confirm or reject the hypotheses. Epidemiological knowledge provides prior probabilities for clinical decision making. The clinician combines his empirical findings with the prior probabilities top reach a diagnosis. This may be informal or formal using Bayesian techniques. In formal clinical decision making, the problem is defined. Alternative actions and possible outcomes are determined. Probabilities are determined on the decision tree and the value of the outcome is computed. Diagnostic tests are used for assessing severity, predicting prognosis, estimating likely response to treatment, and to determine the actual response to treatment. The precision of each test, its sensitivity and specificity must be taken into consideration in interpreting its findings. Diagnostic procedures can be evaluated by computing their predictive value. Epidemiological parameters are used to choose a diagnostic procedure. Diagnostic tests are also useful in predicting illness outcome. Random controlled trials, follow up and case control study designs can be used to assess the role of diagnostic tests in predicting outcome. A hospital stores a lot of clinical data about patients. This data may be shared with other hospitals using local area net-works (LAN). Data-bases have been developed with AI capabilities and they can provide much support to the physician who is trying to diagnose a disease. This is done by comparison of the patient's data with several profiles stored in the data-base.

Q3. Describe the main elements of any 2 public health programs and the potential impact of the information and communication technology on their implementation

(a) DISEASE PREVENTION and HEALTH PROMOTION: Control of infectious disease: immunization, screening programs, contact tracing, disease surveillance, epidemic investigation and control. Health education: health information dissemination through the media, schools, and community groups. Maternal and child health: genetic counseling, genetic screening, family planning, prenatal care, well child care, school health services, disabled children services, and Medical social work. Chronic diseases control: screening programs, education. Occupational Health Programs: Health education, immunizations, screening. Dental public health: Screening and referral, water fluoridation, education on nutrition and dental hygiene. Public health nursing. Nutrition: Nutritional education, nutritional supplementation for women and children. Consumer protection. Public health laboratory: Support for environmental, occupational, and infectious disease control programs. Mental health: education, community support for discharged patients, alcohol and addiction services. School health

Screening (b) MEDICAL and SOCIAL SERVICES: Medical and nursing services for early diagnosis and treatment of disease. Care for the elderly. Social welfare services needed for health (c) ENVIRONMENTAL PROTETION: Environmental sanitation. Air pollution control. Water and sewage control. Occupational health. Food hygiene and inspection: inspection of restaurants. Animal health. Housing inspection. Insect and rodent control

Q4. Discuss the definition, objectives, methods of disease surveillance using an example from Malaysia

A4. Definition: In 1968 The World Health Organization defined surveillance as systematic collection and use of epidemiological information for planning, implementing, and assessing disease control. In active surveillance mechanisms are set up to actively look for and identify disease conditions. Passive surveillance does not set up any special monitoring mechanisms but relies on the existing systems to report disease occurrence.

Objectives: (a) identification of changes in disease incidence (b) epidemiological description of disease: incidence, causes, and associated factors (c) evaluation of disease control and prevention programs (c) assessment of the burden of disease to help health care delivery (d) planning of public health programs by future projection of disease burden (e) using surveillance data for formulating public health policy (f) prediction of the occurrence of epidemics (g) provide information for researchers.

Scope: Surveillance covers infectious disease, occupational health, environmental health, injuries, maternal and child health, and non-communicable disease. In addition to health events, surveillance includes information on risk factors, use of preventive and curative services.

Components of a surveillance system: The essentials of a surveillance system are: data collection, data analysis, data interpretation, and feed-back or data dissemination. Before start of surveillance, the case definition and the target population must be defined. Appropriate personnel must be selected and trained. The logistics of data collection, analysis, and dissemination must be set up. Approvals from the relevant authorities must be obtained. A good surveillance system must be ongoing, practicable, uniform, frequent and rapid, sensitive, timely, representative, high predictive value, accurate, complete, simple, flexible, and acceptable. The surveillance system is evaluated using the following criteria: (a) attainment of objectives of surveillances: decrease of incidence and prevalence and decrease of mortality and case fatality (b) operational characteristics (sensitivity, timelinessness, representativeness, predictive value), acceptability, flexibility, simplicity, and cost.

Sources of data:

The sources of surveillance data are: (a) mandatory morbidity and mortality notification systems (b) Health information systems: vital records (births and deaths), coroner reports, medical care records (integrated health information system & hospital discharge summary), insurance records, worker compensation records, and records of work-absence due to illness, school records (c) disease registries eg cancer registry (hospital-based, population-based, and exposure registers) (d) public health laboratory reports, (e) reports of disease outbreaks, epidemics, and individual case studies (f) vaccine utilization data, (g) records of hazard exposure surveillance (h) special surveys such as health interview surveys and other types of surveys (i) Sentinel surveillance that focuses on key health indicators for early warning. Sentinel events such as infant mortality and sentinel sites such as hospitals and clinics (j) Studies of animal reservoirs and vector distribution (k) study of biological markers (l) study of drug resistance (m) demographic and environmental data (n) media reports.

Data analysis

Data analysis is usually descriptive and usually consists of comparing with the baseline. Surveillance data is interpreted with the following in mind: identification of epidemics, identification of new syndromes, monitoring trends, evaluation of public policy, and projecting future needs. Observation of departures from usual disease distribution does not necessarily mean that there is a problem. Completeness of coverage of the surveillance system is an issue that may arise. Capture-recapture methods may be used to ascertain whether the surveillance system has adequate coverage. Surveillance data has limitations such as under reporting, unrepresentative case series, and inconsistent case definitions. Surveillance information must be disseminated through publications and the mass media. The message to be communicated must be packaged correctly and appropriately. Data dissemination need not wait until the surveillance system is fully operational; provisional surveillance data is still useful in public health.

Q5. Explain what you understand by the term emerging or re-emerging infectious disease. Select an emerging or re-emerging infectious disease and discuss the epidemiological, biological, and ecological factors behind its increasing incidence. Discuss public health measures that can be adopted to prevent its further spread in Malaysia

Factors: The reemergence of infectious diseases in the developed countries after falling over most of the 20^th century is due to socio-demographic, lifestyle or human behavior, environmental, and medical technological factors. The socio-demographic factors are: demographic changes (aging, migration), wide scale commercial and tourist travel, increasing crowding especially in the large urban areas, behavioral and lifestyle changes are factors contributing to emerging infectious diseases. Some diseases are old diseases and are old problems. Some diseases are old but are new problems for example tuberculosis and malaria. Some are new diseases with new pathogens such as Ebola and HIV. The environmental factors are climatic changes (global warming, climate change, rising sea levels, heat waves, and ozone depletion) that disturb the eco-system and thus favor growth and transmission of old and new pathogens. Immune suppression in organ donation and nosocomial infections are side effects of medical technology.

Sexually transmitted diseases: Breakdown of traditional society and emergence of liberal ideas about sexual relations is behind the increase of STD. The traditional STDs are syphilis, gonorrhoea, and chanchroid. New diseases are chlamydia, genital warts, trichomonas, scabies, peduculosis, genital herpes, vaginal candidiasis, E. histolytica infection, G. lamblia, HVA, HBV, HCV, and HIV. Data on STDs is inadequate because of incomplete notification, non-uniform diagnostic criteria, and asymptomatic cases. Health education does not seem to be very effective in the prevention of STDs. Use of condoms by commercial sex workers is effective in decreasing STD incidence. Acquired Immunodeficiency syndrome (AIDS) appeared in 1981. The HIV 1 virus isolated in 1983 belongs to the retrovirus family. It attaches to the CD4+ lymphocytes which are depleted as the infection progresses. HIV is transmitted by semen, blood, vaginal and cervical secretions. Direct transmission occurs when contaminated syringe needles are shared by iv drug users, the perinatal period, breast milk, transfusion of blood and blood products, insemination with donated semen, transplantation of organs and tissues. Primary prevention consists of safe sex (monogamy or condom use), voluntary testing and contact tracing, safe blood supplies with use of antigen screening during the window between infection and appearance of antibodies, safe organ donation programs, precautions in medical facilities, and development of a vaccine.

Viral diseases: The Marburgh virus disease was first recognized in Yugoslavia and Germany when people fell ill after contact with monkeys imported from

Uganda

. The Ebola/Marburg virus epidemic started in 1976 and has been recurring being imported into Europe and the US by importation of monkeys from Africa. The A swine flu epidemic was recognized in 1976. Lassa fever spread is favored by urbanization leading to rodent exposure in the homes. Travel, migration, and urbanization contribute to spread of dengue fever and dengue hemorrhagic fever. The Hantavirus pulmonary syndrome due to hanta virus associated with contaminated droppings of deer mice appeared in 1993. Hanta viruses are spreading because of ecological and environmental changes that increase contact with rodents. Hepatitis B and C are spreading due to transfusion, organ transplantation, intravenous drug abuse, and sexual transmission. Rift valley fever transmission is favored by dam building, agriculture, and irrigation. Yellow fever is being transmitted in new areas because of conditions that favor mosquitoes.

Bacterial diseases: Streptococcus group A is an invasive necrotizing ‘flesh-eating’ bacterium whose increased transmission is not understood. The toxic shock syndrome due to infection of ultra absorbent tampons by Staphylococcus aureus appeared in 1980. Infections by enteropathogens such as Shigella are increasing. Cholera transmission is due to poor sanitation and introduction of new strains (such as O139) due to travel. The hemolytic uremic syndrome is due to mass food processing technology that allows Escheria coli O157:H7 to contaminate meat. Brazilian purpuric fever is due to a new strain of Hemophilus Influenzae. Helicobacter Pylori is probably not a new disease but has just been recognized as an association with gastric ulcers and other gastro-intestinal disorders. The decline of TB incidence in Europe and America registered in the 19^th and 20 century due to socio-economic improvement started being reversed in the 1980s and 1990s due to bad social conditions (poverty, homelessness, and unemployment), infected immigrants, HIV infection, and rise of drug resistant TB. Control of TB is achieved by contact tracing, chemoprophyllaxis, and adherence to treatment schedules. Direct observed therapy (DOT) helps in ensuring treatment compliance. Shorter drug regimens also ensure that the problem of non-compliance does not arise. Prevention of TB is achieved by overall improvement in nutrition, social and environmental conditions, and alleviation of poverty. Primary prevention is based on BCG vaccination and chemoprophylaxis with INH which prevents reactivation of latent TB. Secondary prevention is treatment of multi-drug resistant conditions.

Parasitic diseases: Malaria is spreading due to increasing travel and migration. Schistosomiasis is spreading due to dam building. Lyme disease due to a spirochete called borrelia burgdorferi appeared in 1975. Its transmission is aided by reforestation around homes that favors the tick vector and the deer, a secondary reservoir host. Legionnaire’s disease is due to a small infectious agent spread via air-conditioning systems appeared in 1976. Biofilms that form on water tanks and plumbing favor growth of the causative organisms. P. carinii and Cryptococcus spp are opportunistic infections. Cryptosporidium spp, Cyclospora spp and other water-borne pathogens are due to contaminated surface water and improper water purification.

Q6. Discuss the epidemiological association between alcohol and tobacco consumption and mortality from major non-communicable diseases in Malaysia. What preventive approached can be taken

A7. ALCOHOL: The psychological and behavioral disorders are acute alcohol intoxication, acute alcohol poisoning, hangover, blackouts, and alcohol dependency. The Acute alcohol withdrawal syndrome manifests as delirium tremens, acute auditory hallucinosis, depression, attempted suicide, and suicide. The neurological disorders are epilepsy, peripheral neuropathy, cerebral atrophy, cerebellar atrophy, the Wernicke-Korsakoff syndrome, post traumatic neurological disease, and cerebrovascular disease. The gastrointestinal disorders are oropharyngeal carcinoma, Mallory-Weiss syndrome, esophageal varices, esophageal carcinoma, gastric and duodenal ulceration, atrophic gastritis, gastric carcinoma, disturbed bowel motility, intestinal malabsorption, colon carcinoma, pancreatitis, pancreatic carcinoma, alcoholic hepatitis, liver cirrhosis, and hapatocellular carcinoma. The cardiovascular disorders are cardiac arrhythmias, alcoholic cardiomyopathy, cardiac beriberi, hypertension, and ischemic heart disease. The respiratory disorders are obstructive sleep apnoea, chronic obstructive lung disease, pneumonia, lung abscess, pulmonary tuberculosis, laryngeal carcinoma, and carcinoma of the lung. Reproductive and pregnancy-related disorders are depressed testicular function, depressed ovarian function, carcinoma of the breast, spontaneous abortion, perinatal mortality, low birth weight, impaired development (physical, mental and behavioral), congenital birth defects, fetal alcohol syndrome, pseudo-Cushing syndrome in breast-fed infants, and the alcohol withdrawal syndrome in the newborn. Alcohol is associated with metabolic, endocrine, musculoskeletal disorders, hematological disorders, traumatic injuries, adverse drug interactions, and nutritional deficiencies. The metabolic and endocrine disorders are hypoglycemia, hyperglycemia, diabetes, gout, lactic acidosis, and deranged mineral metabolism.

TOBACCO: Smoking is a risk factor for coronary heart disease/ischemic heart disease and chronic obstructive lung disease. In the 1990s passive smoking was linked to cardiovascular disease. In the 1960s a dose-response relation between smoking and cardio-vascular disease was demonstrated. In the 1980s smoking, oral contraceptive use, and cardiovascular disease were found linked in women. In the 1960s smoking was associated with emphysema and respiratory disease. In the 1970s passive maternal smoking was linked to childhood asthma. Cigarette smoking is a risk factor for cancer of the lung, cancer of the larynx, cancer of the oral cavity, cancer of the bladder. Smoking was related to lung cancer in the 1950s. In the 1970s passive smoking was linked to lung cancer. In the 1990s tobacco was classified as a carcinogen. In the 1970s maternal smoking was associated with low birth weight and other bad pregnancy outcomes (low birth weight, premature rupture of membranes, abruptio placenta). Cigarette smoking is associated with unintentional injury by fire. Smoking behavior can be modified by: knowledge of the health risks, attitude to smoking, cigarette advertising, cost of cigarettes, peer influence, and legislation.

PREVENTION: open

Q8. For any of the 3 main sources of bias in epidemiological research (misclassification, selection, and confounding) discuss with examples the definition, causes, impact on the effect estimate, prevention, and cure if possible

MISCLASSIFICATION BIAS

Misclassification is inaccurate assignment of exposure or disease status. Random or non-differential misclassification of disease biases the effect measure towards the null and underestimates the effect measure but does not introduce bias. Non-random or differential misclassification is a systematic error that biases the effect measures away from the null exaggerating or underestimating the effect measure. Positive association may become negative and negative associations association may become positive. Misclassification bias is classified as information bias, detection bias, and proto-pathic bias. Information bias is systematic incorrect measurement on response due to questionnaire defects, observer errors, respondent errors, instrument errors, diagnostic errors, and exposure mis-specification. Detection bias arises when disease or exposure are sought more vigorously in one comparison more than the other group. Protopathic bias arises when early signs of disease cause a change in behaviour with regard to the risk factor. Misclassification bias can be prevented by using double-blind techniques to decrease observer and respondent bias. Treatment of misclassification bias is by the probabilistic approach or measurement of inter-rater variation.

SELECTION BIAS

Selection bias arises when subjects included in the study differ in a systematic way from those not included. It is due to biological factors, disease ascertainment procedures, or data collection procedures. Selection bias due to biological factors includes the Neyman fallacy and susceptibility bias. The Neyman fallacy arises when the risk factor is related to prognosis (survival) thus biasing prevalence studies. Susceptibility bias arises when susceptibility to disease is indirectly related to the risk factor. Selection bias due to disease ascertainment procedures includes publicity, exposure, diagnostic, detection, referral, self-selection, and Berkson biases. The Hawthorne self selection bias is also called the healthy worker effect since sick people are not employed or are dismissed. The Berkson fallacy arises due to differential admission of some cases to hospital in proportions such that studies based on the hospital give a wrong picture of disease-exposure relations in the community. Selection bias during data collection is represented by non-response bias and follow-up bias. Prevention of selection bias is by avoiding its causes that were mentioned above. There is no treatment for selection bias once it has occurred. There are no easy methods for adjustment for the effect of selection bias once it has occurred.

CONFOUNDING BIAS

Confounding is mixing up of effects. Confounding bias arises when the disease-exposure relationship is disturbed by an extraneous factor called the confounding variable. The confounding variable is not actually involved in the exposure-disease relationship. It is however predictive of disease but is unequally distributed between exposure groups. Being related both to the disease and the risk factor, the confounding variable could lead to a spurious apparent relation between disease and exposure if it is a factor in the selection of subjects into the study. A confounder must fulfil the following criteria: relation to both disease and exposure, not being part of the causal pathway, being a true risk factor for the disease, being associated to the exposure in the source population, and being not affected by either disease or exposure. Prevention of confounding at the design stage by eliminating the effect of the confounding factor can be achieved using 4 strategies: pair-matching, stratification, randomisation, and restriction. Confounding can be treated at the analysis stage by various adjustment methods (both non-multivariate and multi-variate). Non-multivariate treatment of confounding employs standardization and stratified Mantel-Haenszel analysis. Multivariate treatment of confounding employs multivariate adjustment procedures: multiple linear regression, linear discriminant function, and multiple logistic regression. Care must be taken to deal only with true confounders. Adjusting for non-confounders reduces the precision of the study.

Q9. Discuss with examples sources effects, and prevention of environmental radiation exposure

Radiations may be ionizing photons (gamma and x-rays), ionizing particles (alpha and beta), non-ionizing (UV light, visible light, and infrared light), low frequency electromagnetic fields from power lines, or ultra sound.

There are 2 main sources of radiation in the environment: background natural radiation (cosmic or solar from space such as UV, geological /terrestrial from the rocks, inhaled radioactive material, and the radio-active gas radon) and man -made radioactive sources (nuclear bombs, emissions from nuclear power plants, medical exposure, residential exposures to TV and appliance, and occupational exposures). UV is associated with skin cancers (basal cell carcinoma, squamous cell carcinoma, and malignant melanoma). Radon has high concentrations in houses and causes lung cancer. Skin cancer is prevented by avoiding sun exposure. Exposure to radon is prevented by filling up cracks in homes and using concrete foundations. Electromagnetic fields are suspected to cause cancer, adverse reproductive outcomes, and behavioral or neural effects. Residential exposure (from TV, video, and appliances) may lead to childhood and adult malignancy (leukemia and brain cancer). Occupational exposure may also lead to leukemia and brain cancer.

Ionizing radiations cause DNA damage, DNA mutations, and chromosomal aberrations. Effects on germ cells, unlike on somatic cells, can be transmitted to the next generation. The factors influencing biological effects depend on the type and energy of radiation, time of exposure, accumulated dose, and the target tissue. The bone marrow, intestine, skin, and lungs are most affected. Health effects of radiation may be acute (sunburn, photosensitivity, and the acute radiation syndrome) or chronic (cancer, infertility, teratogenesis, and dermatological). There is a disagreement about existence of a threshold.

Primary prevention is to prevent or limit exposure (minimal medical or occupational exposure, use of personal monitoring by dosimeters, prevention of nuclear war and nuclear accidents, safe disposal of nuclear waste). Secondary prevention is mostly supportive: treatment of infection, replacement of bone marrow, and emergency measures during nuclear accidents (stay indoors, iodine tablets, evacuating of residents, controlling exposed food stuff, and decontaminating the environment). Tertiary prevention is by long-term follow-up for those exposed because effects may appear late. Genetic counseling may be necessary.

SHORT ESSAY QUESTIONS

Q1. Illustrate the use of public health strategies (surveillance, intervention, evaluation, and economic development) on the control of an infectious disease of your choice

A1. (a) SURVEILLANCE: Public health surveillance is a continuous process of monitoring and analysis to be able to identify problems early (b) INTERVENTION: Public health intervention is against disease and its determinants (c) EVALUATION: The results of evaluation of public health programs are used to guide further action (d) ECONOMIC STRATEGIES WITH IMPACT ON HEALTH: Role of socio-economic development Infra-structure

Q2. Explain with examples the distinction between public health and community health

A2. The distinction between public and community health is not easy to define because the two terms are used interchangeably. The term public health was used earlier than the term community health and we can perhaps define the difference between the two based on history. The public health movement that became strong in the mid 19^th century was aimed at getting governments or public authorities to take action to improve health. This meant that public health was a governmental function carried out by public authorities. By the 1960s new ideas of personal freedom, autonomy, and empowerment became prominent. Individuals and groups of individuals developed more awareness of their health and environmental problems and desired to be involved both in the identification and solution of the problems. The genesis of community medicine therefore was this non-governmental organization movement. The trend to more popular involvement in community health was aided by the fall and therefore discredit of authoritarian centrally-planned regimes in the former Soviet Union and its satellite states in Europe, Asia, and Africa. Growing privatization of the economic systems and withdrawal of the state from providing health and other social services have led to growth of community power and community health. People now organize themselves to solve their health problems

Q3. Illustrate the limitations of biostatistics by using the following 2 contrasts: substantive vs statistical and analytic vs interpretative.

A3. (a) STATISTICAL VS SUBSTANTIVE: An investigator starts with a substantive question. This is formulated as a statistical question. Data is then collected and is analyzed to answer the statistical question. The answer to the statistical question is the statistical conclusion. The investigator uses the statistical conclusion and other knowledge available to him to reach a substantive conclusion. Statistics therefore gives statistical and not substantive answers. A substantive question is the subject matter stated in ordinary language. Technical terminology may or may not be used. The less technical the formulation is, the better to enable statisticians who are not specialists in the subject matter can understand. Care must be taken to make sure that accuracy and exactness are not sacrificed for the sake of simplification. A statistical question is when the substantive question is stated using statistical language. Since the language of statistics is mathematical, the statistical question is stated as numbers, parameters, relations of equality, and relations of inequality. A statistical conclusion is the result of mathematical manipulation of parameters or data. Statistical conclusions are made about groups and not individuals. Any inference to the individual is to a hypothetical individual. In other words the statistical conclusion is depersonalized. A substantive conclusion is the translation of the statistical conclusion back to normal language to answer the substantive question that was posed at the start. (b) ANAYLSIS VS INTERPRETATION: Statistical results are dry unless well interpreted and put in the right context. Bio-statistics only summarizes the data but does not interpret. Interpretation involves knowledge of the context, prior knowledge, and prior suppositions. Personal familiarity with the data may also influence how it is interpreted. Data that is well analyzed may be poorly interpreted

Q4. Describe and illustrate the following types of probability classification:

A5. (a) BAYESIAN PROBABILITY: Bayesian probability combines both subjective and objective probability to reach a conclusion. The prior probability can be objective or subjective. It can also be a belief. Objective priors are based on previous data. Subjective priors are based on considered opinions of the investigators. Priors are considered with new empirical evidence to reach a posterior probability. Bayesian is a good representation of how conclusions are made from empirical observation in real life. (b) A PRIORI AND POSTERIORI PROBABILITY: A priori (theoretical or classical) probability is knowable or calculable without experimentation. It can be determined by abstract reasoning. On the other hand a posteriori (empirical, frequentist) probability: knowable or calculable from results of experiment. Both a priori and a posteriori probabilities are types of objective probability. In experimentation based on the scientific method, we start with the a priori assumption and after considering the results of the experiment end with a posteriori probability.

Q5. Describe and illustrate the use of Bayesian probability in statistical inference

A6. Bayes' theorem was named posthumously after and English clergyman, Thomas Bayes (1702-1761). It enables us to combine subjective with empirical probabilities. This overcomes the limitation of classical frequentist probability that does not allow allotting a numerical probability to the truth of a hypothesis or proposition. The mathematical formulation of Baye’s theorem is: Pr (B|A) = Pr (An n B)/Pr (A). Pr (A) is the prior and Pr (B1A) is the posterior. The above formula can be stated in a reverse way that is still valid as P (A|B) = P (An n B) / P (B). As can be seen in its formula, Bayesian probability assumes and is based on conditional probability. Pr (A) = Pr (A n B) + Pr (AB’) = Pr (A|B) Pr (B) + Pr (B|B’) Pr (B’). The Bayesian formulation is useful in reversing probabilities and also for combining results of a new study with those of a previous study. There are differences between Bayesian and classical statistical inference. Classical statistics is rigid and does not consider uncertainty or prior belief. In the real world the investigator has some prior belief before starting a new experiment. The experimental evidence just changes the degree of prior belief. Thus the posterior is a function of both the prior and experimental data. Conditional probabilities such as are used in Bayesian formulations satisfy all the laws of probability. The range of conditional probabilities is from 0.0 to 1.0, 0 =< Pr (A|B) =< 1.0. The probability space adds up to 1.0, Pr (S|B) = 1.0. For mutually exclusive A and B, Pr (U A_i |B) = å Pr (A_i |B) for i = 1 ….. i = ¥.

Q6. Describe and illustrate the use of Bayesian probability in clinical decision making

A7. The following quote illustrates the importance of converting prior into posterior probabilities. Claude Bernard said that a physician ‘never makes experiments to confirm his ideas but to simply to control them’ (in Bernard C (1956): Introduction to Experimental Medicine. Diver Reprint. New York). In clinical usage some prior probabilities are known from the database. Baye’s theorem enables combining new empirical data with evidence already available to reach a conclusion. For example if a decision has to be made whether a patient has a disease, D, on the basis of a laboratory test, T, we can use the Bayesian formulation as follows: Pr (D+|T+) = {Pr (T+|D+) Pr (D+)} / {Pr (T+|D+) Pr (D+) + Pr (T+|D-) Pr (D-)}. This can be rewritten as the predictive value of a positive test = Pr (D+|T+) = {(sensitivity) (prevalence)} / [{(sensitivity) (prevalence)} + {(false positive rate) (1 – prevalence)}]. The Bayesian formulation can be used to convert an old Odds Ratio to a new one as follows: New Odds Ratio = Old Odds Ratio x ratio of the new and old conditional probabilities. A likelihood ratio tells us compares the likelihood of results in two circumstances for example we can compute a likelihood ratio LR = Pr (A+|B+) / Pr (A+|B-}. Issues that affect decision-making in public health: access to care, quality of care, effectiveness of care, cost of care, efficiency of care. We must contain costs yet deliver quality care. Data linkage in MIS enables good decision-making.

Q7. Describe and illustrate 2 of the 6 properties of a random variable

A.8 (a) EXPECTATION OF A RANDOM VARIABLE: Even if the random variable changes a lot, there is some central or middle value around which in hovers most of the time. This is called the expectation of the random variable also written as exp (x). We will in due course learn that the expectation is the same as the familiar concept of average. The Strong law of large numbers states that the average of a sequence of independent random variables having a common distribution will converge to the mean of the distribution. Stated precisely {x₁ +x₂ +… x_n}/ n ® m as n® ¥. Another useful mathematical theorem is that the expected value of a sum of random variables is the sum of their expected values. The expectation of a random variable has the same sign as the random variable. The expectation or mean of a random variable is its first moment being defined as å(x_i - m)¹/N where x_i = random variable, m = population average and N= total population. (b) VARIANCE OF A RANDOM VARIABLE: The variations of the random variable around the expectation are measured by its variance written as vax (x). Variance is in effect an average measure of how much the random variable varies either above or below the expectation. Variance is computed as the sum of variations of x from the expectation, var (x) = å {(obs (x) – exp (x)}². The variance of a random variable is its second moment being defined as å(x_i - m)²/N where x_i = random variable, m = population average and N= total population. (c) COVARIANCE: It is not enough to study the variation of one random variable. It may be necessary to study the variation of one random variable in relation to the variation of another random variable. Two measures for comparative study of random variation are used, covariance and correlation. Covariance measures the covariability of the two variables. Covariance of two random variables, x and y, is defined as cov (x, y) = å {obs (x) – exp (x)} {obs (y) – exp (y)} where obs (x) is the random value of x, exp (x) is the expectation of x, obs (x) is the random value of y and exp (y) is the expectation of y. Covariance can be negative, positive, or zero. Correlation measures the linear relation between two random variables. The correlation between x and y is defined as …… There is a relation between covariance and correlation as shown in the equation corr (x, y) = Cov (x, y) / {Var (x) Var (y)}^1/2. The sign of the covariance is always the same as that of the correlation. If the two variables are independent of one another such that any change in one does not affect the other, their covariance and correlation will both be equal to zero. However finding a zero covariance or a zero correlation does not automatically imply independence because the two variables could be dependent on one another in ways that are not measured by the covariance and correlation. (d) SKEWNESS OF RANDOM VARIABLE: Skewness is a measure of how skewed or biased the distribution of the random variable is away from the center. The skew may be above the center (positive skew) or below the center (negative skew). The skew of a random variable is its third moment and is defined as å(x_i - m)³/N where x_i = random variable, m = population average and N= total population (e) KURTOSIS OF A RANDOM VARIABLE: Kurtosis is a measure of how peaked the random variable is at the point of its expectation. High kurtosis means that many random variables crowd around the expectation. It is the fourth moment defined as å(x_i - m)⁴/N where x_i = random variable, m = population average and N= total population.

Q8. Describe and illustrate properties of a normal curve

A9. (a) PARAMETERS: The normal curve is described fully by two parameters only, the mean and the standard deviation. The formula for the normal curve has only those two parameters as shown: y = 1/s(2p)^1/2 exp [{-1/2 {(x-m)/s}²}]. A standardized normal curve has mean = 0 and standard deviation = 1. Two curves may have the same mean but different standard deviations. Two curves may have different means but the same standard deviation. For a normal curve the ratio of the inter-quartile range to SD is approximately 0.67. (b) SYMMETRY: The normal curve is perfectly symmetrical about the mean (c) CONTINUOUS DISTRIBUTION: The normal curve is a continuous distribution. For large data sets it models discrete data fairly well (d) ASYMPTOTIC: The normal curve approaches the x-axis but never touches it. (e) CENTRAL LIMIT THEOREM ASSUMPTIONS: The normal curve conforms to the 3 assumptions of the central limit theorem. These assumptions are true when the sample size is large. The first assumption is that sample means have a normal distribution regardless of the distribution of the population from which the sample was selected. The second assumption is that the sample mean = population mean. The third assumption is that the sample standard error = population standard deviation / (n-1) where n = sample size. Where the population standard deviation is not known, the sample standard deviation is used. A major benefit of the central limit theorem is that it males it unnecessary to assume that observations came from a normal distribution.

Q9 Compare and contrast with illustrations 3 ways of defining the 95% confidence interval

A10. (a) COMMON SENSE CONFIDENCE: 95% CI can be defined in a common sense way or in a statistical way. The commonsense definition is not strictly accurate but it is intuitive. In a common sense way the 95% CI means that we are 95% sure that the true value of the parameter is within the interval. AS mentioned before there is nothing magical about 95%. We can use 90%, 80%, or any other figure. In order to generalize the above definition we can say we are (1-a) 100% confident that the true value of the parameter is in the interval. The level of significance is denoted by a conventionally set at 0.05 but any other value is acceptable. (b) COMMON SENSE OF CHANCE: We can say that the probability that the true parameter lies in the confidence interval is 95% or (1- a) 100% (c) DEFINITION BASED ON SAMPLE MEANS: Imagine repeated sampling from a population and computing a mean for each sample selected. Ninety-five percent (95%) of the sample means will be within the 95% CI. (d) DEFINITION BASED ON THE INTERVALS THAT COVER THE PARAMETER: The statistical definition can be stated more rigorously: An interval estimate of an unknown population parameter is a random interval constructed so that it has a given probability of including the parameter. Thus the 95% confidence interval for parameter q is defined as Pr (a < q < b) = 0.95. Note that we talk of the probability that the interval includes q and not the probability that q lies in the interval. This is because q is fixed but the intervals that cover it vary. The probability can be stated as (1-a) 100%.

Q10. Discuss with examples the difference between control and eradication of communicable disease

A10. Control is a containment of disease and includes both prevention and control measures. Eradication is complete uprooting of a disease and its total elimination. Prevention is pre-empting disease. Primary prevention is at the pre-pathogenic stage such as health promotion and health protection (specific & general). Secondary prevention is early detection and treatment. Tertiary prevention is disability limitation and rehabilitation. Curative medicine (for individuals) and preventive medicine (for the community) are synergistic. Preventive medicine has priority and is more effective with a bigger impact on disease over the past 100 years. Prevention strategies can be disease-oriented such as immunization and tobacco control, Environmental-oriented such as food quality control, or host-oriented such as immunization, nutrition, and medical care

Q11. Discuss with examples the statement 'biomarkers are the future of epidemiological research'

A11. Molecular epidemiology is use of biological markers (cellular, biochemical, molecular, genetic, immunologic, or physiological) to study disease-exposure relations. Both exposure and outcome biomarkers are used. A marker is selected on the basis of biologic relevance, pharmacokinetics, temporal relevance, background variability, confounding, reproducibility, specificity, sensitivity, and predictive value. Markers are validated as correct measures of exposure, disease, and susceptibility by use of dose response relations, inter-personal or intra-personal variation, and correlation with clinical status or with other biomarkers. Cardiovascular markers are of three main classes: lipid-related (total cholesterol, triglycerides, low density lipoproproteins, high density lipoproteins, Apo B and Apo A1), markers of thrombosis (factor VIII, plasma fibrinogen, platelet counts, and platelet aggregation) and markers of disease outcome (various enzymes in myocardial infarction). Markers in genetic diseases are metabolites such as PKU or antigen systems such as HLA. Biomarkers are used in carcinogenesis to measure doses of potential carcinogens (eg DDT, PCB, aflatoxin), assay of early biological effects (chromosomal aberrations), and screen for mutagens (Ames test). Biomarkers in infectious disease are serum antibodies.

Q12. Discuss with examples the use of any one of epidemiological study designs for studying the etiology of genetic diseases

A12. The traditional genetic epidemiology studies are case control studies, cohort studies, and cross sectional studies with case control studies being the most popular. Molecular analysis can be used in case control studies to explore genetic and environmental interactions. New study designs are: family studies, twin studies, adoption studies, migrant studies, affected relative study, and various adaptations of the case control design: case only and case parent studies. Family studies include study of first relative disease risk, concordance studies, gene isolation by segregation or linkage. Increased disease risk in a first degree relative points towards a genetic cause. Concordance of certain variables related to disease within the family (between parents and offspring or among siblings) is assessed by using the correlation coefficient. Concordance among spouses indicates environmental causes. Gene isolation involves investigating the relation between an allele and a disease condition by analysis of DNA polymorphism or by family studies to establish segregation of linkage between disease-associated loci. Segregation analysis seeks to determine if the pattern of familial disease is compatible with Mendelian inheritance using statistical methods. Linkage analysis seeks to investigate whether two alleles from 2 loci segregate together in a family as they are passed from parents to child. Twin studies may be based on study of monozygotic twins or dizygotic twins. Monozygotic twins share genetic material 100%. Dizygotic twins share only 50% of genetic material. The results of a twin study are set out as shown in the 2x2 contingency table below

	Twin 1 +	Twin 1 -
Twin 2 +	a	b
Twin 2 -	c	d

The concordant rates and discordant rates for disease are computed. The concordant rate is computed as a / (a + b + c) and the discordant rate is computed as (b+c) / (a + b + c). A strong concordance in monozygotic twins suggests a genetic cause of disease.

Adoption studies are used to evaluate the relative contributions of genetic and environmental factors to disease. Disease risk can be compared in twins adopted by different families. Disease risk is also compared in adopted children and biological children. Disease risk can also be compared in parents and their offspring adopted into other families.

In affected relative studies, the alleles of the proband are compared to those of a second affected case as well as the parents. The case only study is a simple design that yields the OR; it compares observed genotype with the expected based on the population. The case parent design compares actual with expected genotype based on parental genotype.

Inbreeding studies: inbreeding increases homozygous sites that results in a higher risk of autosomal recessive disorders. It is possible to compute an inbreeding coefficient and to relate it to disease risk in any given community.

Admixture studies are used to study the effect of racial mixing. For example in the US admixture of black and white results in higher risk of diabetes mellitus. Black DNA markers are used to assess the degree of racial admixture.

Genetic mapping in relation to clinical disease: The Human genome project aims at mapping the sequence of the human genome of about 50,000 – 100, 000 genes will contribute new information for genetic studies. Human genes play roles in both rare and common diseases.

Using genetic distribution to compute disease risk

Genetic markers: Genetic markers can be gene products such as ABO, HLA, proteins, or enzymes. They can also be based on direct analysis of DNA. These studies suffer from three main disadvantages: confounding bias, misclassification of genotype, and gene-environmental interactions.

Time trend studies indicate whether disease is biological or environmental. Environmental disease changes with time.

Migrant studies are also used to evaluate the relative roles of genetic and environmental factors in disease. Interpretation of migrant studies is complicated by three considerations: the migrants are a self-selected group that does not represent the general population, age at migration determines the type and amount length of exposure to environmental causes in the home and migrant countries, and migrants may carry with them some of the cultures and lifestyles of the original country.

Q13. Discuss the primary and secondary prevention of genetic diseases

A13. PRIMARY PREVENTION (a) Genetic counseling: Discouraging consanguinity, Pre-marital/pre-pregnancy risk assessment is based on a detailed family history, diagnosis of disease in family members, Pre-natal diagnosis is controversial (b) Screening: Screening is available for: phenylketonuria, hypothyroidism, cystic fibrosis, sickle cell disease, Tay-sachs disease, adult-onset polycystic kidney disease, multiple endocrine adenomatosis, familial polyposis coli. SECONDARY PREVENTION:Surgical correction, Replacement therapy e.g. give insulin in dm, Amelioration therapy e.g. restrict diet in phenylketonuria, Preventive therapy e.g. remove polyposis coli, Gene therapy

Q14. Discuss with examples the definition and prevention of Type A and type B Adverse Drug Reactions

A14. Adverse drug reactions (ADR) are classified as type A and type B. Type A reactions are due to the known pharmacological effects of the drug. They are dose dependent, predictable, and not so severe). Type B reactions are rare idiosyncratic reactions of the drug. They are non-dose dependent, unpredictable and have more mortality. In the UK, 5% of all hospital admissions are due to ADR. 1 in 10 admitted for other reasons develop ADR. 1 in 1000 of hospital deaths is due to ADR. Primary prevention of ADR is by control of prescription, knowing allergies avoiding polypharmacy, and rational drug use. Secondary prevention of ADR is by stopping the drug, using an antidote, monitoring for further side-effects. Post-marketing surveillance of drugs is necessary to pick up more ADRs.

PROBLEM QUESTIONS

PROBLEM #1: The City of Kuala Lumpur experienced a severe dry season for 2 months followed by very heavy rainfall for a week that resulted into flush floods all over the city that overwhelmed the drainage system. Many houses were destroyed and people were housed in temporary camps. In the 4 weeks following the floods, hospitals reported an increase in the number of various disease conditions the most frequent being childhood diarrhea and vomiting as well as first trimester abortions. The Ministry of Health has asked you to investigate and advise on public health measures to be undertaken.

Q1. List various null and alternative hypotheses that you would generate.

A1. The null or research hypothesis, H₀, states that there is no difference between two comparison groups and that the apparent difference seen is due to sampling error. The alternative hypothesis, H_A, disagrees with the null hypothesis. H₀ and H_A are complimentary and exhaustive. They both cover all the possibilities.

Q2. State the substantive and statistical questions for each of your hypotheses

A2. Open

Q3. List and describe types of data you would need to collect to test the hypotheses. You will need to provide justification for each data type in view of your hypothesis / hypotheses.

A2. Open

Q4. Describe the study design and methods of data collection that you would use to test your hypothesis / hypotheses

A3. Open

Q5. Describe the test parameters and steps of hypothesis testing using the p-value method

A4. Parameters of significance testing are the critical region, the significance level, the p-value, type 1 error, type II error, and power. P values for large samples that are normally distributed is derived from 4 test statistics that are computed from the data: t, F, c, and β. P values for small samples that are not normally distributed are computed directly from the data using exact methods based on the binomial distribution. The decision rules are: If the p < 0.05 H₀ is rejected (test statistically significant). If the p>0.05 H₀ is not rejected (test not statistically significant).

Q5. Describe the test parameters and steps of hypothesis testing using the confidence interval method

A5. The 95% confidence interval is more informative than the p-value approach because it indicates precision. Under H₀ the null value is defined as 0 (when the difference between comparison groups=0) or as 1.0 (when the ratio between comparison groups=1). The 95% CIs can be computed from the data using approximate Gaussian (for large samples) or exact binomial methods (for small samples). The decision rule are: if the CI contains the null value, H₀ is not rejected. If the CI When the interval does not contain the null value, H₀ is rejected.

Q6. List and describe errors of statistical testing that you may encounter and describe their impact on the conclusions from the testing of 2 of your main hypotheses

A7. The concepts of conditional probability can be used to define parameters related to statistical testing. Type 1 error = a error = Probability of rejecting a true H₀ = False positive = Pr (rejecting H₀ | H₀is true). Type 2 error = berror = Probability of not rejecting a false H₀ = False negative = Pr (not rejecting H₀ | H₀ is false).

Q7. Describe confounding factors you would consider in reaching a substantive conclusion from the statistical conclusion of one of your hypothesis

A7. Open

PROBLEM #2: The Ministry of Health is proud of being able to provide a sufficient number of hospital beds for the whole population over the past 15 years. They are concerned about planning for enough hospital beds in the next 5 years. They have asked you to develop a regression model that they can use to predict the number of beds needed based on demographic, epidemiological, and economic variables. They have put at your disposal data on these variables and also hospital beds over the past 10 years.

Q1. Compare and contrast the advantages and disadvantages of cross sectional or longitudinal regression models for this problem. Indicate which model you would prefer and why?

A1. Parametric regression models are cross sectional (linear, logistic, or log-linear) or longitudinal (linear and proportional hazards). Both cross sectional and longitudinal models can be used and the student has to discuss relative merits (open answer). The simple linear regression equation is y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable. Multiple linear regression, a form of multivariate analysis, is defined by y=a+b₁x₁ + b₂x₂ + …b_nx_n

Q2. List and describe the variables that you will include in the model

A2. Regression relates independent with dependent variables. The variables may be raw data, dummy indicator variables, or scores. Some variables are included in the model to control for confounding. The candidate will have to think of demographic, epidemiological, and economic variables that can predict bed requirements in the future. Y can be interval, dichotomous, ordinal, or nominal and x can be interval or dichotomous but not ordinal or nominal. Interactive (product) variables can be included in the model..

Q3. List and describe assumptions that must hold for a regression model to be valid

A3. Validity is based on 4 assumptions: linearity of the x-y relation, normal distribution of the y variable for any given value of x, homoscedacity (constant y variance for all x values), and y values are independent for each value of x.

Q4. Describe how you can test for statistical significance of the regression coefficient

A4. The t test can be used to test the significance of the regression coefficient and to compare regression coefficients of 2 lines.

Q5. Describe 3 methods of adding / removing variables in fitting regression lines explaining advantages and disadvantages of each

A5. Fitting the simple regression model is very straightforward since it has only one independent variable. Fitting the multiple regression model is by step-up, step-down, and step-wise selection of x variables. Step-up or forwards selection starts with a minimal set of x variables and one x variable is added at a time. Step-down or backward elimination starts with a full model and one variable is eliminated at a time. Step-wise selection is a combination of step up and step down selection.

Q6. Describe 2 methods of regression model validation

Model validation is by using new data, data splitting, the jackknife procedures, and the boot strap procedure.

Q7. Describe 2 methods of assessing regression models

A7. The best model is one with the highest coefficient of determination or one for which any additions do not make any significant changes in the coefficient. The model is assessed by the following: testing linearity, row diagnostics, column diagnostics, hypothesis testing, residual analysis, impact assessment of individual observations, and the coefficient of determination. Row diagnostics identify the following: outliers, influential observations, unequal variances (heteroscedacity), and correlated errors. Column diagnostics deal mainly with multicollinearity that is correlations among several x variables causing model redundancy and imprecision. Collinear variables should be dropped leaving only the important one. Hypothesis testing of omnibus significance of the model uses the F ratio. Hypothesis testing of individual x variables uses the t test. Residuals are defined as the difference between the observed values and the predicted values. A good model fit will have most residuals near zero and the residual plot will be normal in shape. The impact of specific observations is measured by their leverage or by Cook’s distance. The coefficient of determination defined as r²varies 0-1.0 and is a measure of goodness of fit. The fit of the model can be improved by using polynomial functions, linearizing transformations, creation of categorical or interaction variables, and dropping outliers.

PROBLEM #3: The Ministry of Health wanted to set up a permanent annual health and nutritional survey of children aged 0-5 years. A request for proposals was published and 2 faculties of medicine presented their sampling plans: stratified random sampling, multi-stage random sampling and cluster sampling. You are asked to assess the 2 sampling plans and advise the Ministry which of the 2 to adopt

Q1. Describe the rationale for / advantage of using samples in investigations

A1. Data can be collected from the whole population as in the census (all surveyed) or from a sample survey (including some members of the population). Most bio-statistics is study of samples and not target or study populations. There are only a few exceptional cases when the whole population is studied. Samples and not whole populations are studied for three main reasons (a) study of populations is costly and logistically difficult. More manpower is needed and more time is spent in carrying out studies of populations. (b) Due to logistic considerations, it is easier to be more accurate when studying a small sample than when studying the whole population. (c) Some populations are hypothetical. It is not possible to identify or enumerate all their members. There is no way of studying them except by sampling.

Q2. Explain with illustration the rationale of sampling with unequal inclusion probabilities

A2. In a self-weighing sample inclusion probabilities are the same for all elements. There are some situations in which selection of sample elements is carried out with unequal inclusion probability in order to gain more precise estimates but these are not routine and are complex and give rise to complex sample estimators. An example is the selection of a sample of shoplifters in which bigger shops are favored because they experience more shoplifting. The selection probability can be kept constant throughout the process of sample selection or can be allowed to vary according to pre-determined criteria. In practice it is better to stick to simple and straightforward procedures such as simple random sampling with self-weighing and constant inclusion probability.

Q3. Explain how a sample selected randomnly may end not being representative

A3. Random does not always assure representativeness. There is no 100% certainty that any sample however selected will represent the population perfectly. Small samples even if selected randomly are rarely representative. In general any sample above 60 elements is generally representative. A sample may initially be selected as random and representative however by the time data is collected it is no longer random or representative. This may occur when there is differential non-response i.e. those members of the sample who refuse to participate in the study have characteristics that distinguish them from the rest of the population. The non-response could be unit non-response in which data is not obtained about some members of the sample or item non-response in which data is not obtained about some items of information in a consistently biased way. Unit non-response effectively makes the sample smaller than was planned.

Q4. Discuss the differences between sampling with replacement and sampling without replacement

A4. There are two types of random sampling: sampling with replacement based on the binomial random variable and sampling without replacement based on the hypergeometric random variable. Both types of sampling satisfy conditions of random sampling. The two types of sampling give rise to distributions of similar means and shape, however the standard deviation is smaller for sampling without replacement because of higher precision. Mathematical formulas are easier for sampling with replacement. When sampling with replacement, it makes no difference whether the population sampled is finite or not. When sampling without replacement, the size and finiteness of the population matter and the formulas for variance for a finite and infinite populations are different. When sampling without replacement from a small finite population, sample units are successively selected from a diminishing pool. Thus all elements do not have an equal chance of being selected. Although the two types of sampling can be used, in practice most sampling is without replacement. There is no difference between the two types of sampling if the sample is small compared to the study population.

Q5. Define stratified random sampling with mention of its advantages and explain how you would carry it out in this case.

A5. In this type of sampling the whole population is divided into groups called strata. It forces the investigator to select some elements from each of the strata thus achieving some sort of balance for the whole sample. A pre-determined proportion or fraction of each stratum is randomly selected into the sample. Selection is carried out separately in each stratum using random selection. The sampling fraction from each stratum may be the same or may vary from stratum to stratum. The variation of sampling fractions enables deliberate over-sampling or under-sampling of some strata. Another way of stating this is to give each stratum a weighting. The inclusion probabilities are different for the different strata. Those to be over-sampled have higher weighting than those to be under-samples. The strata may be defined qualitatively or quantitatively. The strata usually employed are: SES (low/ middle / high), sex (male / female), age (young / old), race (black / Caucasian / mongoloid), occupational groups, and geographical units. A stratified sample has lower variance and is therefore more precise than a simple random sample. The reason for this higher precision is that strata are more homogenous than the whole population. Post-sampling stratification can improve the estimates of simple random sampling without the logistic burden of carrying out a full stratified random sampling. A variation of the stratified sample is the proportional stratified sample in which the number of sample elements selected from each stratum is in the same proportion of that stratum to the whole total sample size. A technical term used for this type of sampling is Inclusion Probability Proportional to Size (IPPS). Stated algebraically n_i / n = N_i / N where n_i = number of elements selected from stratum, n = total sample size, N_i = size of the stratum, and N = population size. A proportional stratified sample is better that a simple random sample of the same size.

Q6. Define multistage random sampling with mention of its advantages and explain how you would carry it out in this case.

A6. This is a random sample selected in 2 or more stages. The sample selected at the second stage is a sub sample of that selected at the first stage. An example of a 5-stage multi-stage sampling may involve the following administrative units in descending order: city, neighborhood, block, household, and individual. This is done for example when a random sample is selected from each of the 2 gender categories, male and female. Then random samples are selected from each age category of each gender category. If a sample of households is selected, that sample is called the primary sampling unit (PSU). Household members selected from each household randomly are called the secondary sampling unit (SSU). We can talk of the first stage inclusion probability and the conditional inclusion probability at the second stage. The resulting multi-stage sample has the advantage of being balanced with respect to gender, age, or household characteristics. Multi-stage sampling produces less efficient estimates of population parameters than simple random sampling. It is saves time and money thus becoming cheaper than simple random sampling. Its convenience is that it does not require prior enumeration of the entire sampling frame before start of the sampling process. It is especially convenient when the complete sampling frame is not known. It has the great advantage of ensuring balanced representation of the groups that may not occur with simple random sampling. It is possible to have a sampling scheme that combines stratified with 2-stage sampling.

Q7 Define multistage random sampling with mention of its advantages and explain how you would carry it out in this case.

A7. This is easy, cheap but less precise. Instead of using individuals as sampling units, groups of individuals (clusters) are used. The clusters may be natural or artificial. For example instead of sampling individuals, households may be sampled. Clusters are normally selected as natural sub-groupings of the population. A random sample of clusters is selected and all elements of the cluster are included in the study sample. Cluster sampling can be viewed as a form of simple random sampling of clusters and not individual sampling units. Cluster sampling can also be looked at as a form of 2-stage sampling in which all elements of the groups drawn in the first stage are included in the study sample. Cluster sampling proceeds by selecting geographical units like districts or zip codes. Then a house is selected at random in each unit. A cluster of given size is then formed around the index house. Sophisticated methods for this selection have been developed. For example the researcher may walk in a straight line in a pre-determined direction while counting until a pre-determined number of houses is counted. These houses together with the index house will then constitute the cluster. Similar clusters are formed in the other zip codes and members of the households are interviewed as study subjects. Cluster sampling has several advantages. There is no need to have a complete sampling frame for the whole population. Cluster sampling is easy, quick, and cheap. Clusters can be selected from the more accessible areas. Cluster sampling has some disadvantages. It is non-random. It is less precise than the simple random sample because units selected within each cluster are similar to another. Thus a cluster sample produces more similarity than there is in the actual population. Cluster sampling is used in studies of immunization coverage and in emergency situations. The sample size for cluster sampling is computed as for the simple random sample and is multiplied by a design factor to account for cluster sampling. The design factor is obtained from previous studies.

Q8. Write out a recommendation for the Ministry of Health comparing the 3 sampling schemes indicating your preference giving full justification.

A8. open

PROBLEM #4: The Ministry of Health in response to wide spread complaints about 'dirty air' in Kuala Lumpur called you as an expert to advise on the problem. You have been told to produce s report describing the problem and its causes as well as suggesting solutions. You have an unlimited budget for your investigations and can hire any number of professionals to help you.

Q1. Define air pollution and mention the most likely pollutants in Kuala Lumpur

A1. Air pollution is defined as contamination of the air by substances in amounts great enough to interfere with the comfort, safety, and health of living organisms. The three commonest causes of air pollution are automobiles combustion, burning of fossil fuels for energy generation, and industrial plants like refineries and mills. The most pervasive air pollutants are CO, Pb, NO₂, SO₂, O₃, and particulate matters. Air borne pollutants can be gases, vapors, erosols, mist, dust, or smoke.

Q2. Explain the moral problem associated with control of air pollution in newly industrialized countries (NIC) like Malaysia

A2. NIC face a real moral and economic dilemma. The economic and industrial activities that they need to get out of poverty cause degradation of air-quality and the choice is difficult to make.

Q3. Describe sources of indoor pollution

The sources of indoor pollutants are either internal or external. The concentration of indoor pollutants is higher than that of outdoor pollutants because pollutants are trapped and are concentrated. Humans spend more than 90% of their time indoors.. There is little information on the chronic effects of indoor pollutants. The internal sources are cigarette smoke, heating and air-conditioning, building materials like asbestos, wood combustion, radon, formaldehyde, and nitrogen dioxide from gas stoves. Carbon monoxide is the most dangerous indoor pollutant. Formaldehyde is from indoor insulating material. The external sources: pollutants entering the house from the external atmosphere. Indoor pollutants include organic and inorganic compounds, viruses, bacteria, and fungi. The problem of indoor pollution has come to prominence only recently with the emergence of a politically-powerful anti-smoking movement. Even non-smoking members of the household suffer from passive smoking. Indoor air quality surveys need to be undertaken. Housing codes should incorporate appropriate measures to prevent such pollution.

Q4. Explain sources of outdoor pollution

A4. The sources of outdoor pollution are: burning of coal or heavy oil, automobile emissions that are products of incomplete combustion of petrol, and emissions from the chemical industry. Control of indoor smoking can increase outdoor pollution with smokers preferring to smoke in the open air rather than drop the habit. Air pollutants may be organic or inorganic. The inorganic are either gases or particles. The commonest gases are: nitrogen oxides NO₂, sulfur oxides SO₂, carbon monoxide CO, carbon dioxide CO₂, ozone O₃ and hydrocarbons. The particles are from mining and construction (lead, asbestos, beryllium, cadmium, mercury, iron) or radioactive material. Ozone is a highly reactive oxidant which irritates mucous membranes and causes pulmonary epithelial inflammation. Levels of ozone are high in the summer. Levels are higher in the day than at night. They are highest at mid afternoon. Sulfur dioxide, particles, and erosols are due to burning fossil fuels. Nitrogen oxides due to auto emission systems affect the immune system. Carbon monoxide is due to incomplete combustion of organic matter. Its primary source is auto emission. Acute carbon monoxide exposure is a cause of fatal poisoning. Auto emissions contain the following carcinogenic substances: benzene, polycystic aromatic hydrocarbons (PAH) and nitro-PAH. Exposure to benzene is from 4 sources: cigarette smoke, home solvents, gasoline, and leaky underground tanks that contaminate water supplies. Auto emissions containing lead contaminate vegetables and water. ETS has two types of effects: mainstream smoke for the smokers themselves and side-stream smoke for the non smokers.

Q5. Describe methods of assessing the type and extent of air pollution

A5. It is almost impossible to measure individual exposure because levels vary throughout the day. What are practical are environmental measurements. Different substances are measured for each type of pollution. Assessment of pollution by reducing pollutants that are produced from fossil fuels is by measuring the concentration of particles in smoke particles, the concentration of sulphur dioxide, and the concentration of sulphuric acid/suplhates. Polycyclic aromatic hydrocarbons are from incomplete combustion of fossil fuels are measured by chromatography. The assessment of photochemical oxidizing pollutants is based on assessing nitrogen oxide (most of it from auto emissions), hydrocarbons (from auto and chemical refinery emissions), and ozone. Carbon monoxide, from auto emissions, cigarette smoking and combustion, is measured by continuous monitoring. Lead (from burning coal, burning heavy oils, factories, and petrol engine vehicles) is measured by spectrophotometry.

Q6. Briefly describe how you would investigate the relation between one of the pollutants and common disease conditions in Kuala Lumpur

A6. Open. Pollutants produce health effects by physically or chemically. Physical effects include injury to skin, irritation & inflammation, gases and asphyxia. Chemical effects include enzymatic damage, and binding to active compounds that impairing or changing their properties. There are controversies about the level at which pollutants are harmful to health. One view is that there is a threshold dose below which a pollutant is not harmful. The alternative view is that pollutants have harmful effects at any dose. The clinical effects may be acute or chronic. Acute effects occur at high doses of exposure and include: death, pain, irritation, and respiratory disease (asthmatic attacks, wheezing). Chronic effects occur with low continuous doses and include: neurologic disorders, cardio-vascular disorders, genetic disorders, cancer, and respiratory problems. Impaired respiratory function in children can lead to growth failure. Pollution exacerbates existing chronic disease. Lead has neuropsychological effects in young children. Carbon monoxide leads to impaired psychomotor performance, headache, nausea, dizziness, and coma. Suspended particles & sulfur dioxide are responsible for bronchitis, lung cancer, and other respiratory disease. Smoking exacerbates the respiratory effects of air pollution. Aromatic hydrocarbons lead to leukemia. Aromatic hydrocarbons release into the atmosphere has increased. They become trapped in the stratosphere where they react with and deplete the protective ozone layer. This exposes humans to dangerous cosmic ultra-violet radiations with resultant skin cancer and genetic change.

Q7. Describe the phenomenon of global warming, its causes and its control

A7. Global warming results as a consequence of the greenhouse effects that is when greenhouse gases (CO₂, CFC, methane, and Nitrous oxide) prevent the re-radiation of infra-red (heat) into the atmosphere. With global warming, the polar ice masses thaw releasing extra water that causes the sea levels to rise. The rising sea levels will affect coastal dwellers and result in higher demand for electricity for cooling. Agricultural produce is also affected since more irrigation is needed. The most important greenhouse gas is CO_2. Carbon dioxide is released in the following ways into the atmosphere: burning of fossil fuels (in electricity generation, gas production, automobiles), deforestation which decreases the number of plants that consume carbon dioxide in photosynthesis, primitive agricultural methods of burning and slashing, and use of wood for home cooking. Control of CO₂is achieved by finding alternatives to fossil fuels and growing more vegetation to absorb released

Q8. Describe methods of prevention of air pollution

A8. Primary prevention of outdoor air pollution is by controlling or limiting industrial and car emissions. Many countries including US and UK have clean air legislation controlling car and factory emissions. The USEPA established national standards for allowable concentrations levels called National Ambient Air Quality Standards (NAAQS). The daily reported Pollutant Standard Index, PSI, relates pollutant concentration to health effects. Primary prevention of indoor pollution is to change individual behavior involving proper ventilation, avoiding smoking, and testing homes for radon. Secondary prevention is by staying indoor and reducing indoor pollution. Tertiary prevention is surveillance. Survelliance may be instant or continuous.

Problem #5: A foreign company has presented a screening test for Hepatitis B antigen based on saliva examination and has tried to convince the Center for Disease Prevention of the Ministry of Health to adopt the test for mass screening of all school going children in Malaysia. You have been asked to evaluate the technical aspects of the test and advise the Ministry what to do.

Q1. What is your definition of screening? Distinguish mass screening from other types of screening that you know

A1. Screening, a type of secondary prevention, is identification of unrecognized disease by the application of tests, examinations or other procedures which can be applied easily. Screening can be described as routine or episodic/adhoc, individual or mass, selective or comprehensive.

Q2. What would be the benefits of screening for HBV?

A2. Its benefits may be public (infectious disease), private (insurance screening), and individual (early treatment and reassurance)

Q3. What would be the disadvantages of screening for HBV

A3. Longer morbidity for untreatable screen-detected cases, over-treatment of borderline cases, false reassurance of false negatives, unnecessary treatment of false positives, risks and costs of the screening tests

Q4. Do you think this test can achieve the objective of early disease detection and treatment? Give your reasons

A4. Open

Q5. If this test is introduced, what measure would be used to assess its effectiveness

A5. morbidity, mortality, survival, and quality of life.

Q6. Explain biological and epidemiological characteristics of HBV infection that make it a suitable disease for mass screening

A6. A disease suitable for screening must be definable clearly, with known natural history and a relatively long detectable pre-clinical phase, common (high prevalence), serious, and effectively treatable if detected early.

Q7. What characteristics of the proposed screening test would you consider before approving its use

A7. The screening test must be simple, cheap and cost-effective, acceptable, safe, and perform optimally (high sensitivity, high specificity, low false positive, suitable cut-off level, and reliability).

Q8. Discuss how you would plan to evaluate this screening program if the Ministry approved its implementation

A8. Process parameters of screening program effectiveness are accuracy, validity, reliability, and predictive value. The outcome parameters of a screening program are health outcomes (reduction of morbidity, reduction of mortality, survival, and improvement in the quality of life) or economic outcomes. Outcome assessment can be by pre and post screening comparisons of the same population or comparison of morbidity and /or mortality in the screened and non-screened using the case control or random allocation designs.

Q9. Discuss how you would make cost benefit analysis for this program

Q10. Cost benefit analysis is used to decide on program initiation or continuation. The costs include cost of screening, the cost diagnosis and treatment, patient costs such as lost earnings, human emotional and other costs. QUALY is used as a summary measure of benefits.