search this site.

0900L - MODULE 3.0 DESCRIPTIVE STATISTICS

Print Friendly and PDFPrint Friendly

Copyright by Professor Omar Hasan Kasule Sr


MODULE OUTLINE

3.1 DATA STORAGE, RETRIEVAL, and MANAGEMENT
3.1.1 Data Storage and Retrieval
3.1.2 Data Warehousing
3.1.3 Data Mining
3.1.4 Data Management
3.1.5 Data Transformation

3.2 DATA PRESENTATION AS DIAGRAMS
3.2.1 Data Grouping
3.2.2 Data Tabulation
3.2.3 Data Diagrams Showing One Quantitative Variable
3.2.4 Diagrams Showing 2 or More Quantitative Variables
3.2.5 Shapes of Distributions

3.3 DISCRETE DATA SUMMARY
3.3.1 Definitions
3.3.2 Rates
3.3.3 Hazards
3.3.4 Ratios
3.3.5 Proportions

3.4 CONTINUOUS DATA SUMMARY 1: MEASURES OF CENTRAL TENDENCY
3.4.1 Concept of Averages
3.4.2 Means
3.4.3 Mode
3.4.4 Median
3.4.5 Discussions

3.5 CONTINUOUS DATA SUMMARY 2: MEASURES OF DISPERSION/VARIATION
3.5.1 Types and Sources of Variation
3.5.2 Measures of Variation Based on the Mean
3.5.3 Measures of Variation Based on Quantiles
3.5.4 Other Measures of Variation
3.5.5 Operations / Manipulations


UNIT 3.1
DATA STORAGE and RETRIEVAL

Learning Objectives:
·    Data storage
·    Data retrieval
·    Data security


Key Words and Terms:
·    Data coding
·    Data compression
·    Data encryption
·    Data mining
·    Data modeling
·    Data processing
·    Data protection
·    Data recovery
·    Data reduction
·    Data replication
·    Data retrieval
·    Data storage
·    Data structures
·    Data value


UNIT OUTLINE:
3.1.1 DATA STORAGE
A. Data
B. Document
C. Physical Storage:
D. Data Formatting
E. File Structures

3.1.2 DATA RETRIEVAL
A. Document Surrogates
B. Queries for Retrieval
C. Retrieval by Matching
D. Other Methods of Retrieval
E. Effectiveness of Retrieval

3.1.3 DATA WAREHOUSING
A. Definition
B. Characteristics of Warehouse Data
C. Metadata
D. Data Modeling
E. Online Analytical Processing

3.1.4 DATA MINING
A. Definition
B. The Knowledge Discovery Process
C. Tips for Successful KDD
D. Applications of KADD
E. Artificial Intelligence

3.1.5 DATA REPLICATION
A. Definition
B. Replication Infrastructure
C. Types of Replication


3.1.1 DATA STORAGE
A. DATA
Data gives rise to information that in turn gives rise to knowledge.  Knowledge leads to understanding. Understanding leads to wisdom. Data may be univariate if it has only one variable. It may be bivariate if it has two variables allowing correlation. It may be multivariate allowing more sophisticated analyses.

B. DOCUMENT
A document is stored data in any form: paper, book, letter, message, image, e-mail, voice, and sound. Some documents are ephemeral but can still be retrieved for the brief time that they exist and are recoverable.

C. PHYSICAL STORAGE:
A byte has 8 bits and can therefore represent 28 = 256 characters. ASCII is a machine language that uses only 127 codes (95 character codes and 25 control codes). ANSI is an extension of ASCII used by Microsoft. Different languages use different numbers of codes for example Greek uses 219 characters, Cyrillic uses 259 characters, Arabic uses 196 characters, and Chinese uses 65, 536 characters. Data compression makes document retrieval easier because the search is carried out in a smaller space. Character, image, and sound data can all be compressed; however compression may involve loss of some data.

D. DATA FORMATTING
Data may be formatted in tables of several types of databases (relational, hierarchical, and network). It may be unformatted such as images, sound, or electronic monitoring in the hospital. Formatted documents are easier to retrieve.

E. FILE STRUCTURES
Files may be sequential files, indexed files, tree structured files, and clustered files.

EXAMPLES OF MEDICAL DATA BASESS
MEDLINE was established in 1971. Every year 400,000 articles from 3,700 journals are added and are indexed using medical subject headings (MESH). GRATEFUL MED is a query language used to search MEDLINE.

PDQ
PDQ is a data base about cancer

3.1.2 DATA RETRIEVAL
A. DOCUMENT SURROGATES
A document surrogate is a brief extract of the original data that help in the retrieval of the whole document. Examples of document surrogates are: document identifiers, abstracts, extracts, reviews, indexes, matrix representations, term extraction, term association, lexical measures, amnd trigger phrases. An index must be exhaustive and user-specific. A matrix representation has colums representing terms and rows representing documents. Term extraction is identifying terms that are important in a document by their low frequency according to Ziff’s law that states that the rank of importance x frequency = constant. Term association is looking for a pair of terms that occur near one another like ‘information’ and ‘retrieval’. Lexical measures of term significance use specialized formulas. Trigger phrases are terms like table, figure, and conclusion. There are about 250-300 common grammatical words that account for 50% of any text such as the, of, and, to a, in etc. These have to be excluded from queries with little loss of efficiency. Stemming algorithms remove the ends of words and leave only the roots. The thesaurus can help in the retrieval because it gives synonyms and antonyms of words. Homographs are words that have the same spelling but different meanings. Homonyms are words with the same sound but different spellings.

B. QUERIES FOR RETRIEVAL
Retrieval technology for formatted character documents is now quite sophisticated. Retrieval technology for images is still in its infancy. Queries are short documents used to retrieve larger documents by matching, mapping, or use of Boolean logic. Examples of Boolean logical connectors are AND, OR, Not etc. For example a query may be written to retrieve animals AND plants BUT not machines OR minerals. Queries may be in the form of natural language. They may also be written in probabilistic formulations. Fuzzy queries are becoming popular because they can retrieve documents where the more rigid queries fail. A data query can be in the form of a computer program that has terms or keywords used in the data retrieval process. Document retrieval is easier if authors use a controlled vocabulary.

C. RETRIEVAL BY MATCHING
In the most common form of retrieval, the query is matched to the document being sought. The matching must be significant enough to retrieve the right document. Determinations of what is significant must be made. Not all terms in a query are equally important. It may be necessary to give different terms different weightings. Filtering is used to limit the range of search for example limiting the search to certain years of publication or by language. Sometimes a false drop is made by picking a false document that matches the query. Retrieval can be carried out bu submitting a user profile which then acts as a query. The profile includes language, educational level, job, interests, and types of journals searched.

D. OTHER METHODS OF RETRIEVAL
Natural language processing uses syntactic or semantic analyses. In citation analysis the search goes for documents cited in the footnotes. Use can be made of hypertext links made by the author to other documents. Image and sound processing can also be used.

E. EFFECTIVENESS OF RETRIEVAL
The following 2 x 2 table can be used to assess efficiency of retrieval


RETRIEVED?
RELEVANT?
YES
NO
YES
a
b
NO
c
d

The precision of the retrieval is evaluated as a/(a+c). Recall is evaluated as a/(a+b)
Fallout is evaluated as c/{N – (a+b)}. Generality is c+b/N. Other methods of assessing efficiency are: (a) use of the coverage ratio which is the proportion of documents known to the user that are actually retrieved. (b) The novelty ratio are relevant retrieved documents previously unknown to the user.

3.1.3 DATA WAREHOUSING
A. DEFINITION
Data warehousing is a method of extracting data from various sources, storing it as historical and integrated data for use in decision-support systems. A data warehouse is separated from the organization’s operational systems. Data from the operating system is input into the data warehouse from where it is retrieved for use in decision making. A data warehouse integrates data from all over the organization for purposes of making general reports and making strategic decisions. A data warehouse is usually a read-only data-base. A data warehouse may be stand-alone or may be part of a LAN. The components of a data warehouse are: Access, transformation, distribution, storage, finding, displaying and analysis

B. CHARACTERISTICS OF WAREHOUSE DATA
The data in the warehouse is time-variant, non-volatile (ie it is not updated but is used). It is subject oriented and is integrated.

C. METADATA
Meta data is a term used for definition of data stored in the data warehouse (ie data about data). The metadata database is used to query the data warehouse. Meta data has the following components: description of the physical state of the data, indexing structure for easy access, and data characteristics. The following are the data characteristics that are usually involved: data format, date data was acquired, date data was compiled, person who compiled the data, method of data compilation, accuracy of data compilation, scale of compilation, interpretation of data items, where the data is available, relation of the data to other databases.

D. DATA MODELING
A data model is a graphic representation of the data either as diagrams or charts. The data model reflects the essential features of an organization. The model is used for building systems to be used by the organization. Data models can sometimes be very difficult to read necessitating use of standardized models that everybody can understand. Modeling conventions have been developed. They may be syntactic indicating what symbols mean. They may be positional indicating how symbols are organized on a page.  They may be semantic informing how to group entities according to their meaning.  Data modeling is a prerequisite for data warehousing and supports decision making better. A model is a theoretical representation of the real world using mathematical language. Models simplify the complex world. The purpose of a data model is to facilitate communication between the analyst and the user. It also helps create a logical discipline in database design.

E. ONLINE ANALYTICAL PROCESSING
Online Analytical Processing (OLAP) is live and adhoc data access and analysis for purposes of decision support systems. OLAP can manipulate data in many ways. Parallel computing increases the speed and efficiency of data analysis

<Read more…>