View
215
Download
0
Category
Tags:
Preview:
Citation preview
1
Statistical Computing Certificate Course
Batch -10
Organized by
Department of Statistics University of Rajshahi
Introduction to Data
Prof. Dr. Md. Rezaul Karim Email: mrezakarim@yahoo.com
July 05, 2014
Outline
2
1. Data, Information and Knowledge
2. Stages of a Research Process
3. Sampling Technique
4. Data Classification
5. Data Processing
6. Graphical Representation of Data
7. Summary Chart Concerning Analysis of Data
Prof. Dr. M. Rezaul Karim, Statistics, RU
3
1. Data, Information and Knowledge
Prof. Dr. M. Rezaul Karim, Statistics, RU 4
Data, Information and Knowledge
Data are collected to extract information, which in turn generates knowledge. An understanding
of this is important in any data collection and
analysis.
The word "data" used to be considered as the plural of "datum", which means a single piece of
information
Data are typically the results of measurements; individual pieces of information.
Prof. Dr. M. Rezaul Karim, Statistics, RU
2
5
Data, Information and Knowledge
Raw data (also known as primary data, unprocessed data) is a term for data collected
from a source.
Field data refers to raw data that is collected in an uncontrolled environment or field
Experimental data refers to data that is generated within the context of a scientific
investigation by observation and recording
Prof. Dr. M. Rezaul Karim, Statistics, RU
Data, Information and Knowledge
6 Prof. Dr. M. Rezaul Karim, Statistics, RU
Information is extracted from data through
analysis.
(i) Data represents a fact or statement of an event
without relation to other things. Information
embodies the understanding of the relationship of
some sort, possibly cause and effect (Bellinger et
al. 1997)
(ii) Data are raw facts that have not been organized or
cannot possibly be interpreted. Information is data
that are understood. Information comes from the
relationship between pieces of data (Benyon 1990)
Data, Information and Knowledge
7 Prof. Dr. M. Rezaul Karim, Statistics, RU
Knowledge is the ability of individuals to
understand the information and the manner in
which the information is used in a specific context,
as illustrated by the following:
(i) Knowledge represents a pattern that connects and
generally provides a high level of predictability as
to what is described and what will happen next
(Bellinger et al. 1997)
(ii) Data gets transformed into information through an
understanding of the relationships, and information
yields knowledge through an understanding of the
patterns (Bellinger et al. 1997)
Data, Information and Knowledge
8 Prof. Dr. M. Rezaul Karim, Statistics, RU
Example:
The height of Mt. Everest is generally
considered as "data" [Its peak is 8,848 meters
(29,029 ft) above sea level]
A book on Mt. Everest geological
characteristics may be considered as
"information", and
A report containing practical information on the
best way to reach Mt. Everest's peak may be
considered as "knowledge".
3
Data, Information and Knowledge
9 Prof. Dr. M. Rezaul Karim, Statistics, RU
The link between data, information and knowledge
can be characterized through the DIKW (Data,
Information, Knowledge, and Wisdom) hierarchy -
- a term attributed to Ackoff (1989).
According to this, the content of the human mind
can be classified into five categories:
Data, Information and Knowledge
10 Prof. Dr. M. Rezaul Karim, Statistics, RU
1. Data: symbols
2. Information: data that are processed to be useful;
provides answers to "who", "what", "where", and
"when" questions
3. Knowledge: application of data and information;
answers "how" questions
4. Understanding: appreciation of "why"
5. Wisdom: evaluated understanding.
11
2. Stages of a Research Process
Prof. Dr. M. Rezaul Karim, Statistics, RU 12
Stages of a Research Process
Prof. Dr. M. Rezaul Karim, Statistics, RU
Problem Discovery
and Definition
Research
Design
Sampling
Data
Gathering
Data Processing
and Analysis
Conclusions and
Report
Discovery and
Definition
and so on
4
13
3. Sampling Technique
Prof. Dr. M. Rezaul Karim, Statistics, RU 14
Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
A process of selecting units from a population
A process of selecting a sample to determine certain characteristics of a population
Population Sample
Population Sample
15
Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
From the characteristics of samples, we can infer the characteristics of population, if the sample is
representative of the population
Population: The complete set of people or things being studied. It could be all the citizens in a
country, all farms in a region, or all children under
the age of five in a country, etc.
Sample: A subset (representative) of the population that is actually studied (and from
which the raw data are actually obtained)
16
Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Sampling frame: A complete list of every unit in the population of interest, called a sampling frame,
is needed to select a random sample.
Sample Design: The method of sample selection
5
17
Random (or Probability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Simple Random Sampling:
By definition, a simple random sample refers to those cases that are selected so that each element
in the population has an equal or known chance
of being included in the sample.
A lottery draw is a good example of simple random sampling.
A sample of 5 numbers is randomly generated from a population of 48, with each number
having an equal chance of being selected
18
Random (or Probability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Stratified Random Sampling:
In this sampling technique the population is divided into two or more homogenous subgroups
or strata and a simple random sample would be
taken from each subgroup.
Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which
consists of Ayrshire, Friesian, Galloway and
Jersey cows. He could divide up his herd into the
four sub-groups and take samples from these.
19
Random (or Probability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Systematic Sampling:
A sample drawn from a list using a random start followed by a fixed sampling interval.
Often used in industry, where an item is selected for testing from a production line (say, every
fifteen minutes) to ensure that machines and
equipment are working to specification.
Alternatively, the manufacturer might decide to select every 20th item on a production line to test
for defects and quality.
20
Random (or Probability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Cluster Sampling
Cluster sampling is another form of random sampling. A cluster is any naturally occurring aggregate of the units
that are to be sampled. Thus households (or homes) are
clusters of people and towns are clusters of households.
Cluster samples are most often used when:
o one do not have a complete list of everyone in the
population of interest but do have a complete list of
the clusters in which they occur, or
o have a complete list of everyone, but they are so
widely disbursed that it would be too time consuming
and expensive to send data collectors out to a simple
random sample.
Two Stage and Multistage Cluster Sample
6
21
Nonrandom (nonprobability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Quota sampling
A sample in which a specific number of different
types of units are selected. For example, we may
want to interview 10 teachers and decide that five
will be men and five will be women.
Snowball sampling
This type of sampling is used when we do not know
who or what should be included. Typically used in
interviews, we would ask the interviewees who else
you should talk to. We would continue until no new
suggestions are obtained. 22
Nonrandom (nonprobability) Sampling
Prof. Dr. M. Rezaul Karim, Statistics, RU
Judgmental sampling
In this kind of sample, selections are made based on
pre-determined criteria that, in your judgment, will
provide the data you need. For example, you may
want to interview primary school principals and
decide to interview some from rural areas as well as
some from urban areas (but no quota is established).
Convenience sampling
In this type, selections are made based on the
convenience to the evaluator. Principals from local
schools may be selected because they are near where
the evaluators are located.
23
4. Data Classification
Prof. Dr. M. Rezaul Karim, Statistics, RU
Data Classification
24 Prof. Dr. M. Rezaul Karim, Statistics, RU
Data can be classified in several ways, for example:
1. According to Representation level
Qualitative (or Categorical) data
Quantitative (or numerical) data
7
Qualitative Data
25 Prof. Dr. M. Rezaul Karim, Statistics, RU
Qualitative data take on values that are categories,
characteristic names or labels or descriptions
Data can be observed but not measured
Example: Gender, Colors, smells, tastes,
appearance, beauty, etc.
Data analysis includes the coding of the data
Quantitative Data
26 Prof. Dr. M. Rezaul Karim, Statistics, RU
Quantitative data are numeric and represent a
measurable quantity with numbers
Data which can be measured
Example: Length, height, weight, speed, time,
cost, ages, etc.
Data analysis is mainly statistical
Data Classification
27 Prof. Dr. M. Rezaul Karim, Statistics, RU
2. According to Measurement
Discrete
Continuous
Discrete Data
28 Prof. Dr. M. Rezaul Karim, Statistics, RU
Countable numerical observation
o whole numbers only
o has an equal whole number interval
o obtained through counting
Example: number of occurrence, number of
students, etc.
8
Continuous Data
29 Prof. Dr. M. Rezaul Karim, Statistics, RU
Measurable observations
Decimals or fractions
Obtained through measuring
Example: Height, Weight, Bank deposits, Volume
of liquid, etc.
Data Classification
30 Prof. Dr. M. Rezaul Karim, Statistics, RU
3. According to Source
Primary data
o First-hand information
oExample: Autobiography, first-time taken
financial statement, etc.
Secondary data
o Second-hand information
o Example: Weather forecast from news papers,
Data taken from published journals, books,
webpage, etc.
Data Classification
31 Prof. Dr. M. Rezaul Karim, Statistics, RU
4. According to Arrangement
Ungrouped data
o Raw data
o No specific arrangement
Grouped data
o Organized set of data
o At least 2 groups
o Arranged in any order
Data Classification
32 Prof. Dr. M. Rezaul Karim, Statistics, RU
5. According to dependency of time
Time series data
o a sequence of data points, measured typically at
successive points in time spaced at uniform time
intervals
o Example: weekly share prices, daily rainfall,
temperature, etc.
Cross-section or Cross-sectional data
o data collected by observing many subjects (such as
individuals, firms, countries, etc.) at the same point
of time, or without regard to differences in time.
o Example: Weight and height of randomly selected
100 people.
9
Data Classification
33 Prof. Dr. M. Rezaul Karim, Statistics, RU
6. According to Measuring scale
Nominal
Ordinal
Interval
Ratio
Nominal scale
34 Prof. Dr. M. Rezaul Karim, Statistics, RU
Nominal scale is simply a system of assigning number symbols to events in order to label them.
Nominal scales provide convenient ways of keeping track of people, objects and events.
Data are categorical
Examples Car bands, Gender, Marital status (as 1 for single, 2 for married, 3 for widowed or 4 for
divorced)
Nominal scale
35 Prof. Dr. M. Rezaul Karim, Statistics, RU
Allowable operations counts only; no ranking or numerical operations.
For instance, if we record marital status as 1, 2, 3,
or 4 as stated above, we cannot write
4 > 2 or 3 < 4,
31 = 42,
1+3 = 4 or 4/2 = 2.
Nominal scale
36 Prof. Dr. M. Rezaul Karim, Statistics, RU
Descriptive Statistics used:
o mode (most often observed data category), and
o percent.
Note: averages (mean) and standard deviation are not appropriate!
10
Ordinal scale
37 Prof. Dr. M. Rezaul Karim, Statistics, RU
Data are categorical with a rank-order relationship
One value is greater or less than another, but the magnitude of the difference is unknown.
Examples A students rank in his class, rating scales (severity of damage on a scale of 1 to 4;
quality of sound of a speaker)
Allowable operations counts and ranking; no numerical operations
For instance, if As position in his class is 10 and Bs
position is 40, it cannot be said that As position is
four times as good as that of B.
Interval scale
38 Prof. Dr. M. Rezaul Karim, Statistics, RU
Data are numerical values on an equal-interval scale.
(Note: on an interval scale, there is no true zero - it does not have the capacity to measure the
complete absence of a trait or characteristic)
Example temperature in oC
Allowable operations ranking; addition and subtraction (and therefore averaging);
multiplication and division are not meaningful
Ratio scale
39 Prof. Dr. M. Rezaul Karim, Statistics, RU
Data are numerical values on an equal-interval scale with a uniquely defined zero (For example, the zero
point on a centimeter scale indicates the complete
absence of length or height)
Examples height, weight, distance, time to failure of an item, cost of repair, number of replacements
under warranty
Allowable operations all ordinary numerical and mathematical operations.
Scale of measurement
40 Prof. Dr. M. Rezaul Karim, Statistics, RU
Note:
It is essential to understand the above differences in the nature of data and suggest appropriate
method to store and analyze them.
Many software (e.g. MS Excel and R) do not automatically understand the nature of the data, so
we need to explicitly define the data for those
tools.
11
Scale of measurement
41 Prof. Dr. M. Rezaul Karim, Statistics, RU
Nominal Ordinal Interval Ratio
least
precise
most
precise
Data Classification
42 Prof. Dr. M. Rezaul Karim, Statistics, RU
7. According to Failure/Survival characteristic
Complete or Exact failure data
Incomplete or Censored data
Failure and Censored data
43 Prof. Dr. M. Rezaul Karim, Statistics, RU
1. Complete or Exact failure data
The value of each sample unit is observed or known.
Example: the fan failed at exactly 500 days
2. Incomplete or Censored data
All of the units in the sample may not have failed or the exact times-to-failure of all the units are not
known.
Example: the fan had not yet failed at 500 days;
the fan failed sometime before 500 days
44
Data related problems
Data problems include
Too much data (massive data sets, irrelevant data)
Corrupt and/or noisy data
Too little data (missing entries, missing variables, too few observations)
Fractured data (multiple sources, incompatible data, data obtained at different levels)
Prof. Dr. M. Rezaul Karim, Statistics, RU
12
45
Data related problems
Techniques for dealing with these problems
include
Data transformation (data filtering, ordering, editing, and modeling)
Interactive techniques (data visualization, elimination, selection, identification of principal
components, sampling)
New information generation (time series analysis, data fusion, simulation, dimensional analysis,
etc.)
Prof. Dr. M. Rezaul Karim, Statistics, RU 46
5. Data Processing
Prof. Dr. M. Rezaul Karim, Statistics, RU
47
What is Data Processing?
Prof. Dr. M. Rezaul Karim, Statistics, RU
Data Process Information
o Data - the raw facts-record measures of certain
phenomena
o Process - implies editing, coding, classification
and tabulation of collected data
o Information - facts in a form suitable for taking
decisions by researchers
48
Data Processing Overview
Prof. Dr. M. Rezaul Karim, Statistics, RU
Validation &
Editing Coding Tabulation Using
percentages Classification
13
49
Data Processing
Prof. Dr. M. Rezaul Karim, Statistics, RU
Step One:
Validation: Confirming the interviews/surveys occurred
Editing: The procedure that improves the quality of the data
for coding. That is, the process of checking and adjusting the
data
Consistency
Completeness
Questions answered out of order
50
Data Processing
Prof. Dr. M. Rezaul Karim, Statistics, RU
Step Two:
Coding: Grouping and assigning numeric codes to the
question responses. (Codes also may be other character
symbols)
Rules for coding:
oCategories should be exhaustive
oCategories should be mutually exclusive and independent
Step Three:
Classification: Large volume of raw data are reduced into
homogeneous groups (if we are to get meaningful
relationships). Classification can be (i) according to
attributes or (ii) according to class-intervals
51
Data Processing
Prof. Dr. M. Rezaul Karim, Statistics, RU
Step Four:
Tabulation: Tabulation is the process of summarizing raw
data and displaying the same in compact form (i.e., in the
form of statistical tables) for further analysis.
Step Five:
Percentages: Percentages are often used in data
presentation as they simplify numbers, reducing all of
them to a 0 to 100 range.
52
6. Graphical representation of Data
Prof. Dr. M. Rezaul Karim, Statistics, RU
14
Graphical representation of Data
53
Commonly used graphs and their uses in analysis of data.
Prof. Dr. M. Rezaul Karim, Statistics, RU 54 Prof. Dr. M. Rezaul Karim, Statistics, RU
Graphs
Objective of analysis
Ass
ess re
latio
nshi
ps
bet
wee
n pa
irs o
f var
iabl
es
Ass
ess di
strib
utio
ns
Com
pare
sum
mar
ies or
indi
vidu
al v
alue
s of
a v
aria
ble
Ass
ess di
strib
utio
ns o
f cou
nts
Plo
t a ser
ies of
dat
a ov
er t
ime
Ass
ess re
latio
nshi
ps am
ong
thre
e va
riabl
es
Individual Value Plot
Interval Plot
Marginal Plot
Matrix Plot
Pie Chart
Probability Plot
Scatter Plot
Stem-and-Leaf
Time Series Plot
3D Scatter Plot
3D Surface Plot
Area Graph
Bar Chart
Box Plot
Contour Plot
Dot Plot
Empirical CDF
Histogram
55
7. Summary Chart Concerning
Analysis of Data
Prof. Dr. M. Rezaul Karim, Statistics, RU 56
Summary Chart Concerning Analysis of Data
Prof. Dr. M. Rezaul Karim, Statistics, RU
15
57
References
Ackoff, R.L. (1989). From Data to Wisdom. J. Appl. Sys. Analysis, 16:3-9.
Bellinger, G., Castro, D., Mills, A. (1977). Data, information, knowledge, and wisdom. From
http://www.outsights.com/systems/dikw/dikw.htm
Benyon, D. (1990). Information and Data Modeling. Alfred Waller, Heneley-on-Thames
Blischke W.R., Karim, M. R. and Murthy D.N.P. (2011). Warranty Data Collection and Analysis. Springer-Verlag,
London Ltd.
Kothari, C.R. (2004). Research Methodology: Methods and Techniques, 2nd Ed., New Age International (P) Ltd.,
Prof. Dr. M. Rezaul Karim, Statistics, RU 58
Thank you
Any Questions?
Prof. Dr. M. Rezaul Karim, Statistics, RU
Recommended