15
1 Statistical Computing Certificate Course Batch -10 Organized by Department of Statistics University of Rajshahi Introduction to Data Prof. Dr. Md. Rezaul Karim Email: [email protected] July 05, 2014 Outline 2 1. Data, Information and Knowledge 2. Stages of a Research Process 3. Sampling Technique 4. Data Classification 5. Data Processing 6. Graphical Representation of Data 7. Summary Chart Concerning Analysis of Data Prof. Dr. M. Rezaul Karim, Statistics, RU 3 1. Data, Information and Knowledge Prof. Dr. M. Rezaul Karim, Statistics, RU 4 Data, Information and Knowledge Data are collected to extract information, which in turn generates knowledge. An understanding of this is important in any data collection and analysis. The word "data" used to be considered as the plural of "datum", which means a single piece of information Data are typically the results of measurements; individual pieces of information. Prof. Dr. M. Rezaul Karim, Statistics, RU

MRK Lecture 01 Introduction to Data 20140705 4x4

Embed Size (px)

Citation preview

  • 1

    Statistical Computing Certificate Course

    Batch -10

    Organized by

    Department of Statistics University of Rajshahi

    Introduction to Data

    Prof. Dr. Md. Rezaul Karim Email: [email protected]

    July 05, 2014

    Outline

    2

    1. Data, Information and Knowledge

    2. Stages of a Research Process

    3. Sampling Technique

    4. Data Classification

    5. Data Processing

    6. Graphical Representation of Data

    7. Summary Chart Concerning Analysis of Data

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    3

    1. Data, Information and Knowledge

    Prof. Dr. M. Rezaul Karim, Statistics, RU 4

    Data, Information and Knowledge

    Data are collected to extract information, which in turn generates knowledge. An understanding

    of this is important in any data collection and

    analysis.

    The word "data" used to be considered as the plural of "datum", which means a single piece of

    information

    Data are typically the results of measurements; individual pieces of information.

    Prof. Dr. M. Rezaul Karim, Statistics, RU

  • 2

    5

    Data, Information and Knowledge

    Raw data (also known as primary data, unprocessed data) is a term for data collected

    from a source.

    Field data refers to raw data that is collected in an uncontrolled environment or field

    Experimental data refers to data that is generated within the context of a scientific

    investigation by observation and recording

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data, Information and Knowledge

    6 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Information is extracted from data through

    analysis.

    (i) Data represents a fact or statement of an event

    without relation to other things. Information

    embodies the understanding of the relationship of

    some sort, possibly cause and effect (Bellinger et

    al. 1997)

    (ii) Data are raw facts that have not been organized or

    cannot possibly be interpreted. Information is data

    that are understood. Information comes from the

    relationship between pieces of data (Benyon 1990)

    Data, Information and Knowledge

    7 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Knowledge is the ability of individuals to

    understand the information and the manner in

    which the information is used in a specific context,

    as illustrated by the following:

    (i) Knowledge represents a pattern that connects and

    generally provides a high level of predictability as

    to what is described and what will happen next

    (Bellinger et al. 1997)

    (ii) Data gets transformed into information through an

    understanding of the relationships, and information

    yields knowledge through an understanding of the

    patterns (Bellinger et al. 1997)

    Data, Information and Knowledge

    8 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Example:

    The height of Mt. Everest is generally

    considered as "data" [Its peak is 8,848 meters

    (29,029 ft) above sea level]

    A book on Mt. Everest geological

    characteristics may be considered as

    "information", and

    A report containing practical information on the

    best way to reach Mt. Everest's peak may be

    considered as "knowledge".

  • 3

    Data, Information and Knowledge

    9 Prof. Dr. M. Rezaul Karim, Statistics, RU

    The link between data, information and knowledge

    can be characterized through the DIKW (Data,

    Information, Knowledge, and Wisdom) hierarchy -

    - a term attributed to Ackoff (1989).

    According to this, the content of the human mind

    can be classified into five categories:

    Data, Information and Knowledge

    10 Prof. Dr. M. Rezaul Karim, Statistics, RU

    1. Data: symbols

    2. Information: data that are processed to be useful;

    provides answers to "who", "what", "where", and

    "when" questions

    3. Knowledge: application of data and information;

    answers "how" questions

    4. Understanding: appreciation of "why"

    5. Wisdom: evaluated understanding.

    11

    2. Stages of a Research Process

    Prof. Dr. M. Rezaul Karim, Statistics, RU 12

    Stages of a Research Process

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Problem Discovery

    and Definition

    Research

    Design

    Sampling

    Data

    Gathering

    Data Processing

    and Analysis

    Conclusions and

    Report

    Discovery and

    Definition

    and so on

  • 4

    13

    3. Sampling Technique

    Prof. Dr. M. Rezaul Karim, Statistics, RU 14

    Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    A process of selecting units from a population

    A process of selecting a sample to determine certain characteristics of a population

    Population Sample

    Population Sample

    15

    Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    From the characteristics of samples, we can infer the characteristics of population, if the sample is

    representative of the population

    Population: The complete set of people or things being studied. It could be all the citizens in a

    country, all farms in a region, or all children under

    the age of five in a country, etc.

    Sample: A subset (representative) of the population that is actually studied (and from

    which the raw data are actually obtained)

    16

    Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Sampling frame: A complete list of every unit in the population of interest, called a sampling frame,

    is needed to select a random sample.

    Sample Design: The method of sample selection

  • 5

    17

    Random (or Probability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Simple Random Sampling:

    By definition, a simple random sample refers to those cases that are selected so that each element

    in the population has an equal or known chance

    of being included in the sample.

    A lottery draw is a good example of simple random sampling.

    A sample of 5 numbers is randomly generated from a population of 48, with each number

    having an equal chance of being selected

    18

    Random (or Probability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Stratified Random Sampling:

    In this sampling technique the population is divided into two or more homogenous subgroups

    or strata and a simple random sample would be

    taken from each subgroup.

    Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which

    consists of Ayrshire, Friesian, Galloway and

    Jersey cows. He could divide up his herd into the

    four sub-groups and take samples from these.

    19

    Random (or Probability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Systematic Sampling:

    A sample drawn from a list using a random start followed by a fixed sampling interval.

    Often used in industry, where an item is selected for testing from a production line (say, every

    fifteen minutes) to ensure that machines and

    equipment are working to specification.

    Alternatively, the manufacturer might decide to select every 20th item on a production line to test

    for defects and quality.

    20

    Random (or Probability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Cluster Sampling

    Cluster sampling is another form of random sampling. A cluster is any naturally occurring aggregate of the units

    that are to be sampled. Thus households (or homes) are

    clusters of people and towns are clusters of households.

    Cluster samples are most often used when:

    o one do not have a complete list of everyone in the

    population of interest but do have a complete list of

    the clusters in which they occur, or

    o have a complete list of everyone, but they are so

    widely disbursed that it would be too time consuming

    and expensive to send data collectors out to a simple

    random sample.

    Two Stage and Multistage Cluster Sample

  • 6

    21

    Nonrandom (nonprobability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Quota sampling

    A sample in which a specific number of different

    types of units are selected. For example, we may

    want to interview 10 teachers and decide that five

    will be men and five will be women.

    Snowball sampling

    This type of sampling is used when we do not know

    who or what should be included. Typically used in

    interviews, we would ask the interviewees who else

    you should talk to. We would continue until no new

    suggestions are obtained. 22

    Nonrandom (nonprobability) Sampling

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Judgmental sampling

    In this kind of sample, selections are made based on

    pre-determined criteria that, in your judgment, will

    provide the data you need. For example, you may

    want to interview primary school principals and

    decide to interview some from rural areas as well as

    some from urban areas (but no quota is established).

    Convenience sampling

    In this type, selections are made based on the

    convenience to the evaluator. Principals from local

    schools may be selected because they are near where

    the evaluators are located.

    23

    4. Data Classification

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data Classification

    24 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data can be classified in several ways, for example:

    1. According to Representation level

    Qualitative (or Categorical) data

    Quantitative (or numerical) data

  • 7

    Qualitative Data

    25 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Qualitative data take on values that are categories,

    characteristic names or labels or descriptions

    Data can be observed but not measured

    Example: Gender, Colors, smells, tastes,

    appearance, beauty, etc.

    Data analysis includes the coding of the data

    Quantitative Data

    26 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Quantitative data are numeric and represent a

    measurable quantity with numbers

    Data which can be measured

    Example: Length, height, weight, speed, time,

    cost, ages, etc.

    Data analysis is mainly statistical

    Data Classification

    27 Prof. Dr. M. Rezaul Karim, Statistics, RU

    2. According to Measurement

    Discrete

    Continuous

    Discrete Data

    28 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Countable numerical observation

    o whole numbers only

    o has an equal whole number interval

    o obtained through counting

    Example: number of occurrence, number of

    students, etc.

  • 8

    Continuous Data

    29 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Measurable observations

    Decimals or fractions

    Obtained through measuring

    Example: Height, Weight, Bank deposits, Volume

    of liquid, etc.

    Data Classification

    30 Prof. Dr. M. Rezaul Karim, Statistics, RU

    3. According to Source

    Primary data

    o First-hand information

    oExample: Autobiography, first-time taken

    financial statement, etc.

    Secondary data

    o Second-hand information

    o Example: Weather forecast from news papers,

    Data taken from published journals, books,

    webpage, etc.

    Data Classification

    31 Prof. Dr. M. Rezaul Karim, Statistics, RU

    4. According to Arrangement

    Ungrouped data

    o Raw data

    o No specific arrangement

    Grouped data

    o Organized set of data

    o At least 2 groups

    o Arranged in any order

    Data Classification

    32 Prof. Dr. M. Rezaul Karim, Statistics, RU

    5. According to dependency of time

    Time series data

    o a sequence of data points, measured typically at

    successive points in time spaced at uniform time

    intervals

    o Example: weekly share prices, daily rainfall,

    temperature, etc.

    Cross-section or Cross-sectional data

    o data collected by observing many subjects (such as

    individuals, firms, countries, etc.) at the same point

    of time, or without regard to differences in time.

    o Example: Weight and height of randomly selected

    100 people.

  • 9

    Data Classification

    33 Prof. Dr. M. Rezaul Karim, Statistics, RU

    6. According to Measuring scale

    Nominal

    Ordinal

    Interval

    Ratio

    Nominal scale

    34 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Nominal scale is simply a system of assigning number symbols to events in order to label them.

    Nominal scales provide convenient ways of keeping track of people, objects and events.

    Data are categorical

    Examples Car bands, Gender, Marital status (as 1 for single, 2 for married, 3 for widowed or 4 for

    divorced)

    Nominal scale

    35 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Allowable operations counts only; no ranking or numerical operations.

    For instance, if we record marital status as 1, 2, 3,

    or 4 as stated above, we cannot write

    4 > 2 or 3 < 4,

    31 = 42,

    1+3 = 4 or 4/2 = 2.

    Nominal scale

    36 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Descriptive Statistics used:

    o mode (most often observed data category), and

    o percent.

    Note: averages (mean) and standard deviation are not appropriate!

  • 10

    Ordinal scale

    37 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data are categorical with a rank-order relationship

    One value is greater or less than another, but the magnitude of the difference is unknown.

    Examples A students rank in his class, rating scales (severity of damage on a scale of 1 to 4;

    quality of sound of a speaker)

    Allowable operations counts and ranking; no numerical operations

    For instance, if As position in his class is 10 and Bs

    position is 40, it cannot be said that As position is

    four times as good as that of B.

    Interval scale

    38 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data are numerical values on an equal-interval scale.

    (Note: on an interval scale, there is no true zero - it does not have the capacity to measure the

    complete absence of a trait or characteristic)

    Example temperature in oC

    Allowable operations ranking; addition and subtraction (and therefore averaging);

    multiplication and division are not meaningful

    Ratio scale

    39 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data are numerical values on an equal-interval scale with a uniquely defined zero (For example, the zero

    point on a centimeter scale indicates the complete

    absence of length or height)

    Examples height, weight, distance, time to failure of an item, cost of repair, number of replacements

    under warranty

    Allowable operations all ordinary numerical and mathematical operations.

    Scale of measurement

    40 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Note:

    It is essential to understand the above differences in the nature of data and suggest appropriate

    method to store and analyze them.

    Many software (e.g. MS Excel and R) do not automatically understand the nature of the data, so

    we need to explicitly define the data for those

    tools.

  • 11

    Scale of measurement

    41 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Nominal Ordinal Interval Ratio

    least

    precise

    most

    precise

    Data Classification

    42 Prof. Dr. M. Rezaul Karim, Statistics, RU

    7. According to Failure/Survival characteristic

    Complete or Exact failure data

    Incomplete or Censored data

    Failure and Censored data

    43 Prof. Dr. M. Rezaul Karim, Statistics, RU

    1. Complete or Exact failure data

    The value of each sample unit is observed or known.

    Example: the fan failed at exactly 500 days

    2. Incomplete or Censored data

    All of the units in the sample may not have failed or the exact times-to-failure of all the units are not

    known.

    Example: the fan had not yet failed at 500 days;

    the fan failed sometime before 500 days

    44

    Data related problems

    Data problems include

    Too much data (massive data sets, irrelevant data)

    Corrupt and/or noisy data

    Too little data (missing entries, missing variables, too few observations)

    Fractured data (multiple sources, incompatible data, data obtained at different levels)

    Prof. Dr. M. Rezaul Karim, Statistics, RU

  • 12

    45

    Data related problems

    Techniques for dealing with these problems

    include

    Data transformation (data filtering, ordering, editing, and modeling)

    Interactive techniques (data visualization, elimination, selection, identification of principal

    components, sampling)

    New information generation (time series analysis, data fusion, simulation, dimensional analysis,

    etc.)

    Prof. Dr. M. Rezaul Karim, Statistics, RU 46

    5. Data Processing

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    47

    What is Data Processing?

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Data Process Information

    o Data - the raw facts-record measures of certain

    phenomena

    o Process - implies editing, coding, classification

    and tabulation of collected data

    o Information - facts in a form suitable for taking

    decisions by researchers

    48

    Data Processing Overview

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Validation &

    Editing Coding Tabulation Using

    percentages Classification

  • 13

    49

    Data Processing

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Step One:

    Validation: Confirming the interviews/surveys occurred

    Editing: The procedure that improves the quality of the data

    for coding. That is, the process of checking and adjusting the

    data

    Consistency

    Completeness

    Questions answered out of order

    50

    Data Processing

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Step Two:

    Coding: Grouping and assigning numeric codes to the

    question responses. (Codes also may be other character

    symbols)

    Rules for coding:

    oCategories should be exhaustive

    oCategories should be mutually exclusive and independent

    Step Three:

    Classification: Large volume of raw data are reduced into

    homogeneous groups (if we are to get meaningful

    relationships). Classification can be (i) according to

    attributes or (ii) according to class-intervals

    51

    Data Processing

    Prof. Dr. M. Rezaul Karim, Statistics, RU

    Step Four:

    Tabulation: Tabulation is the process of summarizing raw

    data and displaying the same in compact form (i.e., in the

    form of statistical tables) for further analysis.

    Step Five:

    Percentages: Percentages are often used in data

    presentation as they simplify numbers, reducing all of

    them to a 0 to 100 range.

    52

    6. Graphical representation of Data

    Prof. Dr. M. Rezaul Karim, Statistics, RU

  • 14

    Graphical representation of Data

    53

    Commonly used graphs and their uses in analysis of data.

    Prof. Dr. M. Rezaul Karim, Statistics, RU 54 Prof. Dr. M. Rezaul Karim, Statistics, RU

    Graphs

    Objective of analysis

    Ass

    ess re

    latio

    nshi

    ps

    bet

    wee

    n pa

    irs o

    f var

    iabl

    es

    Ass

    ess di

    strib

    utio

    ns

    Com

    pare

    sum

    mar

    ies or

    indi

    vidu

    al v

    alue

    s of

    a v

    aria

    ble

    Ass

    ess di

    strib

    utio

    ns o

    f cou

    nts

    Plo

    t a ser

    ies of

    dat

    a ov

    er t

    ime

    Ass

    ess re

    latio

    nshi

    ps am

    ong

    thre

    e va

    riabl

    es

    Individual Value Plot

    Interval Plot

    Marginal Plot

    Matrix Plot

    Pie Chart

    Probability Plot

    Scatter Plot

    Stem-and-Leaf

    Time Series Plot

    3D Scatter Plot

    3D Surface Plot

    Area Graph

    Bar Chart

    Box Plot

    Contour Plot

    Dot Plot

    Empirical CDF

    Histogram

    55

    7. Summary Chart Concerning

    Analysis of Data

    Prof. Dr. M. Rezaul Karim, Statistics, RU 56

    Summary Chart Concerning Analysis of Data

    Prof. Dr. M. Rezaul Karim, Statistics, RU

  • 15

    57

    References

    Ackoff, R.L. (1989). From Data to Wisdom. J. Appl. Sys. Analysis, 16:3-9.

    Bellinger, G., Castro, D., Mills, A. (1977). Data, information, knowledge, and wisdom. From

    http://www.outsights.com/systems/dikw/dikw.htm

    Benyon, D. (1990). Information and Data Modeling. Alfred Waller, Heneley-on-Thames

    Blischke W.R., Karim, M. R. and Murthy D.N.P. (2011). Warranty Data Collection and Analysis. Springer-Verlag,

    London Ltd.

    Kothari, C.R. (2004). Research Methodology: Methods and Techniques, 2nd Ed., New Age International (P) Ltd.,

    Prof. Dr. M. Rezaul Karim, Statistics, RU 58

    Thank you

    Any Questions?

    Prof. Dr. M. Rezaul Karim, Statistics, RU