21
Statistics and Data

Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Embed Size (px)

Citation preview

Page 1: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Statistics and Data

Page 2: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

inference

Sample (Data)

statistics

probability

sampling

PopulationCh 2, Ch 4, Ch 6

Ch 5, Ch 9, Ch 21 Ch 7, Ch 8

Ch 3, Ch 23

Ch 10-16

We will cover ch. 1-15, 16, 21 and 23 of SVV.

4 major parts of STATISTICS

Page 3: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Sample & Population

population: a group, as a whole, of objects we

want

to know.sample: a part of population, which we observe(d).

Page 4: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

data: (a set of) numbers and symbols indicating a fact in reality.

A sample is also represented as data.

The word DATA is plural form of DATUM. Sometimes people use DATA as if singular conventionally. To avoid the confusion in grammar, a word DATASET is often used as a singular form of DATA. Others use DATASET meaning for groups of DATA.

Data obtained by observing (a part of) a population are a sample.

Page 5: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

observations (records, cases)

variables (attributes )

In a data table, each column means variable, and each row corresponds to observation. Database technicians use the terminologies attribute and record, instead of variable and observation.

Data table is a data shown in tabular form.

Page 6: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Database (DB) is a collection of closely related data, managed by Database Management System (DBMS) software. Relational DB uses tabular representation of data.

Page 7: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Data and Statistics

Statistics is the way to read data, and interprets the story the data says.

To understand the story, it’s important to delve the CONTEXT of the data, as well as dataset itself.

Page 8: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Context of data

Something told by data regardless of whether it was represented explicitly or not (implicitly).

- Meaning of the variables and values of the data- Who & When are the data collected (by)

- How & Why ….

- All history & background of data

Page 9: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Types of variables

identifier quantitative variable

categorical variable (qualitative variable)

categorical or quantitative : by scale of measurement

Page 10: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Categorical variable

- nominal scale : sex - male (1), female (2)- ordered scale : grade of score - A,B,C,D,F

Quantitative variable

- interval scale : year of birth - BC300, AD1900- ratio scale: price, weight, ….

Page 11: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Interval scale or Ratio scale

Is quotient meaningful? Then, ratio scale.

Test with the cases of -30kg, -50Won, BC1000, -10oC.

- 60kg is 2 times heavier than 30kg.

- AD2000 is 2 times older than AD1000.

- 20oC is 2 times warmer than 10oC.

- 100Won is 2 times more than 50Won.

Page 12: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Customer

Daily call

Yearly call

Clinton 2 954

Ford 1 450

David 0 320

Gates 2 795

Categorical or Quantitative

eg) number of phone calls

rather likely to be categoricalrather likely to be

quantitative

Page 13: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Cross-Sectional Data, Longitudinal Data and Time Series

When several variables are all measured at the same time point, the data is called cross-sectional data. For example, determining sales revenue, number of customers, and expenses for the last month of business.

Variables that are measured at some time points are called a longitudinal data. For example, number of victims of earthquake.

Longitudinal data measured at very many and regularly spaced time points are called time series. For example, monthly recorded discount rates of US treasury bond.

Page 14: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

How to get data

No Treatment Treatment

Non-designed

Sampling byVoluntary Response,

Convenience, etc.

Observational Study

Designed Sampling Experiment

Using data collected from computer DB is basically considered as non-designed study. For example, using customer data from DB. Data-mining dealt in chapter 24 analyzes data collected from DB.

Chapter 3 Chapter 23

Page 15: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Sampling & Experiment

For selected patients by design, blood pressures of the patients are measured without any treatment, then that is data collection by sampling. Before measuring blood pressures, treatments are given, eg. dosing different types of medicines for comparison of the medicines, that is by experiment.

Page 16: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Designed Study & Non-designed StudyDesigned study aims to generalize the results obtained from data to assumed population. For example, the result of well performed political poll for an election can be interpreted as opinions of all voters. To generalize, the sample (or experimental units) must be selected by well designed plan.

Non-designed study does not aim to generalize obtained results from data to population, and aims just to see tentative results. For example, a poll done by internet website can not be generalized to of all citizens.

Page 17: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Sampling schemes

- cluster sampling

- systematic sampling

- simple random sampling (SRS)

- stratified sampling

Page 18: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Assume we want to measure math ability of 5th year grade students of a school. The number of students is 300 and divided into 10 classes. Examples of the sampling schemes selecting 60 students are as follows.

Simple random sampling (SRS) : after giving numbers from 1 to 300 to the students, select 60 students by using random number generator (see, www.random.org).Stratified sampling: select 6 students by SRS, for each class. Cluster sampling: select 2 classes, and measure all students in those classes.

Systematic sampling: select 60 students who got the number of multiples of 5.

Page 19: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Which sampling scheme will give the most precise result, if the school streamed the students for their math ability priori ?

Which sampling scheme is the most convenient to measure math ability of 5th year grade students of the school ?

Page 20: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Voluntary Response Sampling and Convenience Sampling

•In a voluntary response sample, a large group of individuals is invited to respond, and all who do respond are counted. Voluntary response samples are almost always biased, and so conclusions drawn from them are almost always wrong.

• In convenience sampling we simply include the individuals who are convenient. Unfortunately, this group

may not be representative of the population.

Page 21: Statistics and Data. inference Sample (Data) statistics probability sampling Population Ch 2, Ch 4, Ch 6 Ch 5, Ch 9, Ch 21 Ch 7, Ch 8 Ch 3, Ch 23 Ch 10-16

Thank you !!