Describing, Exploring, And Comparing Data

7/28/2019 Describing, Exploring, And Comparing Data

1/61

DESCRIBING, EXPLORING, ANDCOMPARING DATA

APPLIED STATISTICS

Submitted to : Dr. IMELDA E. CUATEL

GRADUATE SCHOOLUNIVERRSITY OF LUZON

Sunday 8:00-12:30

Prepared by : SAIFULDEEN SINAN


2/61

Introduction to Statistics

What is Statistics?

a set of procedures and rulesfor reducing

large masses of data to manageableproportions and for allowing us to drawconclusions from those data

Statistics is a branch of mathematics that deals with

the effective management and

analysis of data.


3/61

What can Stats do? Allow us to draw conclusions from the data

Make data more manageable

Allows us to do this objectively and quantitatively


4/61

Why Statistics?

To develop an appreciation for variability and how it effectsproducts and processes.

Build an appreciation for the advantages & Limitations ofinformed observation and Experimentation.

Determine how to analyze data from designed experimentsin order to build knowledge and continuously improve.


5/61

GroupedFrequency Distributions

A frequency distribution is a table used to organize

data . The left column (called classes or groups)

includes numerical intervals on a variable being

studied. The right column is a list of the frequencies,or number of observations, for each class. .


6/61

Grouped frequency distributions -can be used when therange of values in the data set is very large. The datamust be grouped into classes that are more than one unitin width


7/61

Construction of a Frequency Distribution

Find the highest and lowest value. Find the range.

Select the number of classes desired.

Find the width by dividing the range by the number of

classes and rounding up

Select a starting point (usually the lowest value); add thewidth to get the lower limits.

Find the upper class limits.

Find the boundaries.

Tally the data, find the frequencies and find thecumulative frequency.


8/61

Example

In a survey of 20 patients who smoked, the following

data were obtained. Each value represents thenumber of cigarettes the patient smoked per day.

Construct a frequency distribution using six classes.

10 8 6 14

22 13 17 19

11 9 18 14

13 12 15 15

5 11 16 11


9/61

Answer

Step 1:Find the highest and lowest

values: H = 22 and L = 5.

Step 2:Find the range:R = H L = 22 5 = 17.

Step 3:Select the number of classes desired. In this case it is equal to6.

Step 4: Find the class width by dividing the range by the number ofclasses. Width = 17/6 = 2.83. This value is rounded up to 3.

Step 5: Select a starting point for the lowest class limit. Forconvenience, this value is chosen to be 5, the smallest data value. Thelower class limits will be 5, 8, 11, 14, 17 and 20.

Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.


10/61

Step 7: Find the class boundaries by subtracting 0.5 from each lowerclass limit and adding 0.5 to the upper class limit

Step 8: Tally the data, write the numerical values for the tallies in the

frequency column and find the cumulative frequencies.

Class Limits Class Boundaries Frequency Cumulative Frequency

05 to 07 4.5 - 7.5 2 2

08 to 10 7.5 - 10.5 3 5

11 to 13 10.5 - 13.5 6 11

14 to 16 13.5 - 16.5 5 16

17 to 19 16.5 - 19.5 3 19

20 to 22 19.5 - 22.5 1 20

Note:The dash - represents to.


11/61

Histogram

What is a histogram

It is "a representation of a frequency distribution by means of

rectangles whose widths represent class intervals andwhose areas are proportional to the correspondingfrequencies

A histogram is like a bar chart, but there are some important

differences.

It can only be used to show continuous data

It can only be used to show numerical data

The data is always grouped.


12/61

So The width of a bar represents a quantitative variable x, such as agerather than a category

The height of each bar indicates frequency

How is a Real Histogram Made?

Example

* Consider the set Below

{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.A graph which shows how many ones, how many twos, how many threes,

etc. would be meaningless. Instead we bin the data into convenientranges. In this case, with a bin width of 10, we can easily group the dataas below

Bin =The class size (width of the rectangles) in a histogram

SEE NEXT SLIDE


13/61

SOLUTION

{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.

a bin width of 10

DataRange

Frequency

0-10 1

10-20 3

20-30 6

30-40 4

40-50 2

Note: Changing the size of the bin changes the apprearance of the graph


14/61

Histogram shapes


15/61


16/61

Box plot

A box plot (also referred to as a box and whisker diagram) is a

diagram showing statistical distribution.

A box plot summarizes data using the median, upper and lowerquartiles, and the extreme (least and greatest) values. It allows you

to see important characteristics of the data at a glance.

We need 5 numbers, called the 5 number summary:

1. minimum value

2. Q1

3. median

4. Q3

5. maximum value


17/61

Construction of BOX PLOT

28 32 42 37

30 25 44 38

24 32 33 44

38 34 30 44

31 28 31 29

39 29 32 29

MPG of 4-cylinder cars


18/61

To make a box plot, organize the data in order least to

greatest :

24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38

39 42 44 44 44

* THEN we Find the median of the data. It is 32

* This divides the data in half. The lower half : 24

25 28 28 29 29 29 30 30 31 31 32 and the upper

half: 32 32 33 34 37 38 38 39 42 44 44 44


19/61

Find the median of the top half of the data.32 32 33 34 37 38 38 39 42 44 44 44

This is called the high median, upper quartile or quartile 3 . Q 3 = 38.Take the lower half of the data and find the median of it.

24 25 28 28 29 29 29 30 30 31 31 32This is called the low median, or quartile 1. Q1 = 29

Next, find the lowest data, 24, and the highest data, 44.Lets organize all 5 pieces of data together so we can see

Lower extreme = 24

Lower quartile(Q1) =29Median (Q2) = 32

Upper quartile(Q3) =38

Upper extreme(Q4)=44


20/61

Next, make a number line that will best display the 5 pieces of data(24 ,29 , 32 ,38, 44)

Place a dot above the number line to show the lowerextreme and one for the upper extreme.

Put a vertical slash above the number line for the medianand one for the lower and upper quartiles.

20 24 28 32 36 40 44

20 24 28 32 36 40 44


21/61

Enclose the vertical slashes into a box. Draw a line from the rightcenter of the box to the upper extreme and one from the lower endof the box to the lower extreme, forming the whiskers.

THEN

All graphs must have a title that clearly represents what your graphis showing

Miles per Gallon of 4-cylinder Cars

Miles per gallon (mpg)

20 24 28 32 36 40 44


22/61

OGIVE

An ogive, sometimes called a cumulative line graph, is aline that connects points that are the cumulativepercentage of observations below the upper limit of eachclass in a cumulative frequency distribution.

How to Construct Ogives ? Make a frequency table showing class boundaries and

cumulative frequencies.

For each class, put a dot over the upper class boundary atthe height of the cumulative class frequency.

Place dot on horizontal axis at the lower class boundaryof the first class.

Connect the dots.


23/61

Example


24/61

Draw the x and y axis , Plot the points


25/61


26/61

Pie Chart

Pie graph -A pie graph is a circle that is divided into

sections or wedges according to the percentage offrequencies in each category of the distribution

How to make a Pie Chart ?

1. Organize your information

2. Add the data all together and reach a sum

3. Know the angle between the two sides of the piece

4. Use a mathematical compass to draw a circle

5. Draw the radius6. Draw each section division

7. Color each segment.


27/61

Example

A family's weekly expenditure on its house mortgage, food

and fuel is as follows:

Draw a pie chart to display the information.


28/61

Solution :

We can find what percentage of the total expenditure eachitem equals.Percentage of weekly expenditure on:


29/61

To draw a pie chart, divide the circle into 100 percentage parts.Then allocate the number of percentage parts required for eachitem.


30/61

Measures of Central Tendency (Averages)

A measure of central tendencyis a univariate statistic thatindicates, in one manner or another.

the average or typicalobserved value of a variable in adata set.

Central Tendency = values that summarize/ represent themajority of scores in a distribution

Three main measures of central tendency:

Mean

Median

Mode

Averages

M d


31/61

Mode

The mode (or modal value) of a variable in a set of data is

the value of the variable that is observed most frequentlyin that data (or, given a continuous frequency curve, is atthe point ofgreatest

Note: the mode is the value that is observed mostfrequently, not the frequency itself )

The mode is defined for everytype of variable [i.e.,nominal, ordinal, interval, or ratio].


32/61

0

510

15

20

25

30

35

40

Frequency

1 2 3 4 5 6 7 8 9

DV


33/61

Mode = most frequently occurring data point

Mode = (3+4)/2 = 3.5

Data Point Frequency

0 2

1 5

2 7

3 14

4 15

5 8

6 5


34/61

Median

Middle-most Value

50% of observations are above the Median, 50% arebelow it

The difference in magnitude between the observationsdoes not matter

Therefore, it is not sensitive to outliers

Formula Median = n + 1 / 2


35/61

Median = the middle number when data arearranged in numerical order

Data: 3 5 1

Step 1: Arrange in numerical order

1 3 5

Step 2: Pick the middle number (3)

Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9


36/61

MedianMedian Location = (N +1)/2 = (56 + 1)/2 = 28.5

Median = (3+4)/2 = 3.5Data Point Frequency

0 2

1 5

2 7

3 14

4 15

5 8

6 5


37/61

Mean

The mean (or mean value) of a variable in a set of data isthe result of adding up all the observed values of thevariable and dividing by the number of cases ( the

average as the term is most commonly used). The mean is defined if and only if the variable is at least

interval in nature [i.e., interval or ratio].


38/61

Mean = Average =X/NX = 191 Mean = 191/56 = 3.41

Data Point Frequency X

0 2 0

1 5 5

2 7 14

3 14 42

4 15 60

5 8 40

6 5 30


39/61

Advantages and Disadvantages of the Measures:

Median1. Also unaffected by extreme scores

Data: 5 8 11 Median = 8

Data: 5 8 5 million Median = 8

2. Usually its value actually occurs in the data3. But cannot be entered into equations, because

there is no equation that defines it

4. And not as stable from sample to sample,

because dependent upon the number of scores inthe sample


40/61

Advantages and Disadvantages of the Measures:

Mean1. Defined algebraically

2. Stable from sample to sample

3. But usually does not actually occur in the data

4. And heavily influenced by outliersData: 5 8 11 Mean = 8

Data: 5 8 5 million Mean = 1,666,671


41/61

Measures of Variation

Measures of variation is a measure that describes how spreadout or scattered a set of data. It is also known as measures ofdispersion or measures of spread.

Measures of Variation include:

1. The range

2. The Variance

3. The Standard Deviation

The standard deviation isjust the square root of thevariance


42/61

Range: difference between the extreme values (max - min),actual values are most often reported in the literature (min -max) rather than the difference

Variance - measure of variation in a sample of data: meansquared deviations of a value from the mean, often referred toas the mean square or MS

Standard deviation: square root of the variance, measuresamount of variation of values around the mean

E l


43/61

Example

Heights (in inches) of 5 starting players from basketball

team A:

A: 72 , 73, 76, 76, 78

The rangeis the difference between maximum andminimum values of the data set.

Range of team A: 78-72=6

The sample standard deviationtakes into account alldata values. The following procedure is used to find thesample standard deviation.


44/61

Step 1.

Find the mean of data


45/61

Step 2.

Find the deviation of each score from the mean

Note that the sum of the deviations is zero:

xi

72 72-75 = -3

73 7375 = -2

76 76-75 = 1

76 76-75 = 1

78 78-75= 3

x x


46/61

Step 3.Square each deviation from the mean .Find the sum of the squared deviations.

xi

72 72-75 = -3 9

73 7375 = -2 4

76 76-75 = 1 1

76 76-75 = 1 1

78 78-75= 3 9

0 24

ixx

2)(i

xx


47/61

Step 4.The sample variance is determined by dividing the sum of thesquared deviations by (n-1) (number of scores minus one)

Team A, the sample variance is


48/61

Step 5.The standard deviation Is the square root of the variance.

The mathematical formula for the sample standard deviation is

The sample standard deviation for Team A is


49/61

Measures of Position

Identify the position of a data value in a data set, using

various measures of position such as percentiles andquartiles

Are used to locate the relative position of a data value ina data set

Can be used to compare data values from different datasets

Can be used to compare data values within the samedata set

Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles


50/61

z-scores

Also called the standard score

Represents the number of standard deviations a score isfrom the mean

Always round value to 2 decimal places

Can be used to compare data values from different datasets by converting raw data to a standardized scale

Calculation involves the mean and standard deviation ofthe data set

Represents the number of standard deviations that adata value is from the mean for a specific distribution


51/61

Z -score

Is obtained by subtracting the

mean from the given datavalue and dividing the resultby the standard deviation.

Symbol of BOTH population

and sample is z Can be positive, negative or

zero A date point can be considered

unusual if its z-score is

sufficiently large or small

Formula

Sample


52/61

ExampleHuman body temperatures have a mean of 98.20 degrees

and a standard deviation of 0.62 degrees.Find the z score for temperatures of:

a. 100 degrees

b. 97 degrees

Solution

Z = (100 98.20)/0.62

Z = 2.90

Z = (97 98.20)/0.62

Z = -1.94


53/61

Significance of Z

Z scores above 2 or below -2 are considered to be

UNUSUAL.

Z scores above 3 or below -3 are considered to be VERYUNUSUAL.

So

The temperature of 100 degrees is UNUSUAL.

The temperature of 97 degrees is ordinary


54/61

Percentiles

Are position measures used indicate the position of an

individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the

national norm Symbolized by P

1,P

2 ,..

Percentile rank indicates the percentage of data valuesthat fall belowthe specified rank

Where B = number of scores belowxE = number of scores equal toxn = number of scores


55/61

A percentile tells the percent of scores that are lowerthan a given score.

Example : If Jason graduated 25th out of a class of 150students, then 125 students were ranked belowJason. Jason's percentile rank would be:

Jason's standing in the class at the 84th percentile is as

higher or higher than 84% of the graduates.

Q til


56/61

Quartiles

Quartiles divide the data set into 4 groups, each of which

has the same number of members. Q1 corresponds to P25

Q2 corresponds to P50 or the median

Q3 corresponds to P75

Q1, Q2, Q3

divides ranked scores into four equal parts


57/61

Example

Find : Q1,Q2,Q3 ?


58/61

Q2(Median)

The median is theaverage of the 6th and7th scores.

(80.2+ 82.5)/2

Q2= 81.35


59/61

Q1

Find the median ofthe first 6 scores

(78.6 + 79.2)/2 78.9


60/61

Q3

Find the medianof the last 6

scores

(84.3+84.6)/2

84.45

THE END


61/61

THE END

Documents

Describing, Exploring, And Comparing Data