Upload
lizette-leah-ching
View
222
Download
0
Embed Size (px)
Citation preview
7/28/2019 Describing, Exploring, And Comparing Data
1/61
DESCRIBING, EXPLORING, ANDCOMPARING DATA
APPLIED STATISTICS
Submitted to : Dr. IMELDA E. CUATEL
GRADUATE SCHOOLUNIVERRSITY OF LUZON
Sunday 8:00-12:30
Prepared by : SAIFULDEEN SINAN
7/28/2019 Describing, Exploring, And Comparing Data
2/61
Introduction to Statistics
What is Statistics?
a set of procedures and rulesfor reducing
large masses of data to manageableproportions and for allowing us to drawconclusions from those data
Statistics is a branch of mathematics that deals with
the effective management and
analysis of data.
7/28/2019 Describing, Exploring, And Comparing Data
3/61
What can Stats do? Allow us to draw conclusions from the data
Make data more manageable
Allows us to do this objectively and quantitatively
7/28/2019 Describing, Exploring, And Comparing Data
4/61
Why Statistics?
To develop an appreciation for variability and how it effectsproducts and processes.
Build an appreciation for the advantages & Limitations ofinformed observation and Experimentation.
Determine how to analyze data from designed experimentsin order to build knowledge and continuously improve.
7/28/2019 Describing, Exploring, And Comparing Data
5/61
GroupedFrequency Distributions
A frequency distribution is a table used to organize
data . The left column (called classes or groups)
includes numerical intervals on a variable being
studied. The right column is a list of the frequencies,or number of observations, for each class. .
7/28/2019 Describing, Exploring, And Comparing Data
6/61
Grouped frequency distributions -can be used when therange of values in the data set is very large. The datamust be grouped into classes that are more than one unitin width
7/28/2019 Describing, Exploring, And Comparing Data
7/61
Construction of a Frequency Distribution
Find the highest and lowest value. Find the range.
Select the number of classes desired.
Find the width by dividing the range by the number of
classes and rounding up
Select a starting point (usually the lowest value); add thewidth to get the lower limits.
Find the upper class limits.
Find the boundaries.
Tally the data, find the frequencies and find thecumulative frequency.
7/28/2019 Describing, Exploring, And Comparing Data
8/61
Example
In a survey of 20 patients who smoked, the following
data were obtained. Each value represents thenumber of cigarettes the patient smoked per day.
Construct a frequency distribution using six classes.
10 8 6 14
22 13 17 19
11 9 18 14
13 12 15 15
5 11 16 11
7/28/2019 Describing, Exploring, And Comparing Data
9/61
Answer
Step 1:Find the highest and lowest
values: H = 22 and L = 5.
Step 2:Find the range:R = H L = 22 5 = 17.
Step 3:Select the number of classes desired. In this case it is equal to6.
Step 4: Find the class width by dividing the range by the number ofclasses. Width = 17/6 = 2.83. This value is rounded up to 3.
Step 5: Select a starting point for the lowest class limit. Forconvenience, this value is chosen to be 5, the smallest data value. Thelower class limits will be 5, 8, 11, 14, 17 and 20.
Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.
7/28/2019 Describing, Exploring, And Comparing Data
10/61
Step 7: Find the class boundaries by subtracting 0.5 from each lowerclass limit and adding 0.5 to the upper class limit
Step 8: Tally the data, write the numerical values for the tallies in the
frequency column and find the cumulative frequencies.
Class Limits Class Boundaries Frequency Cumulative Frequency
05 to 07 4.5 - 7.5 2 2
08 to 10 7.5 - 10.5 3 5
11 to 13 10.5 - 13.5 6 11
14 to 16 13.5 - 16.5 5 16
17 to 19 16.5 - 19.5 3 19
20 to 22 19.5 - 22.5 1 20
Note:The dash - represents to.
7/28/2019 Describing, Exploring, And Comparing Data
11/61
Histogram
What is a histogram
It is "a representation of a frequency distribution by means of
rectangles whose widths represent class intervals andwhose areas are proportional to the correspondingfrequencies
A histogram is like a bar chart, but there are some important
differences.
It can only be used to show continuous data
It can only be used to show numerical data
The data is always grouped.
7/28/2019 Describing, Exploring, And Comparing Data
12/61
So The width of a bar represents a quantitative variable x, such as agerather than a category
The height of each bar indicates frequency
How is a Real Histogram Made?
Example
* Consider the set Below
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.A graph which shows how many ones, how many twos, how many threes,
etc. would be meaningless. Instead we bin the data into convenientranges. In this case, with a bin width of 10, we can easily group the dataas below
Bin =The class size (width of the rectangles) in a histogram
SEE NEXT SLIDE
7/28/2019 Describing, Exploring, And Comparing Data
13/61
SOLUTION
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.
a bin width of 10
DataRange
Frequency
0-10 1
10-20 3
20-30 6
30-40 4
40-50 2
Note: Changing the size of the bin changes the apprearance of the graph
7/28/2019 Describing, Exploring, And Comparing Data
14/61
Histogram shapes
7/28/2019 Describing, Exploring, And Comparing Data
15/61
7/28/2019 Describing, Exploring, And Comparing Data
16/61
Box plot
A box plot (also referred to as a box and whisker diagram) is a
diagram showing statistical distribution.
A box plot summarizes data using the median, upper and lowerquartiles, and the extreme (least and greatest) values. It allows you
to see important characteristics of the data at a glance.
We need 5 numbers, called the 5 number summary:
1. minimum value
2. Q1
3. median
4. Q3
5. maximum value
7/28/2019 Describing, Exploring, And Comparing Data
17/61
Construction of BOX PLOT
28 32 42 37
30 25 44 38
24 32 33 44
38 34 30 44
31 28 31 29
39 29 32 29
MPG of 4-cylinder cars
7/28/2019 Describing, Exploring, And Comparing Data
18/61
To make a box plot, organize the data in order least to
greatest :
24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38
39 42 44 44 44
* THEN we Find the median of the data. It is 32
* This divides the data in half. The lower half : 24
25 28 28 29 29 29 30 30 31 31 32 and the upper
half: 32 32 33 34 37 38 38 39 42 44 44 44
7/28/2019 Describing, Exploring, And Comparing Data
19/61
Find the median of the top half of the data.32 32 33 34 37 38 38 39 42 44 44 44
This is called the high median, upper quartile or quartile 3 . Q 3 = 38.Take the lower half of the data and find the median of it.
24 25 28 28 29 29 29 30 30 31 31 32This is called the low median, or quartile 1. Q1 = 29
Next, find the lowest data, 24, and the highest data, 44.Lets organize all 5 pieces of data together so we can see
Lower extreme = 24
Lower quartile(Q1) =29Median (Q2) = 32
Upper quartile(Q3) =38
Upper extreme(Q4)=44
7/28/2019 Describing, Exploring, And Comparing Data
20/61
Next, make a number line that will best display the 5 pieces of data(24 ,29 , 32 ,38, 44)
Place a dot above the number line to show the lowerextreme and one for the upper extreme.
Put a vertical slash above the number line for the medianand one for the lower and upper quartiles.
20 24 28 32 36 40 44
20 24 28 32 36 40 44
7/28/2019 Describing, Exploring, And Comparing Data
21/61
Enclose the vertical slashes into a box. Draw a line from the rightcenter of the box to the upper extreme and one from the lower endof the box to the lower extreme, forming the whiskers.
THEN
All graphs must have a title that clearly represents what your graphis showing
Miles per Gallon of 4-cylinder Cars
Miles per gallon (mpg)
20 24 28 32 36 40 44
7/28/2019 Describing, Exploring, And Comparing Data
22/61
OGIVE
An ogive, sometimes called a cumulative line graph, is aline that connects points that are the cumulativepercentage of observations below the upper limit of eachclass in a cumulative frequency distribution.
How to Construct Ogives ? Make a frequency table showing class boundaries and
cumulative frequencies.
For each class, put a dot over the upper class boundary atthe height of the cumulative class frequency.
Place dot on horizontal axis at the lower class boundaryof the first class.
Connect the dots.
7/28/2019 Describing, Exploring, And Comparing Data
23/61
Example
7/28/2019 Describing, Exploring, And Comparing Data
24/61
Draw the x and y axis , Plot the points
7/28/2019 Describing, Exploring, And Comparing Data
25/61
7/28/2019 Describing, Exploring, And Comparing Data
26/61
Pie Chart
Pie graph -A pie graph is a circle that is divided into
sections or wedges according to the percentage offrequencies in each category of the distribution
How to make a Pie Chart ?
1. Organize your information
2. Add the data all together and reach a sum
3. Know the angle between the two sides of the piece
4. Use a mathematical compass to draw a circle
5. Draw the radius6. Draw each section division
7. Color each segment.
7/28/2019 Describing, Exploring, And Comparing Data
27/61
Example
A family's weekly expenditure on its house mortgage, food
and fuel is as follows:
Draw a pie chart to display the information.
7/28/2019 Describing, Exploring, And Comparing Data
28/61
Solution :
We can find what percentage of the total expenditure eachitem equals.Percentage of weekly expenditure on:
7/28/2019 Describing, Exploring, And Comparing Data
29/61
To draw a pie chart, divide the circle into 100 percentage parts.Then allocate the number of percentage parts required for eachitem.
7/28/2019 Describing, Exploring, And Comparing Data
30/61
Measures of Central Tendency (Averages)
A measure of central tendencyis a univariate statistic thatindicates, in one manner or another.
the average or typicalobserved value of a variable in adata set.
Central Tendency = values that summarize/ represent themajority of scores in a distribution
Three main measures of central tendency:
Mean
Median
Mode
Averages
M d
7/28/2019 Describing, Exploring, And Comparing Data
31/61
Mode
The mode (or modal value) of a variable in a set of data is
the value of the variable that is observed most frequentlyin that data (or, given a continuous frequency curve, is atthe point ofgreatest
Note: the mode is the value that is observed mostfrequently, not the frequency itself )
The mode is defined for everytype of variable [i.e.,nominal, ordinal, interval, or ratio].
7/28/2019 Describing, Exploring, And Comparing Data
32/61
0
510
15
20
25
30
35
40
Frequency
1 2 3 4 5 6 7 8 9
DV
7/28/2019 Describing, Exploring, And Comparing Data
33/61
Mode = most frequently occurring data point
Mode = (3+4)/2 = 3.5
Data Point Frequency
0 2
1 5
2 7
3 14
4 15
5 8
6 5
7/28/2019 Describing, Exploring, And Comparing Data
34/61
Median
Middle-most Value
50% of observations are above the Median, 50% arebelow it
The difference in magnitude between the observationsdoes not matter
Therefore, it is not sensitive to outliers
Formula Median = n + 1 / 2
7/28/2019 Describing, Exploring, And Comparing Data
35/61
Median = the middle number when data arearranged in numerical order
Data: 3 5 1
Step 1: Arrange in numerical order
1 3 5
Step 2: Pick the middle number (3)
Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9
7/28/2019 Describing, Exploring, And Comparing Data
36/61
MedianMedian Location = (N +1)/2 = (56 + 1)/2 = 28.5
Median = (3+4)/2 = 3.5Data Point Frequency
0 2
1 5
2 7
3 14
4 15
5 8
6 5
7/28/2019 Describing, Exploring, And Comparing Data
37/61
Mean
The mean (or mean value) of a variable in a set of data isthe result of adding up all the observed values of thevariable and dividing by the number of cases ( the
average as the term is most commonly used). The mean is defined if and only if the variable is at least
interval in nature [i.e., interval or ratio].
7/28/2019 Describing, Exploring, And Comparing Data
38/61
Mean = Average =X/NX = 191 Mean = 191/56 = 3.41
Data Point Frequency X
0 2 0
1 5 5
2 7 14
3 14 42
4 15 60
5 8 40
6 5 30
7/28/2019 Describing, Exploring, And Comparing Data
39/61
Advantages and Disadvantages of the Measures:
Median1. Also unaffected by extreme scores
Data: 5 8 11 Median = 8
Data: 5 8 5 million Median = 8
2. Usually its value actually occurs in the data3. But cannot be entered into equations, because
there is no equation that defines it
4. And not as stable from sample to sample,
because dependent upon the number of scores inthe sample
7/28/2019 Describing, Exploring, And Comparing Data
40/61
Advantages and Disadvantages of the Measures:
Mean1. Defined algebraically
2. Stable from sample to sample
3. But usually does not actually occur in the data
4. And heavily influenced by outliersData: 5 8 11 Mean = 8
Data: 5 8 5 million Mean = 1,666,671
7/28/2019 Describing, Exploring, And Comparing Data
41/61
Measures of Variation
Measures of variation is a measure that describes how spreadout or scattered a set of data. It is also known as measures ofdispersion or measures of spread.
Measures of Variation include:
1. The range
2. The Variance
3. The Standard Deviation
The standard deviation isjust the square root of thevariance
7/28/2019 Describing, Exploring, And Comparing Data
42/61
Range: difference between the extreme values (max - min),actual values are most often reported in the literature (min -max) rather than the difference
Variance - measure of variation in a sample of data: meansquared deviations of a value from the mean, often referred toas the mean square or MS
Standard deviation: square root of the variance, measuresamount of variation of values around the mean
E l
7/28/2019 Describing, Exploring, And Comparing Data
43/61
Example
Heights (in inches) of 5 starting players from basketball
team A:
A: 72 , 73, 76, 76, 78
The rangeis the difference between maximum andminimum values of the data set.
Range of team A: 78-72=6
The sample standard deviationtakes into account alldata values. The following procedure is used to find thesample standard deviation.
7/28/2019 Describing, Exploring, And Comparing Data
44/61
Step 1.
Find the mean of data
7/28/2019 Describing, Exploring, And Comparing Data
45/61
Step 2.
Find the deviation of each score from the mean
Note that the sum of the deviations is zero:
xi
72 72-75 = -3
73 7375 = -2
76 76-75 = 1
76 76-75 = 1
78 78-75= 3
x x
7/28/2019 Describing, Exploring, And Comparing Data
46/61
Step 3.Square each deviation from the mean .Find the sum of the squared deviations.
xi
72 72-75 = -3 9
73 7375 = -2 4
76 76-75 = 1 1
76 76-75 = 1 1
78 78-75= 3 9
0 24
ixx
2)(i
xx
7/28/2019 Describing, Exploring, And Comparing Data
47/61
Step 4.The sample variance is determined by dividing the sum of thesquared deviations by (n-1) (number of scores minus one)
Team A, the sample variance is
7/28/2019 Describing, Exploring, And Comparing Data
48/61
Step 5.The standard deviation Is the square root of the variance.
The mathematical formula for the sample standard deviation is
The sample standard deviation for Team A is
7/28/2019 Describing, Exploring, And Comparing Data
49/61
Measures of Position
Identify the position of a data value in a data set, using
various measures of position such as percentiles andquartiles
Are used to locate the relative position of a data value ina data set
Can be used to compare data values from different datasets
Can be used to compare data values within the samedata set
Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles
7/28/2019 Describing, Exploring, And Comparing Data
50/61
z-scores
Also called the standard score
Represents the number of standard deviations a score isfrom the mean
Always round value to 2 decimal places
Can be used to compare data values from different datasets by converting raw data to a standardized scale
Calculation involves the mean and standard deviation ofthe data set
Represents the number of standard deviations that adata value is from the mean for a specific distribution
7/28/2019 Describing, Exploring, And Comparing Data
51/61
Z -score
Is obtained by subtracting the
mean from the given datavalue and dividing the resultby the standard deviation.
Symbol of BOTH population
and sample is z Can be positive, negative or
zero A date point can be considered
unusual if its z-score is
sufficiently large or small
Formula
Sample
7/28/2019 Describing, Exploring, And Comparing Data
52/61
ExampleHuman body temperatures have a mean of 98.20 degrees
and a standard deviation of 0.62 degrees.Find the z score for temperatures of:
a. 100 degrees
b. 97 degrees
Solution
Z = (100 98.20)/0.62
Z = 2.90
Z = (97 98.20)/0.62
Z = -1.94
7/28/2019 Describing, Exploring, And Comparing Data
53/61
Significance of Z
Z scores above 2 or below -2 are considered to be
UNUSUAL.
Z scores above 3 or below -3 are considered to be VERYUNUSUAL.
So
The temperature of 100 degrees is UNUSUAL.
The temperature of 97 degrees is ordinary
7/28/2019 Describing, Exploring, And Comparing Data
54/61
Percentiles
Are position measures used indicate the position of an
individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the
national norm Symbolized by P
1,P
2 ,..
Percentile rank indicates the percentage of data valuesthat fall belowthe specified rank
Where B = number of scores belowxE = number of scores equal toxn = number of scores
7/28/2019 Describing, Exploring, And Comparing Data
55/61
A percentile tells the percent of scores that are lowerthan a given score.
Example : If Jason graduated 25th out of a class of 150students, then 125 students were ranked belowJason. Jason's percentile rank would be:
Jason's standing in the class at the 84th percentile is as
higher or higher than 84% of the graduates.
Q til
7/28/2019 Describing, Exploring, And Comparing Data
56/61
Quartiles
Quartiles divide the data set into 4 groups, each of which
has the same number of members. Q1 corresponds to P25
Q2 corresponds to P50 or the median
Q3 corresponds to P75
Q1, Q2, Q3
divides ranked scores into four equal parts
7/28/2019 Describing, Exploring, And Comparing Data
57/61
Example
Find : Q1,Q2,Q3 ?
7/28/2019 Describing, Exploring, And Comparing Data
58/61
Q2(Median)
The median is theaverage of the 6th and7th scores.
(80.2+ 82.5)/2
Q2= 81.35
7/28/2019 Describing, Exploring, And Comparing Data
59/61
Q1
Find the median ofthe first 6 scores
(78.6 + 79.2)/2 78.9
7/28/2019 Describing, Exploring, And Comparing Data
60/61
Q3
Find the medianof the last 6
scores
(84.3+84.6)/2
84.45
THE END
7/28/2019 Describing, Exploring, And Comparing Data
61/61
THE END