View
126
Download
4
Category
Tags:
Preview:
Citation preview
Unit-1
Statistics
Definition 1 :-
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to scientific, industrial, or societal problem, it is necessary to begin with a population or process to be studied. Populations can be diverse topics such as "all persons living in a country" or "every atom composing a crystal". It deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments. Definition 2 :-
Statistics is a science of facts and figures and nothing beyond that. It's a measurement of data and expression of the same in the numerical manner. Uses of statistics:
1. It is highly quantitative than qualitative 2. Statistical method deals with two fundamental principles 3. Statistical unit 4. Statistical data must be manipulated 5. Presentation of statistical data with the help of line-diagram
1. It is highly quantitative than qualitative: Social statistics which present the data of an area must be numerous in nature. By which we can measure the tendency of a project. In a little period, it also understand by everyone, when listen the percentage. So it is easy to record and easy to understand.
2. Statistical method deals with two fundamental principles:
Fundamental regularity based on mathematical probability It says about capacity of the researcher
Fundamental regularity based on mathematical probability: It states that every social phenomena is influenced by large number by
variables, which are co-related and inter related and statistics ls to study this co-relation. Therefore the theory of probability, linear programs and shadow prices are used to find-out the reality.
It says about capacity of the researcher: For substantiation of findings and conclusions, statistical jargon are necessary and it save the researcher/scholar from danger and challenges. It is the data, facts and figures which say the capacity of the researcher. The skills and the resources which is used by the researcher must be applied in its research finding.
3. Statistical Units:
Statistical unit has four characteristics as:
Appropriateness Clarity Measurability Comparability
4. Statistical data must be manipulated: The statistical data must be manipulated, divided and totaled to formulate some conclusions.
5. Presentation of statistical data with the help of line-diagram: Presentation of statistical data with the help of line-diagram, graphs, charts, histogram, frequency, distribution, pie-diagrams etc.
Limitations of statistics:
Statistics is indispensable to almost all sciences - social, physical and natural. It is very often used in most of the spheres of human activity. In spite of the wide scope of the subject it has
certain limitations. Some important limitations of statistics are the following:
1. Statistics does not study qualitative phenomena:
Statistics deals with facts and figures. So the quality aspect of a variable or the subjective phenomenon falls out of the scope of statistics. For example, qualities like beauty, honesty,
intelligence etc. cannot be numerically expressed. So these characteristics cannot be examined statistically. This limits the scope of the subject.
2. Statistical laws are not exact:
Statistical laws are not exact as incase of natural sciences. These laws are true only on average. They hold good under certain conditions. They cannot be universally applied. So statistics has
less practical utility.
3. Statistics does not study individuals:
Statistics deals with aggregate of facts. Single or isolated figures are not statistics. This is considered to be a major handicap of statistics.
4. Statistics can be misused:
Statistics is mostly a tool of analysis. Statistical techniques are used to analyze and interpret the
collected information in an enquiry. As it is, statistics does not prove or disprove anything. It is just a means to an end. Statements supported by statistics are more appealing and are commonly
believed. For this, statistics is often misused. Statistical methods rightly used are beneficial but if misused these become harmful. Statistical methods used by less expert hands will lead to inaccurate results. Here the fault does not lie with the subject of statistics but with the person
who makes wrong use of it.
Frequency Distribution
Frequency:- Frequency is how often something occurs.
Example: Sam played football on Saturday Morning, Saturday Afternoon, Thursday Afternoon The frequency was 2 on Saturday, 1 on Thursday and 3 for the whole week. Frequency Distribution By counting frequencies we can make a Frequency Distribution table. Example: Goals
Sam put the numbers in order, then added up:
how often 1 occurs (2 times), how often 2 occurs (5 times), etc,
and wrote them down as a Frequency Distribution table.
Sam's team has scored the following numbers
of goals in recent games:
2, 3, 1, 2, 1, 3, 2, 3, 4, 5, 4, 2, 2,3
From the table we can see interesting things such as
getting 2 goals happens most often only once did they get 5 goals
Frequency Distribution:- values and their frequency (how often each value occurs).
Example: Newspapers These are the numbers of newspapers sold at a local shop over the last 10 days: 22, 20, 18, 23, 20, 25, 22, 20, 18, 20 Let us count how many of each number there is:
Papers Sold Frequency
18 2
19 0
20 4
21 0
22 2
23 1
24 0
25 1
It is also possible to group the values. Here they are grouped in 5s:
Papers Sold Frequency
15-19 2
20-24 7
25-29 1
Frequency Curve
A smooth curve which corresponds to the limiting case of a histogram computed for a frequency distribution of a continuous distribution as the number of data points becomes very large is
called frequency curve.
Measures of Central Tendency
Introduction
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. Measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode.
The mean, median and mode are all valid measures of central tendency.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values π₯1 , π₯2 , π₯3 , β¦ , π₯π the sample mean, usually denoted by (pronounced x bar), is:
οΏ½Μ οΏ½ =(π₯1 + π₯2 + π₯3 + β― + π₯π)
π
This formula is usually written in a slightly different manner using the Greek capitol letter, β , pronounced "sigma", which means "sum of...":
οΏ½Μ οΏ½ =βπ₯
π
When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency.
Median
The median is the middle score for a set of data that has been arranged in order of magnitude.
If the number of events are even then the average of two middle are taken.
The median is better for describing the typical value.
Example:-
In order to calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold).
Mode
The mode is the most frequent score in our data set.
What will happen to the measures of central tendency if we add the same amount
to all data values, or multiply each data value by the same amount?
Data Mean Mode Median
Original Data Set:
6, 7, 8, 10, 12, 14, 14, 15, 16, 20 12.2 14 13
Add 3 to each
data value 9, 10, 11, 13, 15, 17, 17, 18, 19, 23 15.2 17 16
Multiply 2
times each
data value
12, 14, 16, 20, 24, 28, 28, 30, 32, 40 24.4 28 26
When added: Since all values are shifted the same amount, the measures of central tendency all shifted by the same amount. If you add 3 to each data value, you will add 3 to the mean, mode and median.
When multiplied: Since all values are affected by the same multiplicative values, the measures
of central tendency will feel the same affect. If you multiply each data value by 2, you will multiply the mean, mode and median by 2.
Example :-1
Find the mean, median and mode for the following data: 5, 15, 10, 15, 5, 10, 10, 20, 25, 15. Answer:- (You will need to organize the data.) 5, 5, 10, 10, 10, 15, 15, 15, 20, 25
Mean: ππ’π ππ πππ‘π
ππ’ππππ ππ πππ‘π=
130
10= 13
Median: 5, 5, 10, 10,10,15,15, 15, 20, 25 Listing the data in order is the easiest way to find the median.
The numbers 10 and 15 both fall in the middle.
Average these two numbers to get the median. 10+15
2= 12.5
Mode: Two numbers appear most often: 10 and 15. There are three 10's and three 15's. In this example there are two answers for the mode.
Example :- 2 For what value of x will 8 and x have the same mean (average) as 27 and 5?
Answer:-
First, find the mean of 27 and 5:
27 + 5
2= 16
Now, find the x value, knowing that the
average of x and 8 must be 16: π₯ + 8
2= 16
βΉ32 = x + 8 cross multiply
β π₯ = 32 β 8 = 24
Example :- 3 On his first 5 biology tests, Bob received the following scores: 72, 86, 92, 63, and 77. What test score must Bob earn on his sixth test so that his average (mean score) for all six tests will be 80? Show how you arrived at your answer.
Answer:- Possible solution: Set up an equation to represent the situation. Remember to use all 6 test scores:
72+86+92+63+77+x
6= 80
cross multiply and solve: (80)(6) = 390 + π₯ β 480 = 390 + π₯ β π₯ = 480 β 390 = 90
Example:- 4 The mean (average) weight of three dogs is 38 pounds. One of the dogs, Sparky, weighs 46 pounds. The other two dogs, Eddie and Sandy, have the same weight. Find Eddie's weight. Answer:- Let x = Eddie's weigh ( they weigh the same, so they are both represented by "x".) Let x = Sandy's weight Average: sum of the data divided by the number of data. x + x + 46 = 38 cross multiply and solve 3(dogs)
(38)(3) = 2x + 46 114 = 2x + 46
2π₯ = 114 β 46 β π₯ =68
2= 34
β΄ Eddie weighs 34 pounds.
For Class interval:
ππππππ = πΏ + (
π2 β ππ
π) Γ π
πβπππ πΏ = πΏππ€ππ πππππ‘ ππ ππππππ ππππ π
π = πππ‘ππ ππ’ππππ ππ πππ‘π ππ‘πππ
ππ = πΆπ’ππ’πππ‘ππ£π πππππ’ππππ¦
π = πππππ’ππππ¦ ππ π‘βπ ππππππ ππππ π
π = π‘βπ ππππ π πππ‘πππ£ππ ππ π‘βπ ππππππ ππππ π
ππππ = π +πΆ(ππ β ππβ1)
2ππ β ππβ1 β ππ +1
πβπππ π = πππππ ππππ π
ππ = πππ₯πππ’π πππππ’ππππ¦
πΆ = πΆπππ π‘πππ‘ ππππππππππ πππ πππβ ππππ π
Question:- Find the median of the following data.
Cost 10-20 20-30 30-40 40-50 50-60
Items in a group 4 5 3 6 3
Solution:-
Cost Number of items in the group Cumulative frequency 10-20 4 4
20-30 5 9 30-40 3 12
40-50 6 18 50-60 3 21
Here N=21 β π
2= 10.5
The median class is 30-40.
From Formula,
ππππππ = πΏ + (
π2
β ππ
π) Γ π
L=30, π = 10, ππ = 9
ππππππ = 30 +(10.5β9)
12Γ 10 = 30 + 1.25 = 31.25
Question:- Find the Mode of the following distribution:
Class Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Frequency 5 9 8 12 28 20 12 11
Solution:-
Maximum Frequency=28, Modal class=40-50
From Formula,
ππππ = π +πΆ(ππ β ππ β1)
2ππ β ππβ1 β ππ+1
π = 40, πΆ = 10, ππ = 28, ππβ1 = 12, ππ+1 = 20
Mode=40+ 10(28β12)
(2Γ28)β12β20= 40 + 6.666 = 46.666
FAQs - Measures of Central Tendency
What is the best measure of central tendency?
There can often be a "best" measure of central tendency with regards to the data you are analyzing, but there is no one "best" measure of central tendency. This is because whether you use the median, mean or mode will depend on the type of data you have (see our Types of Variable guide), such as nominal or continuous data; whether your data has outliers and/or is skewed; and what you are trying to show from your data. Further considerations of when to use each measure of central tendency is found in our guide on the previous page.
In a strongly skewed distribution, what is the best indicator of central tendency?
It is usually inappropriate to use the mean in such situations where your data is skewed. You would normally choose the median or mode, with the median usually preferred. This is discussed on the previous page under the subtitle, "When not to use the mean".
Does all data have a median, mode and mean?
Yes and no. All continuous data has a median, mode and mean. However, strictly speaking, ordinal data has a median and mode only, and nominal data has only a mode. However, a consensus has not been reached among statisticians about whether the mean can be used with ordinal data, and you can often see a mean reported for Likert data in research.
When is the mean the best measure of central tendency?
The mean is usually the best measure of central tendency to use when your data distribution is continuous and symmetrical, such as when your data is normally distributed. However, it all depends on what you are trying to show from your data.
When is the mode the best measure of central tendency?
The mode is the least used of the measures of central tendency and can only be used when dealing with nominal data. For this reason, the mode will be the best measure of central tendency (as it is the only one appropriate to use) when
dealing with nominal data. The mean and/or median are usually preferred when dealing with all other types of data, but this does not mean it is never used with these data types.
When is the median the best measure of central tendency?
The median is usually preferred to other measures of central tendency when your data set is skewed (i.e., forms a skewed distribution) or you are dealing with ordinal data. However, the mode can also be appropriate in these situations, but is not as commonly used as the median.
What is the most appropriate measure of central tendency when the data has outliers?
The median is usually preferred in these situations because the value of the mean can be distorted by the outliers. However, it will depend on how influential the outliers are. If they do not significantly distort the mean, using the mean as the measure of central tendency will usually be preferred.
In a normally distributed data set, which is greatest: mode, median or mean?
If the data set is perfectly normal, the mean, median and mean are equal to each other (i.e., the same value).
For any data set, which measures of central tendency have only one value?
The median and mean can only have one value for a given data set. The mode can have more than one value
MERITS AND DEMERITS OF MEAN, MEDIAN AND MODE
MEAN
The arithmetic mean (or simply "mean") of a sample is the sum of the sampled
values divided by the number of items in the sample.
MERITS OF ARITHEMETIC MEAN
1. ARITHEMETIC MEAN RIGIDLY DEFINED BY ALGEBRIC FORMULA
2. It is easy to calculate and simple to understand
3. IT BASED ON ALL OBSERVATIONS AND IT CAN BE REGARDED AS
REPRESENTATIVE OF THE GIVEN DATA
4. It is capable of being treated mathematically and hence it is widely used in
statistical analysis.
5. Arithmetic mean can be computed even if the detailed distribution is not
known but some of the observation and number of the observation are
known.
6. It is least affected by the fluctuation of sampling
DEMERITS OF ARITHMETIC MEAN
1. It can neither be determined by inspection or by graphical location
2. Arithmetic mean cannot be computed for qualitative data like data on
intelligence honesty and smoking habit etc
3. It is too much affected by extreme observations and hence it is not
adequately represent data consisting of some extreme point
4. Arithmetic mean cannot be computed when class intervals have open ends
MEDIAN The median is that value of the series which divides the group into two equal parts, one part comprising all values greater than the median value and the other part comprising all the values smaller than the median value. MERITS OF MEDIAN (1) Simplicity:- It is very simple measure of the central tendency of the series. I the case of simple statistical series, just a glance at the data is enough to locate the median value. (2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are always a certain specific value in the series. (4) Real value: - Median value is real value and is a better representative value of the series compared to arithmetic mean average, the value of which may not exist in the series at all. (5) Graphic presentation: - Besides algebraic approach, the median value can be estimated also through the graphic presentation of data. (6) Possible even when data is incomplete: - Median can be estimated even in the case of certain incomplete series. It is enough if one knows the number of items and the middle item of the series.
DEMERITS OF MEDIAN Following are the various demerits of median: (1) Lack of representative character: - Median fails to be a representative measure in case of such series the different values of which are wide apart from each other. Also, median is of limited representative character as it is not based on all the items in the series. (2) Unrealistic:- When the median is located somewhere between the two middle values, it remains only an approximate measure, not a precise value. (3) Lack of algebraic treatment: - Arithmetic mean is capable of further algebraic treatment, but median is not. For example, multiplying the median with the number of items in the series will not give us the sum total of the values of the series. However, median is quite a simple method finding an average of a series. It is quite a commonly used measure in the case of such series which are related to qualitative observation as and health of the student.
MODE
The value of the variable which occurs most frequently in a distribution is called the mode. MERITS OF M0DE
Following are the various merits of mode: (1) Simple and popular: - Mode is very simple measure of central tendency. Sometimes, just at the series is enough to locate the model value. Because of its simplicity, it s a very popular measure of the central tendency. (2) Less effect of marginal values: - Compared top mean, mode is less affected by marginal values in the series. Mode is determined only by the value with highest frequencies. (3) Graphic presentation:- Mode can be located graphically, with the help of histogram. (4) Best representative: - Mode is that value which occurs most frequently in the series. Accordingly, mode is the best representative value of the series. (5) No need of knowing all the items or frequencies: - The calculation of mode does not require knowledge of all the items and frequencies of a distribution. In simple series, it is enough if one knows the items with highest frequencies in the distribution. DEMERITS OF M0DE Following are the various demerits of mode: (1) Uncertain and vague: - Mode is an uncertain and vague measure of the central tendency. (2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of further algebraic treatment. (3) Difficult: - With frequencies of all items are identical, it is difficult to identify
the modal value. (4) Complex procedure of grouping:- Calculation of mode involves cumbersome procedure of grouping the data. If the extent of grouping changes there will be a change in the model value. (5) Ignores extreme marginal frequencies:- It ignores extreme marginal frequencies. To that extent model value is not a representative value of all the items in a series. Besides, one can question the representative character of the model value as its calculation does not involve all items of the series.
Dispersion
In statistics, dispersion (also called variability, scatter, or spread) denotes how stretched or squeezed is a distribution (theoretical or that underlying a statistical sample). Common examples of measures of statistical dispersion are the variance, standard deviation and interquartile range.
Dispersion is contrasted with location or central tendency, and together they are the most used properties of distributions.
Measures of dispersion
The set of constants which would in a concise way explain the βvariabilityβ, or βscatterβ in a data is called βMeasures of dispersion or variabilityβ.
The average for two groups of the same number of measurements may be equal, but one group may be more variable then the others.
e.g. set of five values 5,6,7,8,9 has the mean as 7; while other set of five values 1,6,4,10,14 also has the same mean 7. The second set has more variability then the first.
Usually four measures of dispersion or variability are defined.
Range:-
The Range is the difference between the two extreme values.
In frequency distribution, π = (πΏπππππ π‘ π₯ π£πππ’π) β (πππππππ π‘ π₯ π£πππ’π)
Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9-3 = 6. Quartile deviation:- Median bisects the distribution. If the distribution divided into four parts, quartiles are obtained. First Quartile isπ1 and third Quartile is π3 .
π1 = π +(
π4
β ππ1)
πΓ πΆ π3 = π +
(3π4
β ππ3)
πΓ πΆ
Where π = lower limit of the Quartile class πΆ = common factor
Quartile Deviation is defined as π. π·. =1
2(π3 β π1)
Average Deviation:-If average chosen A, then average deviation about A is average deviation.
π΄. π·. (π΄) =1
3β|π₯π β π΄| πππ πππ ππππ‘π πππ‘π
=1
3βππ|π₯π β π΄| πππ π πππππ’ππππ¦ πππ π‘ππππ’π‘πππ
Standard deviation:-
Standard deviation(π) = β1
πβ(π₯π β οΏ½Μ οΏ½)2 πππ πππ ππππ‘π πππ‘π πππ π‘ππππ’π‘πππ
= β1
πβππ (π₯π β οΏ½Μ οΏ½)2 πππ πππππ’ππππ¦ πππ π‘ππππ’π‘πππ
Square of standard deviation, π2 is defined as Variance (π).
π = π2 =1
πβππ (π₯π β οΏ½Μ οΏ½)2
Coefficient of variation
In probability theory and statistics, the coefficient of variation (CV) is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation πto the mean π . It is also known as unitized risk or the variation coefficient. The absolute value of the CV is sometimes known as relative standard deviation (RSD), which is expressed as a percentage.
Definition
The coefficient of variation (CV) is defined as the ratio of the standard deviation π to the mean π :
πΆπ£ =π
π
It shows the extent of variability in relation to mean of the population.
Example:-The owner of a restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and write down the following data: 44, 50, 38, 96, 42, 47, 40,39, 46, 50
Find mean, standard deviation and variance.
Solution:-
Mean is calculated by adding and dividing by 10.
Mean = οΏ½Μ οΏ½ = 49.2
Following table is used to find standard deviation
P π β ππ. π (π β ππ. π)π 44 -5.2 27.04 50 0.8 0.64
38 11.2 125.44 96 46.8 2190.24
42 -7.2 51.84 47 -2.2 4.84
40 -9.2 84.64 39 -10.2 104.04
46 -3.2 10.24 50 0.8 0.64
Total 2600.4
Standard Deviation= π
= β1
πβ(π₯π β οΏ½Μ οΏ½)2 =β
2600.4
10β1= β
2600.4
9= β288.93 = Β±16.997=17
Variance =π2 = 288.93
Coefficient of variation (C.V.)= πΆπ£ =π
π=
16.997
49.2= 0.34547
Recommended