NATURE, SCOPE AND L IMIT ATIONS OF STATISTICS · NATURE, SCOPE AND L IMIT ATIONS OF STATISTICS Introduction The term “statistics” is used in two senses : ... sales, birth, death

1

LESSON 1

NATURE, SCOPE

AND LIMITATIONS OF STATISTICS

Introduction

The term “statistics” is used in two senses : first in plural sense meaning a collection of numerical

facts or estimates—the figure themselves. It is in this sense that the public usually think of statistics, e.g.,

figures relating to population, profits of different units in an industry etc. Secondly, as a singular noun, the term

‘statistics’ denotes the various methods adopted for the collection, analysis and interpretation of the facts

numerically represented. In singular sense, the term ‘statistics’ is better described as statistical

methods. In our study of the subject, we shall be more concerned with the second meaning of the word

‘statistics’.

Definition

Statistics has been defined differently by different authors and each author has assigned new limits

to the field which should be included in its scope. We can do no better than give selected definitions of

statistics by some authors and then come to the conclusion about the scope of the subject.

A.L. Bowley defines, “Statistics may be called the science of counting”. At another place he defines,

“Statistics may be called the science of averages”. Both these definitions are narrow and throw light only on

one aspect of Statistics.

According to King, “The science of statistics is the method of judging collective, natural or social,

phenomenon from the results obtained from the analysis or enumeration or collection of estimates”.

Many a time counting is not possible and estimates are required to be made. Therefore, Boddington

defines it as “the science of estimates and probabilities”. But this definition also does not cover the entire

scope of statistics.

The statistical methods are methods for the collection, analysis and interpretation of numerical data

and form a basis for the analysis and comparison of the observed phenomena. In the words of Croxton &

Cowden, “Statistics may be defined as the collection, presentation, analysis and interpretation of numerical

data”.

Horace Secrist has given an exhaustive definition of the term satistics in the plural sense.

According to him:

“By statistics we mean aggregates of facts affected to a marked extent by a multiplicity of causes

numerically expressed, enumerated or estimated according to reasonable standards of accuracy collected in

a systematic manner for a pre-determined purpose and placed in relation to each other”.

This definition makes it quite clear that as numerical statement of facts, ‘statistic’ should possess the

following characteristics.

1. Statistics are aggregate of facts

A single age of 20 or 30 years is not statistics, a series of ages are. Similarly, a single figure relating

to production, sales, birth, death etc., would not be statistics although aggregates of such figures would be

statistics because of their comparability and relationship.

2

2. Statistics are affected to a marked extent by a multiplicity of causes

A number of causes affect statistics in a particular field of enquiry, e.g., in production statistics are

affected by climate, soil, fertility, availability of raw materials and methods of quick transport.

3. Statistics are numerically expressed, enumrated or estimated

The subject of statistics is concerned essentially with facts expressed in numerical form—with their

quantitative details but not qualitative descriptions. Therefore, facts indicated by terms such as ‘good’, ‘poor’

are not statistics unless a numerical equivalent, is assigned to each expression. Also this may either be

enumerated or estimated, where actual enumeration is either not possible or is very difficult.

4. Statistics are numerated or estimated according to reasonable standard of accuracy

Personal bias and prejudices of the enumeration should not enter into the counting or estimation of

figures, otherwise conclusions from the figures would not be accurate. The figures should be counted or

estimated according to reasonable standards of accuracy. Absolute accuracy is neither necessary nor sometimes

possible in social sciences. But whatever standard of accuracy is once adopted, should be used throughout

the process of collection or estimation.

5. Statistics should be collected in a systematic manner for a predetermined purpose

The statistical methods to be applied on the purpose of enquiry since figures are always collected

with some purpose. If there is no predetermined purpose, all the efforts in collecting the figures may prove to

be wasteful. The purpose of a series of ages of husbands and wives may be to find whether young husbands

have young wives and the old husbands have old wives.

6. Statistics should be capable of being placed in relation to each other

The collected figure should be comparable and well-connected in the same department of inquiry.

Ages of husbands are to be compared only with the corresponding ages of wives, and not with, say, heights

of trees.

Functions of Statistics

The functions of statistics may be enumerated as follows :

(i) To present facts in a definite form : Without a statistical study our ideas are likely to be

vague, indefinite and hazy, but figures helps as to represent things in their true perspective. For

example, the statement that some students out of 1,400 who had appeared, for a certain

examination, were declared successful would not give as much information as the one that 300

students out of 400 who took the examination were declared successful.

(ii) To simplify unwieldy and complex data : It is not easy to treat large numbers and hence they

are simplified either by taking a few figures to serve as a representative sample or by taking

average to give a bird’s eye view of the large masses. For example, complex data may be

simplified by presenting them in the form of a table, graph or diagram, or representing it through

an average etc.

(iii) To use it as a technique for making comparisons : The significance of certain figures can

be better appreciated when they are compared with others of the same type. The comparison

between two different groups is best represented by certain statistical methods, such as average,

coefficients, rates, ratios, etc.

3

(iv) To enlarge individual experience : An individual’s knowledge is limited to what he can

observe and see; and that is a very small part of the social organism. His knowledge is extended

in various ways by studying certain conclusions and results, the basis of which are numerical

investigations. For example, we all have general impression that the cost of living has increased.

But to know to what extent the increase has occurred, and how far the rise in prices has

affected different income groups, it would be necessary to ascertain the rise in prices of articles

consumed by them.

(v) To provide guidance in the formulation of policies : The purpose of statistics is to enable

correct decisions, whether they are taken by a businessman or Government. In fact statistics is

a great servant of business in management, governance and development. Sampling methods

are employed in industry in tacking the problem of standardisation of products. Big business

houses maintain a separate department for statistical intelligence, the work of which is to collect,

compare and coordinate figures for formulating future policies of the firm regarding production

and sales.

(vi) To enable measurement of the magnitude of a phenomenon : But for the development of

the statistical science, it would not be possible to estimate the population of a country or to

know the quantity of wheat, rice and other agricultural commodities produced in the country

during any year.

Importance of Statistics

These days statistical methods are applicable everywhere. There is no field of work in which statistical

methods are not applied. According to A L. Bowley, ‘A knowledge of statistics is like a knowledge of foreign

languages or of Algebra, it may prove of use at any time under any circumstances”. The importance of the

statistical science is increasing in almost all spheres of knowledge, e g., astronomy, biology, meteorology,

demography, economics and mathematics. Economic planning without statistics is bound to be baseless.

Statistics serve in administration, and facilitate the work of formulation of new policies. Financial institutions

and investors utilise statistical data to summaries the past experience. Statistics are also helpful to an auditor,

when he uses sampling techniques or test checking to audit the accounts of his client.

Limitations of Statistics

The scope of the science of statistic is restricted by certain limitations :

1. The use of statistics is limited numerical studies: Statistical methods cannot be applied to

study the nature of all type of phenomena. Statistics deal with only such phenomena as are

capable of being quantitatively measured and numerically expressed. For, example, the health,

poverty and intelligence of a group of individuals, cannot be quantitatively measured, and thus

are not suitable subjects for statistical study.

2. Statistical methods deal with population or aggregate of individuals rather than with individuals.

When we say that the average height of an Indian is 1 metre 80 centimetres, it shows the height

not of an individual but as found by the study of all individuals.

3. Statistical relies on estimates and approximations : Statistical laws are not exact laws like

mathematical or chemical laws. They are derived by taking a majority of cases and are not true

for every individual. Thus the statistical inferences are uncertain.

4

4. Statistical results might lead to fallacious conclusions by deliberate manipulation of figures and

unscientific handling. This is so because statistical results are represented by figures, which are

liable to be manipulated. Also the data placed in the hands of an expert may lead to fallacious

results. The figures may be stated without their context or may be applied to a fact other than

the one to which they really relate. An interesting example is a survey made some years ago

which reported that 33% of all the girl students at John Hopkins University had married University

teachers. Whereas the University had only three girls student at that time and one of them

married to a teacher.

Distrust of Statistics

Due to limitations of statistics an attitude of distrust towards it has been developed. There are some

people who place statistics in the category of lying and maintain that, “there are three degrees of comparison

in lying-lies, dammed lies and statistics”. But this attitude is not correct. The person who is handling statistics

may be a liar or inexperienced. But that would be the fault not of statistics but of the person handling them.

The person using statistics should not take them at their face value. He should check the result from an

independent source. Also only experts should handle the statistics otherwise they may be misused. It may be

noted that the distrust of statistics is due more to insufficiency of knowledge regarding the nature, limitations

and uses of statistics then to any fundamental inadequacy in the science of statistics. Medicines are meant

for curing people, but if they are unscientifically handle by quacks, they may prove fatal to the patient. In both

the cases, the medicine is the same; but its usefulness or harmfulness depends upon the man who handles it.

We cannot blame medicine for such a result. Similarly, if a child cuts his finger with a sharp knife, it is not a

knife that is to be blamed, but the person who kept the knife at a place that the child could reach it. These

examples help us in emphasising that if statistical facts are misused by some people it would be wrong to

blame the statistics as such. It is the people who are to be blamed. In fact statistics are like clay which can be

moulded in any way.

Collection of data

For studying a problem statistically first of all, the data relevant thereto must be collected. The

numerical facts constitute the raw material of the statistical process. The interpretation of the ultimate conclusion

and the decisions depend upon the accuracy with which the data are collected. Unless the data are collected

with sufficient care and are as accurate as is necessary for the purposes of the inquiry, the result obtained

cannot be expected to be valid or reliable.

Before starting the collection of the data, it is necessary to know the sources from which the data are

to be collected.

Primary and Secondary Sources

The original compiler of the data is the primary source. For example, the office of the Registrar

General will be the primary source of the decennial population census figures.

A secondary source is the one that furnishes the data that were originally compiled by someone else.

If the population census figures issued by the office of the Registrar-General are published in the Indian year

Book, this publication will be the secondary source of the population data.

The source of data also are classified according to the character of the data yielded by them. Thus

the data which are gathered from the primary source is known as primary data and the one gathered from the

secondary source is known as secondary data. When an investigator is making use of figures which he has

5

obtained by field enumeration, he is said to be using primary data and when he is making use of figures which

he has obtained from some other source, he is said to be using secondary data.

Choice between Primary and Secondary Data

An investigator has to decide whether he will collect fresh (primary) data or he will compile data

from the published sources. The former is reliable per se but the latter can be relied upon only by examining

the following factors :—

(i) source from which they have been obtained;

(ii) their true significance;

(iii) completeness and

(iv) method to collection.

In addition to the above factors, there are other factors to be considered while making choice between

the primary or secondary data :

(i) Nature and scope of enquiry.

(ii) Availability of time and money.

(iii) Degree of accuracy required and

(iv) The status of the investigator i.e., individual, Pvt. Co., Govt. etc.

However, it may be pointed out that in certain investigations both primary and secondary data may

have to be used, one may be supplement to the other.

Methods of Collection of Primary Data

The primary methods of collection of statistical information are the following :

1. Direct Personal Observation,

2. Indirect Personal Observation,

3. Schedules to be filled in by informants

4. Information from Correspondents, and

5. Questionnaires in charge of enumerators

The particular method that is decided to be adopted would depend upon the nature and availability of

time, money and other facilities available to the investigation.

1. Direct Personal Observation

According to this method, the investigator obtains the data by personal observation. The method is

adopted when the field of inquiry is small. Since the investigator is closely connected with the collection of

data, it is bound to be more accurate. Thus, for example, if an inquiry is to be conducted into the family

budgets and giving conditions of industrial labour, the investigation himself live in the industrial area as one of

the industrial workers, mix with other residents and make patience and careful personal observation regarding

how they spend, work and live.

6

2. Indirect Personal Observation

According to this method, the investigator interviews several persons who are either directly or

indirectly in possession of the information sought to be collected. It may be distinguished form the first

method in which information is collected directly from the persons who are involved in the inquiry. In the case

of indirect personal observation, the persons from whom the information is being collected are known as

witnesses or informants. However it should be ascertained that the informants really passes the knowledge

and they are not prejudiced in favour of or against a particular view point.

This method is adopted in the following situations :

1. Where the information to be collected is of a complete nature.

2. When investigation has to be made over a wide area.

3. Where the persons involved in the inquiry would be reluctant to part with the information.

This method is generally adopted by enquiry committee or commissions appointed by government.

3. Schedules to be filled in by the informants

Under this method properly drawn up schedules or blank forms are distributed among the persons

from whom the necessary figure are to be obtained. The informants would fill in the forms and return them

to the officer incharge of investigation. The Government of India issued slips for the special enumeration of

scientific and technical personnel at the time of census. These slips are good examples of schedules to be

filled in by the informants.

The merit of this method is its simplicity and lesser degree of trouble and pain for the investigator. Its

greatest drawback is that the informants may not send back the schedules duly filled in.

4. Information from Correspondents

Under this method certain correspondent are appointed in different parts of the field of enquiry, who

submit their reports to the Central Office in their own manner. For example, estimates of agricultural wages

may be periodically furnished to the Government by village school teachers.

The local correspondents being on the spot of the enquiry are capable of giving reliable information.

But it is not always advisable to place much reliance on correspondents, who have often got their own

personal prejudices. However, by this method, a rough and approximate estimate is obtained at a very low

cost. This method is also adopted by various departments of the government in such cases where regular

information is to be collected from a wide area.

Questionnaire incharge of Enumerations

A questionnaire is a list of questions directly or indirectly connected with the work of the enquiry. The

answers to these questions would provide all the information sought. The questionnaire is put in the charge of

trained investigators whose duty is to go to all persons or selected persons connected with the enquiry.

This method is usually adopted in case of large inquiries. The method of collecting data is relatively

cheap. Also the information obtained is that of good quality.

The main drawback of this method is that the enumerator (i.e., investigator in charge of the

questionnaire) may be a biased one and may not enter the answer given by the information. Where there are

many enumerators, they may interpret various terms in questionnaire according to their whims. To that extent

the information supplied may be either inaccurate or inadequate or not comparable.

This drawback can be removed to a great extent by training the investigators before the enquiry

begins. The meaning of different questions may be explained to them so that they do not interpret them

according to their whims.

7

Drafting the Questionnaire

The success of questionnaire method of collecting information depends on the proper drafting of the

questionnaire. It is a highly specialized job and requires great deal of skill and experience. However, the

following general principle may be helpful in framing a questionnaire :

1. The number of the questions should be kept to the minimum fifteen to twenty five may be a fair

number.

2. The questions must be arranged in a logical order so that a natural and spontaneous reply to

each is induced.

3. The questions should be short, simple and easy to understand and they should convey one

meaning.

4. As far as possible, quotation of a personal and pecuniary nature should not be asked.

5. As far as possible the questions should be such that they can be answered briefly in ‘Yes’ or

‘No’, or in terms of numbers, place, date, etc.

6. The questionnaire should provide necessary instructions to the Informants. For instance, if there

is a question on weight. It should be specified as to whether weight is to be indicated in lbs or

kilograms.

7. Questions should be objective type and capable of tabulation.

Specimen Questionnaire

We are giving below a specimen questionnaire of Expenditure Habits or Students residing in college

Hostels.

Name of Student ............................................ Class ............................................

State and District of origin ..............................

Age ..............................

1. How much amount do you get from your father/guardian p.m. ?

2. Do you get some scholarship ? If so, state the amount per month.

3. Is there any other source from which you get money regularly ? (e.g. mother, brother or uncle).

4. How much do you spend monthly on the following items :

Rs.

College Tuition Fee .........

Hostel Food Expenses ........

Other hostel fees ........

Clothing ........

Entertainment ........

Smoking ........

Miscellaneous ........

Total

8

5. Do you smoke ?

If so what is the daily expenditure on it ?

6. Any other item on which you spend money ?

Sources of Secondary Data

There are number of sources from which secondary data may be obtained. They may be classified

as follow. :

1. Published sources, and

2. Unpublished sources.

1. Published Sources

The various sources of published data are :

1. Reports and official publications of-

(a) International bodies such as the International Monetary Fund, International Finance

Corporation, and United Nations Organisation.

(b) Central and State Governments- such as the Report of the Patel Committee, etc.

2. Semi Official Publication. Various local bodies such as Municipal Corporation, and Districts

Boards.

3. Private Publication of—

(a) Trade and professional bodies such as the Federation of India, Chamber of Commerce and

Institute of Chartered Accountants of India.

(b) Financial and Economic Journals such as “Commerce”, ‘Capital’ etc.

(c) Annual Reports of Joint Stock Companies.

(d) Publication brought out by research agendas, research scholars, etc.

2. Unpublished Sources

There are various sources of unpublished data such as records maintained by various government

and private offices, studies made by research institutions, scholars, etc., such source can also be used where

necessary.

Census and Sampling Techniques of Collection of Data

There are two important techniques of Data collection, (i) Census enquiry implies complete

enumeration of each unit of the universe, (ii) In a sample survey, only a small part of the group, is considered,

which is taken as representative. For example the population census in India implies the counting of each and

every human being within the country.

In practice sometimes it is not possible to examine every item in the population. Also many a time it

is possible to obtain sufficiently accurate results by studying only a part of the “population”. For example, if

the marks obtained in statistics by 10 students in an examination are selected at random, say out of 100, then

the average marks obtained by 10 students will be reasonably representative of the average marks obtained

by all the 100 students. In such a case, the populations will be the marks of the entire group of 100 students

and that of 10 students will be a sample.

9

Objects of Sampling

1. To get as much information as possible of the whole universe by examining only a part of it.

2. To determine the reliability of the estimates. This can be done by drawing successive samples

from the some parent universe and comparing the results obtained from different samples.

Advantages of Census Method

1. As the entire ‘population’ is studied, the result obtained are most correct.

2. In a census, information is available for each individual item of the population which is not

possible in the case of a sample. Thus no information is sacrificed under the census method.

3. If data are to be secured only from a small fraction of the aggregate, their completeness and

accuracy can be ensured only by the census method, since greater attention thereby is given to each item.

4. The census mass of data being taken into consideration all the characteristics of the ‘population’

is maintained in original.

Disadvantages of Census Method

1. The cost of conducting enquiry by the census method is very high as the whole universe is to be

investigated.

2. The census method is not practicable in very big enquiries due to the inconvenience of individual

enumeration.

3. In the cases of very big enquiries, the census method can be resorted to by the government

agencies only. The application of this method is limited to those who are having adequate financial resources

and other facilities at their disposal.

4. As all the items in the universe are to be enumerated, there is a need for training of staff and

investigators. Sometimes it becomes very difficult to maintain uniformity of standards, when many investigators

are involved. Individual preferences and prejudices are there and it becomes very difficult to avoid bias in

such type of enquiries.

Advantages of Sampling Method

1. Sample method is less costly since the sample is a small fraction of the total population.

2. Data can be collected and summarized more quickly. This is a vital consideration when the

information is urgently needed.

3. A sample produces more accurate results than are ordinarily practicable on a complete

enumeration.

4. Personnel of high quality can be employed and given intensive training as the number of much

personnel would not be very large.

5. A sample method is not restricted to the Government agencies. Even private agencies can use

this method as the financial burden is not heavy. It is much more economical than the census method.

Disadvantages of Sampling Method

1. In a census, information is available for each individual item of the population which is not

possible in the case of a sample. Some information has to be sacrificed.

10

2. If data are to be secured only from a small fraction of the aggregate, their completeness and

accuracy can be ensured only through the sensus method, since greater attention thereby is given to each

item.

3. In using the technique of sampling, the investigator may not choose a representative sample. The

aim of sampling is that it should afford a sufficiently accurate picture of a large group without the need for a

complete enumeration of all the units of the group. If the sample chosen is not representative of the group,

the very object of sampling is defeated.

4. The sampling technique is based upon the fundamental assumption that the population to be

sampled is homogenous. It is not so, the sampling method should not be adopted unless the population is first

divided into groups or “strata” before the selection of the sample is made.

Principle of sampling

There are two important principles on which the theory of sampling is based ;

1. Principle of Statistical Regularity, and

2. Principle of ‘Intertia of Large Numbers’

1. Principle of Statistical Regularity

This principle points out that if a sample is taken at random from a population. It is likely to possess

almost the same characteristics as that of the population.

By random selection, we mean a selection where each and every item of the population has an equal

chance of being selected in the sample. In other words, the selection must not be made by deliberate exercise

of one’s discretion. A sample selected in this manner would be representative of the population. For example,

if one intends to make a study of the average weight of the students of Delhi University, it is not necessary to

take the weight of each and every student. A few students may be selected at random from every college,

their weights taken and the average weight of the University students in general may be inferred.

2. Principle of Intertia of Large Numbers

This principle is a corollary of the principle of statistical regularity. This principle is that, other things

being equal, larger the size of the sample, more accurate the results are likely to be. This is because large

numbers are more stable as compared to small ones. For example, if a coin is tossed 10 time we should

expect an equal number of heads and tails, i.e., 5 each. But since the experiment is tried a small number of

items it is likely that we may not get exactly 5 heads and 5 tails. The result may be a combination of 9 heads

and 1 tail or 8 heads and 2 tails or 7 heads and 3 tails etc. If the same experiment is carried out 1,000 times

the chance of getting 500 heads and 500 tails would be very higher. The basic reason for such likelihood is

that the experiment has been carried out a sufficiently large number of time and possibility of variations in one

direction compensates for others.

Method of Sampling

The various methods available for sampling are :

(i) Conscious or Deliberate or Purposive Sampling.

(ii) Random Sampling or Chance Selection.

(iii) Stratified Sampling.

11

(iv) Systematic Sampling.

(v) Multi-stage Sampling.

(i) Purposive Sampling

Purposive sampling is representative sampling by analysing carefully the universe enquiry and selecting

only those which seem to be most representatives of the characteristics of the universe. If economic conditions

of people living in a state are to be studied according to this method, then a few villages and towns may be

purposively selected so that intensive study on the principle that they shall be representative of the entire

state.

Thus the purposive sampling is a purposive selection by the investigator that depends on the nature

and purpose of the enquiry. This method is very much exposed to the dangers of personal prejudices. Also

there is a possibility of certain wrong cases being included in the data under collection, consciously or

unconsciously.

However, it may be noted that this method gives a very representative sample data provided neither

bias nor prejudices influence the process of data selection.

(ii) Random Sampling

In order to avoid the danger of personal bias and prejudices, a random sample is adopted. Under this

method every item in the universe is given equal chance of being included in the sample.

A random sample is the simplest type of sample. For obtaining such sample, a certain number of units

are selected at random from the universe. But this sampling technique is based upon the fundamental assumption

that the population to be swapped is homogenous. If it is not so, then the stratified sampling is adopted.

(iii) Stratified Sampling

Under this method, the population is first sub-divided into groups or “strata” before the selection of

the samples is made. This is done to achieve homogeneity within each group or “stratum”. A stratified sample

is nothing but a set of random samples of a number of sub-populations, each representing a single group. The

major advantage of such a stratification is that the several sub-divisions of the population which are relevant

for purpose of inquiry are adequately represented.

(iv) Systematic Sampling

This method is used where complete list of the population from which sample is to be drawn is

available. The method is to select every rth item*, from the list where ‘r’ refers to the sampling interval. The

first item between the first and the rth is selected as random. For example, if a list of 500 students of a college

is available and if we want to draw a sample of 100, we must select every fifth item (i.e., r = 5). The first item

between one and five shall be selected at random. Suppose it comes out to be 4. Now we shall add five and

obtain numbers of the desired sample. Thus the second item would be the 9th students; the third 14th students;

the fourth 19 students; and so on.

Sampling interval or r = size of the universe

size of the sample

12

This method is more convenient to adopt than the random sampling or stratified sampling method.

The time and work involved are relatively smaller. But the main drawback of this method is that systematic

samples an not always random samples.

(v) Multi-Stage Sampling

As the name implies this method refers to a sampling procedure which is carried out in several

stages. At first stage, the first stage units are sampled by some statistical method, such as random sampling.

Then a sample of second stage units is selected from each of the selected first units. Further stages may be

added as required.

This method introduces flexibility in the sampling method which is lacking in the other methods.

However, a multi-stage sample is less accurate than sample containing the same numbers of final stage units

which have been selected by some suitable single stage process.

13

LESSON 2

CONSTRUCTION OF FREQUENCY DISTRIBUTION

AND GRAPHICAL PRESENTATION

What is frequency distribution

Collected and classified data are presented in a form of frequency distribution. Frequency distribu-

tion is simply a table in which the data are grouped into classes on the basis of common characteristics and

the number of cases which fall in each class are recorded. It shows the frequency of occurrence of different

values of a single variable. A frequency distribution is constructed to satisfy three objectives :

(i) to facilitate the analysis of data,

(ii) to estimate frequencies of the unknown population distribution from the distribution of sample

data, and

(iii) to facilitate the computation of various statistical measures.

Frequency distribution can be of two types :

1. Univariate Frequency Distribution.

2. Bivariate Frequency Distribution.

In this lesson, we shall understand the Univariate frequency distribution. Univariate distribution in-

corporates different values of one variable only whereas the Bivariate frequency distribution incorporates the

values of two variables. The Univariate frequency distribution is further classified into three categories:

(i) Series of individual observations,

(ii) Discrete frequency distribution, and

(iii) Continuous frequency distribution.

Series of individual observations, is a simple listing of items of each observation. If marks of 14

students in statistics of a class are given individually, it will form a series of individual observations.

Marks obtained in Statistics :

Roll Nos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Marks: 60 71 80 41 81 41 85 35 98 52 50 91 30 88

Marks in Ascending Order Marks in Descending Order

30 98

35 91

41 88

41 85

50 81

52 80

60 71

14

71 60

80 52

81 50

85 41

88 41

91 35

98 30

Discrete Frequency Distribution: In a discrete series, the data are presented in such a way that

exact measurements of units are indicated. In a discrete frequency distribution, we count the number of times

each value of the variable in data given to you. This is facilitated through the technique of tally bars.

In the first column, we write all values of the variable. In the second column, a vertical bar called tally

bar against the variable, we write a particular value has occurred four times, for the fifth occurrence, we put

a cross tally mark ( / ) on the four tally bars to make a block of 5. The technique of putting cross tally bars at

every fifth repetition facilitates the counting of the number of occurrences of the value. After putting tally

bars for all the values in the data; we count the number of times each value is repeated and write it against the

corresponding value of the variable in the third column entitled frequency. This type of representation of the

data is called discrete frequency distribution.

We are given marks of 42 students:

55 51 57 40 26 43 46 41 46 48 33 40 26 40 40 41

43 53 45 53 33 50 40 33 40 26 53 59 33 39 55 48

15 26 43 59 51 39 15 45 26 15

We can construct a discrete frequency distribution from the above given marks.

Marks of 42 Students

Marks Tally Bars Frequency

15 ||| 3

26 5

33 |||| 4

39 || 2

40 5

41 || 2

43 ||| 3

45 || 2

46 || 2

48 || 2

50 | 1

51 || 2

15

53 ||| 3

55 ||| 3

57 | 1

59 || 2

Total 42

The presentation of the data in the form of a discrete frequency distribution is better than arranging

but it does not condense the data as needed and is quite difficult to grasp and comprehend. This distribution

is quite simple in case the values of the variable are repeated otherwise there will be hardly any condensation.

Continuous Frequency Distribution; If the identity of the units about a particular information col-

lected, is neither relevant nor is the order in which the observations occur, then the first step of condensation

is to classify the data into different classes by dividing the entire group of values of the variable into a suitable

number of groups and then recording the number of observations in each group. Thus, we divide the total

range of values of the variable (marks of 42 students) i.e. 59-15 = 44 into groups of 10 each, then we shall get

(42/10) 5 groups and the distribution of marks is displayed by the following frequency distribution:

Marks of 42 Students

Marks (×) Tally Bars Number of Students (f)

15 – 25 ||| 3

25 – 35 |||| 9

35 – 45 || 12

45 – 55 || 12

55 – 65 | 6

Total 42

The various groups into which the values of a variable are classified are known classes, the length

of the class interval (10) is called the width of the class. Two values, specifying the class. are called the

class limits. The presentation of the data into continuous classes with the corresponding frequencies is

known as continuous frequency distribution. There are two methods of classifying the data according to

class intervals :

(i) exclusive method, and

(ii) inclusive method

In an exclusive method, the class intervals are fixed in such a manner that upper limit of one

class becomes the lower limit of the following class. Moreover, an item equal to the upper limit of a

class would be excluded from that class and included in the next class. The following data are classified

on this basis.

Income (Rs.) No. of Persons

200 – 250 50

250 – 300 100

16

300 – 350 70

350 – 400 130

400 – 50 50

450 – 500 100

Total 500

It is clear from the example that the exclusive method ensures continuity of the data in as much as

the upper limit of one class is the lower limit of the next class. Therefore, 50 persons have their incomes

between 200 to 249.99 and a person whose income is 250 shall be included in the next class of 250 – 300.

According to the inclusive method, an item equal to upper limit of a class is included in that class

itself. The following table demonstrates this method.

Income (Rs.) No.of Persons

200 – 249 50

250 – 299 100

300 – 349 70

350 – 399 130

400 – 149 50

450 – 499 100

Total 500

Hence in the class 200 – 249, we include persons whose income is between Rs. 200 and Rs. 249.

Principles for Constructing Frequency Distributions

Inspite of the great importance of classification in statistical analysis, no hard and fast rules are laid

down for it. A statistician uses his discretion for classifying a frequency distribution and sound experience,

wisdom, skill and aptness for an appropriate classification of the data. However, the following guidelines must

be considered to construct a frequency distribution:

1. Type of classes: The classes should be clearly defined and should not lead to any ambiguity.

They should be exhaustive and mutually exclusive so that any value of variable corresponds to

only class.

2. Number of classes: The choice about the number of classes in which a given frequency distri-

bution should he divided depends upon the following things;

(i) The total frequency which means the total number of observations in the distribution.

(ii) The nature of the data which means the size or magnitude of the values of the variable.

(iii) The desired accuracy.

(iv) The convenience regarding computation of the various descriptive measures of the

frequency distribution such as means, variance etc.

17

The number of classes should not be too small or too large. If the classes are few, the classification

becomes very broad and rough which might obscure some important features and characteristics of the data.

The accuracy of the results decreases as the number of classes becomes smaller. On the other hand, too

many classes will result in a few frequencies in each class. This will give an irregular pattern of frequencies

in different classes thus makes the frequency distribution irregular. Moreover a large number of classes will

render the distribution too unwieldy to handle. The computational work for further processing of the data will

become quite tedious and time consuming without any proportionate gain in the accuracy of the results.

Hence a balance should be maintained between the loss of information in the first case and irregularity of

frequency distribution in the second case, to arrive at a suitable number of classes. Normally, the number of

classes should not be less than 5 and more than 20. Prof. Sturges has given a formula:

k = 1 + 3.322 log n

where k refers to the number of classes and n refers to total frequencies or number of observations.

The value of k is rounded to the next higher integer :

If n = 100 k = 1 + 3.322 log 100 = 1 + 6.644 = 8

If n = 10,000 k = 1 + 3.22 log 10,000 = 1 + 13.288 = 14

However, this rule should be applied when the number of observations are not very small.

Further, the number or class intervals should be such that they give uniform and unimodal distribution

which means that the frequencies in the given classes increase and decrease steadily and there are no

sudden jumps. The number of classes should be an integer preferably 5 or multiples of 5, 10, 15, 20, 25 etc.

which are convenient for numerical computations.

3. Size of Class Intervals : Because the size of the class interval is inversely proportional to the

number of classes in a given distribution, the choice about the size of the class interval will

depend upon the sound subjective judgment of the statistician. An approximate value of the

magnitude of the class interval say i can be calculated with the help of Sturge’s Rule :

where i slands for class magnitude or interval, Range refers to the difference between the

largest and smallest value of the distribution, and n refers to total number of observations.

If we are given the following information; n = 400, Largest item = 1300 and Smallest item = 340.

then, = =

Another rule to determine the size of class interval is that the length of the class interval should not he

greater than of the estimated population standard deviation. If 6 is the estimate of population standard

deviation then the length of class interval is given by: i ≤ 6/4,

The size of class intervals should he taken as 5 or multiples of 5, 10, 15 or 20 for easy computations

of various statistical measures of the frequency distribution, class intervals should be so fixed that each class

has a convenient mid-point around which all the observations in that class cluster. It means that the entire

frequency of the class is concentrated at the mid value of the class. It is always desirable to take the class

intervals of equal or uniform magnitude throughout the frequency distribution.

4. Class Boundaries: If in a grouped frequency distribution there are gaps between the upper

limit of any class and lower limit of the succeeding class (as in case of inclusive type of classi-

fication), there is a need to convert the data into a continuous distribution by applying a correc-

tion factor for continuity for determining new classes of exclusive type. The lower and upper

18

class limits of new exclusive type classes are called class boundaries.

If d is the gap between the upper limit of any class and lower limit of succeeding class, the class

boundaries for any class are given by:

d/2 is called the correction factor.

Let us consider the following example to understand :

Marks Class Boundaries

20 – 24 (20 – 0.5, 24 + 0.5) i.e., 19.5 – 24.5

25 – 29 (25 – 0.5,29 + 0.5) i.e., 24.5 – 29.5

30 – 34 (30 – 0.5,34 + 0.5) i.e., 29.5 – 34.5

55 – 39 (35 – 0.5,39 + 0.5) i.e., 34.5 – 39.5

40 – 44 (40 – 0.5,44 + 0.5) i.e., 39.5 – 44.5

Correction factor = =

5. Mid-value or Class Mark: The mid value or class mark is the value of a variable which is

exactly at the middle of the class. The mid-value of any class is obtained by dividing the sum of

the upper and lower class limits by 2.

Mid value of a class = 1/2 [Lower class limit + Upper class limit]

The class limits should be selected in such a manner that the observations in any class are

evenly distributed throughout the class interval so that the actual average of the observations in

any class is very close to the mid-value of the class.

6. Open End Classes : The classification is termed as open end classification if the lower limit of

the first class or the upper limit of the last class or both are not specified and such classes in

which one of the limits is missing are called open end classes. For example, the classes like the

marks less than 20 or age above 60 years. As far as possible open end classes should be avoided

because in such classes the mid-value cannot be accurately obtained. But if the open end

classes are inevitable then it is customary to estimate the class mark or mid-value for the first

class with reference to the succeeding class. In other words, we assume that the magnitude of

the first class is same as that of the second class.

Example :Construct a frequency distribution from the following data by inclusive method taking 4 as the

class interval:

10 17 15 22 11 16 19 24 29 18

25 26 32 14 17 20 23 27 30 12

15 18 24 36 18 15 21 28 33 38

34 13 10 16 20 22 29 19 23 31

Solution : Because the minimum value of the variable is 10 which is a very convenient figure for taking the

lower limit of the first class and the magnitude of the class interval is given to be 4, the classes for preparing

frequency distribution by the Inclusive method will be 10 – 13, 14 – 17, 18 – 21, 22 – 25,.............. 38 – 41.

Frequency Distribution

19

Class Interval Tally Bars Frequency (f)

10 – 13 5

14 – 17 ||| 8

18 – 21 ||| 8

22 – 25 || 7

26 – 29 5

30 – 33 |||| 4

34 – 37 || 2

38 – 41 | 1

Example : Prepare a statistical table from the following :

Weekly wages (Rs.) of 100 workers of Factory A

88 23 27 28 86 96 94 93 86 99

82 24 24 55 88 99 55 86 82 36

96 39 26 54 87 100 56 84 83 46

102 48 27 26 29 100 59 83 84 48

104 46 30 29 40 101 60 89 46 49

106 33 36 30 40 103 70 90 49 50

104 36 37 40 40 106 72 94 50 60

24 39 49 46 66 107 76 96 46 67

26 78 50 44 43 46 79 99 36 68

29 67 56 99 93 48 80 102 32 51

Solution : The lowest value is 23 and the highest 106. The difference between the lowest and highest

value is 83. If we take a class interval of 10. nine classes would be made. The first class should be taken

as 20 – 30 instead of 23 – 33 as per the guidelines of classification.

Frequency Distribution of the Wages of 100 Workers

Wages (Rs.) Tally Bars Frequency (f)

29 – 30 |||| 13

30 – 40 | 11

40 – 50 ||| 18

50 – 60 10

60 – 70 | 6

70 – 80 5

80 – 90 |||| 14

90 – 100 || 12

20

100 – 110 | 11

Total 100

Graphs of Frequency Distributions

The guiding principles for the graphic representation of the frequency distributions are same as for

the diagrammatic and graphic representation of other types of data. The information contained in a frequency

distribution can be shown in graphs which reveals the important characteristics and relationships that are not

easily discernible on a simple examination of the frequency tables. The most commonly used graphs for

charting a frequency distribution are :

1. Histogram

2. Frequency polygon

3. Smoothed frequency curves

4. Ogives or cumulative frequency curves.

1. Histogram

The term ‘histogram’ must not be confused with the term ‘historigram’ which relates to time charts.

Histogram is the best way of presenting graphically a simple frequency distribution. The statistical meaning

of histogram is that it is a graph that represents the class frequencies in a frequency distribution by vertical

adjacent rectangles.

While constructing histogram the variable is always taken on the X-axis and the corresponding class-

interval. The distance for each rectangle on the X-axis shall remain the same in case the class-intervals are

uniform throughout; if they are different the width of the rectangles shall also change proportionately. The Y-

axis represents the frequencies of each class which constitute the height of its rectangle. We get a series of

rectangles each having a class interval distance as its width and the frequency distance as its height. The

area of the histogram represents the total frequency.

The histogram should be clearly distinguished from a bar diagram. A bar diagram is one-dimensional

where the length of the bar is important and not the width, a histogram is two-dimensional where both the

length and width are important. However, a histogram can be misleading if the distribution has unequal class

intervals and suitable adjustments in frequencies are not made.

The technique of constructing histogram is explained for :

(i) distributions having equal class-intervals, and

(ii) distributions having unequal class-intervals.

Example : Draw a histogram from the following data :

Classes Frequency

0 – 10 5

10 – 20 11

20 – 30 19

30 – 40 21

40 – 50 16

21

50 – 60 10

60 – 70 8

70 – 80 6

80 – 90 3

90 – 100 1

Solution :

When class-intervals are unequal the frequencies must be adjusted before constructing a histogram.

We take that class which has the lowest class-interval and adjust the frequencies of classes accordingly. If

one class interval is twice as wide as the one having the lowest class-interval we divide the height of its

rectangle by two, if it is three times more we divide it by three etc. the heights will be proportional to the ratios

of the frequencies to the width of the classes.

Example : Represent the following data on a histogram.

Average monthly income of 1035 employees in a construction industry is given below :

Monthly Income (Rs.) No. of Workers

600 – 700 25

700 – 800 100

800 – 900 150

900 – 1000 200

1000 – 1200 140

1200 – 1400 80

1400 – 1500 50

1500 – 1800 30

1800 or more 20

Solution : Histogram showing monthly incomes of workers :

When mid point are given, we ascertain the upper and lower limits of each class and then

construct the histogram in the same manner.

Example : Draw a histogram of the following distribution :

Life of Electric Lamps Frequency

(hours) Firm A Firm B

1010 10 287

1030 130 105

1050 482 26

1070 360 230

1090 18 352

Solution : Since we are given the mid points, we should ascertain the class limits. To calculate the class limits

22

of various classes, take difference of two consecutive mid-points and divide the difference by 2, then add and

subtract the value obtained from each mid-point to calculate lower and higher class-limits.

Life of Electric Lamps Frequency

(hours) Firm A Firm B

1000 – 1020 10 287

1020 – 1040 130 105

1040 – 1060 482 76

1060 – 1080 360 230

1080 – 1100 18 352

2. Frequency Polygon

This is a graph of frequency distribution which has more than four sides. It is particularly effective in

comparing two or more frequency distributions. There are two ways of constructing a frequency polygon.

(i) We may draw a histogram of the given data and then join by straight line the mid-points of the

upper horizontal side of each rectangle with the adjacent ones. The figure so formed shall be frequency

polygon. Both the ends of the polygon should be extended to the base line in order to make the area under

frequency polygons equal to the area under Histogram.

(ii) Another method of constructing frequency polygon is to take the mid-points of the various class-

intervals and then plot the frequency corresponding to each point and join all these points by straight lines.

The figure obtained by both the methods would be identical.

Frequency polygon has an advantage over the histogram. The frequency polygons of several distri-

butions can be drawn on the same axis, which makes comparisons possible whereas histogram cannot be

used in the same way. To compare histograms we need to draw them on separate graphs.

3. Smoothed Frequency Curve

A smoothed frequency curve can be drawn through the various points of the polygon. The curve is

drawn by free hand in such a manner that the area included under the curve is approximately the same as that

of the polygon. The object of drawing a smoothed curve is to eliminate all accidental variations which exists

in the original data, while smoothening, the top of the curve would overtop the highest point of polygon

particularly when the magnitude of the class interval is large. The curve should look as regular as possible and

all sudden turns should be avoided. The extent of smoothening would depend upon the nature of the data. For

drawing smoothed frequency curve it is necessary to first draw the polygon and then smoothen it. We must

keep in mind the following points to smoothen a frequency graph:

(i) Only frequency distribution based on samples should be smoothened.

(ii) Only continuous series should be smoothened.

(iii) The total area under the curve should be equal to the area under the histogram or polygon.

The diagram given below will illustrate the point :

23

4. Cumulative Frequency Curves or Ogives

We have discussed the charting of simple distributions where each frequency refers to the mea-

surement of the class-interval against which it is placed. Sometimes it becomes necessary to know the

number of items whose values are greater or less than a certain amount. We may, for example, be inter-

ested in knowing the number of students whose weight is less than 65 Ibs. or more than say 15.5 Ibs. To

get this information, it is necessary to change the form of frequency distribution from a simple to a

cumulative distribution. In a cumulative frequency distribution, the frequency of each class is made to

include the frequencies of all the lower or all the upper classes depending upon the manner in which

cumulation is done. The graph of such a distribution is called a cumulative frequency curve or an Ogive.

There are two method of constructing ogives, namely:

(i) less than method, and

(ii) more than method

In less than method, we start with the upper limit of each class and go on adding the frequencies.

When these frequencies are plotted we get a rising curve.

In more than method, we start with the lower limit of each class and we subtract the frequency of

each class from total frequencies. When these frequencies are plotted, we get a declining curve.

This example would illustrate both types of ogives.

Example : Draw ogives by both the methods from the following data.

Distribution of weights of the students of a college (Ibs.)

Weights No. of Students

90.5 – 100.5 5

100.5 – 110.5 34

110.5 – 120.5 139

120.5 – 130.5 300

130.5 – 140.5 367

140.5 – 150.5 319

150.5 – 160.5 205

160.5 – 170.5 76

170.5 – 180.5 43

180.5 – 190.5 16

190.5 – 200.5 3

200.5 – 210.5 4

210.5 – 220.5 3

220.5 – 230.5 1

Solution : First of all we shall find out the cumulative frequencies of the given data by less than method.

Less than (Weights) Cumulative Frequency

24

100.5 5

110.5 39

120.5 178

130.5 478

140.5 845

150.5 1164

160.5 1369

170.5 1445

180.5 1488

190.5 1504

200.5 1507

210.5 1511

220.5 1514

230.5 1515

Plot these frequencies and weights on a graph paper. The curve formed is called an Ogive

Now we calculate the cumulative frequencies of the given data by more than method.

More than (Weights) Cumulative Frequencies

90.5 1515

100.5 1510

110.5 1476

120.5 1337

130.5 1037

140.5 670

150.5 351

160.5 146

170.5 70

180.5 27

190.5 11

200.5 8

210.5 4

220.5 1

By plotting these frequencies on a graph paper, we will get a declining curve which will be our

cumulative frequency curve or Ogive by more than method.

25

Although the graphs are a powerful and effective method of presenting statistical data, they are not

under all circumstances and for all purposes complete substitutes for tabular and other forms of presentation.

The specialist in this field is one who recognizes not only the advantages but also the limitations of these

techniques. He knows when to use and when not to use these methods and from his experience and expertise

is able to select the most appropriate method for every purpose.

Example :Draw an ogive by less than method and determine the number of companies earning profits

between Rs. 45 crores and Rs. 75 crores :

Profits No. of Profits No. of

(Rs. crores) Companies (Rs. crores) Companies

10—20 8 60—70 10

20—30 12 70—80 7

30—40 20 80—90 3

40—50 24 90—100 1

50—6.0 15

Solution :

OGIVE BY LESS THAN METHOD

Profits No.of

(Rs. crores) Companies

Less than 20 8

Less than 30 20

Less than 40 40

Less than 50 64

Less than 60 79

Less than 70 89

Less than 80 96

Less than 90 99

Less than 100 100

It is clear from the graph that the number of companies getting profits less than Rs.75 crores is 92

and the number of companies getting profits less than Rs. 45 crores is 51. Hence the number of companies

getting profits between Rs. 45 crores and Rs. 75 crores is 92 – 51 = 41.

Example :The following distribution is with regard to weight in grams of mangoes of a given variety. If

mangoes of weight less than 443 grams be considered unsuitable for foreign market, what is the percentage

of total mangoes suitable for it? Assume the given frequency distribution to be typical of the variety:

Weight in gms. No. of mangoes Weight in gms. No. of mangoes

410 – 119 10 450 – 159 45

420 – 429 20 460 – 469 18

26

430 – 139 42 470 – 179 7

440 – 449 54

Draw an ogive of ‘more than’ type of the above data and deduce how many mangoes will be more

than 443 grams.

Solution :Mangoes weighting more than 443 gms. are suitable for foreign market. Number of mangoes

weighting more than 443 gms. lies in the last four classes. Number of mangoes weighing between 444 and

449 grams would be

Total number of mangoes weighing more than 443 gms. = 32.4 + 45 + 18 + 7 = 102.4

Percentage of mangoes =

Therefore, the percentage of the total mangoes suitable for foreign market is 52.25.

OGIVE BY MORE THAN METHOD

Weight more than (gms.) No. of Mangoes

410 196

420 186

430 166

440 124

450 70

460 25

470 7

From the graph it can be seen that there are 103 mangoes whose weight will be more than 443 gms.

and are suitable for foreign market.

DIAGRAM

Statistical data can be presented by means of frequency tables, graphs and diagrams. In this lesson,

so far we have discussed the graphical presentation. Now we shall take up the study of diagrams. There are

many variety of diagrams but here we are concerned with the following types only :

(i) Bar diagrams

(ii) Rectangles, squares and circles

Bar Diagram

A bar diagram may be simple or component or multiple. A simple bar diagram is used to represent

only one variable. Length of the bars is proportional to the magnitude to be represented. But when we are

interested in showing various parts of a whole, then we construct component or composite bar diagrams.

Whenever comparisons of more than one variable is to be made at the same time, then multiple bar chart,

which groups two or more bar charts together, is made use of. We shall now illustrate these by examples.

27

Example 1 : The following table gives the average approximate yield of rice in Ibs, per acre in various

countries of the world in 2000–05.

Country Yield in lbs. per acre

India 728

Siam 943

U.S.A. 1469

Italy 2903

Egypt 2153

Japan 2276

Indicate this by a suitable diagram

Solution :

In the above example, bars have been erected vertically. Also bars may be erected horizontally.

Example 2 : Draw a suitable diagram for the following date of expenditure of an average working class

family,

Item of Expenditure Percentage of Total Expenditure

Food 65

Clothing 10

Housing 12

Fuel and lighting 5

Miscellaneous 8

Solution : This is a case of percentage bar diagram as per cent of total expenditure is given.

Example 3 : Represent the following data with a suitable diagram.

Year Men Women Children Total

1990 180 110 100 390

1995 200 140 125 465

2000 250 200 150 600

Solution : This is case of a component or composite bar diagram. In addition to the number of men, women

and children employed the total number of labour force for the three years is obvious.

Example 4 : The following table gives the number of companies at work in India for a few years. Represent

the data by a suitable diagram.

28

Year Public companies Private companies Total

2000 5000 20,000 25,000

2001 4000 16,000 20,000

2002 6000 18,000 24,000

2003 7000 21,000 28,000

2004 5000 15,000 20,000

Solution : The data can be shown with the help of a component bar diagram for each year. Also it

can be shown with the help of multiple bar diagram which is drawn below.

From the above diagram it is clear that comparison between the number of private companies and

public companies is very sharp as the data is placed side by side. But as compared to a component bar

diagram, no idea can be formed about the total number of companies at work.

Circles or Pie Diagrams

When circles are drawn to represent its idea equivalent to the figures, they are said to form pie-

diagrams or circle diagrams. In case of circles the square roots of magnitudes are proportional to the radius.

Suppose we are given the following figures

144, 81, 64, 16, 9

For the purpose of showing the data with the help of circles, we shall find out the square roots of all

the values. We get 12, 9, 8, 4 and 3. Now we shall use these values as radii of the different circles. It may be

noted that in this case, bar diagram would not show the comparison and also it would be difficult to draw as

there is a wide gap between the smallest and the highest value of the variate.

Sub-divided Pie-diagrams

Sub-divided pie-diagrams are used when comparison of the component parts is done with another

and the total.

The total value is equated to 360° and then the angles corresponding to component parts are calcu-

lated. Let us take an example.

Example : A rupee spent on “Khadi” is distributed as follows :

Farmer 20 Paise

Carder and spinner 35

Weaver 25

Washerman, dyer and printer 10

Administrative Agency 10

Total 100

Present the data in the form of a pie-diagram.

Solution : The angles subtended at the centre would be calculated as follows :

29

Expenditure Paise Angle

Farmer 20

Carder and spinner 35

Weaver 25

Washerman, dyer and printer 10

Administrative Agency 10

100 = 360°

A sub-divided circle is drawn with the angles of 72°, 126°, 90°, 36° and 36° for the various items of

expenditure.

The above data could also be presented by a percentage bar diagram as there is not much difference

between the smaller and the highest values. It is simple and easier to draw a bar diagram in this case.

Choice of a Suitable Diagram

The choice of diagram out of several ones in a given situation is a ticklish problem. The choice

primarily depends upon two factors, (i) the nature of the data; and (ii) the type of people for whom the

diagram is needed.

On the nature of the data would depend whether to use one dimensional, two dimensional or three

dimensional diagram, and if it is one dimensional, whether to adopt the simple bar or sub-divided bar, multiple

bar or some other type.

While selecting the diagram the type of the people for whom the diagram is intended must also be

considered. For example, for drawing attention of an uneducated mass, pictograms and cartograms are more

effective.

There are different types of bars and the appropriate type of bar chart can be divided on the follow-

ing basis :

(a) Simple bar charts should be used where changes in totals are required to be conveyed.

(b) Components bar charts are more useful where changes in totals as well as in the size of compo-

nent figures (absolute ones) are required to be displayed.

(c) Percentage composition bar charts are better suited where changes in the relative size of com-

ponents figures are to be exhibited.

(d) Multiple bar charts should be used where changes in the absolute values of the components

figures are to be emphasised and the overall total is of no importance.

However, multiple and component bar charts should be used only when there are not more than three

or four components as a large number of components make the bar charts too complex to enable worthwhile

visual impression to be gained. When a large number of components have to be shown a pie chart is more

suitable.

Occassionally, circles are used to represent size. But it is difficult to compare them and they should

not be used when it is possible to use bars. This is because it is easier to compare the lengths of lines or bars

than to compare areas or volume.

30

LESSON 3

MEASURES OF CENTRAL TENDENCY

What is Central Tendency

One of the important objectives of statistics is to find out various numerical values which explains the

inherent characteristics of a frequency distribution. The first of such measures is averages. The averages are

the measures which condense a huge unwieldy set of numerical data into single numerical values which

represent the entire distribution. The inherent inability of the human mind to remember a large body of

numerical data compels us to few constants that will describe the data. Averages provide us the gist and give

a bird’s eye view of the huge mass of unwieldy numerical data. Averages are the typical values around which

other items of the distribution congregate. This value lie between the two extreme observations of the distri-

bution and give us an idea about the concentration of the values in the central part of the distribution. They are

called the measures of central tendency.

Averages are also called measures of location since they enable us to locate the position or place of

the distribution in question. Averages are statistical constants which enables us to comprehend in a single

value the significance of the whole group. According to Croxlon and Cowden, an average value is a single

value within the range of the data that is used to represent all the values in that series. Since an average is

somewhere within the range of data, it is sometimes called a measure of central value. An average is the

most typical representative item of the group to which it belongs and which is capable of revealing all

important characteristics of that group or distribution.

What are the Objects of Central Tendency

The most important object of calculating an average or measuring central tendency is to determine a

single figure which may be used to represent a whole series involving magnitudes of the same variable.

Second object is that an average represents the empire data, it facilitates comparison within one

group or between groups of data. Thus, the performance of the members of a group can be compared with

the average performance of different groups.

Third object is that an average helps in computing various other statistical measures such as disper-

sion, skewness, kurtosis etc.

Essential of a Good Average

An average represents the statistical data and it is used for purposes of comparison, it must possess

the following properties.

1. It must be rigidly defined and not left to the mere estimation of the observer. If the definition is

rigid, the computed value of the average obtained by different persons shall be similar.

2. The average must be based upon all values given in the distribution. If the item is not based on

all value it might not be representative of the entire group of data.

3. It should be easily understood. The average should possess simple and obvious properties. It

should be too abstract for the common people.

4. It should be capable of being calculated with reasonable care and rapidity.

31

5. It should be stable and unaffected by sampling fluctuations.

6. It should be capable of further algebraic manipulation.

Different methods of measuring “Central Tendency” provide us with different kinds of averages.

The following are the main types of averages that are commonly used:

1. Mean

(i) Arithmetic mean

(ii) Weighted mean

(iii) Geometric mean

(iv) Harmonic mean

2. Median

3. Mode

Arithmetic Mean: The arithmetic mean of a series is the quotient obtained by dividing the sum of

the values by the number of items. In algebraic language, if X1, X

2, X

3 ....... X

n are the n values of a variate X.

Then the Arithmetic Mean is defined by the following formula:

=

=

Example : The following are the monthly salaries (Rs.) of ten employees in an office. Calculate the mean

salary of the employees: 250, 275, 265, 280, 400, 490, 670, 890, 1100, 1250.

Solution : =

= = Rs. 587

Short-cut Method: Direct method is suitable where the number of items is moderate and the

figures are small sizes and integers. But if the number of items is large and/or the values of the variate are big,

then the process of adding together all the values may be a lengthy process. To overcome this difficulty of

computations, a short-cut method may be used. Short cut method of computation is based on an important

characteristic of the arithmetic mean, that is, the algebraic sum of the deviations of a series of individual

observation from their mean is always equal to zero. Thus deviations of the various values of the variate

from an assumed mean computed and the sum is divided by the number of items. The quotient obtained is

added to the assumed mean lo find the arithmetic mean.

Symbolically, = . where A is assumed mean and dx are deviations = (X – A).

We can solve the previous example by short-cut method.

Computation of Arithmetic Mean

Serial Salary (Rupees) Deviations from assumed mean

Number X where dx = (X – A), A = 400

1. 250 –150

2. 275 –125

3. 265 –135

32

4. 280 –120

5. 400 0

6. 490 +90

7. 670 +270

8. 890 +490

9. 1100 + 700

10. 1250 + 850

N = 10 Σdx = 1870

=

By substituting the values in the formula, we get

=

Computation of Arithmetic Mean in Discrete series. In discrete series, arithmetic mean may

be computed by both direct and short cut methods. The formula according to direct method is:

=

where the variable values X1 X

2, .......... X

n, have frequencies f

1, f

2, ................f

n and N = Σf.

Example : The following table gives the distribution of 100 accidents during seven days of the week in

a given month. During a particular month there were 5 Fridays and Saturdays and only four each of other

days. Calculate the average number of accidents per day.

Days : Sun. Mon. Tue. Wed. Thur. Fri. Sat. Total

Number of

accidents : 20 22 10 9 11 8 20 = 100

Solution : Calculation of Number of Accidents per Day

Day No. of No. of Days Total Accidents

Accidents in Month

X f fX

Sunday 20 4 80

Monday 22 4 88

Tuesday 10 4 40

Wednesday 9 4 36

Thursday 11 4 44

Friday 8 5 40

Saturday 20 5 100

100 N = 30 ΣfX = 428

= = 14.27 = 14 accidents per day

33

The formula for computation of arithmetic mean according to the short cut method is

= where A is assumed mean, dx = (X – A) and N = Σf.

We can solve the previous example by short-cut method as given below :

Calculation of Average Accidents per Day

Day X dx = X – A f fdx

(where A = 10)

Sunday 20 + 10 4 + 40

Monday 22 + 12 4 + 48

Tuesday 10 + 0 4 + 0

Wednesday 9 – 1 4 – 4

Thursday 11 + 1 4 + 4

Friday 8 – 2 5 – 10

Saturday 20 + 10 5 + 50

30 + 128

= = = 14 accidents per day

Calculation of arithmetic mean for Continuous Series: The arithmetic mean can be

computed both by direct and short-cut method. In addition, a coding method or step deviation method is

also applied for simplification of calculations. In any case, it is necessary to find out the mid-values of

the various classes in the frequency distribution before arithmetic mean of the frequency distribution

can be computed. Once the mid-points of various classes are found out, then the process of the calcu-

lation of arithmetic mean is same as in the case of discrete series. In case of direct method, the formula

to be used:

= , when m = mid points of various classes and N = total frequency

In the short-cut method, the following formula is applied:

= where dx = (m – A) and N = Σf

The short-cut method can further be simplified in practice and is named coding method. The

deviations from the assumed mean are divided by a common factor to reduce their size. The sum of the

products of the deviations and frequencies is multiplied by this common factor and then it is divided by the

total frequency and added to the assumed mean. Symbolically

= where and i = common factor

Example : Following is the frequency distribution of marks obtained by 50 students in a test of Statistics:

Marks Number of Students

0 – 10 4

10 – 20 6

34

20 – 30 20

30 – 40 10

40 – 50 7

50 – 60 3

Calculate arithmetic mean by:

(i) direct method.

(ii) short-cut method, and

(iii) coding method

Solution : Calculation of Arithmetic Mean

X f m fm dx = (m – A) fdx fd’x

(where A = 25) where i = 10

0 – 10 4 5 20 – 20 – 2 – 80 – 8

10 – 20 6 15 90 – 10 – 1 – 60 – 6

20 – 30 20 25 500 0 0 0 0

30 – 40 10 35 350 +10 +1 100 + 10

40 – 50 7 45 315 +20 +2 140 + 14

50 – 60 3 55 165 +30 +3 90 + 9

N = 50 Σfm = 1440 Σfdx = 190 Σfd’x = 19

Direct Method :

= = marks

Short-cut Method :

= marks

Coding Method :

= marks

We can observe that answer of average marks i.e. 28.8 is identical by all methods.

Mathematical Properties of the Arithmetic Mean

(i) The sum of the deviation of a given set of individual observations from the arithmetic mean is

always zero. Symbolically. = 0. It is due to this property that the arithmetic mean is characterised

as the centre of gravity i.e., the sum of positive deviations from the mean is equal to the sum of

negative deviations.

(ii) The sum of squares of deviations of a set of observations is the minimum when deviations are

taken from the arithmetic average. Symbolically, = smaller than Σ (X – any other value)2.

We can verify the above properties with the help of the following data:

35

Values Deviations from Deviations from Assumed Mean

X

3 – 6 36 – 7 49

5 – 4 16 – 5 25

10 1 1 0 0

12 3 9 2 4

15 6 36 5 25

Total = 45 0 98 – 5 103

= where A (assumed mean) = 10

(iii) If each value of a variable X is increased or decreased or multiplied by a constant k, the

arithmetic mean also increases or decreases or multiplies by the same constant.

(iv) If we are given the arithmetic mean and number of items of two or more groups, we can

compute the combined average of these groups by apply the following formula :

=

where refers to combined average of two groups.

refers to arithmetic mean of first group.

refers to arithmetic mean of second group.

N1 refers to number of items of first group, and

N2 refers to number of items of second group

We can understand the property with the help of the following examples.

Example : The average marks of 25 male students in a section is 61 and average marks of 35 female

students in the same section is 58. Find combined average marks of 60 students.

Solution : We are given the following information.

= 61, N1 = 25, = 58, N

2 = 35

Apply =

Example : The mean wage of 100 workers in a factory, running two shifts of 60 and 40 workers respectively

is Rs.38. The mean wage of 60 workers in morning shift is Rs.40. Find the mean wage of 40 workers

working in the evening shift.

Solution : We are given the following information

= 40, N1 = 60, = ?, N

2 = 40, = 38, and N = 100

Apply =

38 = or 3800 = 2400 +

=

Example : The mean age of a combined group of men and women is 30 years. If the mean age of the group

of men is 32 and that of women group is 27. find out the percentage of men and women in the group.

Solution : Let us take group of men as first group and women as second group. Therefore. = 32 years. =

36

27 years, and = 30 years. In the problem, we are not given the number of men and women. We can assume

N1

+ N2 = 100 and therefore. N

1 = 100 – N

2

Apply =

30 = (Substitute N1 = 100 – N

2)

30 × 100 = 32(100 – N2) + 27N

2or 5N

2 = 200

N2

= 200/5 – 40%

N1

= (100 – N2) = (100 – 40) = 60%

Therefore, the percentage of men in the group is 60 and that of women is 40.

(v) Because =

∴ ΣX = N.

If we replace each item in the series by the mean, the sum of these substitutions will be equal to the

sum of the individual items. This property is used to find out the aggregate values and corrected averages.

We can understand the property with the help of an example.

Example : Mean of 100 observations is found to be 44. If at the time of computation two items are wrongly

taken as 30 and 27 in place of 3 and 72. Find the corrected average.

Solution : =

∴ ΣX = N. = 100 × 44 = 4400

Corrected ΣX = ΣX + correct items – wrong items = 4400 + 3 + 72 – 30 – 27 = 4418

Corrected average =

Calculation of Arithmetic mean for Open-End Classes

Open-end classes are those in which lower limit of the first class and the upper limit of the last class

are not defined. In these series, we can not calculate mean unless we make an assumption about the un-

known limits. The assumption depends upon the class-interval following the first class and preceding the last

class. For example:

Marks No. of Students

Below 15 4

15 – 30 6

30 – 15 12

45 – 60 8

Above 60 7

In this example, because all defined class-intervals are same, the assumption would be that the first

and last class shall have same class-interval of 15 and hence the lower limit of the first class shall be zero and

upper limit of last class shall be 75. Hence first class would be 0 – 15 and the last class 60 – 75.

What happens in this case?


Below 10 4

37

10 – 30 7

30 – 60 10

60 – 100 8

Above 100 4

In this problem because the class interval is 20 in the second class, 30 in the third, 40 in the fourth

class and so on. The class interval is increasing by 10. Therefore the appropriate assumption in this case

would be that the lower limit of the first class is zero and the upper limit of the last class is 150. In case of

other open-end class distributions the first class limit should be fixed on the basis of succeeding class interval

and the last class limit should be fixed on the basis of preceding class interval.

If the class intervals are of varying width, an effort should be made to avoid calculating mean and

mode. It is advisable to calculate median.

Weighted Mean

In the computation of arithmetic mean, we give equal importance to each item in the series.

Raja Toy Shop sell : Toy Cars at Rs. 3 each; Toy Locomotives at Rs. 5 each; Toy Aeroplane at Rs.

7 each; and Toy Double Decker at Rs. 9 each.

What shall be the average price of the toys sold ? If the shop sells 4 toys one of each kind.

(Mean Price) =

In this case the importance of each toy is equal as one toy of each variety has been sold. While

computing the arithmetic mean this fact has been taken care of including the price of each toy once

only.

But if the shop sells 100 toys, 50 cars, 25 locomotives, 15 aeroplanes and 10 double deckers, the

importance of the four toys to the dealer is not equal as a source of earning revenue. In fact their respec-

tive importance is equal to the number of units of each toy sold, i.e., the importance of Toy car is 50; the

importance of Locomotive is 25; the importance of Aeroplane is 15; and the importance of Double Decker

is 10.

It may be noted that 50, 25, 15, 10 are the quantities of the various classes of toys sold. These

quantities are called as ‘weights’ in statistical language. Weight is represented by symbol W and ΣW repre-

sents the sum of weights.

While determining the average price of toy sold these weights are of great importance and are taken

into account to compute weighted mean.

=

where, W1, W

2, W

3, W

4 are weights and X

1, X

2, X

3, X

4 represents the price of 4 varieties of toy.

Hence by substituting the values of W1, W

2, W

3, W

4 and X

1, X

2, X

3, X

4, we get

=

=

The table given below demonstrates the procedure of computing the weighted Mean.

Weighted Arithmetic mean of Toys by the Raja Shop.

Toy Price per toy (Rs.) Number Sold Price x Weight

38

X W WX

Car 3 50 150

Locomotive 5 25 125

Aeroplane 7 15 105

Double Decker 9 10 90

ΣW = 100 ΣWX = 470

∴ =

Example: The table below shows the number of skilled and unskilled workers in two localities along with

their average hourly wages.

Ram Nagar Shyam Nagar

Worker Category Number Wages (per hour) Number Wages (per hour)

Skilled 150 1.80 350 1.75

Unskilled 850 1.30 650 1.25

Determine the average hourly wage in each locality. Also give reasons why the results show that the

average hourly wage in Shyam Nagar exceed the average hourly wage in Ram Nagar even though in Shyam

Nagar the average hourly wages of both categories of workers is lower. It is required to compute weighted

mean.

Solution :

Ram Nagar Shyam Nagar

X W WX X W WX

Skilled 1.80 150 270 1.75 350 612.50

Unskilled 1.30 850 1105 1.25 650 812.50

Total 1000 1375 1000 1425

= =

It may be noted that weights are more evenly assigned to the different categories of workers in

Shyam Nagar than in Ram Nagar.

Geometric Mean :

In general, if we have n numbers (none of them being zero), then the GM. is defined as

G.M. =

In case of a discrete series, if x1, x

2,............. x

n occur f

1, f

2, ............... f

n times respectively and N is

the

total frequency (i.e. N = f1 + f

2...................f

n ), then

G.M. =

39

For convenience, use of logarithms is made extensively to calculate the nth root. In terms of loga-

rithms

G.M. =

= , where AL refers to antilog.

and in case of continuous series, G.M. =

Example: Calculate G.M. of the following data :

2, 4, 8

Solution: G.M. =

In terms of logarithms, the question can be solved as follows :

log 2 = 0.3010, log 4 = 0.6021, and log 8 = 9.9031

Apply the formula :

G.M. =

Example : Calculate geometric mean of the following data :

x 5 6 7 8 9 10 11

f 2 4 7 10 9 6 2

Solution : Calculation of G.M.

x log x f f log x

5 0.6990 2 1.3980

6 0.7782 4 3.1128

7 0.8451 7 5.9157

8 0.9031 10 9.0310

9 0.9542 9 8.5878

10 1.0000 6 6.0000

11 1.0414 2 2.0828

N = 40 Σf log x = 36.1281

G.M. =

Example : Calculate G.M. from the following data :

X f

9.5 – 14.5 10

14.5 – 19.5 15

40

19.5 – 24.5 17

24.5 – 29.5 25

29.5 – 34.5 18

34.5 – 39.5 12

39.5 – 44.5 8

Solution: Calculation of G.M.

X m log m f f log m

9.5 – 14.5 12 1.0792 10 10.7920

14.5 – 19.5 17 1.2304 15 18.4560

19.5 – 24.5 22 1.3424 17 22.8208

24.5 – 29.5 27 1.4314 25 35.7850

29.5 – 34.5 32 1.5051 18 27.0918

34.5 – 39.5 37 1.5682 12 18.8184

39.5 – 14.5 42 1.6232 8 12.9850

N = 105 Σf logm = 146.7410

G.M. =

Specific uses of G.M. : The geometric Mean has certain specific uses, some of them are :

(i) It is used in the construction of index numbers.

(ii) It is also helpful in finding out the compound rates of change such as the rate of growth of

population in a country.

(iii) It is suitable where the data are expressed in terms of rates, ratios and percentage.

(iv) It is quite useful in computing the average rates of depreciation or appreciation.

(v) It is most suitable when large weights are to be assigned to small items and small weights to

large items.

Example : The gross national product of a country was Rs. 1.000 crores 10 years earlier. It is Rs. 2,000

crores now. Calculate the rate of growth in G.N.P.

Solution: In this case compound interest formula will be used for computing the average annual per cent

increase of growth.

Pn

= Po (l + r)n

where Pn

= principal sum (or any other variate) at the end of the period.

Po

= principal sum in the beginning of the period.

r = rate of increase or decrease.

n = number of years.

It may be noted that the above formula can also be written in the following form :

41

r =

Substituting the values given in the formula, we have

r =

=

Hence, the rate of growth in GNP is 7.18%.

Example : The price of commodity increased by 5 per cent from 2001 to 2002, 8 percent from 2002 to 2003

and 77 per cent from 2003 to 2004. The average increase from 2001 to 2004 is quoted at 26 per cent and not

30 per cent. Explain this slatement and verify the arithmetic.

Solution : Taking Pn as the price at the end of the period, P

o as the price in the beginning, we can substitute

the values of Pn and P

o in the compound interest formula. Taking P

o = 100; P

n = 200.72

Pn

= Po (l + r)n

200.72 = 100 (1 + r)3

or (l + r)3 = or l + r =

r = – 1 = 1.260 – 1 = 0.260 = 26%

Thus increase is not average of (5 + 8 + 77)/3 = 30 percent. It is 26% as found out by G.M.

Weighted G.M.: The weighted GM. is calculated with the help of the following formula :

G.M. =

=

=

Example : Find out weighted G.M. from the following data :

Group Index Number Weights

Food 352 48

Fuel 220 10

Cloth 230 8

House Rent 160 12

Misc. 190 15

Solution : Calculation of Weighted GM.

Group Index Number (x) Weights (W) Log x w log x

Food 352 48 2.5465 122.2320

42

Fuel 220 10 2.3424 23.4240

Cloth 230 8 2.3617 17.8936

House Rent 160 12 2.2041 26.4492

Misc. 190 15 2.2788 34.1820

93 225.1808

G.M. =

Example: A machine depreciates at the rate of 35.5% per annum in the first year, at the rate of 22.5% per

annum in the second year, and at the rate of 9.5% per annum in the third year, each percentage being

computed on the actual value. What is the average rate of depreciation?

Solution: Average rate of depreciation can be calculated by taking GM.

Year X (value is taking 100 as base) log X

I 100 – 35.5 = 64.5 1.8096

II 100 – 22.5 = 77.5 1.8893

III 100 – 9.5 = 90.5 1.9566

Σ log X = 5.6555

Apply G.M. =

∴ Average rate of depreciation = 100-76.77 = 23.33%.

Example : The arithmetic mean and geometric mean of two values are 10 and 8 respectively. Find the

values.

Solution : If two values are taken as a and b, then

and

or a + b = 20, ab = 64

then a – b =

Now, we have a + b = 20, ...(i)

Solving for a and b, we get a = 4 and b = 16. ...(ii)

Harmonic Mean : The harmonic mean is defined as the reciprocals of the average of reciprocals of items

in a series. Symbolically,

H. M. =

In case of a discrete series,

H. M. =

and in case of a continuous series,

H. M. =

It may be noted that none of the values of the variable should be zero.

Example: Calculate harmonic mean from the following data: 5, 15, 25, 35 and 45.

Solution :

43

X

5 0.20

15 0.067

25 0.040

35 0.029

45 0.022

N = 5

H.M. =

Example : From the following data compute the value of the harmonic mean :

x : 5 15 25 35 45

f : 5 15 10 15 5

Solution : Calculation of Harmonic Mean

X f

5 5 0.200 1.000

15 15 0.067 1.005

25 10 0.040 0.400

35 15 0.29 0.435

45 5 0.022 0.110

Σf = 50

H.M. =

Example : Calculate harmonic mean from the following distribution :

x f

0 – 10 5

10 – 20 15

20 – 30 10

30 – 40 15

40 – 50 5

Solution : First of all, we shall find out mid points of the various classes. They are 5, 15, 25, 35 and 45.

44

Then we will calculate the H.M. by applying the following formula :

H.M. =

Calculation of Harmonic Mean

x (Mid Points) f

5 5 0.200 1.000

15 15 0.067 1.005

25 10 0.040 0.400

35 15 0.29 0.435

45 5 0.022 0.110

Σf = 50

H.M. =

Application of Harmonic Mean to special cases: Like Geometric means, the harmonic mean is

also applicable to certain special types of problems. Some of them are:

(i) If in averaging time rates, distance is constant, then H.M. is to be calculated.

Example: A man travels 480 km a day. On the first day he travels for 12 hours @ 40 km. per hour and

second day for 10 hours @ 48 km. per hour. On the third day he travels for 1.5 hours @ 32 km. per hour. Find

his average speed.

Solution: We shall use the harmonic mean,

H.M. =

The arithmetic mean would be

(ii) If, in averaging the price data, the prices are expressed as “quantity per rupee”. Then harmonic

mean should be applied.

Example : A man purchased one kilo of cabbage from each of four places at the rate of 20 kg. 16 kg. 12 kg.,

and 10 kg. per rupees respectively. On the average how many kilos of cabbages he has purchased per rupee.

Solution : H.M. =

POSITIONAL AVERAGES

Median

The median is that value of the variable which divides the group in two equal parts. One part com-

prising the values greater than and the other all values less than median. Median of a distribution may be

defined as that value of the variable which exceeds and is exceeded by the same number of observation. It

is the value such that the number of observations above it is equal to the number of observations below it.

Thus we know that the arithmetic mean is based on all items of the distribution, the median is positional

average, that is, it depends upon the position occupied by a value in the frequency distribution.

45

When the items of a series are arranged in ascending or descending order of magnitude the value of

the middle item in the series is known as median in the case of individual observation. Symbolically.

Median = size of th item

It the number of items is even, then there is no value exactly in the middle of the series. In such a

situation the median is arbitrarily taken to be halfway between the two middle items. Symbolically.

Median =

Example : Find the median of the following series:

Solution : Computation of Median

(i) (ii)

Serial No. X Serial No. X

1 3 1 5

2 4 2 5

3 4 3 7

4 5 4 9

5 6 5 11

6 8 6 12

7 8 7 15

8 8 8 28

9 10

N = 9 N – 8

Far (i) series Median = size of th item = size of the th item = size of 5th item = 6

For (ii) series Medium = size of th item = size of the th item

= =

Location of Median in Discrete series: In a discrete series, medium is computed in the following

manner:

(i) Arrange the given variable data in ascending or descending order,

(ii) Find cumulative frequencies.

(iii) Apply Med. = size of th item

(iv) Locate median according to the size i.e., variable corresponding to the size or for next cumula-

tive frequency.

Example: Following are the number of rooms in the houses of a particular locality. Find median of the data:

No. of rooms: 3 4 5 6 7 8

No of houses: 38 654 311 42 12 2

Solution: Computation of Median

No. of Rooms No. of Houses cumulative Frequency

X f Cf

3 38 38

46

4 654 692

5 311 1003

6 42 1045

7 12 1057

8 2 1059

Median = size of th item = size of th item = 530 th item.

Median lies in the cumulative frequency of 692 and the value corresponding to this is 4

Therefore, Median = 4 rooms.

In a continuous series, median is computed in the following manner:

(i) Arrange the given variable data in ascending or descending order.

(ii) If inclusive series is given, it must he converted into exclusive series to find real class interval

(iii) Find cumulative frequencies.

(iv) Apply Median = size of th item to ascertain median class.

(v) Apply formula of interpolation to ascertain the value of median.

Median = l1 + or Median = l

2 –

where, l1

refers to lower limit of median class,

l2

refers to higher limit of median class,

cfo

refers cumulative frequency of previous to median class,

f refers to frequency of median class,

Example: The following table gives you the distribution of marks secured by some students in an examina-

tion:


0—20 42

21—30 38

31—40 120

41—50 84

51— 60 48

61—70 36

71—80 31

Find the median marks.

Solution: Calculation of Median Marks

Marks No. of Students cf

(x) (f)

47

0 – 20 42 42

21 – 30 38 80

31 – 40 120 200

41 – 50 84 284

51 – 60 48 332

61 – 70 36 368

71 – 80 31 399

Median = size of th item = size of th item = 199.5 th item.

which lies in (31 – 40) group, therefore the median class is 30.5 – 40.5.

Applying the formula of interpolation.

Median = l1 +

= 30.5 +

Related Positional Measures: The median divides the series into two equal parts. Similarly there

are certain other measures which divide the series into certain equal parts, there are first quartile, third

quartile, deciles, percentiles etc. If the items are arranged in ascending or descending order of magnitude, Qt

is that value which covers l/4th of the total number of items. Similarly, if the total number of items are divided

into ten equal parts, then, there shall be nine deciles.

Symbolically,

First decile (Q1) = size of th item

Third quartile (Q3) = size of th item

First decile (D1) = size of th item

Sixth decile (D6) = size of th item

First percentile (P1) = size of th item

Once values of the items are found out, then formulae of interpolation are applied for ascertaining the

value of Q1, Q

2, D

1, D

4, P

40 etc.

Example: Calculate Q1, Q

3, D

2, and P

5, from following data:

Marks : Below 10 10 – 20 20 – 40 40 – 60 60 – 80 above 80

No. of Students: 8 10 22 25 10 5

Solution: Calculation of Positional Values

Marks No. of Students (f) C.f.

Below 10 8 8

10 – 20 10 18

20 – 40 22 40

40 – 60 25 65

48

60 – 80 10 75

Above 80 5 80

N = 80

Q1

= size of th item = = 20th item

Hence Qt lies in the class 20 – 40, apply

Q1

= where l1 = 20, Cf

o = 18, f = 22 and i = (l

2 – l

1) = 20

By substituting the values, we get

Q1

=

Similarly, we can calculate

Q3

= size of th item = th item = 60th item

Hence Q3 lies in the class 40 – 60, apply

Q3

= where l1 = 40, , Cf

o = 40, f = 25, i = 20

∴ Q3

=

D2

= size of th item = 16th item. Hence D2 lies in the class 10 – 20.

D2

= where

D2

=

P5

= size of th item = th item = 4th item. Hence P5 lies in the class 0 – 10.

P5

= where l1 = 0, Cf

o = 0, f = 8, i = 10

P5

=

Calculation of Missing Frequencies:

Example: In the frequency distribution of 100 families given below: the number of families corresponding to

expenditure groups 20 – 40 and 60 – 80 are missing from the table. However the median is known to be 50.

Find out the missing frequencies.

Expenditure: 0 – 20 20 – 40 40 – 60 60 – 80 80 – 100

No. of families: 14 ? 27 ? 15

Solution: We shall assume the missing frequencies for the classes 20—40 to be x and 60—80 to y

Expenditure (Rs.) No. of Families C.f.

0 – 20 14 14

20 – 40 x 14 + x

40 – 60 27 14 + 27 + x

60 – 80 y 41 + x + y

80 – 100 15 41 + 15 + x + y

N = 100 = 56 + x + y

From the table, we have N = ΣF = 56 + x + y = 100

49

∴ x + y = 100 – 56 + 44

Median is given as 50 which lies in the class 40 – 60, which becomes the median class,

By using the median formula we get:

Median =

∴ 50 = or 50 =

or 50 – 40 = or 50 – 40 =

or 10 × 27 = 720 – 20x or 270 = 720 – 20x

∴ 20x = 720 – 270

x =

By substitution the value of x in the equation,

x + y = 44

We get, 22.5 + y = 44

∴ y = 44 – 22.5 = 21.5

Hence frequency for the 20 – 40 is 22.5 and 60 – 80 is 21.5

Mode

Mode is that value of the variable which occurs or repeats itself maximum number of item. The

mode is most “ fashionable” size in the sense that it is the most common and typical and is defined by Zizek

as “the value occurring most frequently in series of items and around which the other items are distributed

most densely.” In the words of Croxton and Cowden, the mode of a distribution is the value at the point where

the items tend to be most heavily concentrated. According to A.M. Tuttle, Mode is the value which has the

greater frequency density in its immediate neighbourhood. In the case of individual observations, the mode is

that value which is repeated the maximum number of times in the series. The value of mode can be denoted

by the alphabet z also.

Example : Calculate mode from the following data:

Sr. Number : 1 2 3 4 5 6 7 8 9 10

Marks obtained : 10 27 24 12 27 27 20 18 15 30

Solution :

Marks No. of students

10 1

12 1

15 1

18 1

20 1

24 1

27 3 Made is 27 marks

30 1

50

Calculation of Mode in Discrete series. In discrete series, it is quite often determined by inspec-

tion. We can understand with the help of an example:

X 1 2 3 4 5 6 7

f 4 5 13 6 12 8 6

By inspection, the modal size is 3 as it has the maximum frequency. But this test of greatest fre-

quency is not fool proof as it is not the frequency of a single class, but also the frequencies of the neighbour

classes that decide the mode. In such cases, we shall be using the method of Grouping and Analysis table.

Size of shoe 1 2 3 4 5 6 7

Frequency 4 5 13 6 12 8 6

Solution : By inspection, the mode is 3, but the size of mode may be 5. This is so because the neighboring

frequencies of size 5 are greater than the neighbouring frequencies of size 3. This effect of neighbouring

frequencies is seen with the help of grouping and analysis table technique.

Grouping table

Size of Shoe Frequency

1 2 3 4 5 6

1 4

9

2 5 22

18

3 13 24

19

4 6 31

18

5 12 26

20

6 8 26

14

7 6

When there exist two groups of frequencies with equal magnitude, then we should consider either

both or omit both while analysing the sizes of items.

Analysis Table

Column Size of Items with Maximum Frequency

1 3

2 5, 6

51

3 1, 2, 3, 4, 5

4 4, 5, 6

5 5, 6, 7

6 3, 4, 5

Item 5 occurs maximum number of times, therefore, mode is 5. We can note that by inspection we

had determined 3 to be the mode.

Determination of mode in continuous series: In the continuous series, the determination of

mode requires one additional step. Once the modal class is determined by inspection or with the help of

grouping technique, then the following formula of interpolation is applied:

Mode = or Mode =

l1

= lower limit of the class, where mode lies,

l2

= upper limit of the class, where mode lies,

f0

= frequency of the class preceding the modal class.

f1

= frequency of the class, where mode lies.

f2

= frequency of the class succeeding the modal class.

Example: Calculate mode from the following frequency distribution:

Variable Frequency

0 – 10 5

10 – 20 10

20 – 30 15

30 – 40 14

40 – 50 10

50 – 60 5

60 – 70 3

Solution: Grouping Table

X 1 2 3 4 5 6

0 – 10 5

15

10 – 20 10 30

25

20 – 30 15 39

29 39

30 – 10 14 24

29

40 – 50 10

15

50 – 60 5 18

52

8

60 – 70 3

Analysis Table

Column Size of Item with Maximum Frequency

1 20 – 30

2 20 – 30, 30 – 40

3 10 – 20, 20 – 30

4 0 – 10, 10 – 20, 30 – 40

5 10 – 20, 20 – 30, 30 – 40

6 20 – 30, 30 – 40, 40 – 50

Modal group is 20 – 30 because it has occurred 6 times. Applying the formula of interpolation,

Mode =

=

Calculation of mode where it is ill defined. The above formula is not applied where there are

many modal values in a series or distribution. For instance there may be two or more than two items having

the maximum frequency. In these cases, the series will be known as bimodal or multimodal series. The mode

is said to be ill-defined and in such cases the following formula is applied.

Mode = 3 Median – 2 Mean.

Example: Calculate mode of the following frequency data:

Variate Value Frequency

10 – 20 5

20 – 30 9

30 – 40 13

40 – 50 21

50 – 60 20

60 – 70 15

70 – 80 8

80 – 90 3

Solution : First of all, ascertain the modal group with the help of process of grouping.

Grouping Table

X 1 2 3 4 5 6

10 – 20 5

14

20 – 30 9 27

53

22

30 – 40 13 43

34

40 – 50 21 54

41

50 – 60 20 56

35

60 – 70 15 43

23

70 – 80 8 26

11

80 – 90 3

Analysis Table

Column Size of Item with Maximum Frequency

1 40 – 50

2 50 – 60, 60 – 70

3 40 – 50, 50 – 60

4 40 – 50, 50 – 60, 60 – 70

5 20 – 30, 30 – 40, 40 – 50, 50 – 60, 60 – 70, 70 – 80

6 30 – 40, 40 – 50, 50 – 60

There are two groups which occur equal number of items. They are 40 – 50 and 50 – 60. Therefore,

we will apply the following formula:

Mode = 3 median – 2 mean and for this purpose the values of mean and median are required to be

computed.

Calculation of Mean and Median

Variate Frequency Mid Values

X f m d’x fd’x Cf

10 – 20 5 15 – 3 – 15 5

20 – 30 9 25 – 2 – 18 14

30 – 40 13 35 –1 – 13 27

40—50 21 45 0 0 48 Median is the

50—60 20 55 + 1 + 20 68 value of

60—70 15 65 + 2 + 30 83 item which lies

54

70—80 8 75 + 3 + 24 91 in (40 – 50) group

80—90 3 85 + 4 + 12 94

N = 94 Σfd’ = + 40

= Med. =

= =

Mode = 3 median – 2 mean

= 3 (49.5)-2 (49.2)= 148.5-98.4 = 50.1

Determination of mode by curve fitting : Mode can also be computed by curve fitting. The

following steps are to be taken;

(i) Draw a histogram of the data.

(ii) Draw the lines diagonally inside the modal class rectangle, starting from each upper corner of

the rectangle to the upper corner of the adjacent rectangle.

(iii) Draw a perpendicular line from the intersection of the two diagonal lines to the X-axis.

The abscissa of the point at which the perpendicular line meets is the value of the mode.

Example : Construct a histogram for the following distribution and, determine the mode graphically:

X : 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50

f : 5 8 15 12 7

Verify the result with the help of interpolation.

Solution :

Mode =

=

Example: Calculate mode from the following data:


Below 10 4

‘’ 20 6

‘’ 30 24

‘’ 40 46

‘’ 50 67

‘’ 60 86

‘’ 70 96

‘’ 80 99

‘’ 90 100

Solution :

55

Since we are given the cumulative frequency distribution of marks, first we shall convert it

into the normal frequency distribution:

Marks Frequencies

0 – 10 4

10 – 20 6 – 4 = 2

20 – 30 24 – 6 = 18

30 – 40 46 – 24 = 22

40 – 50 67 – 46 = 21

50 – 60 86 – 67 = 19

60 – 70 96 – 86 = 10

70 – 80 99 – 96 = 3

80 – 90 100 – 99 = 1

It is evident from the table that the distribution is irregular and maximum chances are that the

distribution would be having more than one mode. You can verify by applying the grouping and analysing

table.

The formula to calculate the value of mode in cases of bio-modal distributions is :

Mode = 3 median – 2 mean.

Computation of Mean and Median:

Marks Mid-value Frequency

(X) (f) Cf (dx) fdx

0 – 10 5 4 4 – 4 – 16

10 – 20 15 2 6 – 3 – 6

20 – 30 25 18 24 – 2 – 36

30 – 40 35 22 46 – 1 – 22

40—50 45 21 67 0 0

50—60 55 19 86 1 19

60—70 65 10 96 2 20

70—80 75 3 99 3 9

80—90 85 1 100 4 4

Σf = 100 Σfdx = – 28

Mean =

Median = size of item = = 50th item

Because 50 is smaller to 67 in C.f. column. Median class is 40 – 50

Median =

Median =

Apply, Mode = 3 median – 2 mean

56

Mode = 3 × 41.9 – 2 × 42.2 = 125.7 – 84.6 = 41.3

Example : Median and mode of the wage distribution are known to be Rs. 33.5 and 34 respectively. Find the

missing values.

Wages (Rs.) No. of Workers

0 – 10 4

10 – 20 16

20 – 30 ?

30 – 40 ?

40 – 50 ?

50 – 60 6

60 – 70 4

Total = 230

Solution : We assume the missing frequencies as 20 – 30 as x, 30 – 40 as y, and 40 – 50 as 230 – (4 + 16 +

x + y + 6 + 4) = 200 – x – y.

We now proceed further to compute missing frequencies:

Wages (Rs.), No. of workers Cumulative frequencies

X f C.f.

0 – 10 4 4

10 – 20 16 20

20 – 30 x 20 + x

30 – 40 y 20 + x + y

40 – 50 200 – x – y 220

50 – 60 6 226

60 – 70 4 230

N = 230

Apply, Median =

33.5 =

y(33.5 – 30) = (115 – 20 – x)10

3.5y = 1150 – 200 – 10x

10x + 3.5y = 950 ...(i)

Apply, Mode =

34 =

4(3y – 200) = 10(y – x)

10x + 2y = 800 ...(ii)

Subtract equation (ii) from equation (i),

1.5y = 150, y =

Substitute the value of y = 100 in equation (i), we get

57

10x + 3.5(100) = 950

10x = 950 – 350

x = 600/10 = 60

∴Third missing frequency = 200 – x – y = 200 – 60 – 100 = 40.

58

LESSON 4

MEASURES OF DISPERSION

Why dispersion?

Measures of central tendency, Mean, Median, Mode, etc., indicate the central position of a series.

They indicate the general magnitude of the data but fail to reveal all the peculiarities and characteristics of the

series. In other words, they fail to reveal the degree of the spread out or the extent of the variability in

individual items of the distribution. This can be explained by certain other measures, known as ‘Measures of

Dispersion’ or Variation.

We can understand variation with the help of the following example :

Series 1 Series 11 Series III

10 2 10

10 8 12

10 20 8

ΣX = 30 30 30

In all three series, the value of arithmetic mean is 10. On the basis of this average, we can say that

the series are alike. If we carefully examine the composition of three series, we find the following differ-

ences:

(i) In case of 1st series, three items are equal; but in 2nd and 3rd series, the items are unequal and

do not follow any specific order.

(ii) The magnitude of deviation, item-wise, is different for the 1st, 2nd and 3rd series. But all these

deviations cannot be ascertained if the value of simple mean is taken into consideration.

(iii) In these three series, it is quite possible that the value of arithmetic mean is 10; but the value of

median may differ from each other. This can be understood as follows ;

I II III

10 2 8

10 Median 8 Median 10 Median

10 20 12

The value of Median’ in 1st series is 10, in 2nd series = 8 and in 3rd series = 10. Therefore, the

value of the Mean and Median are not identical.

(iv) Even though the average remains the same, the nature and extent of the distribution of the size

of the items may vary. In other words, the structure of the frequency distributions may differ

even (though their means are identical.

What is Dispersion?

Simplest meaning that can be attached to the word ‘dispersion’ is a lack of uniformity in the sizes or

59

quantities of the items of a group or series. According to Reiglemen, “Dispersion is the extent to which the

magnitudes or quantities of the items differ, the degree of diversity.” The word dispersion may also be used

to indicate the spread of the data.

In all these definitions, we can find the basic property of dispersion as a value that indicates the

extent to which all other values are dispersed about the central value in a particular distribution.

Properties of a good measure of Dispersion

There are certain pre-requisites for a good measure of dispersion:

1. It should be simple to understand.

2. It should be easy to compute.

3. It should be rigidly defined.

4. It should be based on each individual item of the distribution.

5. It should be capable of further algebraic treatment.

6. It should have sampling stability.

7. It should not be unduly affected by the extreme items.

Types of Dispersion

The measures of dispersion can be either ‘absolute’ or “relative”. Absolute measures of dispersion

are expressed in the same units in which the original data are expressed. For example, if the series is

expressed as Marks of the students in a particular subject; the absolute dispersion will provide the value in

Marks. The only difficulty is that if two or more series are expressed in different units, the series cannot be

compared on the basis of dispersion.

‘Relative’ or ‘Coefficient’ of dispersion is the ratio or the percentage of a measure of absolute

dispersion to an appropriate average. The basic advantage of this measure is that two or more series can be

compared with each other despite the fact they are expressed in different units.

Theoretically, ‘Absolute measure’ of dispersion is better. But from a practical point of view, relative

or coefficient of dispersion is considered better as it is used to make comparison between series.

Methods of Dispersion

Methods of studying dispersion are divided into two types :

(i) Mathematical Methods: We can study the ‘degree’ and ‘extent’ of variation by these methods.

In this category, commonly used measures of dispersion are :

(a) Range

(b) Quartile Deviation

(c) Average Deviation

(d) Standard deviation and coefficient of variation.

(ii) Graphic Methods: Where we want to study only the extent of variation, whether it is higher or

lesser a Lorenz-curve is used.

Mathematical Methods

60

(a) Range

It is the simplest method of studying dispersion. Range is the difference between the smallest value

and the largest value of a series. While computing range, we do not take into account frequencies of different

groups.

Formula: Absolute Range = L – S

Coefficient of Range =

where, L represents largest value in a distribution

S represents smallest value in a distribution

We can understand the computation of range with the help of examples of different series,

(i) Raw Data: Marks out of 50 in a subject of 12 students, in a class are given as follows:

12, 18, 20, 12, 16, 14, 30, 32, 28, 12, 12 and 35.

In the example, the maximum or the highest marks obtained by a candidate is ‘35’ and the lowest

marks obtained by a candidate is ‘12’. Therefore, we can calculate range;

L = 35 and S = 12

Absolute Range = L – S = 35 – 12 = 23 marks


(ii) Discrete Series

Marks of the Students in No. of students

Statistics (out of 50)

(X) (f)

Smallest 10 4

12 10

18 16

Largest 20 15

Total = 45

Absolute Range = 20 – 10 = 10 marks


(iii) Continuous Series

X Frequencies

10 – 15 4

S = 10 15 – 20 10

L = 30 20 – 25 26

25 – 30 8

61

Absolute Range = L – S = 30 – 10 = 20 marks


Range is a simplest method of studying dispersion. It takes lesser time to compute the ‘absolute’ and

‘relative’ range. Range does not take into account all the values of a series, i.e. it considers only the extreme

items and middle items are not given any importance. Therefore, Range cannot tell us anything about the

character of the distribution. Range cannot be computed in the case of “open ends’ distribution i.e., a distri-

bution where the lower limit of the first group and upper limit of the higher group is not given.

The concept of range is useful in the field of quality control and to study the variations in the prices

of the shares etc.

(b) Quartile Deviations (Q.D.)

The concept of ‘Quartile Deviation does take into account only the values of the ‘Upper quartile (Q3)

and the ‘Lower quartile’ (Q1). Quartile Deviation is also called ‘inter-quartile range’. It is a better method

when we are interested in knowing the range within which certain proportion of the items fall.

‘Quartile Deviation’ can be obtained as :

(i) Inter-quartile range = Q3 – Q

1

(ii) Semi-quartile range =

(iii) Coefficient of Quartile Deviation =

Calculation of Inter-quartile Range, semi-quartile Range and Coefficient of Quartile Deviation in

case of Raw Data

Suppose the values of X are : 20, 12, 18, 25, 32, 10

In case of quartile-deviation, it is necessary to calculate the values of Q1 and Q

3 by arranging the

given data in ascending of descending order.

Therefore, the arranged data are (in ascending order):

X = 10, 12, 18, 20, 25, 32

No. of items = 6

Q1

= the value of item = = 1.75th item

= the value of 1st item + 0.75 (value of 2nd item – value of 1st item)

= 10 + 0.75 (12 – 10) = 10 + 0.75(2) = 10 + 1.50 = 11.50

Q3

= the value of item =

= the value of 3(7/4)th item = the value of 5.25th item

= 25 + 0.25 (32 – 25) = 25 + 0.25 (7) = 26.075

Therefore,

(i) Inter-quartile range = Q3 – Q

1 = 26.75 – 11.50 = 15.25

(ii) Semi-quartile range =


Calculation of Inter-quartile Range, semi-quartile Range and Coefficient of Quartile Deviation in

discrete series

Suppose a series consists of the salaries (Rs.) and number of the workers in a factory:

62

Salaries (Rs.) No. of workers

60 4

100 20

120 21

140 16

160 9

In the problem, we will first compute the values of Q3 and Q

1

Salaries (Rs.) No. of workers Cumulative frequencies

(x) (f) (c.f.)

60 4 4

100 20 24 – Q1 lies in this cumulative

120 21 45 frequency

140 16 61 – Q3 lies in this cumulative

160 9 70 frequency

N = Σf = 70

Calculation of Q1 : Calculation of Q

3 :

Q1 = size of th item Q

3 = size of th item

= size of th item = 17.75 = size of th item = 53.25th item

17.75 lies in the cumulative frequency 24, 53.25 lies in the cumulative frequency 61 which

which is corresponding to the value Rs. 100 is corresponding to Rs. 140

∴ Q1 = Rs. 100 ∴ Q

3 = Rs. 140

(i) Inter-quartie range = Q3 – Q

1 = Rs. 140 – Rs. 100 = Rs. 40

(ii) Semi-quartie range =


Calculation of Inter-quartile range, semi-quartile range and Coefficient of Quartile Deviation in

case of continuous series

We are given the following data :

Salaries (Rs.) No. of Workers

10 – 20 4

20 – 30 6

30 – 10 10

40 – 50 5

63

Total = 25

In this example, the values of Q3 and Q

1 are obtained as follows:

Salaries (Rs.) No. of workers Cumulative frequencies

(x) (f) (c.f.)

10 – 20 4 4

20 – 30 6 10

30 – 40 10 20

40 – 50 5 25

N = 25

Q1 =

Therefore, . It lies in the cumulative frequency 10, which is corresponding to class

20 – 30.

Therefore, Q1 group is 20 – 30.

where, l1

= 20, f = 6, i = 10, and cfo = 4

Q1

=

Q3

=

Therefore, = 18.75, which lies in the cumulative frequency 20, which is corresponding to class 30 –

40, Therefore Q3 group is 30 – 40.

where, l1

= 30, i = 10, cf0 = 10, and f = 10

Q3

= = Rs. 38.75

Therefore :

(i) Inter-quartile range = Q3

– Ql = Rs. 38.75 – Rs. 23.75 = Rs.15.00

(iii) Semi-quartile range =


Advantages of Quartile Deviation

Some of the important advantages are :

(i) It is easy to calculate. We are required simply to find the values of Q1 and Q

3 and then apply the

formula of absolute and coefficient of quartic deviation.

(ii) It has better results than range method. While calculating range, we consider only the extreme

values that make dispersion erratic, in the case of quartile deviation, we take into account middle

50% items.

(iii) The quartile deviation is not affected by the extreme items.

Disadvantages

(i) It is completely dependent on the central items. If these values are irregular and abnormal the

result is bound to be affected.

64

(ii) All the items of the frequency distribution are not given equal importance in finding the values of

Q1 and Q

3.

(iii) Because it does not take into account all the items of the series, considered to be inaccurate.

Similarly, sometimes we calculate percentile range, say, 90th and 10th percentile as it gives slightly

better measure of dispersion in certain cases.

(i) Absolute percentile range = P90

– P10.

(ii) Coefficient of percentile range =

This method of calculating dispersion can be applied generally in case of open end series where the

importance of extreme values are not considered.

(c) Average Deviation

Average deviation is defined as a value which is obtained by taking the average of the deviations of

various items from a measure of central tendency Mean or Median or Mode, ignoring negative signs.

Generally, the measure of central tendency from which the deviations arc taken, is specified in the

problem. If nothing is mentioned regarding the measure of central tendency specified than deviations are

taken from median because the sum of the deviations (after ignoring negative signs) is minimum.

Computation in case of raw data

(i) Absolute Average Deviation about Mean or Median or Mode =

where: N = Number of observations,

|d| = deviations taken from Mean or Median or Mode ignoring signs.

(ii) Coefficient of A.D. =

Steps to Compute Average Deviation :

(i) Calculate the value of Mean or Median or Mode

(ii) Take deviations from the given measure of central-tendency and they are shown as d.

(iii) Ignore the negative signs of the deviation that can be shown as \d\ and add them to find Σ|d|.

(iv) Apply the formula to get Average Deviation about Mean or Median or Mode.

Example : Suppose the values are 5, 5, 10, 15, 20. We want to calculate Average Deviation and Coefficient

of Average Deviation about Mean or Median or Mode.

Solution : Average Deviation about mean (Absolute and Coefficient).

Deviation from mean Deviations after ignoring signs

(x) d | d |

5 – 6 6 =

5 – 6 6 where N = 5. ΣX = 55

10 + 1 1

15 + 4 4

65

20 + 9 9

ΣX = 55 Σ | d | = 26

Average Deviation about Mean =

Coefficient of Average Deviation about mean =

Average Deviation (Absolute and Coefficient) about Median

X Deviation from median Deviations after ignoring

d negative signs | d |

5 – 5 5

5 – 5 5

Median 10 0 0

15 + 5 5

20 + 10 10

N = 5 Σ | d | = 25


Coefficient of Average Deviation about mean =

Average Deviation (Absolute and Coefficient) about Mode

X Deviation from mode (d ) | d |

5 0 0

Mode 5 0 0

10 + 5 5

15 + 10 10

20 + 15 15

N = 5 Σ | d | = 30

Average Deviation about Mode =

Coefficient of Average Deviation about Mode =

Average deviation in case of discrete and continuous series

Average Deviation about Mean or Median or Mode =

where N = No. of items

|d| = deviations from Mean or Median or Mode after ignoring signs.

Coefficient of A.D. about Mean or Median or Mode =

Example: Suppose we want to calculate coefficient of Average Deviation about Mean from the following

66

discrete series:

X Frequency

10 5

15 10

20 15

25 10

30 5

Solution: First of all, we shall calculate the value of arithmetic Mean,

Calculation of Arithmetic Mean

X f f X

10 5 50

15 10 150 =

20 15 300

25 10 250

30 5 150

N = 45 ΣfX = 900

Calculation of Coefficient of Average Deviation about Mean

Deviation from mean Deviations after ignoring f |d|

X f d negative signs |d|

10 5 – 10 10 50

15 10 – 5 5 50

20 15 0 0 0

25 10 + 5 5 50

30 5 + 10 10 50

N = 55 Σf|d| = 200

Coefficient of Average Deviation about Mean =


In case we want to calculate coefficient of Average Deviation about Median from the following data:

Class Interval Frequency

10 – 14 5

15 – 19 10

20 – 24 15

67

25 – 29 10

30 – 34 5

N = 45

First of all we shall calculate the value of Median but it is necessary to find the ‘real limits’ of the given

class-intervals. This is possible by subtracting 0.5 from all the lower-limits and add 0.5 to all the upper limits of

the given classes. Hence, the real limits shall be : 9.5 – 14.5, 14.5 – 19.5, 19.5 – 24.5, 24.5 – 29.5 and 29.5 – 34.5

Calculation of Median

Class Interval f Cumulative Frequency

9.5 – 14.5 5 5

14.5 – 19.5 10 15

19.5 – 24.5 15 30

24.5 – 29.5 10 40

29.5 – 34.5 5 45

N = 5

Median =

where l1

= lower limit of median group

i = magnitude of median group

f = frequency of median

Cf0

= cumulative frequency of the group preceding median group

= size of median group

∴∴∴∴∴ Median size = th item i.e. = 22.5

It lies in the cumulative frequency 30, which is corresponding to class 19.5 – 24.5.

Median group is 19.5 – 24.5

Median =

Calculation of Coefficient of Average Deviation about Mean

Class Frequency Mid points Deviation from Deviations after ignoring

Interval f x median (22) negative signs |d| f |d|

9.5 – 14.5 5 12 – 10 10 50

14.5 – 19.5 10 17 – 5 5 50

19.5 – 24.5 15 22 0 0 0

24.5 – 29.5 10 27 + 5 5 50

29.5 – 34.5 5 32 + 10 10 50

N = 45 Σf |d| = 200

Advantages of Average Deviations

1. Average deviation takes into account all the items of a series and hence, it provides sufficiently

representative results.

2. It simplifies calculations since all signs of the deviations are taken as positive.

68

3. Average Deviation may be calculated either by taking deviations from Mean or Median or

Mode.

4. Average Deviation is not affected by extreme items.

5. It is easy to calculate and understand.

6. Average deviation is used to make healthy comparisons.

Disadvantages of Average Deviations

1. It is illogical and mathematically unsound to assume all negative signs as positive signs.

2. Because the method is not mathematically sound, the results obtained by this method are not

reliable.

3. This method is unsuitable for making comparisons either of the series or structure of the series.

This method is more effective during the reports presented to the general public or to groups who are

not familiar with statistical methods.

(d) Standard Deviation

The standard deviation, which is shown by greek letter σ (read as sigma) is extremely useful in

judging the representativeness of the mean. The concept of standard deviation, which was introduced by

Karl Pearson has a practical significance because it is free from all defects, which exists in a range, quartile

deviation or average deviation.

Standard deviation is calculated as the square root of average of squared deviations taken from

actual mean. It is also called root mean square deviation. The square of standard deviation i.e., σ2 is called

‘variance’.

Calculation of standard deviation in case of raw data

There are four ways of calculating standard deviation for raw data:

(i) When actual values are considered;

(ii) When deviations are taken from actual mean;

(iii) When deviations are taken from assumed mean; and

(iv) When ‘step deviations’ are taken from assumed mean.

(i) When the actual values are considered:

σ = where, N = Number of the items,or σ2 = X = Given values of the series,

= Arithmetic mean of the series

We can also write the formula as follows :

σ = where, =

Steps to calculate σσσσσ

(i) Compute simple mean of the given values,

(ii) Square the given values and aggregate them

(iii) Apply the formula to find the value of standard deviation

Example: Suppose the values are given 2, 4, 6, 8, 10. We want to apply the formula

69

σ =

Solution: We are required to calculate the values of N, , ΣX2. They are calculated as follows :

X X2

2 4

4 16

6 36

8 64

10 100

N = 5 ΣX2 = 220

σ =

Variance (σ)2 =

=

(ii) When the deviations are taken from actual mean

σ = where, N = no. of items and x = (X – )

Steps to Calculate σσσσσ

(i) Compute the deviations of given values from actual mean i.e., (X – ) and represent them

by x.

(ii) Square these deviations and aggegate them

(iii) Use the formula, σ =

Example : We are given values as 2, 4, 6, 8, 10. We want to find out standard deviation.

X (X – ) = x x2

2 2 – 6 = – 4 (– 4)2 = 16

4 4 – 6 = – 2 (– 2)2 = 4

6 6 –6 = 0 = 0

8 8 – 6 = + 2 (2)2 = 4

10 10 – 6 = + 4 (4)2 = 16

N = 5 Σx2 = 40

∴ = and

σ =

(iii) When the deviations are taken from assumed mean

σ =

where, N = no. of items,

70

dx = deviations from assumed mean i.e., (X – A).

A = assumed mean

Steps to Calculate :

(i) We consider any value as assumed mean. The value may be given in the series or may not be

given in the series.

(ii) We take deviations from the assumed value i.e., (X – A), to obtain dx for the series and aggre-

gate them to find Σdx.

(iii) We square these deviations to obtain dx2 and aggregate them to find Σdx2.

(iv) Apply the formula given above to find standard deviation.

Example : Suppose the values are given as 2, 4, 6, 8 and 10. We can obtain the standard deviation as:

X dx = (X – A) dx2

2 – 2 = (2 – 4) 4

assumed mean (A) 4 0 = (4 – 4) 0

6 + 2 = (6 – 4) 4

8 + 4 = (8 – 4) 16

10 + 6 = (10 – 4) 36

N = 5 Σdx = 10 Σdx2 = 60

σ =

(iv) When step deviations are taken from assumed mean

σ =

where, i = common factor, N = number of item, dx (Step-deviations) =

Steps to Calculate :

(i) We consider any value as assumed mean from the given values or from outside.

(ii) We take deviation from the assumed mean i.e. (X – A).

(iii) We divide the deviations obtained in step (ii) with a common factor to find step deviations and

represent them as dx and aggregate them to obtain Σdx.

(iv) We square the step deviations to obtain dx2 and aggregate them to find Σdx2.

Example : We continue with the same example to understand the computation of Standard Deviation.

X d = (X – A) dx = and i = 2 dx2

2 – 2 1 1

A = 4 0 0 0

6 + 2 1 1

8 + 4 2 4

10 + 6 3 9

71

N = 5 Σdx = 5 Σdx2 = 15

σ = where N = 5, i = 2, dx = 5, and Σdx2 = 15

σ =

Note : We can notice an important point that the standard deviation value is identical by four methods.

Therefore any of the four formulae can be applied to find the value of standard deviation. But the

suitability of a formula depends on the magnitude of items in a question.

Coefficient of Standard-deviation =

In the above given example, σ = 2.828 and = 6

Therefore, coefficient of standard deviation =

Coefficient of Variation or C. V.

=

Generally, coefficient of variation is used to compare two or more series. If coefficient of variation

(C.V.) is more for one series as compared to the other, there will be more variations in that series, lesser

stability or consistency in its composition. If coefficient of variation is lesser as compared to other series, it

will be more stable or consistent. Moreover that series is always better where coefficient of variation or

coefficient of standard deviation is lesser.

Example : Suppose we want to compare two firms where the salaries of the employees are given as

follows:

Firm A FirmB

No. of workers 100 100

Mean salary (Rs.) 100 80

Standard-deviation (Rs.) 40 45

Solution : We can compare these firms either with the help of coefficient of standard deviation or coefficient

of variation. If we use coefficient of variation, then we shall apply the formula :

C.V. =

Firm A Firm B

C.V. = C.V. =

= 100, σ = 40. = 80, σ = 45

Because the coefficient of variation is lesser for firm A than firm B, therefore, firm A is less variable

and more stable.

Calculation of standard-deviation in discrete and continuous series

We use the same formula for calculating standard deviation for a discrete series and a continuous

series. The only difference is that in a discrete series, values and frequencies are given whereas in a continu-

ous series, class-intervals and frequencies are given. When the mid-points of these class-intervals are ob-

tained, a continuous series takes shape of a discrete series. X denotes values in a discrete series and mid

points in a continuous series.

When the deviations are taken from actual mean

We use the same formula for calculating standard deviation for a continuous series

72

σ =

where N = Number of items

f = Frequencies corresponding to different values or class-intervals.

x = Deviations from actual mean (X – ).

X = Values in a discrete series and mid-points in a continuous series.

Step to calculate σσσσσ

(i) Compute the arithmetic mean by applying the required formula.

(ii) Take deviations from the arithmetic mean and represent these deviations by x.

(iii) Square the deviations to obtain values of x .

(iv) Multiply the frequencies of different class-intervals with x2 to find fx2. Aggregate fx2 column to

obtain Σ fx2.

(v) Apply the formula to obtain the value of standard deviation.

If we want to calculate variance then we can compute σ2 =

Example : We can understand the procedure by taking an example :

Class Intervals Frequency (f ) mid-points (m) fm

10 – 14 5 12 60

15 – 19 10 17 170

20 – 24 15 22 330

25 – 29 10 27 270

30 – 34 5 32 160

N = 45 Σfm = 990

Therefore, = where, N = 45, Σfm = 990

Calculation of Standard Deviation

Class Mid Deviations from

Intervals points aclual mean

f x x = (X – 22) x2 fx2

10—14 5 12 – 10 100 500

15—19 10 17 – 5 25 250

20—24 15 22 0 0 0

25—29 10 27 + 5 25 250

30—34 5 32 + 10 100 500

N = 45 Σfx2 = 1500

73

σ = where, N = 45, Σfx2 = 1500

σ =

When the deviations are taken from assumed mean

In some cases, the value of simple mean may be in fractions, them it becomes time consuming to

take deviations and square them. Alternatively, we can take deviations from the assumed mean.

σ =

where N = Number of the items,

dx = deviations from assumed mean (X – A),

f = frequencies of the different groups,

A = assumed mean and

X = Values or mid points.

Step to calculate σσσσσ

(i) Take the assumed mean from the given values or mid points.

(ii) Take deviations from the assumed mean and represent them by dx.

(iii) Square the deviations to get dx2 .

(iv) Multiply f with dx of different groups to abtain fdx and add them up to get Σfdx.

(v) Multiply f with dx2 of different groups to abtain fdx2 and add them up to get Σfdx2.

(vi) Apply the formula to get the value of standard deviation.

Class Frequency Mid Deviations from

Intervals points aclual mean

f x dx = (X – 17) dx2 fdx fdx2

10 – 14 5 12 – 5 25 – 25 125

15 – 19 10 17 0 0 0 0

20 – 24 15 22 + 5 25 75 375

25 – 29 10 27 + 10 100 100 1000

30 – 34 5 32 + 15 225 75 1125

N = 45 Σfdx = 225 Σfdx2 = 2625

σ = where, N = 45, Σfdx2 = 2625, Σfdx = 225

∴ σ =

When the step deviations are taken from the assumed mean

σ =

where N = Number of the items (Σf ).

74

i = common factor,

f = frequencies corresponding to different groups,

dx = step-deviations =

Steps to calculate σσσσσ

(i) Take deviations from the assumed mean of the calculated mid-points and divide all deviations by

a common factor (i) and represent these values by dx.

(ii) Square these step deviations dx to obtain dx2 for different groups.

(iii) Multiply f with dx of different groups to find fdx and add them to obtain fdx .

(iv) Multiply f with dx2 of different groups to find fdx2 for different groups and add them to obtain

Σfdx2.

(v) Apply the formula to find standard deviation.

Example : Suppose we are given the series and we want to calculate standard deviation with the help of step

deviation method. According to the given formula, we are required to calculate the value of i, N, Σfdx and

Σfdx2.

Class Frequency Mid Deviations from i =5

Intervals points aclual mean (22)

f x x dx dx2 fdx fdx2

10 – 14 5 12 – 10 – 2 4 – 10 20

15 – 19 10 17 – 5 – 1 1 – 10 10

20 – 24 15 22 + 0 0 0 0 0

25 – 29 10 27 + 5 + 1 1 10 10

30 – 34 5 32 + 10 + 2 4 10 20

N = 45 Σfdx = 0 Σfdx2 = 60

σ = where, N = 45, i = 5, Σfdx = 0, Σfdx2 = 60

∴ σ =

Advantages of Standard Deviation

(i) Standard deviation is the best measure of dispersion because it takes into account all the items

and is capable of future algebric treatment and statistical analysis.

(ii) It is possible to calculate standard deviation for two or more series.

(iii) This measure is most suitable for making comparisons among two or more series about varibility.

Disadvantages

(i) It is difficult to compute.

(ii) It assigns more weights to extreme items and less weights to items that are nearer to mean. It

is because of this fact that the squares of the deviations which are large in size would be

proportionately greater than the squares of those deviations which are comparatively small.

75

Mathematical properties of standard deviation (σσσσσ)

(i) If deviations of given items are taken from arithmetic mean and squared then the sum of squared

deviation should be minimum, i.e., = Minimum,

(ii) If different values are increased or decreased by a constant, the standard deviation will remain

the same. If different values arc multiplied or divided by a constant than the standard deviation

will be multiplied or divided by that constant.

(iii) Combined standard deviation can be obtained for two or more series with below given formula:

σ12

=

where: N1

represents number of items in first series,

N2

represents number of items in second series,

represents variance of first series,

represents variance of second series,

d1

represents the difference between

d2

represents the difference between

represents arithmetic mean of first series,

represents arithmetic mean of second series,

represents combined arithmetic mean of both the series.

Example : Find the combined smadard deviation of two series, from the below given information :

First Series Second Series

No. of items 10 15

Arithmetic means 15 20

Standard deviation 4 5

Solution : Since we are considering two series, therefore combined standard deviation is computed by the

following formula :

σ12

=

where: N1

= 10, N2 = 15, , , σ

1 = 4, σ

2 = 5

=

or =

d1

=

By applying the formula of combined standard deviation, we get :

σ12

=

=

=

(iv) Standard deviation of n natural numbers can he computed as :

σ = where, N represents number of items.

76

(v) For a symmetrical distribution

+ σ covers 68.27% of items.

+ 2σ covers 95.45% of items.

+ 3σ covers 99.73% of items.

Example : You are heading a rationing department in a State affected by food shortage. Local investigators

submit the following report:

Daily calorie value of food available per adult during current period :

Area Mean Standard deviation

A 2,500 400

B 2,000 200

The estimated requirement of an adult is taken at 2,800 calories daily and the absolute minimum is

1,350. Comment on the reported figures, and determine which area, in your opinion, need more urgent

attention.

Solution : We know that + σ covers 68.27% of items. + 2σ covers 95.45% of items and + 3σ covers

99.73% . In the gjven problem if we take into consideration 99.73%. i.e., almost the whole population, the

limits would be + 3σ.

For Area A these limits are :

+ 3σ = 2,500 + (3 × 400) = 3,700

– 3σ = 2,500 – (3 × 400) = 1,300

For Area B these limits are :

+ 3σ = 2,000 + (3 × 200) = 2,600

– 3σ = 2,000 – (3 × 200) = 1,400

It is clear from above limits that in Area A there are some persons who are getting 1300 calories, i.e.

below the minimum which is 1,350. But in case of area B there is no one who is getting less than the

minimum. Hence area A needs more urgent attention.

(vi) Relationship between quartile deviation, average deviation and standard deviation is given as:

Quartile deviation = 2/3 Standard deviation

Average deviation = 4/5 Standard deviation

(vii) We can also compute corrected standard deviation by using the following formula :

Correct σ =

(a) Compute corrected =

where, corrected Σf = ΣX + correct items – wrong items

where, ΣX = N.

(b) Compute corrected ΣX2 = ΣX2 + (Each correct item)2 – (Each wrong item}2

where, ΣX2 = Nσ1 +

77

Example : (a) Find out the coefficient of variation of a series for which the following results are given :

N = 50, ΣX’ = 25, ΣX’2 = 500 where: X’ = deviation from the assumed average 5.

(b) For a frequency distribution of marks in statistics of 100 candidates, (grouped in class inervals

of 0 – 10, 10 – 20) the mean and standard deviation were found to be 45 and 20. Later it was discovered that

the score 54 was misread as 64 in obtaining frequency distribution. Find out the correct mean and correct

standard deviation of the frequency destribution.

(c) Can coefficient of variation be greater than 100%? If so, when?

Solution : (a) We want to calculate, coefficient of variation which is =

Therefore, we are required to calculate mean and standard deviation.

Calculation of simple mean

= where, A = 5, N = 50, ΣX’ = 25

∴ =

Calculation of standard deviation

σ =

Calculation of Coefficient of variation

C.V. =

(b) Given = 45, σ = 20, N = 100, wrong value = 64, correct value = 54

Since this is a case of continuous series, therefore, we will apply the formula for mean and standard

deviation that are applicable in a continuous series.

Calculation of correct Mean

= or N = ΣfX

By substituting the values, we get 100 × 45 = 4500

Correct Σ fX = 4500 – 64 + 54 = 4490

∴ Correct =

Calculation of correct σσσσσ

σ = or σ2 =

where, σ = 20, N = 100, = 45

(20)2 =

or 400 =

or 400 + 2025 =

or 2425 × 100 = ΣfX2 = 242500

∴ Correct ΣfX2 = 242500 – (64)2 + (54)2 = 242500 – 4096 + 2916 = 242500 – 1180 = 241320

Correct σ =

=

78

(c) The formulae for the computation of coefficient of variation is =

Hence, coefficient of variation can be greater than 100% only when the value of standard deviation

is greater than the value of mean.

This will happen when data contains a large number of small items and few items are quite large. In

such a case the value of simple mean will be pulled down and the value of standard deviation will go up.

Similarly, if there arc negative items in a series, the value of mean will come down and the value of

standard deviation shall not be affecied because of squaring the deviations.

Example : In a distribution of 10 observations, the value of mean and standard deviation are given as 20 and

8. By mistake, two values are taken as 2 and 6 instead of 4 and 8. Find out the value of correct mean and

variance.

Solution : We are given: N – 10, = 20, σ = 3

Wrong values = 2 and 6 and Correct values = 4 and 8

Calculation of correct Mean

=

ΣX = 10 × 20 = 200

But ΣX is incorrect. Therefore we shall find correct ΣX.

Correct ΣX = 200 – 2 – 6 + 4 + 8 = 204

Correct Mean =

Calculation of correct variance

σ =

or σ2 =

or (8)2 =

or 64 + 400 =

or ΣX2 = 4640

But this is wrong and hence we shall compute correct ΣX2

Correct ΣX2 = 4640 – 22 – 62 + 42 + 82

= 4640 – 4 – 36 + 16 + 64 = 4680

Correct σ2 =

=

Revisionary Problems

Example : Compute (a) Inter-quartile range. (b) Semi-quartile range, and

(c) Coefficient of quartile deviation from the following data :

Farm Size (acres) No. of firms Farm Size (acres) No. of firms

0 – 40 394 161 – 200 169

41 – 80 461 201 – 240 113

79

81 – 120 391 24 1 and over 148

121 – 160 334

Solution :

In this case, the real limits of the class intervals are obtained by subtracting 0.5 from the lower limits

of each class and adding 0.5 to the upper limits of each class. This adjustment is necessary to calculate

median and quartiles of the series.

Farm Size (acres) No. of firms Cumulative frequency (c.f.)

– 0.5 – 10.5 394 394

40.5 – 80.5 461 855

80.5 – 120.5 391 1246

120.5 – 160.5 334 1580

160.5 – 200.5 169 1749

200.5 – 240.5 113 1862

240.5 and over 148 2010

N = 2010

Q1

=

=

Q1 lies in the cumulative frequency of the group 40.5 – 80.5. and

l1 = 40.5, f = 461, i = 40, cf

0 = 394, = 502.5

∴ Q1

=

Similarly, Q3

=

=

Q3

lies in the cumulative frequency of the group 121 – 160, where the real limits of the class intervalare 120.5 – 160.5 and l

1 = 120.5, i = 40, f = 334, = 1507.5, c.f. = 1246

∴ Q3

=

Inter-quartile range = Q3 – Q

1 = 151.8 – 49.9 = 101.9 acres

Semi-quartile range =

Coefficient of quartile deviation =

Example : Calculate mean and coefficient of mean deviation about mean from the following data :

Marks less than No. of students

10 4

20 10

30 20

40 40

80

50 50

60 56

70 60

Solution :

In this question, we are given less than type series alongwith the cumulative frequencies. Therefore,

we are required first of all to find out class intervals and frequencies for calculating mean and coefficient of

mean deviation about mean.

Marks No. of Mid Deviations from Step Deviation

students points assumed mean Deviation from mean (35)

(A = 35) i = 10 (ignoring signs)

f X d dx = |dx| fdx f|dx|

0 – 10 4 5 – 30 – 3 3 – 12 12

10 – 20 6 15 – 20 – 2 2 – 12 12

20 – 30 10 25 – 10 – 1 1 – 10 10

30 – 40 20 35 0 0 0 0 0

40 – 50 10 45 + 10 + 1 1 + 10 10

50 – 60 6 55 + 20 + 2 2 + 12 12

60 – 70 4 65 + 30 + 3 3 + 12 12

N = 45 Σfdx = 0 Σf|dx| = 68

=

where, N = 60, A = 35, i = 10, Σfdx = 0

∴ =

M.D. about mean =

Coefficient of M.D. about mean =

Example : Calculate standard deviation from the following data :

Class Interval frequency

– 30 to – 20 5

– 20 to – 10 10

– 10 to – 0 15

81

0 to 10 10

10 to 20 5

N = 45

Solution: Calculation of Standard Deviation

Class Frequency Mid Deviations from Step Derivations

Intervals points assumed Mean(A = – 5) when i = 10

f X X’ dx = dx2 fdx fdx2

–30 to –20 5 – 25 – 20 – 2 4 – 10 20

–20 to –10 10 – 15 – 10 – 1 1 – 10 10

–10 to 0 15 – 5 + 0 0 0 0 0

0 to 10 10 5 + 10 1 1 10 10

10 to 20 5 15 + 20 2 4 10 20

N = 45 Σfdx = 0Σfdx2 = 60

σ =

where, N = 45, i = 10, Σfdx = 0, Σfdx2 = 60

∴ σ =

Example : For two firms A and B belonging to same industry, the following details are available :

Firm A Firm B

Number of Employees : 100 200

Average wage per month : Rs. 240 Rs. 170

Standard deviation of the wage per month : Rs. 6 Rs. 8

Find (i) Which firm pays out larger amount as monthly wages?

(ii) Which firm shows greater variability in the distribution of wages?

(iii) Find average monthly wages and the standard deviation of wages of all employees for both

the firms.

Solution : (i) For finding out which firm pays larger amount, we have to find out ΣX.

X = or ΣX = NX

82

Firm A : N = 100, X = 240 ∴ ΣX = 100 × 240 = 24000

Firm B : N = 200, X = 170 ∴ ΣX = 200 × 170 = 34000

Hence firm B pays larger amount as monthly wages.

(ii) For finding out which firm shows greater variability in the distribution of wages, we have to

calculate coefficient of variation.

Firm A : C.V. =

Firm B : C.V. =

Since coefficient of variation is greater for firm B. hence it shows greater variability in the

distribution of wages.

(iii) Combined wages : =

where, N1

= 100, = 240, N2 = 200, = 170

Hence =

Combined Standard Deviation :

σ12

=

where N1

= 100, N2 = 200, σ

1 = 6, σ

2 = 8, = 240 – 193.3 = 46.7

and d1

= = 170 – 193.3 = – 23.3

σ12

=

=

Example : From the following frequency distribution of heights of 360 boys in the age-group 10 – 20 years

calculate the :

(i) arithmetic mean;

(ii) coefficient of variation; and

(iii) quartile deviation

Height (cms) No. of boys Height (cms) No. of boys

126 – 130 31 146 – 150 60

131 – 135 44 151 – 155 55

136 – 140 48 156 – 160 43

141 – 145 51 161 – 165 28

Solution : Calculation of , Q.D., and C.V.

Heights m.p. (X – 143)/5

X f dx fdx fdx2 c.f.

126 – 130 128 31 – 3 – 93 279 31

83

131 – 135 133 44 – 2 – 88 176 75

136 – 140 138 48 – 1 – 48 48 123

141 – 145 143 51 0 0 0 174

146 – 150 148 60 + 1 + 60 60 234

151 – 155 153 55 + 2 + 10 220 289

156 – 160 158 43 + 3 + 129 387 332

161 – 165 163 28 + 4 + 112 448 360

N = 45 Σfdx = 182 Σfdx2 = 1618

(i) = where, N = 360, A = 143, i = 5, Σfdx = 182

∴ =

(ii) C.V. =

σ =

=

C.V. =

(iii) Q.D. =

Q1 = Size of th observation = observation

Q1 lies in the class 136 – 140. But the real limits of this class is 135.5 – 140.5

Q1 =

Q3 = Size of observition = observation

Q3 lies in the class 151 – 155. But the real limit of this class is 150 – 155.5

Q3 =

Q.D. =

84

LESSON 5

MEASURES OF SKEWNESS AND KURTOSIS

Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study the

characteristics of a frequency distribution. Averages tell us about the central value of the distribution and

measures of dispersion tell us about the concentration of the items around a central value. These measures

do not reveal whether the dispersal of value on either side of an average is symmetrical or not. If observa-

tions are arranged in a symmetrical manner around a measure of central tendency, we get a symmetrical

distribution, otherwise, it may be arranged in an asymmetrical order which gives asymmetrical distribution.

Thus, skewness is a measure that studies the degree and direction of departure from symmetry.

A symmetrical distribution, when presented on the graph paper, gives a ‘symmetrical curve’, where

the value of mean, median and mode are exactly equal. On the other hand, in an asymmetrical distribution, the

values of mean, median and mode are not equal.

When two or more symmetrical distributions are compared, the difference in them are studied with

‘Kurtosis’. On the other hand, when two or more sysmmetrical distributions are compared, they will give

different degrees of Skewness. These measures are mutually exclusive i.e. the presence of skewness

implies absence of kurtosis and vice-versa.

Tests of Skewness

There are certain tests to know whether skewness does or does not exist in a frequency distribution.

They are :

1. In a skewed distribution, values of mean, median and mode would not coincide. The values of

mean and mode are pulled away and the value of median will be at the centre. In this distribu-

tion, mean-Mode = 2/3 (Median - Mode).

2. Quartiles will not be equidistant from median.

3. When the asymmetrical distribution is drawn on the graph paper, it will not give a bell shaped-

curve.

4. Sum of the positive deviations from the median is not equal to sum of negative deviations.

5. Frequencies are not equal at points of equal deviations from the mode.

Nature of Skewness

Skewness can be positive or negative or zero.

1. When the values of mean, median and mode are equal, there is no skewness.

2. When mean > median > mode, skewness will be positive.

3. When mean < median < mode, skewness will be negative.

Characteristic of a good measure of skewness

1. It should be a pure number in the sense that its value should be independent of the unit of the

series and also degree of variation in the series.

2. It should have zero-value, when the distribution is symmetrical.

85

3. It should have a meaningful scale of measurement so that we could easily interpret the mea-

sured value.

Methods of ascertaining Skewness

Skewness can be studied graphically and mathematically. When we study skewness graphically, we

can find out whether skewness is positive or negative or zero. This can be shown with the help of a diagram

:

Mathematically skewness can be studied as :

(a) Absolute Skewness

(b) Relative or coefficient of skewness

When the skewness is presented in absolute term i.e, in units, it is absolute skewness. If the value of

skewness is obtained in ratios or percentages, it is called relative or coefficient of skewness.

When skewness is measured in absolute terms, we can compare one distribution with the other if the

units of measurement are same. When it is presented in ratios or percentages, comparison become easy.

Relative measures of skewness is also called coefficient of skewness.

Mathematical measures of skewness can be calculated by :

(a) Bowley’s Method

(b) Karl-Pearson’s Method

(c) Kelly ‘s method

(a) Bowley’s Method

Bowley’s method of skewness is based on the values of median, lower and upper quartiles. This

method suffers from the same limitations which are in the case of median and quartiles.

Wherever positional measures are given, skewness should be measured by Bowley’s method. This

method is also used in case of ‘open-end series’, where the importance of extreme values is ignored.

Absolute skewness = Q3 + Q

1 – 2 Median

Coefficient of Skewness =

Coefficient of skewness lies within the limit ± 1. This method is quite convenient for determining

skewness where one has already calculated quartiles.

For example, if the class intervals and frequencies are given as follows :

Class Intervals Frequencies

Below 10 5

10 – 20 10

20 – 30 15

30 – 40 10

above 40 5

86

To compute coefficient of skewness we are required to calculate the values of Median, Q3 and Q

1:

Calculation of Coefficient of Skewness on the basis of Median and Quartiles

Class Frequencies Cumulative Calculations

Intervals frequency

Below 10 5 5 Median =

10 – 20 10 15 size of Median = = 22.5, It lies in the cumulative

20 – 30 15 30 frequency 30, corresponding to class (20 –30)

30 – 0 10 40 Median =

40 and above 5 45

N = 45

Q1 = n/4 = lies in the cumulative frequency

Absolute Skewness = Q3 + Q

1 – 2 Median corresponding to class interval (10 – 20).

where, Q3 = 33.75, Q

1 = 16.75, Median = 25 Q

1 =

∴Ab. Skewness = 33.75 + 16.25 – 2(25) Q3 =

= 50 – 50 = 0 3n/4 = 33.75, that lies in the cumulative

Coefficient of Skewness = frequency 40, corresponding to group (30 – 40)

Now we have, Q3 = 33.75, Q

1 = 16.25, Median = 25 Q

3 =

∴ Coefficient of Skewness =

(b) Karl-Pearson’s Method (Personian Coefficient of Skewness)

Karl Pearson has suggested two formulae;

(i) where the relationship of mean and mode is established;

(ii) where the relationship between mean and median is not established.

When the values of Mean When the values of Mean

and Mode are related and Median are related

Absolute skewness = Mean – Mode Absolute skewness = 3(Mean – Median)

Coefficient of skwenes = Coefficient of skweness =

Coefficient of skewness generally lies within + 1 Coefficient of skewness generally lies within + 3

Calculation of Coefficient of skewness by using the following formula

Coefficient of skewness =

Given X values are = 12, 18, 18, 22, 35, and N = 5 ∴ =

87

(X – 21)

X x x 2

12 – 9 81 σ =

18 – 3 9 ∴ σ =

18 – 3 9

22 + 1 1

35 + 14 196

N = 5 Σx2 = 296

∴ Coefficient of skeweness =

Substitute Mean = 21, Mode = 18, Standard deviation = 7.7.

∴ SK =

Calculation of Karl-Pearson’s coefficient of skewness by using the following formula:


For the given data X = 12, 18, 18, 22, 35

Mean = 21, Median = 18, σ = 7.7

∴ Coefficient of skewness =

Revisionary Problems

Exercise 1. Calculate appropriate measure of skewness from the following income distribution:

Monthly income (Rs.) Frequency

upto – 100 9

101 – 150 51

151 – 200 120

201 – 300 240

301 – 500 136

501 – 750 33

751 – 1000 9

above 1000 2

N = 600

Solution : In this problem, the open-ends series is given with inclusive class-intervals. Hence, Bowley’s

measure of skcwness is better because it is based on Quartiles and not affected by extreme class intervals.

88

Calculation of coefficient of skewness based on quartiles and median

Monthly income (Rs.) Frequency Cumulative Frequency

f c.f

Upto 100 9 9

101 – 150 51 60

151 – 200 120 180

201 – 300 240 420

301 – 500 136 556

501 – 750 33 589

751 – 1000 9 598

Above 1000 2 600

N = 600


Median =

∴ N/2 = It lies in the cumulative frequency 420, which is corresponding to group

201 – 300.

But the real limits of the class interval are 200.5 – 300.5

∴ Median =

Q1

=

n/4 = . It lies in the cumulative frequency 180, which is corresponding to class-interval 151 – 200.

But the real limits of this class – interval are 150.5 – 200.5.

∴ Q3

=

Q3

=

Where 3n/4 is used to find out upper quartile group.

∴ 3n/4 = It lies in the cumulative frequency 556, which is corresponding to group 301 – 500.

The real limits of this class interval are 300.5 – 500.5

∴ Q3

=

Hence, Coefficient of skewness =

Exercise 2 : Calculate the appropriate measure of skewness from the following cumulative frequencydistribution:

Age (under years) : 20 30 40 50 60 70

No. of persons : 12 29 48 75 94 106

Solution : In this problem, we are given the upper limits of classes along with the cumulative frequency.

89

Therefore, we have to find out the lower limits and frequencies for the given data.

Age (years) Number of Persons Cumulative

Frequency (/) Frequency (c.f.)

Below 20 12 12

20 – 30 17 29

30 – 40 19 48

40 – 50 27 75

50 – 60 19 94

60 – 70 12 106

N = 106

Because the lower limit of the first group is not given, the appropriate measure of skewness is

Bowley’s method. It is based on quartiles and median and is not influenced by extreme class-intervals.

Bowley’s coefficient of skewness =

Thus, we have to calculate the values of Q3, Q

1 and median.

Median =

Median has items or or 53 items below it.

Therefore, it lies in the cumulative frequency 75, which is corresponding to the class-interval

(40 – 50).

Hence, median group is (40 – 50).

where L1

= 40, i = 10, f = 27, = 53, c.f. = 48

∴ Median =

Q1

=

Q1 has or or 26.5 items below it.

It lies in the cumulative frequency 29, which is corresponding to the class-interval 20 – 30.


where L1

=

∴ Q1

=

Q3

=

Q3 has or or 79.5 items below it.

It lies in the cumulative frequency 94, which is corresponding to the group 50 – 60.

90


∴ Q3

=


Where Q3 = 52.37, Q

1 = 28.53, median = 41.9

Coefficient of Skewness =

Exercise 3 : Calculate the Karl-Pearson’s coefficient of skewness from the following data :

Marks (above) : 0 10 20 30 40 50 60 70 80

No. of Students: 150 140 100 80 80 70 30 14 0

Thus formula of Karl-Pearson is applied to find out coefficient of skewness.

S.K. =

Median = L1 +

Median has or or 75 items below it. It lies in the cumulative frequency 80 which is corresponding to

the group (40 –50). Therefore median group is 40 – 50.

where, L1

=

Median =

Calculation of Mean and Standard Deviation

Marks Frequency Mid Deviations from i = 10 dx2 fdx fdx2

f points Assumed Mean

X (X – 45) dx =

0 – 10 10 5 – 40 – 4 16 –40 160

10 – 20 40 15 – 30 – 3 9 –120 360

20 – 30 20 25 – 20 – 2 4 –40 80

30 – 40 0 35 – 10 – 1 1 0 0

40 – 50 10 45 0 0 0 0 0

50 – 60 40 55 + 10 + 1 1 40 40

60 – 70 16 65 + 20 + 2 4 32 64

70 – 80 14 75 + 30 + 3 9 42 126

N = 150 Σfdx = – 86 Σfdx2 = 830

=

∴ =

σ =

where, N = 150, i = 10, Σfdx = –86, Σfdx2 = 830.

91

∴ σ =

or σ =


Where, σ = 22.8, Mean = 39.27, Median = 45

∴ Coefficient of skewness =

Hence, Coefficient of skweness is – 0.754

Exercise 4 : (a) In a frequency distribution the coefficient of skewness based on quartiles is 0.6. If the sum

of the upper and lower quartile is 100 and median is 38, find the values of lower and upper quartiles. Also find

out the value of middle 50% items.

(b) In a certain distribution, the following results were obtained :

Coefficient of variation = 40%

= 25

Mode = 20

Find out the Coefficient of skewness, by applying

Solution : (a) Since Bowley’s method is based on quartiles. We shall use the following formula :


Where coeff. of sk. = +0.6, median = 38, (Q3 + Q

1) = 100

By substituting the values in the formula, we get

+ 0.6 =

By cross multiplying, we get :

0.6 (Q3 – Q

1) = 100 – 76 = 24 or Q

3 – Q

1 =

We can solve the below given simultaneous equations :

Q3 + Q

1= 100 ...(i)

Q3 – Q

1= 40 ...(ii)

or 2Q3

= 140 (adding both equations)

Q3

= 70

Since Q3 + Q

1= 100

Q1

= 100 – 70 = 30.

Hence the lower and upper quartiles are 30 and 70.

The value of middle 50% items can be obtained with the help of (Q3 – Q

1).

∴ The value of middle 50% items is (70 – 30) = 40.

(b) In this problem, the value of standard-deviation is missing. We can calculate σ by applying the

92

following formula :

C.V. =

We are given, C. V. = 40%, = 25

∴ 40 =

Coefficient of skweness =

We are given, Mean = 25, Mode = 20, σ = 10

Q Coefficient of skweness =

Hence Coefficient of skweness is = + 0.5

Exercise 5. What is the relationship between Mean, Median and Mode :

(a) Symmetrical curve.

(b) a negatively skewed curve.

(c) A positively skewed curve,

From the marks obtained by 120 students each in section A and B of a class, the following measures

are secured :

Section A Section B

Mean = 47 Marks Mean = 48

Standard deviation = 15 Marks Standard deviation = 15 marks

Mode = 52 Mode = 45.

Find out the coefficient of skewness and determine the degree of skewness and in which distribution,

the marks are more skewed.

Solution : The relationship between mean, median and mode, in different cases, can be established as :

(a) In a symmetrical curve, there is no skewness. Therefore the value of mean = median = mode.

(b) In a negatively skewed curve, the value of mean is less than median is less than mode. In other

words, mean < median < mode.

(c) In a positively skewed curve, the value of mean is greater than median is greater than mode. In

other words, mean > median > mode.

In the given problem, for finding out the degree of skewness, we have to compute the coefficient of

skewness.

where β2

= 3, Mesokurtic Curve

β2

< 3, Platykurtic Curve

β2

> Leptokurtic Curve

Measures of Kurtosis

Measure of kurtosis is denoted by β2 and in a normal distribution β

2 = 3.

93

If β2 is greater than 3, the curve is more peaked and is named as leptokurtic. If β

2 is less than 3, the

curve is flatter at the top than the normal, and is named as platykurtic. Thus kurtosis is measured by

β2 = where x = (X – )

R.A. Fisher had introduced another notation Greek letter gamma, symbolically.

γ2

= β2 – 3 = = 3.

In this case of a normal distribution, γ2 is zero. γ

2 is more than zero (positive), then the curve is

platykurtic and if γ2 is less than 0 (negative) then the curve is leptokurtic.

It may be noted that µ4 = is an absolute measure of kurtosis but β

2 = is a relative measure of

kurtosis. Larger the value of γ2 in a frequency distribution, the greater is its departure from normality.

β1 and β

2 are measures of symmetry and normality respectively. If β

2 = 0, the distribution is sym-

metrical and if β2 = 3, the distribution curve is mesokurtic.

Comparison among dispersion, skewness and kurtosis

Dispersion, Skewness and Kurtosis are different characteristics of frequency distribution. Disper-

sion studies the scatter of the items round a central value or among themselves. It does not show the extent

to which deviations cluster below an average or above it. Skewness tells us about the cluster of the deviations

above and below a measure of central tendency. Kurtosis studies the concentration of the items at the central

part of a series. If items concentrate too much at the centre, the curve becomes ‘LEPTOKURTIC’ and if the

concentration at the centre is comparatively less, the curve becomes ‘PLATYKURTIC’.

Exercise : From the following data given below, calculate the value of kurtosis and find out the nature of

distribution:

X : 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50

f : 5 10 15 10 5

Solution :

Calculation of Mean =

Calculation of β2= where, µ

4 =

µ2

=

∴ β2

=

Since the value of β2 = 3, the distribution curve is mesoknrtic.

(CALCULATION OF βββββ2)

X f Mid

Classes Frequency Points (X – 25)

X fx x x 2 x 3 x 4 fx2 fx3 fx4

0 – 10 5 5 25 – 20 400 – 8000 16000 2000 – 40000 800000

10 – 20 10 15 150 – 10 100 –1000 10000 1000 –10000 100000

20 – 30 15 25 375 0 0 0 10 0 0 0

30 – 40 10 35 350 + 10 100 1000 10000 1000 10000 100000

40—50 5 45 225 +20 400 8000 160000 2000 40000 800000

94

N = 45 Σfx = 1125 Σfx2 = 6000 0 Σfx4 = 1800000

95

LESSON 6

MOMENTS

The concept of moments has crept into the statistical literature from mechanics. In mechanics, this

concept refers to the turning or the rotating effect of a force whereas it is used to describe the peculiarities

of a frequency distribution in statistics. We can measure the central tendency of a set of observations by

using moments. Moments also help in measuring the scatteredness, asymmetry and peakedness of a curve

for a particular distribution.

Moments refers to the average of the deviations from mean or some other value raised to a certain

power. The arithmetic mean of various powers of these deviations in any distribution is called the moments of

the distribution about mean. Moments about mean are generally used in statistics. We use a Greek alphabet

read as mu for these moments. We shall understand the first four moments about mean in this lesson, i.e., µ1,

µ2, µ

3, and µ

4.

Calculation of Central Moments

We can compute central moments in the following ways :

1. Direct method

2. Short -cut method

3. Step-deviation method.

(1) Direct Method

(i) Calculate arithmetic mean (X)

(ii) Calculate the sum of deviations (Σx) from arithmetic mean.

(iii) Calculate the sum of x2, x3 and x4.

In case of frequency distributions multiply the individual value of x2, x3 and x4 with corresponding

frequencies and find out the sum of fx2, fx3. and fx4.

(iv) Apply the following formulae rule.

µ1

=

µ2

=

µ3

=

µ4

=

In case of frequency distribution apply :

µ1

=

µ2

=

µ3

=

µ4

=

Let us take an example to understand the computation of the moments about mean

Illustration : Calculate the first four moments about the mean from the following set of numbers ; 2, 3, 7,

8, 10

96

Solution : Calculation of Moments

X x x 2 x 3 x 4

2 – 4 16 – 64 256

3 – 3 9 – 27 81

7 1 1 1 1

8 2 4 8 16

10 4 16 64 256

Σx = 30 0 46 – 18 610

=

Moments of the data can be computed by using the values calculated above.

µ1

=

µ2

=

µ3

=

µ4

=

Therefore, the first four central moments about the mean are : 0, 9.2, – 3.6 and 122 respectively.

Illustration : From the marks distribution of 100 candidates, compute the first four moments about mean.

Marks Number of Candidates

0 – 10 10

10 – 20 15

20 – 30 25

30 – 10 25

40 – 50 10

50 – 60 10

60 – 70 5


(Mid Value)

Marks X f fx X fX fx2 fx3 fx4

0 – 10 5 10 50 – 26 – 260 6,760 – 1,75,760 45,69,760

10 – 20 15 15 225 – 16 – 240 3,840 – 61,440 9,83,040

20 – 30 25 25 625 – 6 – 150 900 – 5,400 32,400

97

30 – 40 35 25 875 4 100 400 1,600 6,400

40 – 50 45 10 450 14 140 1,960 27,440 3,84,160

50 – 60 55 10 550 24 240 5,760 1,38,240 33,17,760

60 – 70 65 5 325 34 170 5,780 1,96,520 66,81,680

N = 100 3100 0 25,400 1,21,200 1,59,75,200

∴ =

Now, we can calculate the moments about mean as follows :

µ1

=

µ2

=

µ3

=

µ4

=

Therefore, the Central Moments are : 0,254, 1212, 159752 respectively.

(2) Short-cut Method : If the arithmetic mean is in fractions then, it is difficult to calculate devia-

tions (x) from arithmetic mean. Short-cut method is used in such cases.

(i) Take any value as an arbitrary mean (A).

(ii) Calculate deviations (d) from A and calculate the first four moments in the similar way as done

in direct method.

These moments are called moments about an arbitrary origin which are represented by the Greek

word v read as nu. The formulae for these moments are :

v1

= where d = X – A

v2

=

v3

=

v4

=

In case of frequency distribution,

v1

=

v2

=

v3

=

v4

=

After calculating moments about an arbitrary origin convert them into Moments about mean by using

the following equations :

µ1

= v1 – v

1 = 0

µ2

= v2 – v

12 = σ2

98

µ3

= v3 – 3v

2 v

1 + 2v

1

3

µ4

= v4 – 4v

3⋅v

1 + 6v

2⋅v

12 – 3v

14

We can calculate the Moments about an arbitrary origin from Moments about the mean by this

relationship:

v1

= µ1 + d

v2

= µ2 + d2

v3

= µ3 + 3µ

2d + d3

µ4

= µ4 + 4µ

3d + 6µ

2d2 + d4

Illustration : We are given the following set of numbers 1 , 3, 7, 9, 10. Calculate the first four moments about

the origin 4.

Solution :

Calculation of First Four Moments about A = 4

X d(X – A) d 2 d 3 d 4

1 – 3 9 – 27 81

3 – 1 1 – 1 1

7 3 9 27 81

9 5 25 125 625

10 6 36 216 1296

N = 5 10 80 340 2084

v1

=

v2

=

v3

=

v4

=

Therefore the Moments about an arbitrary origin are 2,16, 68 and 416.8 respectively.

Illustration : Calculate first four moments about mean for the distribution of heights of the following 100

students.

Heights (Inches) 61 64 67 70 73

Number of Students 5 18 42 27 8

Solution :

Calculation of Central Moments (short-cut method)

99

Heights No. of students A = 67 f × d fd2 × d fd2 × d fd3 × d

Student

X f d (X – 67) fd fd2 fd3 fd4

61 5 – 6 – 30 180 – 1,080 6,480

64 18 – 3 – 54 162 – 486 1,458

67 42 0 0 0 0 0

70 27 +3 81 243 729 2,187

73 8 + 6 48 288 1,728 10,368

N = 100 45 873 981 20,493

Now we can substitute the calculated values in the formulae

v1

=

v2

=

v3

=

v4

=

Moments about mean can be calculated as follows :

µ1

= v1 – v

1 = 0 = 0.45 – 0.45 = 0

µ2

= v2 – v

1

2 = σ2 = 8.73 – (0.45)2 = 8.73 – 0.2025 = 8.5275

µ3

= v3 – 3v

2v

1 + 2v

1

3 = 8.91 – 3(8.73 × 0.45) + 2(0.45)3 = 8.91 – 11.7855 + 0.18225 = – 2.6932

µ4

= v4 – 4v

3.v

1 + 6v

2.v

12 – 3v

14

= 204.93 – 4 × 8.91 × 0.45 + 6 × 8.73 × (0.45)2 – 3 × (0.45)4 = 204.93 – 16.038 + 10.60695 –

0.1230 = 199.3759

Hence the Central Moments are : 0,8.5275, – 2.6932 and 199.3759.

If we are given the values of central moments and were interested in finding the Moments about an

arbitary origin (A = 67). Then we can calculate as follows :

=

d = ( – A) = 67.45 – 67 = 0.45

v1

= µ1 + d = 0 + 0.45 = 0.45

v2

= µ2 + d2 = 8.5275 + (0.45)2 = 8.5275 + .2025 = 8.73

v3

= µ3 + 3µ

2 d3 + d3

= – 2.6932 + 3 × 8.5275 × (0.45) + (0.45)3 = – 2.6932 + 1 1.512125 + 0.091 125 = 8.91

v4

= µ4 + 4M

3 d + 6µ

2 d3 + d4

100

= 199.3759 – (4 × – 2.69325 + 0.45) + 6 × 8.5275 × (0.45)2 + (0.45)4

= 199.3759 – 4.84785 + 10.36091 + 0.041006 = 204.93

∴ Moments about an arbitrary origin (67) are : 0.45, 8.73, 8.91 and 204.93

(3) Step-Deviation Method : It is the most appropriate method to calculate central moments in

problems of continuous frequency distributions with equal class-intervals. Step-deviation method is similar to

short-cut method. The only difference is that in case of step-deviation method, we take a common factor

from among the deviations (d) which are taken from assumed mean (A).

(i) Calculate deviations (d) from arbitrary origin (A).

(ii) Take a common factor from all the values of d and find out the sum of d′.

(iii) Find out the values of d’2, d’3 and d’4 and their aggregates.

(iv) Calculate the value of v1, v

2, v

3 and v

4 by using the formulae.

(v) Convert the calculated Moments about an arbitrary origin into Moments about the mean with

the help of these relationship :

µ1

= v1 – v

1 = 0

µ2

= v2 – v

12 = σ2

µ3

= v3 – 3v

2v

1 + 2v

1

3

µ4

= v4 – 4v

3.v

1 + 6v

2.v

1

2 – 3v1

4

We shall understand the computation of the first four moments about an arbitrary origin by step

deviation method.

Illustration : Calculate first four moments about mean with the help of moments about an assumed mean 35

from the following data :

Class Frequency

0 – 10 4

10 – 20 10

20 – 30 21

30 – 40 32

40 – 50 21

50 – 60 7

60 – 70 5

Solution : Calculation of Moments about arbitrary Mean

(Mid Points) A = 35, C = 10

Class X f d’ fd’ fd’2 fd’3 fd’4

0 – 10 5 4 – 3 – 12 36 – 108 324

101

10 – 20 15 10 – 2 – 20 40 – 80 160

20 – 30 25 21 – 1 – 21 21 – 21 21

30 – 40 35 32 0 0 0 0 0

40 – 50 45 21 1 21 21 21 21

50 – 60 55 7 2 14 28 56 112

60 – 70 65 5 3 15 45 135 405

N = 100 – 3 191 3 1403

First of all, we shall calculate the first four moments about an arbitrary mean by substituting the

values.

v1 = =

v2 = =

v3 = =

v4 = =

From these we get the central moments as below :

µ1

= v1 – v

1 = – 0.3 – (– 0.3) = 0

µ2

= v2 – v

1

2 = σ2 = 191 – (– 0.3)2 = 191 – 0.09 = 190.91

µ3

= v3 – 3v

2v

1 + 2v

13

= 30 – 3 × 191 (– 0.3) + 2(–0.3)3 = 30 + 171.9 – 0.054 = 201.846

µ4

= v4 – 4v

3.v

1 + 6v

2.v

12 – 3v

14

= 1,04,300 – 4 × 30 (–0.3) × 6 × 191 (– 0.3)2 – 3 (– 0.3)4

= 1,04,300 + 36 + 103.14 – 0.0243 = 1,04,439.12

Sheppard’s Correction for Grouping Errors

When data are grouped into frequency distributions, the individual values lose their identity. While

calculating moments, it is assumed that the frequencies are concentrated at the mid-points of the classes for

a continuous frequency distribution.

Let us understand by assuming a class of 10-20 whose relative frequency is 20.

To compute Moments or arithmetic mean, we always take the mid-point of the class 10 – 20 which

is But in reality it may be just possible that more than half the values for the class 10-20 are more than 15.

Due to this assumption the grouping errors enter into the calculation of Moments. To remove these errors

W.F. Sheppard introduced some corrections which are known as Sheppard’s corrections. These are :

The first and third Moment needs no corrections.

µ2 (corrected) = µ

2 (uncorreced) – where i is the class-interval.

µ4 (corrected) = µ

4 (uncorreced) – (uncorrected)

102

Sheppard’s corrections are applied only under these conditions :

(i) When the frequency distribution is continuous.

(ii) When frequency tapers off to zero in both directions.

(iii) When the frequencies are not less than 1000.

(iv) When the frequency distribution is not J-shaped or U-shaped or skewed.

(v) When the class-interval is uniform.

Let us understand with the help of an example.

Illustration : Applying Sheppard’s corrections, find out the corrected values of the moments where the class

interval is 10 and µ1 = 0, µ

2 = 254, µ

3 = 1212 and µ

4 = 1,59,752.

Solution : We are given all the values of four moments and class-interval.

µ1 and µ

3 needs no correction.

µ2 (corrected) =

µ4 (corrected) =

= 1,59,752 – 12,700 + 291.667 = 1,47,343.667

Therefore, the corrected values of four moments are 0, 245.667, 1212 and 1,47,347.667 respectively.

Coefficients based on Moments

There are three coefficients which are used in practice. They are α (Alpha), β (Beta), γ (Gamma)

coefficients. These coefficients are calculated on individual relationships of various Moments. The formulae

are :

Alpha-Coefficients

α1

=

α3

= and

Beta-Coefficients

β1

= and

Gama-Coefflclents

γ1

= and

Beta-Coefficients (β1 and β

2) are used to find the skewness and kurtosis of a distribution. Let us take

an illustration to understand coefficients.

Illustration : The values of µ1, µ

2, µ

3, and µ

4, are 0, 9.2, 3.6 and 122 respectively. Find out the skewness and

kurtosis of the distribution.

Solution : To comment upon skewness and kurtosis of the distribution, we shall calculate the values of β1

and β2.

103

β1

=

β2

=

Hence the distribution is positively skewed and the curve is platykurtic or flat at the top.

Illustration : Calculate the first four Moments about an arbitrary origin. Convert them into Moments about

mean. Applying Sheppard’s corrections, calculate corrected Moments and beta coefficents from the follow-

ing data :

Experience No. of Employees

(years)

0 – 1 15

1 – 2 22

2 – 3 45

3 – 4 35

4 – 5 30

5 – 6 20

6 – 7 16

7 – 8 10

8 – 9 5

9 – 10 2


Experience (Mid Points) No. of Employes Let A = 4.5

(Years)

X f d = (X – A) fd fd2 fd3 fd4

0 – 1 0.5 15 – 4 – 60 240 – 960 3,840

1 – 2 1.5 22 – 3 – 66 198 – 594 1,782

2 – 3 2.5 45 – 2 – 90 180 – 360 720

3 – 4 3.5 35 – 1 – 35 35 – 35 35

4 – 5 4.5 30 0 0 0 0 0

5 – 6 5.5 20 1 20 20 20 20

6 – 7 6.5 16 2 32 64 128 256

7 – 8 7.5 10 3 30 90 270 810

104

8 – 9 8.5 5 4 20 80 320 1,280

9 – 10 9.5 2 5 10 50 250 1,250

200 – 139 1,197 – 961 9,993

We can find out v1

=

v2

=

v3

=

v4

=

Computed moments are moments about an arbitrary point, 4.5. The central moments are calculated

below:

µ1

= v1 – v

2 = – 0.695 – (0.695) = 0

µ2

= v2 – v

12 = 5.985 – (–0.695)2 = 5.985 – 0.483 = 5.502

µ3

= v3 – 3v

2v

1 + 2v

13

= 4.85 – 3(5.985) × (–0.695) +2 (–0.695)3 = –4.805 + 12.479 – 0.671 = 7003

µ4

= v4 – 4v

3⋅v

1 + 6v

3⋅v

12 – 3v

14

= 49.965 – 4(– 4.805) × (– 0.695) + 6 × 5.985 (– 0.695)2 – 3 (– 0.695)4

= 49.965 – 13.358 + 17.345 – 0.700 = 53.252.

Applying Sheppard’s corrections, we have

µ2(corrected) =

=

µ2(corrected) =

=

=

Beta Constants β1

=

β2

=

Therefore the central moments after correction are 0,5.419, 7.003 and 50.53. β1 = 0.31 and

β2 = 1.714.

105

LESSON 7

SIMPLE CORRELATION

In the earlier chapters we have discussed univariate distributions to highlight the important charac-

teristics by different statistical techniques. Univariate distribution means the study related to one variable only

we may however come across certain series where each item of the series may assume the values of two or

more variables. The distributions in which each unit of series assumes two values is called bivariate distribu-

tion. In a bivariate distribution, we are interested to find out whether there is any relationship between two

variables. The correlation is a statistical technique which studies the relationship between two or more

variables and correlation analysis involves various methods and techniques used for studying and measuring

the extent of relationship between the two variables. When two variables are related in such a way that a

change in the value of one is accompanied either by a direct change or by an inverse change in the values of

the other, the two variables are said to be correlated. In the correlated variables an increase in one variable

is accompanied by an increase or decrease in the other variable. For instance, relationship exists between the

price and demand of a commodity because keeping other things equal, an increase in the price of a commod-

ity shall cause a decrease in the demand for that commodity. Relationship might exist between the heights

and weights of the students and between amount of rainfall in a city and the sales of raincoats in that city.

These are some of the important definitions about correlation.

Croxton and Cowden says, “When the relationship is of a quantitative nature, the appropriate

statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as

correlation”.

A.M. Tuttle says, “Correlation is an analysis of the covariation between two or more variables.”

W.A. Neiswanger says, “Correlation analysis contributes to the understanding of economic behavior,

aids in locating the critically important variables on which others depend, may reveal to the economist the

connections by which disturbances spread and suggest to him the paths through which stabilizing forces may

become effective.”

L.R. Conner says, “If two or more quantities vary in sympathy so that the movements in one tends

to be accompanied by corresponding movements in others than they are said be correlated.”

Utility of Correlation

The study of correlation is very useful in practical life as revealed by these points.

1. With the help of correlation analysis, we can measure in one figure, the degree of relationship

existing between variables like price, demand, supply, income, expenditure etc. Once we know

that two variables are correlated then we can easily estimate the value of one variable, given the

value of other.

2. Correlation analysis is of great use to economists and businessmen, it reveals to the economists

the disturbing factors and suggest to him the stabilizing forces. In business, it enables the execu-

tive to estimate costs, sales etc. and plan accordingly.

3. Correlation analysis is helpful to scientists. Nature has been found to be a multiplicity of inter-

related forces.

Difference between Correlation and Causation

106

The term correlation should not be misunderstood as causation. If correlation exists between two

variables, it must not be assumed that a change in one variable is the cause of a change in other variable. In

simple words, a change in one variable may be associated with a change in another variable but this change

need not necessarily be the cause of a change in the other variable. When there is no cause and effect

relationship between two variables but a correlation is found between the two variables such correlation is

known as “spurious correlation” or “nonsense correlation”. Correlation may exist due to the following:

1. Pure change correlation : This happens in a small sample. Correlation may exist between

incomes and weights of four persons although there may be no cause and effect relationship between

incomes and weights of people. This type of correlation may arise due to pure random sampling variation or

because of the bias of investigator in selecting the sample.

2. When the correlated variables are influenced by one or more variables. A high degree of

correlation between the variables may exist, where the same cause is affecting each variable or different

cause affecting each with the same effect. For instance, a degree of correlation may be found between yield

per acre of rice and tea due to the fact that both are related to the amount of rainfall but none of the two

variables is the cause of other.

3. When the variable mutually influence each other so that neither can be called the cause of

other. All times it may be difficult to say that which of the two variables is the cause and which is the effect

because both may be reacting on each other.

Types of Correlation

Correlation can be categorised as one of the following :

(i) Positive and Negative,

(ii) Simple and Multiple.

(iii) Partial and Total.

(iv) Linear and Non-Linear (Curvilinear)

(i) Positive and Negative Correlation : Positive or direct Correlation refers to the movement of

variables in the same direction. The correlation is said to be positive when the increase (de-

crease) in the value of one variable is accompanied by an increase (decrease) in the value of

other variable also. Negative or inverse correlation refers to the movement of the variables in

opposite direction. Correlation is said to be negative, if an increase (decrease) in the value of

one variable is accompanied by a decrease (increase) in the value of other.

(ii) Simple and Multiple Correlation : Under simple correlation, we study the relationship be-

tween two variables only i.e., between the yield of wheat and the amount of rainfall or between

demand and supply of a commodity. In case of multiple correlation, the relationship is studied

among three or more variables. For example, the relationship of yield of wheat may be studied

with both chemical fertilizers and the pesticides.

(iii) Partial and Total Correlation : There are two categories of multiple correlation analysis.

Under partial correlation, the relationship of two or more variables is studied in such a way that

only one dependent variable and one independent variable is considered and all others are kept

constant. For example, coefficient of correlation between yield of wheat and chemical fertil-

izers excluding the effects of pesticides and manures is called partial correlation. Total correla-

tion is based upon all the variables.

(iv) Linear and Non-Linear Correlation : When the amount of change in one variable tends to

keep a constant ratio to the amount of change in the other variable, then the correlation is said to

be linear. But if the amount of change in one variable does not bear a constant ratio to the

107

amount of change in the other variable then the correlation is said to be non-linear. The distinc-

tion between linear and non-linear is based upon the consistency of the ratio of change between

the variables.

Methods of Studying Correlation

There are different methods which helps us to find out whether the variables are related or not.

1. Scatter Diagram Method.

2. Graphic Method.

3. Karl Pearson’s Coefficient of correlation.

4. Rank Method.

We shall discuss these methods one by one.

(1) Scatter Diagram : Scatter diagram is drawn to visualise the relationship between two variables.

The values of more important variable is plotted on the X-axis while the values of the variable are plotted on

the Y-axis. On the graph, dots are plotted to represent different pairs of data. When dots are plotted to

represent all the pairs, we get a scatter diagram. The way the dots scatter gives an indication of the kind of

relationship which exists between the two variables. While drawing scatter diagram, it is not necessary to

take at the point of sign the zero values of X and Y variables, but the minimum values of the variables

considered may be taken.

When there is a positive correlation between the variables, the dots on the scatter diagram run from

left hand bottom to the right hand upper corner. In case of perfect positive correlation all the dots will lie on

a straight line.

When a negative correlation exists between the variables, dots on the scatter diagram run from the

upper left hand corner to the bottom right hand corner. In case of perfect negative correlation, all the dots lie

on a straight line.

If a scatter diagram is drawn and no path is formed, there is no correlation. Students are advised to

prepare two scatter diagrams on the basis of the following data :

(i) Data for the first Scatter Diagram :

Demand Schedule

Price (Rs.) Commodity Demand (units)

6 180

7 150

8 130

9 120

10 125

(ii) Data for the second Scatter Diagram :

Supply Schedule

108

Price (Rs.) Commodity Supply

50 2,000

51 2,100

52 2,200

53 2,500

54 3,000

55 3,800

56 4,700

Students will find that the first diagram indicate a negative correlation where the second diagram

shall reveal a positive correlation.

(2) Graphic Method. In this method the individual values of the two variables are plotted on the

graph paper. Therefore two curves are obtained-one for X variable and another for Y variable.

Interpreting Graph

The graph is interpreted as follows:

(i) If both the curves run parallel or nearly parallel or more in the same direction, there is positive

correlation,

(ii) On the other hand, if both the curves move in the opposite direction, there is a negative correla-

tion.

Illustration : Show correlation from the following data by graphic method;

Year 1995 96 97 98 99 2000 2001 2002 2003 2004

Average Income (Rs.) 100 110 125 140 150 180 200 220 250 360

Average Expenditure (Rs.) 90 95 100 120 120 140 150 170 200 260

Solution :

The graph prepared shows that income and expenditure have a close positive correlation. As income

increases, the expenditure also increases.

(3) Karl Pearson’s Co-efficient of Correlation. Karl Pearson’s method, popularly known as

Pearsonian co-efficient of correlation, is most widely applied in practice to measure correlation. The Pearsonian

co-efficient of correlation is represented by the symbol r.

According to Karl Pearson’s method, co-efficient of correlation between the variables is obtained by

dividing the sum of the products of the corresponding deviations of the various items of two series from their

respective means by the product of their standard deviations and the number of pairs of observations. Sym-

bolically,

r = where r stands for coefficient of correlation ...(i)

where x1, x2, x3, x4 ..................... xn are the deviations of various items of the first variable from

the mean,

109

y1, y2, y3,........................ yn are the deviations of all items of the second variable from mean,

Σxy is the sum of products of these corresponding deviations. N stands for the number of pairs, σx

stands for the standard deviation of X variable and σy stands for the standard deviation of Y variable.

σx = and σy =

If we substitute the value of σx and σ

y in the above written formula of computing r, we get

r = or r =

Degree of correlation varies between + 1 and –1; the result will be + 1 in case of perfect positive

correlation and – 1 in case of perfect negative correlation.

Computation of correlation coefficient can be simplified by dividing the given data by a common

factor. In such a case, the final result is not multiplied by the common factor because coefficient of correla-

tion is independent of change of scale and origin.

Illustration : Calculate Co-efficient of Correlation from the following data:

X 50 100 150 200 250 300 350

Y 10 20 30 40 50 60 70

Solution :

X x x2 Y y y2 xy

50 – 150 – 3 9 10 – 30 – 3 9 9

100 – 100 – 2 4 20 – 20 – 2 4 4

150 – 50 – 1 1 30 – 10 – 1 1 1

200 0 0 0 40 0 0 0 0

250 + 50 + 1 1 50 + 10 + 1 1 1

300 + 100 + 2 4 60 + 20 + 2 4 4

350 + 150 + 3 9 70 + 30 + 3 9 9

Σx = 0 Σx2 = 28 Σy = 0 Σy2 = 28 Σxy = 28

r =

By substituting the values we get r =

Hence there is perfect positive correlation.

Illustration : A sample of five items is taken from the production of a firm, length and weight of the five

items are given below:

Length (inches) 3 4 6 7 10

Weight (ounces) 9 11 14 15 16

Calculate Karl Pearson’s correlation co-efficient between length and weight and interpret the value

110

of correlation coefficient.

Solution : = and =

X x x2 Y y y2 xy

3 – 3 9 9 – 4 16 12

4 – 2 4 11 – 2 4 4

6 0 0 14 + 1 1 0

7 + 1 1 15 + 2 4 2

10 + 4 16 16 + 3 9 12

Σx = 30 0 30 Σy = 65 0 34 30

r = where Σxy = 30, Σx2 = 30, and Σy2 = 34

r =

The value of r indicates that there exists a high degree positive correlation between lengths

and weights.

Illustration : From the following data, compute the co-efficient of correlation between X and Y :

X Series Y Series

Number of items 15 15

Arithmetic Mean 25 18

Square of deviation from Mean 136 138

Summation of product deviations of X and Y from their Arithmetic Means = 122

Solution : Denoting deviations of X and Y from their arithmetic means by x and y respectively, the given data

are : Σx2 = 136, Σxy = 122, and Σy2 = 138

r = =

Short-cut Method: To avoid difficult calculations due to mean being in fraction, deviations are taken

from assumed means while calculating coefficient of correlation. The formula is also modified for standard

deviations because deviations are taken from assumed means. Karl Perason’s formula for short-cut method

is given below :

r = or

r =

Illustration : Compute the coefficient of correlation from the following data :

Marks in Statistics 20 30 28 17 19 23 35 13 16 38

Marks in Mathematics 18 35 20 18 25 28 33 18 20 40

Solution :

Marks in (x – 30) Marks in Y – 30

Statistics X dx dx2 Maths Y dy dy2 dxdy

111

20 – 10 100 18 – 12 144 + 120

30 0 0 35 + 5 25 0

28 – 2 4 20 – 10 100 + 20

17 – 13 169 18 – 12 144 + 156

19 – 11 121 25 – 5 25 + 55

23 – 7 49 28 – 2 4 + 14

35 + 5 25 33 + 3 9 + 15

13 – 17 289 18 – 12 144 + 204

16 – 14 196 20 – 10 100 + 140

38 + 8 64 40 + 10 100 + 80

N = 10 – 61 1017 – 45 795 804

r =

where dx — deviations of X series from an assumed mean 30.

dy — deviations of Y series from an assumed mean 30.

dx2 — sum of the squares of the deviations of X series from assumed mean.

dy2 — sum of the squares of the deviations of Y series from assumed mean.

dxdy — sum of the products of deviations of X and Y series from their assumed means.

∴ r =

or r =

Direct Method of Computing Correlation Coefficient

Correlation Coefficient can also be computed from given X and Y values by using the below given

formula:

r =

The above given formula gives us the same answer as we are getting by taking durations from actual

mean or arbitrary mean.

Illustration : Compute the coefficient of correlations from the following data :

Marks in Statistics 20 30 28 17 19 23 35 13 16 38

Marks in Mathematics 18 35 20 18 25 28 33 18 20 40

Solution :

Marks in Marks in

Statistics X Mathematics Y X2 Y2 XY

20 18 400 324 360

30 35 900 1225 1050

28 20 784 400 560

17 18 289 324 306

19 25 361 625 475

23 28 529 784 644

35 33 1225 1089 1155

13 18 169 324 234

16 20 256 400 320

112

38 40 1444 1600 1520

ΣX = 239 ΣY = 255 ΣX2 = 6357 ΣY2 = 7095 ΣXY = 6624

Substitute the computed values in the below given formula,

r =

=

=

Coefficient of Correlation in a Continuous Series

In the case of a continuous series, we assume that every item which falls within a given class interval

falls exactly at the middle of that class. The formula, because of the presence of frequencies is modified as

follows:

r =

Various values shall be calculated as follows :

(i) Take the step deviations of variable X and denote it as dx.

(ii) Take the step deviations of variable Y and denote it as dy.

(iii) Multiply dx dy and the respective frequency of each cell and write the figure obtained in the

right-hand upper comer of each cell.

(iv) Add all the cornered values calculated in step (iii) to get Σfdxdy.

(v) Multiply the frequencies of the variable X by the deviations of X to get Σfdx.

(vi) Take the squares of the deviations of the variable X and multiply them by the respective fre-

quencies to get Σfdx2.

(vii) Multiply the frequencies of the variable Y by the deviations of Y to get Σfdy.

(viii) Take the squares of the deviations of the variable Y and multiply them by the respective fre-

quencies to get Σfdy2.

(ix) Now substitute the values of Σfdxdy, Σfdx, Σfdx2, Σfdy, Σfdy2 in the formula to get the value

of r.

Illustration : The following table gives the ages of husbands and wives at the time of their marriages.

Calculate the cerrelation coefficient between the ages of husbands and wives.

Ages of Husbands

Age of Wives 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70 Total

15 – 25 5 9 3 — — 17

25 – 35 — 10 25 2 — 37

35 – 45 — 1 12 2 — 15

45 – 55 — — 4 16 5 25

55 – 65 — — — 4 2 6

Total 5 20 44 24 7 100

r =

=

Properties of Coefficient of Correlation

113

Following are some of the important proportion of r :

(1) The coefficient of correlation lies between – 1 and + 1 (– 1 ≤ r ≤ + 1 )

(2) The coefficient of correlation is independent of change of scale and origin of the variable X and Y.

(3) The coefficient of correlation is the geometric mean of two regression coefficients.r =

Merits of Pearson’s coefficient of correlation : The correlation of coefficient summarizes in one

figure the degree and direction of correlation. Value varies between +1 and –1.

Demerits of Pearson’s coefficient of correlation : It always assumes linear relationship between

the variables; in fact the assumption may be wrong. Secondly, it is not easy to interpret the significance of

correlation coefficient. The method is time consuming and affected by the extreme items.

Probable Error of the coefficient of correlation : It is calculated to find out how far the Pearson’s

coefficient of correlation is reliable in a particular case.

P.E of coefficient of correlation =

where r = coefficient of correlation and N = number of pairs of items.

If the probable error calculated is added to and subtracted from the coefficient of correlation, it

would give us such limits within which we can expect the value of the coefficient of correlation to vary.

If r is less than probable error, then there is no real evidence of correlation.

If r is more than 6 times the probable error, the coefficient of correlation is considered highly significant.

If r is more than 3 times the probable error but less than 6 times, correlation is considered significant

but not highly significant.

If the probable error is not much and the given r is more than the probable error but less then 3 times

of it, nothing definite can be concluded.

(4) Rank Correlation : There are many problems of business and industry when it is not possible to

measure the variable under consideration quantitatively or the statistical series is composed of items which

can not be exactly measured. For instance, it may be possible for the two judges to rank six different brands

of cigarettes in terms of taste, whereas it may be difficult to give them a numerical grade in terms of taste. In

such problems. Spearman’s coefficient of rank correlation is used. The formula for rank correlation is :

ρ = or

where ρ stands for rank coefficient of correlation.

D refers to the difference of ranks between paired items.

N refers to the number of paired observations.

The value of rank correlation coefficient varies between +1 and –1. When the value of ρ = +1, there

is complete agreement in the order of ranks and the ranks will be in the same order. When ρ = – l, the ranks

will be in opposite direction showing complete disagreement in the order of ranks. Let’ us understand with the

help of an illustration.

Illustration : Ranks of 10 individuals at the start and at the finish of a course of training are given :

Individual : A B C D E F Q H I J

Rank before : 1 6 3 9 5 2 7 10 8 4

Rank after : 6 8 3 7 2 1 5 9 4 10

Calculate coefficient of correlation.

114

Solution :

Individual Rank before Rank after (R1 – R

2)

R1

R2

D D2

A 1 6 – 5 25

B 6 8 – 2 4

C 3 3 0 0

D 9 7 2 4

E 5 2 3 9

F 2 1 1 1

G 7 5 2 4

H 10 9 1 1

I 8 4 4 16

J 4 10 – 6 36

N = 10 ΣD2 = 100

By applying the formula,

ρ =

When we are given the actual data and not the ranks, it becomes necessary for us to assign the

ranks. Ranks can be assigned by taking either the highest value as one or the lowest value as one. But if we

start by taking the highest value or the lowest value we must follow the same order for both the variables to

assign ranks.

Illustration : Calculate rank correlation from the following data :

X : 17 13 15 16 6 11 14 9 7 12

Y : 36 46 35 24 12 18 27 22 2 8

Solution : Calculation or Rank Correlation

X R1 Y R2 D D2

(Rank) (Ranks) (R1 – R2)

17 1 36 2 – 1 1

13 5 46 1 + 4 16

15 3 35 3 0 0

16 2 24 5 – 3 9

6 10 12 8 + 2 4

11 7 18 7 0 0

14 4 27 4 0 0

9 8 22 6 + 2 4

7 9 5 10 – 1 1

115

12 6 8 9 – 3 9

N = 10 ΣD2 = 44

Rank correlation coefficient is calculated as follow :

ρ =

ρ =

In some case it becomes necessary to rank two or more items an identical rank. In such cases, it is

customary to give each item an average rank. Therefore, if two items are equal for 4th and 5th rank, each

item shall be ranked 4.5 i.e., . It means, where two or more items are to be ranked equal, the rank assigned

for purposes of calculating coefficient of correlation is the average of ranks which these items would have

got had they differed slightly from each other. When equal ranks are assigned to some items, the rank

correlation formula is also adjusted. The adjustment consists of adding (m2 – m) to the value of ΣD2 where

m stands for number of items whose ranks are identical.

ρ =

Let us take an example to understand this.

Illustration : Compute the rank correlation coefficient from the following data:

Section A : 115 109 112 87 98 98 120 100 98 118

Section B : 75 73 85 70 76 65 82 73 68 80

Solution : Computation of Rank correlation coefficient

Series Ranks Series Ranks D D2

A R1 B R2 (R1 – R2)

115 8 75 6 2 4

109 6 73 4.5 1.5 2.25

112 7 85 10 – 3 9

87 1 70 3 – 2 4

98 3 76 7 – 4 16

98 3 65 1 2 4

120 10 82 9 1 1

100 5 73 4.5 0.5 0.25

98 3 68 2 I 1

118 9 80 8 1 1

N = 10 ΣD2 = 42.50

Apply formula to calculate Rank Correlation

ρ =

116

Item 98 is repeated three times in series A. Hence m = 3. In series B the item 73 is repeated two

times and so m = 2.

ρ =

ρ =

REGRESSION ANALYSIS

The statistical technique correlation establishes the degree and direction of relationship between two

or more variables. But we may be interested in estimating the value of an unknown variable on the basis of

a known variable. If we know the index of money supply and price-level, we can find out the degree and

direction of relationship between these indices with the help of correlation technique. But the regression

technique helps us in determining what the general price-level would be assuming a fixed supply of money.

Similarly if we know that the price and demand of a commodity are correlated we can find out the demand

for that commodity for a fixed price. Hence, the statistical tool with the help of which we can estimate or

predict the unknown variable from known variable is called regression. The meaning of the term “Regres-

sion” is the act of returning or going back. This term was first used by Sir Francis Galton in 1877 when he

studied the relationship between the height of fathers and sons. His study revealed a very interesting relation-

ship. All tall fathers tend to have tall sons and all short fathers short sons but the average height of the sons

of a group of tall fathers was less than that of the fathers and the average height of the sons of a group of

short fathers was greater than that of the fathers. The line describing this tendency of going back is called

“Regression Line”. Modern writers have started to use the term estimating line instead of regression line

because the expression estimating line is more clear in character. According to Morris Myers Blair, re-

gression is the measure of the average relationship between two or more variables in terms of the original

units of the data.

Regression analysis is a branch of statistical theory which is widely used in all the scientific disci-

plines. It is a basic technique for measuring or estimating the relationship among economic variables that

constitute the essence of economic theory and economic life. The uses of regression analysis are not con-

fined to economic and business activities. Its applications are extended to almost all the natural, physical and

social sciences. The regression technique can be extended to three or more variables but we shall limit

ourselves to problems having two variables in this lesson.

Regression analysis is of great practical use even more than the correlation analysis. Some of the

uses of the regression analysis are given below :

(i) Regression Analysis helps in establishing a functional relationship between two or more vari-

ables. Once this is established it can be used for various analytic purposes.

(ii) With the use of electronic machines and computers, the medium of calculation of regression

equation particularly expressing multiple and non-linear relations has been reduced consider-

ably.

(iii) Since most of the problems of economic analysis are based on cause and effect relationship, the

regression analysis is a highly valuable tool in economic and business research.

(iv) The regression analysis is very useful for prediction purposes. Once a functional relationship is

established the value of the dependent variable can be estimated from the given value of the

independent variables.

Difference between Correlation and Regression

117

Both the techniques are directed towards a common purpose of establishing the degree and direction

of relationship between two or more variables but the methods of doing so are different. The choice of one or

the other will depend on the purpose. If the purpose is to know the degree and direction of relationship,

correlation is an appropriate tool but if the purpose is to estimate a dependent variable with the substitution of

one or more independent variables, the regression analysis shall be more helpful. The point of difference are

discussed below :

(i) Degree and Nature of Relationship : The correlation coefficient is a measure of degree of

covariability between two variables whereas regression analysis is used to study the nature of

relationship between the variables so that we can predict the value of one on the basis of

another. The reliance on the estimates or predictions depend upon the closeness of relationship

between the variables.

(ii) Cause and Effect Relationship : The cause and effect relationship is explained by regression

analysis. Correlation is only a tool to ascertain the degree of relationship between two variables

and we can not say that one variable is the cause and other the effect. A high degree of

correlation between price and demand for a commodity or at a particular point of time may not

suggest which is the cause and which is the effect. However, in regression analysis cause and

effect relationship is clearly expressed— one variable is taken as dependent and the other an

independent.

The variable which is the basis of prediction is called independent variable and the variable that is to

be predicted is called dependent variable. The independent variable is represented by X and the dependent

variable by Y.

Principle of Least Squares

Regression refers to an average of relationship between a dependent variable with one or more

independent variables. Such relationship is generally expressed by a line of regression drawn by the method

of the “Least Squares”. This line of regression can be drawn graphically or derived algebraically with the

help of regression equations. According to Tom Cars, before the equation of the least line can be deter-

mined some criterion must be established as to what conditions the best line should satisfy. The condition

usually stipulated in regression analysis is that the sum of the squares of the deviations of the observed Y

values from the fitted line shall be minimum. This is known as the least squares or minimum squared error

criterion.

A line fitted by the method of least squares is the line of best fit. The line satisfies the following

conditions :

(i) The algebraic sum of deviations above the line and below the line are equal to zero.

Σ(X – Xc) = 0 and Σ(Y – Yc) = 0

Where .XC

and YC

are the values derived with the help of regression technique.

(ii) The sum of the squares of all these deviations is less than the sum of the squares of deviations

from any other line, we can say

Σ (X – Xc)2 is smaller than Σ (X – A)2 and

Σ (Y – Yc)2 is smaller than Σ (Y – A)2

Where A is some other value or any other straight line.

(iii) The line of regression (best fit) intersect at the mean value of the variables i.e., and

118

(iv) When the data represent a sample from a larger population, the least square line is the best

estimate of the population line.

Methods of Regression Analysis

We can study regression by the following methods :

1. Graphic method (regression lines)

2. Algebraic method (regression equations)

We shall discuss these methods in detail.

1. Graphic Method : When we apply this method different points are plotted on a graph paper

representing different pairs of variables. These points give a picture of a scatter diagram with several

points spread over. A regression line may be drawn between these points either by free hand or by a scale

in such a way that the squares of the vertical or horizontal distances between the points and the line of

regression is minimum. It should be drawn in such a manner that the line leaves equal number of points on

both sides. However, to ensure this is rather difficult and the method only renders a rough estimate which

can not be completely free from subjectivity of person drawing it. Such a line can be a straight line or a

curved line depending upon the scatter of points and relationship to be established. A non-linear free hand

curve will have more element of subjectivity and a straight line is generally drawn. Let us understand it

with the help of an example:

Example :

Height of father Height of sons

(Inches)

65 68

63 66

67 68

64 65

68 69

62 66

70 68

66 65

68 71

67 67

69 68

71 70

Solution : The diagram given below shows the height of fathers on x-axis and the height of sons on y-axis.

The line of regression called the regression of y on x is drawn between the scatter dots.

Fig. 1

Another line of regression called the regression line of x on y is drawn amongst the same set of

119

scatter dots in such a way that the squares of the horizontal distances between dots are minimised.

Fig. 2 Fig. 3

It is clear that the position of the regression line of x on y is not exactly like that of the regression lime

of y on x. In the following figure both the regression of y on x and x on y are exhibited.

Fig. 4

When there is either perfect positive or perfect negative correlation between the two variables, the

two regression lines will coincide and we will have only one line. The farther the two regression lines from

each other, the lesser is the degree of correlation and vice-versa. If the variables are independent, correlation

is zero and the lines of regression will be at right angles. It should be noted that the regression lines cut each

other at the point of average of x and y, i.e., if from the point where both the regression lines cut each other

a perpendicular is drawn on the x-axis, we will get the mean value of x series and if from that point a

horizontal line is drawn on the y-axis we will get the mean of y series.

2. Algebraic Method : The algebraic method for simple linear regression can be understood by two

methods:

(i) Regression Equations

(ii) Regression Coefficients

Regression Equations : These equations are known as estimating equations. Regression equa-

tions are algebraic expressions of the regression lines. As there are two regression lines, there are two

regression equations :

(i) x on y is used to describe the variations in the values of x for given changes in y.

(ii) y on x is used to describe the variations in the values of y for given changes in x.

The regression equations of y on x is expressed as

Yc = a + bX

The regression equations of x on y is expressed as

Xc = a + bY

In these equations a and b are constants which deretmine the position of the line completely. These

constants are called the parameters of the line. If the value of any of these parameters is changed, another

line is determind.

Parameter a refers to the intercept of the line and b to the slope of the line. The symbol Yc and Xcrefers to the values of Y computed and the value of X computed on the basis of independent variable in both

the cases. If the values of both the parameters are obtained, the line is completely determined. The values of

these two parameters a and b can be obtained by the method of least squares. With a little algebra and

differential calculus it can be shown that the following two equations, are solved simultaneously, will give

values of the parameters a and b such that the least squares requirement is fulfilled;

For regression equation Yc = a + bX

Σ y = Na + bΣx

Σxy = aΣx + bΣx2

For regression equation Xc = a + bY

Σ x = Na + bΣY

120

Σxy = aΣy + bΣy2

These equations are usually called the normal equations. In the equations Σx, Σy, Σxy, Σx2, Σy2

indicate totals which are computed from the observed pairs of values of two variables x and y to which the

least squares estimating line is to be fitted and N is the number of observed pairs of values. Let us understand

by an example.

Example : From the following data obtain the two regression equations :

x : 6 2 10 4 8

y : 9 11 5 8 7

Solution :

Computation of Regression Equations

x y xy x y2

6 9 54 36 81

2 11 22 4 121

10 5 50 100 25

4 8 32 16 64

8 7 56 64 49

Σx = 30 Σy = 40 Σxy = 214 Σx2 = 220 Σx2 = 340

Regression line of Y on X is expressed by the equation of the form

Yc

= a + bX

To determine the values of a and b, the following two normal equations are solved

Σ y = Na + bΣx

Σxy = aΣx + bΣx2

Substituting the values, we get

40 = 5a + 30b ...(i)

214 = 30a + 220b ...(ii)

Multiplying equation (i) by 6, we get

240 = 30a +180b ...(iii)

214 = 30a + 220b ...(iv)

Deduct equation (iv) from (iii)

– 40b = + 26

b = – 0.65

Substitute the value of b in equation (i)

40 = 5a + 30 (– 0.65)

5a = 40 + 19.5 or a = 11.9

Substitute the values of a and b in the equation

121

Regression line of Y on X is

yc = 11.9 – 0.65X

Regression line of X on Y is

Xc = a + bY

The corresponding normal equations are

Σ x = Na + bΣy

Σxy = aΣy + bΣy2

Substituting the values

30 = 5a + 40b ...(i)

214 = 40a + 340b ...(ii)

Multiply equation (i) by 8

240 = 40a + 320b ...(iii)

214 = 40a + 340b ...(iv)

Deduct equation (iv) from (iii)

– 20b = 26 or b = – 1.3

Substitute the value of b in equation (i)

30 = 5a + 40(–1.3)

5a = 30 + 52 or a =16.4

Substitute the values of a and b in the equation. Regression line of X on Y is

Xc = 16.4 – 1.3y

Regression Coefficients : In the regression equation b is the regression coefficient which indi-

cates the degree and direction of change in the dependent variable with respect to a change in the indepen-

dent variable. In the two regression equations :

Xc = a + bxy

Yc = a + byx

Where bxy and byx are known as the regression coefficients of the two equations. These coeffi-

cients can be obtained independently without using simultaneous normal equations with these formulae:

Regression coefficients of x on y is

bxy =

bxy =

bxy = where x = and y =

Regression Coefficient of Y on X is

byx =

byx =

byx = where x = and y =

122

Example : Calculate the regression coefficients from data given below :

Series x Series y

Average 25 22

Standard deviation 4 5 r = 0.8

Solution : The coefficient of regression of x on y is

bxy =

The coefficient of regression of y on x is

byx =

Properties of Regression Coefficients

(i) The coefficient of correlation is the geometric mean of the two regression coefficients,

r =

(ii) Both the regression coefficients are either positive or negative. It means that they always have

identical sign i.e., either both have positive sign or negative sign.

(iii) The coefficient of correlation and the regression coefficients will also have same sign.

(iv) If one of the regression coefficient is more than unity, the other must be less than unity because

the value of coefficient of correlation can not exceed one (r = ± 1)

(v) Regression coefficients are independent of the change in the origin but not of the scale.

(vi) The average of regression coefficients is always greater than correlation coefficient.

We can compute the regression equations with the help of regression coefficients by the

following equations:

1. Regression equation X on Y

=

Where is the mean of X series

is the mean of X series

is the regression coefficient of x on y

2. Regression equation Y on X

=

We can explain this by taking an example :

Example : Calculate the following from the below given data :

(a) the two regression equations,

(b) the coefficient of correlation and

(c) the most likely marks in Statistics when the marks in Economics are 30

Marks in Economics : 25 28 35 32 31 36 29 38 34 32

123

Marks in Statistics : 43 46 49 41 36 32 31 30 33 39

Solution : Calculation of Regression Equations and Correlation Coefficient

Marks in Marks in

Eco (X) x x2 Stats (Y) y y2 xy

25 – 7 49 43 + 5 25 – 35

28 – 4 16 46 + 8 64 – 32

35 + 3 9 49 + 11 121 + 33

32 0 0 41 + 3 9 0

31 – 1 1 36 – 2 4 + 2

36 + 4 16 32 – 6 36 – 24

29 – 3 9 31 – 7 49 + 21

38 + 6 36 30 – 8 64 – 48

34 + 2 4 33 – 5 25 – 10

32 0 0 39 + 1 1 0

ΣX = 320 Σx = 0 Σx2 = 140 ΣY = 380 Σy = 0 Σy2 = 398 Σxy = – 93

Regression equation X on Y

= bxy

bxy =

= and =


X – 32 = – 0.234 (Y – 38)

X – 32 = – 0.234Y + 8.892

or X = 40.892 – 0.234Y

Regression equation Y on X

= byx

byx =

= 32, = 38, b = – 0.664

∴ Y – 38 = 0.664 (X – 32)

or = – 0.664Y + 21.248

124

Y = 59.248 – 0.664X

(b) Correlation Coefficient (r) =

Since both the regression coefficients are negative, value of r must also be negative.

(c) Likely marks in statistics when marks in Economics are 30

Y = – 0.664 X + 59.248 where X = 30

Y = (– 0.664 × 30) + 59.248 = 39.328 or 39

Example: The following scores were worked out from a test in Mathematics and English in an annual

examination.

Scores in Mathematics (x) English (y)

Mean 39.5 47.5

Standard deviation 10.8 16.8 r = + 0.42

Find both the regression equations. Using these regression estimate find the value of Y for X = 50 and

the value of X for Y = 30.

Solution : Regression of X on Y

=

where = 47.5, = 39.5, r = 0.42, σx = 10.8, and σy = 16.8

By substituting values, we get

X – 39.5 =

= 0.27 (Y – 47.5) = 0.27 Y – 12.82

or X = 0.27Y – 12.82 + 39.5 = 0.27Y + 26.68

when Y = 30

Value of X = (0.27 × 30 + 26.68) = 34.78

Regression equation of Y on X

=

where = 39.5, = 47.5, r = 0.42, σx = 10.8, and σy = 16.8

Y – 47.5 =

Y – 47.5 = 0.653 (X – 39.5) = 0.653 X – 25.79

or Y = 0.653 X – 25.79 + 47.5 = 0.653X + 21.71

When X = 50

Value of Y = (0.653 × 50 + 21.71) = 32.65 + 21.71 = 54.36

Thus the regression equations are :

125

Xc = 0.27y + 26.68

Yc = 0.653x + 21.71

Value of X when Y = 30 is 34.78

Value of Y when X = 50 is 54.36

When actual mean of both the variables X and Y come out to be in fractions, the deviation from actual

means create a problem and it is advisable to take deviations from the assumed mean. Thus when devitations

are taken from assumed means, the value of bxy and byx is given by

bxy = where dx = (X – A) and dy = (Y – A)

The regression equation is :

= bxy

Similarly the regression equation of Y on X is

= byx

byx =

Let us take an example to understand

Example: You are given the data relating to purchases and sales. Compute the two regression equations by

method of least squares and estimate the likely sales when the purchases are 100.

Purchases : 62 72 98 76 81 56 76 92 88 49

Sales : 112 124 131 117 132 96 120 136 97 85

Solution : Calculations of Regression Equations

Purchases (X – 76) Sales (Y – 120)

X dx dx2 Y dy dy2 dxdy

62 – 14 196 112 – 8 64 112

72 – 4 16 124 + 4 16 – 16

98 + 22 484 131 + 11 121 + 242

76 0 0 117 – 3 9 0

81 + 5 25 132 + 12 144 + 60

56 – 20 400 96 – 24 576 + 480

76 0 0 120 0 0 0

92 + 16 256 136 + 16 256 256

88 + 12 144 97 – 23 529 – 276

49 – 27 729 85 – 35 1225 + 945

126

Σdx = – 10 Σdx2 = 2250 Σdy = – 50 Σdy2 = 2940 Σdxdy = 1803

= and =

Regression Coefficients : X on Y

bxy =

Y on X

byx =

Regression equation : X on Y

= bxy


X – 75 = 0.652 (Y – 115) = 0.652Y – 74.98

or X = 0.652Y + 0.02

Regression equation : Y on X

= byx

Y – 115 = 0.78(X – 75) = 0.78X – 58.5

Y = 0.78X + 56.5

when X = 100

then Y = 0.78 × 100 + 56.5 = 134.5

Standard Error of Estimate

Standard error of an estimate is the measure of the spread of observed values from estimated ones,

expressed by regression line or equation. The concept of standard error of estimate is analogous to the

standard deviation which measures the variation or scatter of individual items about the arithmetic mean.

Therefore, like the standard deviation which is the average of square of deviations about the arithmetic mean,

the standard error of an estimate is the average of the square of deviations between the actual or the

observed values and the estimated values based on the regression equation. It can also be expressed as the

root of the measure of unexplained variations divided by N – 2 :

Syx =

and Sxy =

where Syx refers to standard error of estimate of Y values on X values.

Sxy refers to standard error of estimate of X values on Y values.

Yc and Xc are the estimated values of Y and X variables by means of their regression equations

respectively.

N – 2 is used for getting an unbiased estimate of standard error. The usual explanation given for this

division by N – 2 is that the two constants a and b were calculated on the basis of original data and we lose

two degrees of freedom. Degrees of freedom mean the number of classes to which values can be assigned

at will without violating any restrictions.

127

However a simpler method of computing Syx and Sxy is to use the following formulae :

Syx =

and Sxy =

The standard error of estimate measures the accuracy of the estimated figures. The smaller the

values of standard error of estimate, the closer will be the dots to the regression line and the better the

estimates based on the equation for this line. If standard error of estimate is zero, then there is no variation

about the line and the correlation will be perfect. Thus with the help of standard error of estimate it is possible

for us to ascertain how good and representative the regression line is as a description of the average relation-

ship between two series.

Example : Given the following data :

X : 6 2 10 4 8

Y : 9 11 5 8 7

And two regression equations Y = 11.09 – 0.65 X and X = 16.4 – 1.3 Y. Calculate the standard error

of estimate i.e. Syx and Sxy.

Solution : We can calculate Xc and Yc values from these regression equations.

X Y Yc Xc (Y – Yc )2 (X – Xc)

2

6 9 8.0 4.7 1.00 1.69

2 11 10.6 2.1 0.16 0.01

10 5 5.4 9.9 0.16 0.01

4 8 9.3 6.0 1.69 4.00

8 7 6.7 7.3 0.09 0.49

ΣX = 30 ΣY = 40 ΣYc = 40 ΣXc = 30 Σ(Y – Yc )2 = 3.1 Σ(X – Xc )

2 = 6.20

Thus we can calculate Syx and Sxy from the above calculated values

Syx =

Sxy =

128

LESSON 8

ANALYSIS OF TIME SERIES

When quantitative data are arranged in the order of their occurrence, the resulting statistical series is

called a time series. The quantitative values are usually recorded over equal time interval daily, weekly,

monthly, quarterly, half yearly, yearly, or any other time measure. Monthly statistics of Industrial Production

in India, Annual birth-rate figures for the entire world, yield on ordinary shares, weekly wholesale price of

rice, daily records of tea sales or census data are some of the examples of time series. Each has a common

characteristic of recording magnitudes that vary with passage of time.

Time series are influenced by a variety of forces. Some are continuously effective other make

themselves felt at recurring time intervals, and still others are non-recurring or random in nature. Therefore,

the first task is to break down the data and study each of these influences in isolation. This is known as

decomposition of the time series. It enables us to understand fully the nature of the forces at work. We can

then analyse their combined interactions. Such a study is known as time-series analysis.

Components of time series

A time series consists of the following four components or elements :

1. Basic or Secular or Long-time trend;

2. Seasonal variations;

3. Business cycles or cyclical movement; and

4. Erratic or Irregular fluctuations.

These components provide a basis for the explanation of the past behaviour. They help us to predict

the future behaviour. The major tendency of each component or constituent is largely due to casual factors.

Therefore a brief description of the components and the causal factors associated with each component

should be given before proceeding further.

1. Basic or secular or long-time trend : Basic trend underlines the tendency to grow or decline

over a period of years. It is the movement that the series would have taken, had there been no seasonal,

cyclical or erratic factors. It is the effect of such factors which are more or less constant for a long time or

which change very gradually and slowly. Such factors are gradual growth in population, tastes and habits or

the effect on industrial output due to improved methods. Increase in production of automobiles and a gradual

decrease in production of foodgrains are examples of increasing and decreasing secular trend.

All basic trends are not of the same nature. Sometimes the predominating tendency will be a con-

stant amount of growth. This type of trend movement takes the form of a straight line when the trend values

are plotted on a graph paper. Sometimes the trend will be constant percentage increase or decrease. This

type takes the form of a straight line when the trend values are plotted on a semi-logarithmic chart. Other

types of trend encountered are “logistic”, “S-curyes”, etc.

Properly recognising and accurately measuring basic trends is one of the most important problems

in time series analysis. Trend values are used as the base from which other three movements are mea-

sured. Therefore, any inaccuracy in its measurement may vitiate the entire work. Fortunately, the causal

elements controlling trend growth are relatively stable. Trends do not commonly change their nature quickly

and without warning. It is therefore reasonable to assume that a representative trend, which has charac-

terized the data for a past period, is prevailing at present, and that it may be projected into the future for a

year or so.

129

2. Seasonal Variations : The two principal factors liable for seasonal changes are the climate or

weather and customs. Since, the growth of all vegetation depends upon temperature and moisture, agricul-

tural activity is confined largely to warm weather in the temperate zones and to the rainy or post-rainy season

in the torried zone (tropical countries or sub-tropical countries like India). Winter and dry season make

farming a highly seasonal business. This high irregularity of month to month agricultural production deter-

mines largely all harvesting, marketing, canning, preserving, storing, financing, and pricing of farm products.

Manufacturers, bankers and merchants who deal with farmers find their business taking on the same sea-

sonal pattern which characterise the agriculture of their area.

The second cause of seasonal variation is custom, education or tradition. Such traditional days as

Dewali, Christmas. Id etc., product marked variations in business activity, travel, sales, gifts, finance, acci-

dent, and vacationing.

The successful operation of any business requires that its seasonal variations be known, measured

and exploited fully. Frequently, the purchase of seasonal item is made from six months to a year in advance.

Departments with opposite seasonal changes are frequently combined in the same firm to avoid dull seasons

and to keep sales or production up during the entire year.

Seasonal variations are measured as a percentage of the trend rather than in absolute quantities. The

seasonal index for any month (week, quarter etc.) may be defined as the ratio of the normally expected value

(excluding the business cycle and erratic movements) to the corresponding trend value. When cyclical move-

ment and erratic fluctuations are absent in a lime series, such a series is called normal. Normal values thus

are consisting of trend and seasonal components. Thus when normal values are divided by the corresponding

trend values, we obtain seasonal component of time series.

3. Business Cycle : Because of the persistent tendency for business to prosper, decline, stagnate

recover; and prosper again, the third characteristic movement in economic time series is called the business

cycle. The business cycle does not recur regularly like seasonal movement, but moves in response to causes

which develop intermittently out of complex combinations of economic and other considerations.

When the business of a country or a community is above or below normal, the excess deficiency is

usually attributed to the business cycle. Its measurement becomes a process of contrast occurrences with a

normal estimate arrived at by combining the calculated trend and seasonal movements. The measurement of

the variations from normal may be made in terms of actual quantities or it may be made in such terms as

percentage deviations, which is generally more satisfactory method as it places the measure of cyclical

tendencies on comparable base throughout the entire period under analysis.

4. Erratic or Irregular Component : These movements are exceedingly difficult to dissociate

quantitatively from the business cycle. Their causes are such irregular and unpredictable happenings such as

wars, droughts, floods, fires, pestilence, fads and fashions which operate as spurs or deterrents upon the

progress of the cycle. Examples such movements are : high activity in middle forties due to erratic effects of

2nd world war, depression of thirties throughout the world, export boom associated with Korean War in 1950.

The common denominator of every random factor it that does not come about as a result of the ordinary

operation of the business system and does not recur in any meaningful manner.

Mathematical Statement of the Composition of Time Series

A time series may not be affected by all type of variations. Some of these type of variations may

affect a few time series, while the other series may be effected by all of them. Hence, in analysing time

series, these effects are isolated. In classical time series analysis it is assumed that any given observation is

made up of trend, seasonal, cyclical and irregular movements and these four components have multiplicative

relationship.

Symbolically :

130

O = T × S × C × I

where O refers to original data,

T refers to trend.

S refers to seasonal variations,

C refers to cyclical variations and

I refers lo irregular variations.

This is the most commonly used model in the decomposition of time series.

There is another model called Additive model in which a particular observation in a time series is the

sum of these four components.

O = T + S + C + I

To prevent confusion between the two models, it should be made clear that in Multiplicative model S,

C, and I are indices expressed as decimal percents whereas in Additive model S, C and I are quantitative

deviations about trend that can be expressed as seasonal, cyclical and irregular in nature.

If in a multiplicative model. T = 500, S = 1.4, C = 1.20 and I = 0.7 then

O = T × S × C × I

By substituting the values we get

O = 500 × 1.4 × 1.20 × 0.7 = 608

In additive model, T = 500, S = 100, C = 25, I = –50

O = 500 + 100 + 25 – 50 = 575

The assumption underlying the two schemes of analysis is that whereas there is no interaction among

the different constituents or components under the additive scheme, such interaction is very much present in

the multiplicative scheme. Time series analysis, generally, proceed on the assumption of multiplicative formu-

lation.

Methods of Measuring Trend

Trend can be determined : (i) Free hand curve method ; (ii) moving averages method ; (iii) semi-

averages method; and (iv) least-squares method. Each of these methods is described below :

(i) Freehand Curve Method : The term freehand is used to any non-mathematical curve in

statistical analysis even if it is drawn with the aid of drafting instruments. This is the simplest method of

studying trend of a time series. The procedure for drawing free hand curve is an follows :

(i) The original data are first plotted on a graph paper.

(ii) The direction of the plotted data is carefully observed.

(iii) A smooth line is drawn through the plotted points.

While fitting a trend line by the freehand method, an attempt should be made that the fitted curve

conforms to these conditions.

(i) The curve should be smooth either a straight line or a combination of long gradual curves.

(ii) The trend line or curve should be drawn through the graph of the data in such a way that the

131

areas below and above the trend line are equal to each other.

(iii) The vertical deviations of the data above the trend line must equal to the deviations below the

line.

(iv) Sum of the squares of the vertical deviations of the observations from the trend should be

minimum.

Illustration : Draw a time series graph relating to the following data and fit the trend by freehand method :

Year Production of Steel

(million tonnes)

1994 20

1995 22

1996 30

1997 28

1998 32

1999 25

2000 29

2001 35

2002 40

2003 32

The trend line drawn by the freehand method can be extended to project future values. However, the

freehand curve fitting is too subjective and should not be used as a basis for prediction.

Method of Moving Averages : The moving average is a simple and flexible process of trend

measurement which is quite accurate under certain conditions. This method establishes a trend by means of

a series of averages covering overlapping periods of the data.

The process of successively averaging, say, three years data, and establishing each average as the

moving-average value of the central year in the group, should be carried throughout the entire series. For a

five-item, seven-item or other moving averages, the same procedure is followed : the average obtained each

time being considered as representative of the middle period of the group.

The choice of a 5-year, 7-year, 9-year, or other moving average is determined by the length of period

necessary to eliminate the effects of the business cycle and erratic fluctuations. A good trend must be free

from such movements, and if there is any definite periodicity to the cycle, it is well to have the moving

average to cover one cycle period. Ordinarily, the necessary periods will range between three and ten years

for general business series but even longer periods are required for certain industries.

In the preceding discussion, the moving averages of odd number of years were representatives of

the middle years. If the moving average covers an even number of years, each average will still be represen-

tative of the midpoint of the period covered, but this mid-point will fall halfway between the two middle years.

In the case of a four-year moving average, for instance each average represents a point halfway between

the second and third years . In such a case, a second moving average may be used to ‘recentre’ the aver-

ages. That is, if the first moving averages gives averages centering half-way between the years, a further

two-point moving average will recentre the data exactly on the years.

132

This method, however, is valuable in approximating trends in a period of transition when the math-

ematical lines or curves may be inadequate. This method provides a basis for testing other types of trends,

even though the data are not such as to justify its use otherwise.

Illustration : Calculate 5-yearly moving average trend for the time series given below.

Year : 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Quantity : 239 242 238 252 257 250 273 270 268 288 284

Year : 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Quantity : 282 300 303 298 313 317 309 329 333 327

Solution :

Year Quantity 5-yearly moving total 5-yearly moving average

1990 239

1991 242

1992 238 1228 245.6

1993 252 1239 247.8

1994 257 1270 254.0

1995 250 1302 260.4

1996 273 1318 263.6

1997 270 1349 269.8

1998 268 1383 276.6

1999 288 1392 278.4

1990 284 1422 284.4

2001 282 1457 291.4

2002 300 1467 293.4

2003 303 1496 299.2

2004 298 1531 306.2

2005 313 1540 308.0

2006 317 1566 313.2

2007 309 1601 320.2

2008 329 1615 323.0

2009 333

2010 327

To simplify calculation work: Obtain the total of first five years deta. Find out the difference

between the first and sixth term and add to the total to obtain the total of second to sixth term. In this way the

difference between the term to be omitted and the term to be included is added to the preceding total in order

to obtain the next successive total.

Illustration : Fit a trend line by the method of four-yearly moving average to the following time series data.

133

Year : 1995 1996 1997 1998 1999 2000 2001 2002

Sugar production (lakh tons) : 5 6 7 7 6 8 9 10

Year : 2003 2004 2005 2006

Sugar production (lakh tons) : 9 10 11 11

Solution :

Year Sugar Production 4-yearIy 4-yearly To recenter trend values

(lakh tons) moving moving 2 yearly centred 2-yearly moving

total average total average

1. 2. 3. 4. 5. 6.

1995 5

1996 6

1997 7 25 6.25 12.75 6.375

1998 7 26 6.50 13.50 6.75

1999 6 28 7.00 14.50 7.25

2000 8 30 7.50 15.75 7.875

2001 9 33 8.25 17.25 8.625

2002 10 36 9.00 18.50 9.25

2003 9 38 9.50 19.50 9.75

2004 10 40 10.00 20.25 10.125

2005 11 41 10.25

2006 11

Remark : Observe carefully the placement of totals, averages between the lines.

Merits

1. This is a very simple method.

2. The element of flexibility is always present in this method as all the calculations have not to be

altered if same data is added. It only provides additional trend values.

3. If there is a coincidence of the period of moving averages and the period of cyclical fluctuations,

the fluctuations automatically disappear.

4. The pattern of moving average is determined in the trend of data and remains unaffected by the

choice of method to be employed.

5. It can be put to utmost use in case of series having strikingly irregular trend.

Limitations

1. It is not possible to have a trend value for each and every year. As the period of moving average

increases, there is always an increase in the number of years for which trend values cannot be

calculated and known. For example, in a five yearly moving average, trend value cannot be

obtained for the first two years and last two years, in a seven yearly moving average for the first

three years and last three years and so on. But usually values of the extreme years are of great

134

interest.

2. There is no hard and fast rule for the selection of a period of moving average.

3. Forecasting is one of the leading objectives of trend analysis. But this objective remains unful-

filled because moving average is not represented by a mathematical function.

4. Theoretically it is claimed that cyclical fluctuations are ironed out if period of moving average

coincide with period of cycle, but in practice cycles are not perfectly periodic.

Trend by the Method of Semi-averages : This method can be used if a straight line trend is to be

obtained. Since the location of only two points is necessary to obtain a straight line equation, it is obvious that

we may select two representative points and connect them by a straight line. Data are divided into two halves

and an average is obtained for each half. Each such average is shown against the mid-point of the half period,

we obtain two points on a graph paper. By joining these points, a straight line trend is obtained.

The method is to be commended for its simplicity and used to some extent in practical work. This

method is also flexible, for it is permissible to select representative periods to determine the two points.

Unrepresentative years may be ignored.

Method of Least Squares : If a straight line is fitted to the data it will serve as a satisfactory trend,

perhaps the most accurate method of fitting is that of least squares. This method is designed to accomplish

two results.

(i) The sum of the vertical deviations from the straight line must equal zero.

(ii) The sum of the squares of all deviations must be less than the sum of the squares for any other

conceivable straight line.

There will be many straight lines which can meet the first condition. Among all different lines, only

one line will satisfy the second condition. It is because of this second condition that this method is known as

the method of least squares. It may be mentioned that a line fitted to satisfy the second condition, will

automatically satisfy the first condition.

The formula for a straight-line trend can most simply be expressed as

Yc

= a + bX

where X represents time variable, Yc is the dependent variable for which trend values are to be

calculated and a and b are the constants of the straight tine to be found by the method of least squares.

Constant is the Y-intercept. This is the difference between the point of the origin (O) and the point of

the trend line and Y-axis intersect. It shows the value of Y when X = 0, constant b indicates the slope which is

the change in Y for each unit change in X.

Let us assume that we are given observations of Y for n number of years. If we wish to find the

values of constants a and b in such a manner that the two conditions laid down above are satisfied by the

fitted equation.

Mathematical reasoning suggests that, to obtain the values of constants a and b according to the

Principle of Least Squares, we have to solve simultaneously the following two equations.

ΣY = na + bΣY ...(i)

ΣXY = aΣX + bΣX2 ...(ii)

Solution of the two normal equations yield the following values for the constants a and b :

b =

135

and a =

Least Squares Long Method : It makes use of the above mentioned two normal equations with-

out attempting to shift the time variable to convenient mid-year. This method is illustrated by the following

example.

Illustration : Fit a linear trend curve by the least-squares method to the following data :

Year Production (Kg.)

2001 3

2002 5

2003 6

2004 6

2005 8

2006 10

2007 11

2008 12

2009 13

2010 15

Solution : The first year 2001 is assumed to be 0, 2002 would become 1, 2003 would be 2 and so on. The

various steps are outlined in the following table.

Year Production

Y X XY X2

1 2 3 4 5

2001 3 0 0 0

2002 5 1 5 1

2003 6 2 12 4

2004 6 3 18 9

2005 8 4 32 16

2006 10 5 50 25

2007 11 6 66 36

2008 12 7 84 49

2009 13 8 104 64

2010 15 9 135 11

Total 89 45 506 285

The above table yields the following values for various terms mentioned below :

n = 10, ΣX = 45, ΣX2 = 285, ΣY = 89, and ΣXY = 506

136

Substituting these values in the two normal equations, we obtain

89 = 10a + 45b ...(i)

506 = 45a + 285b ...(ii)

Multiplying equation (i) by 9 and equation (ii) by 2, we obtain

80l = 90a + 405b ...(iii)

1012 = 90a + 570b ...(iv)

Subtracting equation (iii) from equation (iv), we obtain

211 = 165b or b = 211/165 = 1.28

Substituting the value of b in equation (i), we obtain

89 = 10a + 45 × 1.28

89 = 10a + 57.60

10a = 89 – 57.6

10a = 31.4

a = 31.4/10 = 3.14

Substituting these values of a and b in the linear equation, we obtain the following trend line

Yc = 3. 14 + 1.28X

Inserting various values of X in this equation, we obtain the trend values as below :

Year Observed Y bxX Yc (Col. 3 plus Col. 4)

1 2 3 4 5

2001 3 3.14 1.28 × 0 3.14

2002 5 3.14 1.28 × 1 4.42

2003 6 3.14 1.28 × 2 5.70

2004 6 3.14 1.28 × 3 6.98

2005 8 3.14 1.28 × 4 8.26

2006 10 3.14 1.28 × 5 9.54

2007 11 3.14 1.28 × 6 10.82

2008 12 3.14 1.28 × 7 12.10

2009 13 3.14 1.28 × 8 13.38

2010 15 3.14 1.28 × 9 14.66

Least Squares Method : We can take any other year as the origin, and for that year X would be 0.

Considerable saving of both time and effort is possible if the origin is taken in the middle of the whole time

137

span covered by the entire series. The origin would than be located at the mean of the X values. Sum of the

X values would then equal 0. The two normal equations would then be simplified to

ΣY = Na ...(i)

or a =

and ΣXY = bΣX2 or b = ...(ii)

Two cases of short cut method are given below. In the first case there are odd number of years while

in the second case the number of observations are even.

Illustration : Fit a straight line trend on the following data :

Year 1996 1997 1998 1999 2000 2001 2002 2003 2004

Y 4 7 7 8 9 11 13 14 17

Solution : Since we have 9 observations, therefore, the origin is taken at 2000 for which X is assumed to be 0.

Year Y X XY X2

1996 4 – 4 – 16 16

1997 7 – 3 – 21 9

1998 7 – 2 – 14 4

1999 8 – 1 – 8 1

2000 9 0 0 0

2001 11 1 11 1

2002 13 2 26 4

2003 14 3 42 9

2004 17 4 68 16

Total 90 0 88 60

Thus n = 9, ΣY = 90, ΣX = 0, ΣXY = 88, and ΣX2 = 60

Substituting these values in the two normal equations, we get

90 = 9a or a = 90/9 or a = 10

88 = 60 or b = 88/60 or b = 1.47

∴ Trend equation is : Yc = 10 + 1.47 X

Inserting the various values of X, we obtain the trend values as below :

Years Observed Y X a b × X Yc (Col. 4 plus Col. 5)

1996 4 – 4 10 1.47 × – 4 = –5.88 4.12

1997 7 – 3 10 1.47 × – 3 = –4.41 5.59

1998 7 – 2 10 1.47 × – 2 = –2.84 7.06

138

1999 8 – 1 10 1.47 × – 1 = –1.47 8.53

2000 9 0 10 1.47 × 0 = 0 10.00

2001 11 1 10 1.47 × 1 = 1.47 11.47

2002 13 2 10 1.47 × 2 = 2.94 12.94

2003 14 3 10 1.47 × 3 = 4.41 14.41

2004 17 4 10 1.47 × 4 = 5.88 15.88

Illustration : Fit a straight line trend to the data which gives number of passenger cars sold (millions)

Year 2003 2004 2005 2006 2007 2008 2009 2010

No. of cars 6.7 5.3 4.3 6.1 5.6 7.9 5.8 6.1

(millions)

Solution : Here there are two mid-years viz; 2006 and 2007. The mid-point of the two years is assumed to

be 0 and the time of six months is treated to be the unit. On this basis the calculations are as shown below:

Years Observed Y X XY X2

2003 6.7 – 7 – 46.9 49

2004 5.3 – 5 – 26.5 25

2005 4.3 – 3 – 12.9 9

2006 6.1 – 1 – 6.1 1

2007 5.6 1 5.6 1

2008 7.9 3 23.7 9

2009 5.8 5 29.0 25

2010 6.1 7 42.7 49

Total 47.8 0 8.6 168

From the above computations, we get the following values.

n = 8, ΣY = 47.8, ΣX = 0, ΣXY = 8.6, ΣX2 = 168

Substituting these values in the two normal equations, we obtain

47.8 = 8a or a = 47.8/8 or a = 5.98

and 8.6 = 168 b or = 8.6/168 or b = 0.051

The equation for the trend line is : Yc = 5.98 + 0.051X

Trend values generated by this equation are below :

Years Observed Y X a b Yc (Col. 4 plus Col. 5)

2003 6.7 – 7 5.98 .051 × – 7 = –.357 5.623

2004 5.3 – 5 5.98 .051 × – 5 = –.255 5.725

2005 4.3 – 3 5.98 .051 × – 3 = –.153 5.827

139

2006 6.1 – 1 5.98 .051 × – 1 = –.051 5.939

2007 5.6 1 5.98 .051 × 1 = .051 6.031

2008 7.9 3 5.98 .051 × 3 = .153 6.133

2009 5.8 5 5.98 .051 × 5 = .255 6.235

2010 5.1 7 5.98 .051 × 7 = .357 6.337

Second Degree Parabola

The simplest example of the non-linear trend is the second degree parabola, the equation is written in

the form :

Yc

= a + bX + cX2

When numerical values for a, b and c have been derived, the trend value for any year may be

computed substituting in the equation the value of X for that year. The values of a, b and c can be determined

by solving the following three normal equations simultaneously:

(i) ΣY = Na + bΣX + cΣX2

(ii) ΣXY = aΣX + bΣX2 + cΣX3

(iii) ΣX2Y = aΣX2 + bΣX3 + cΣX4

Note that the first equation is merely the summation of the given function, the second is the summa-

tion of X multiplied into the given function, and the third is the summation of X2 multiplied into the given

function.

When time origin is taken between two middle years ΣX would be zero. In that case the equations

are reduced to :

(i) ΣY = Na + cΣX2

(ii) ΣXY = bΣX2

(iii) ΣX2Y = aΣX2 + cΣX4

The value of b can now directly be obtained from equation (ii) and value of a and c by solving

equations (i) and (iii) simultaneously. Thus,

a = b = c =

Illustration : The price of a commodity during 2000 – 2005 is given below. Fit a parabola Y = a + bX + cX2

to this data. Estimate the price of the commodity for the year 2010 :

Year Price Year Price

2000 100 2003 140

2001 107 2004 181

2002 128 2005 192

140

Also plot the actual and trend values on graph.

Solution : To determine the value a, b and c, we solve the following normal equations:

Σ Y = Na + bΣX + cΣX2

ΣXY = aΣX + bΣX2 + cΣX3

ΣX2Y = aΣX2 + bΣX3 + cΣX4

Year Y X X2 X3 X4 XY X2Y Yc

2000 100 – 2 4 – 8 16 – 200 400 97.744

2001 107 – 1 1 – 1 1 – 107 107 110.426

2002 128 0 0 0 0 0 0 126.680

2003 140 +1 1 +1 1 +140 140 146.506

2004 181 +2 4 +8 16 + 362 724 169.904

2005 192 +3 9 +27 81 +576 1728 196.874

N = 6 ΣY = 848 ΣX = 3 ΣX2 = 19 ΣX3 = 27 ΣX4 = 115 ΣXY = 771 ΣX2Y = 3099 ΣYc = 848.134

848 = 6a + 3b + 19c ...(i)

771 = 3a +19b +27c ...(ii)

3,099 = 19a + 27b +115c ...(iii)

Solving Eqns. (i) and (ii), get

35b + 35c = 695 ...(iv)

Multiplying Eqn. (ii) by 19 and Eqn. (iii) by 3. Subtracting (iii) from (ii), we get

5352 = 280b + 168 c ...(v)

Solving Eqns. (iv) and (v), we get

c = 1.786

Substituting the value of c in Eqn. (iv), we get

b = 18.04 [35 b +(35 × 1.786) = 695]

Putting the value of b and c in Eqn. (i), we get

a = 126.68 [848 = 6a + (3 × 18.04) + (19 × 1.786))

Thus a = 126.68, b =18.04 and c = 1.786

Substituting the values in the equation

Yc = 126.68 + 18.04X + 1.786X2

When X = – 2, Y = 126.68 + 18.04(–2) + 1.786(– 2)2

= 126.68 – 36.08 + 7.144 = 97.744

When X = –1, Y = 126.68 + 18.04(–1) + 1.786(–1)2

141

= 126.68 – 18.04 + 1.786 = 110.426

When X = 0, Y = 126.68

When X = l, Y = 126.68 + 18.04 + 1.786 = 146.506

When X = 2, Y = 126.68 + 18.04(2) + 1.786(2)2

= 126.68 + 36.08 + 7.144 = 169.904

When X = 3, Y = 126.68 + 18.04(3) + 1.786(3)2

= 126.68 + 54.12 + 16.074 = 196.874

Price for 2010, Y = 126.68 + 18.04(8) + 1.786(8)2

When X = 8 = 126.68 + 144.32 + 114.304 = 385.304

Thus the likely price of the commodity for the year 2010 is Rs.385.304.

The graph of the actual trend values values is given below:

Conversion of Annual Trend Equation to Monthly Trend Equation

Fiting a trend line by least squares to monthly data may be excessively time consuming. It is more

convenient to compute the trend equation from annual data and then convert this trend equation to a monthly

trend equation.

There are two possible situations: (i) the Y units are annual totals, for example, the total number of

passenger cars sold; (ii) the Y units are monthly averages, for example average monthly wholesale price

Index.

Where Data are Annual Totals

A trend equation operative on an annual level is to be reduced to a monthly level. Constant value,

a, is expressed in terms of annual Y values. To express it in terms of monthly values, we must divide it by

12. Similarly b is to be divided by 12 to convert the annual change to a monthly change. But this division

shows us only the change for any month of two consecutive years, whereas we want change for two

consecutive months. Therefore b is to be divided by 12 once again. Consequently, to convert annual trend

equation to a monthly trend equation, when the annual data are expressed as annual totals, we divide a by

12 and b by 144.

Where the Data are given as monthly averages per year

In this case, Y values are on a monthly level. Therefore, a value remains unchanged in the conversion

process. The b value in this case shows us the change on a monthly level, but from a month in one year to the

corresponding month in the following year. Here, it is necessary only to convert b value to make it measure

the change between consecutive month by dividing it with 12 only.

Merits

(i) This method has no place for subjectivity since it is a mathematical method of measuring trend,

(ii) This method gives the line of best fit because from this line the sum of the positive and negative

deviations is zero and the total of the squares of these deviations is minimum.

Limitations

142

The best practicable use of mathematical trends is for describing movements in time series. It does

not provide a clue to the causes of such movements. Therefore, forecasting on this basis may be quite risky.

Forecasting will be valid if there is a functional relationship between the variable under consideration and time

for a particular trend. But if trend describes the past behaviour, it hardly throws light on the causes which may

influence the future behaviour.

The other limitation is that if some items are added to the original data, a new equation has to be

obtained.

Curvilinear Trend

Sometimes, the time series may not be represented by a straight line trend. Such trends are known as

curvilinear trends. If the curvilinear trend is represented by a straight line or semi-log paper, or by polynomials

of second or higher degree or by double logarithmic function, then the method of least squares is also appli-

cable to such cases.

MEASUREMENT OF SEASONAL VARIATIONS

Seasonal variations are those rhythmic changes in the time series data that occur regularly each

year. They have their origin in climatic or institutional factors that affect either supply or demand or both. It

is important that these variations be measured accurately for three reasons. First, the investigator wants to

eliminate seasonal variations from the data he is studying. Second, a precise knowledge of the seasonal

pattern aid in planning future operations. Lastly, complete knowledge of seasonal variations is of use to those

who are trying to remove the cause of seasonals or are attempting to mitigate the problem by diversification,

off setting opposing seasonal patterns, or some other means.

Since the number of calender days and working days vary from month to month, therefore, it is

essential to adjust the monthly figures if the same are based on daily quantities, otherwise, there is no need for

such adjustment when we deal with either volume of inventories or of bank deposits because then the values

are not influenced by the number of calender days or working days.

Methods of Measuring Seasonal Variations

1. Method of Simple Averages (Weekly, Monthly or Quarterly).

2. Ratio-to-Trend Method.

3. Ratio-to-Moving Average Method.

4. Link Relatives Method.

Methods of Simple Average

This is the simplest method of obtaining a seasonal index. The following steps are necessary for

calculating the index :

(i) Average the unadjusted date by years and months or quarters if quarterly data are given.

(ii) Find totals of January, February etc.

(iii) Divide each total by the number of years for which data are given. For example, if we are given

monthly data for five years then we shall first obtain total for each month for five years and

divide each total by 5 to obtain an average.

143

(iv) Obtain an average of monthly averages by dividing the total of monthly averages by 12.

(v) Taking the average of monthly average as 100, compute the percentage of various monthly

averages as follows:

Seasonal Index for January

=

If instead of the average of each month, the total of each month are obtained, we will get the same

result.

The following example shall illustrate the method.

Illustration : Consumption of monthly electric power in million of kilowat (k.w.) hours for street lighting in

India during 1999-2003 is given below:

Year Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec.

1999 318 281 278 250 231 216 223 245 269 302 325 347

2000 342 309 299 268 249 236 242 262 288 321 342 364

2001 367 328 320 287 269 251 259 284 309 245 367 394

2002 392 349 342 311 290 273 282 305 328 364 389 417

2003 420 378 370 334 314 296 305 330 356 396 422 452

Find out seasonal variation by the method of monthly averages.

Solution : COMPUTATION OF SEASONAL INDICES BY MONTHLY AVERAGES

Monthly Five

Consumption of monthly electric power total for yearly Percentage

Month 5 years average

(1) (2) (3) (4) (5) (6) (7) (8) (9)

Jan. 318 342 367 392 420 1,839 367.8 116.1

Feb. 281 309 328 349 378 1,645 329.0 103.9

March 278 299 320 342 370 1,609 321.8 101.6

April 250 268 287 311 334 1,450 290.0 91.6

May 231 249 269 290 314 1,353 270.6 85.4

June 216 236 251 273 296 1,272 245.4 80.3

July 223 242 259 282 305 1,311 262.2 82.8

Aug. 245 262 284 305 330 1,426 285.2 90.1

Sept. 269 288 309 328 356 1,550 310.0 97.9

144

Oct. 302 321 245 364 396 1,728 345.6 109.1

Nov. 325 342 367 389 422 1,845 369.0 116.5

Dec. 347 364 394 417 452 1,974 394.8 124.7

Total 19,002 3,800.4 1,200

Average 1,583.5 316.7 100

The above calculations are explained below:

(i) Column No. 7 gives the total for each month for five years.

(ii) In column No. 8 each total of column No. 7 has been divided by 5 to obtain an average for each

month.

(iii) The average of monthly averages is obtained by dividing the total of monthly averages by 12.

(iv) In column No. 9 each monthly average has been expressed as percentage of the average of

monthly averages. Thus, the percentage for January

=

Percentage for February =

If instead of monthly data, we are given weekly or quarterly data, we shall compute weekly or

quarterly averages by following the same procedure.

Ratio-to-moving average method : The method of monthly totals or monthly averages does not

give any consideration to the trend which may be present in the data. The ratio-to-moving-average method is

one of the simplest of the commonly used devices for measuring seasonal variation which takes the trend into

consideration: The steps to compute seasonal variation are as follows :

(i) Arrange the unadjusted data by years and months.

(ii) Compute the trend values by the method of moving averages. For this purpose take 12 month

moving average followed by a two-month moving average to recentre the trend values.

(iii) Express the data for each month as a percentage ratio of the corresponding moving-average

trend value.

(iv) Arrange these ratios by months and years.

(v) Aggregate the ratios for January, February etc.

(vi) Find the average ratio for each month.

(vii) Adjust the average monthly ratios found in step (vi) so that they will themselves average 100

percent. These adjusted ratios will be the seasonal indices for various months.

A seasonal index computed by the ratios-to-moving-average method ordinarily does not fluctuate so

much as the index based on straight-line trends. This is because the 12-month moving average follows the

cyclical course of the actual data quite closely. Therefore the index ratios obtained by this method are often

more representative of the data from which they are obtained than is the case in the ratio-to-trend method

which will be discussed later on.

Illustration : Prepare a monthly seasonal index from the following data, using moving averages method :

Monthly Sales of XYZ Products Co,. Ltd. (Rs.)

145

Year

2000 2001 2002

January 3,639 3,913 4,393

February 3,591 3,856 4,530

March 3,326 3,714 4,287

April 3,469 3,820 4.405

May 3,321 3,647 4,024

June 3,320 3,498 3,992

July 3,205 3,476 3,795

August 3,205 3,354 3,492

September 3,255 3,594 3,571

October 3,550 3,830 3,923

November 3,771 4,183 3,984

December 3,772 4,482 3,880

Solution : Computations of Ratios to 12-month centered moving averages for sales (Rs.)

Year & Sales 12-month 12-month Centred Ratio to moving

month (Rs.) moving moving 12-months average

total average moving average

1 2 3 4 5 6

2000

Jan. 3,639

Feb. 3,591

March 3,326

April 3,469

May 3,321

June 3,320

41,424 3,452

July 3,205 3,463 92.55

41,698 3,475

Aug. 3,205 3,486 91.94

41,963 3,497

Sept. 3,255 3,513 92.66

42,351 3,529

Oct. 3,550 3,543 100.20

146

42,702 3,558

Nov. 3,771 3,572 105.37

43,028 3,586

Dec. 3.772 3,593 104.98

2001 43,206 3,601

Jan. 3,913 3,612 108.33

43,477 3,623

Feb. 3,856 3,630 106.23

43,626 3,636

March 3,714 3,650 101.75

43,965 3,664

April 3,820 3,675 103.95

44,245 3,687

May 3,647 3,704 98.46

44,657 3,721

June 3,498 3,751 93.26

45,367 3,781

July 3,476 3,801 91.45

45,847 3,821

Aug. 3,354 3,849 87.14

46,521 3,877

Sept. 3,594 3,901 92.13

47,094 3,925

Oct. 3,830 3,949 96.99

47,679 3,973

Nov. 4,183 3,989 104.86

48,056 4,005

Dec. 4,482 4,025 111.35

2002 48,550 4,046

Jan. 4,393 4,059 108.23

48,869 4,072

Feb. 4,530 4,078 111.08

49,007 4,084

March 4,287 4,083 105.00

147

48,984 4,082

April 4,405 4,086 107.81

49,077 4,090

May 4,024 4,081 98,60

48,878 4,073

June 3,992 4,048 98.62

48,276 4,023

July 3,795

Aug. 3,492

Sept. 3,571

Oct. 3,923

Nov. 4,984

Dec. 3,880

Arranging the ratios to moving average by months and years we obtain the following table from

which the seasonal index for each month is also obtained.

Computation of Seasonal Index by Ratios to Moving Averages of XYZ Products Co. Ltd.

Year Seasonal

Month 2000 2001 2002 Total Average Index

January — 108.33 108.23 216.56 108.28 107.6

February — 106.33 111.08 217.31 108.65 108.1

March — 101.75 105.00 206.75 103.37 102.8

April — 103.95 107.81 211.76 105.88 105.3

May — 98.46 98.60 197.06 98.53 98.0

June — 93.26 98.62 191.88 95.54 95.4

July 92.55 91.45 — 184.00 92.00 91.5

August 91.94 87.14 — 179.08 89.54 89.0

September 92.66 92.13 — 184.79 92.40 91.9

October 100.20 96.99 — 197.19 98.60 98.1

November 105.57 104.86 — 210.43 105.21 104.06

December 104.98 111.35 — 216.33 108.16 107.6

Total of Monthly Averages 1206.56

148

Average of Monthly Averages 100.55

Putting average of monthly averages as 100, monthly averages have been admitted to obtain sea-

sonal index

for each month.

For example, Seasonal Index for January =

For February =

Merits

This method is more widely used in practice than other methods. The index calculated by the ratio-

to-moving average method does not fluctuate very much. The 12-month moving average follows the cyclical

course of the actual data closely. So index ratios are the true representative of the data from which they have

been obtained.

Limitations

All seasonal index numbers cannot be calculated for each month for which data is available. When

a four month average is taken, 2 months, in the beginning and 2 months in the end are left out for which we

cannot calculate seasonal index numbers.

The ratio-to-trend method : The ratio-to-trend method is similar to ratio-to-moving-average method.

The only difference is the way of obtaining the trend values. Whereas in the ratio-to-moving-average method,

the trend values are obtained by the method of moving averages, in the ratio-to-trend method, the corre-

sponding trend is obtained by the method of least sequares.

The steps in the calculation of seasonal variation are as follows :

(i) Arrange the unadjusted data by years and months.

(ii) Compute the trend values for each month with the help of least squares equation.

(iii) Express the data for each month as a percentage ratio of the corresponding trend value.

(iv) Aggregate the January’s ratios, February’s ratios, etc., computed previously

(v) Find the average ratio for each month.

(vi) Adjust the average ratios found in step (v) so that they will themselves average 100 per cent.

The last step gives us the seasonal index for each month.

Sometimes the median is used in place of the arithmetic average of the ratios-to-trend. The choice

depends upon circumstances but there is a preference for the median if several erratic ratios are found. In

fact, if a fairly large number of years, say, 20 or 15, are used in the computation, it is not uncommon to omit

extremely erratic ratios from the computation of average of monthly ratios. Only the arithmetic average

should be used for small number of years.

This method has the advantage of simplicity and case of interpretation. Although it makes allowance

for the trend, it may be influenced by errors in the calculation of the trend. The method may also be influ-

enced by cyclical and erratic influences. This source of possible error is eliminated by the selection of a

period of time in which depression is offset by prosperity.

149

Illustration : Find seasonal variations by the ratio-to-trend method from the following data :

Year 1st Quarter 2nd Quarter 3rd Quarter 4th Quarter

2000 30 40 36 34

2001 34 52 40 44

2002 40 58 54 48

2003 54 76 68 62

2004 80 92 86 82

Solution : For finding out seasonal variations by ratio-to-trend method, first the trend for yearly data will be

obtained and convert them into quarterly data.

Year Yearly Average of X XY X2 Trend Y = 56 + 12X

totals quarterly Values

values of Y

2000 140 35 – 2 – 70 4 32 56 +12 (–2)

2001 180 45 – 1 – 45 1 44 56+ 12 (–1)

2002 200 50 0 0 0 56 56+12(0)

2003 260 65 1 65 1 68 56+12(1)

2004 340 85 2 170 4 80 56 + 12 (2)

Total 1120 280 + 120 10 280

Y = a + bX

a =

b =

The trend value for the middle quarter 2000, i.e., which should come between 2nd and 3rd quarter is 32.

Quarterly increment is :

Therefore, the trend value for 2nd quarter will be

The trend value for 3rd quarter is

Similarly other values will be calculated.

Quarterly Trend Values

Year 1st 2nd 3rd 4th Total

2000 27.5 30.5 33.5 36.5 128

2001 39.5 42.5 45.5 48.5 176

2002 51.5 54.5 57.5 60.5 224

2003 63.5 66.5 69.5 72.5 272

150

2004 72.5 78.5 81.5 84.5 320

Now we calculate percentage of trend values on the basis of quarterly trend values

Year 1st 2nd 3rd 4th

2000 109.1 131.1 107.5 93.1

2001 86.1 122.4 109.9 90.5

2002 77.7 106.4 93.9 79.3

2003 85.0 114.3 97.8 85.5

2004 106.0 117.2 105.5 97.0

Total 473.9 591.4 514.6 445.6

Average 92.78 118.28 102.92 89.12

The average of quarterly average of trend figures :

Quarterly seasonal Index for 1st Quarter :

Quarterly seasonal Index for 2rd Quarter :

Quarterly seasonal Index for 3rd Quarter :

Quarterly seasonal Index for 4th Quarter :

The total of seasonal indices should be equal to 400 and that for monthly indices should be 1200.

Merits

(i) This method is based on a logical procedure for measuring seasonal variations. This procedure

has an advantage over the moving average method for it has a ratio to trend value for each

month for which data is available. So this method avoids loss of data which is inherent in the

case of moving averages. If the period of time series is very short then the advantage becomes

more prominent.

(ii) It is a simple method.

(iii) It is easy to understand.

Limitations :

If the cyclical changes are very wide in the time series, the trend can never follow the actual data, as

closely as a 12-month moving average will follow, under the ratio-to-trend method. There will be more bias in

a seasonal index computed by ratio to trend method.

4. Link Relatives Method

Among all the methods of measuring seasonal variation, link relatives method is the most difficult

one. When this method is adopted the following steps are taken to calculate the seasonal variation indices :

(i) Calculate the link relatives of the seasonal figures. Link relatives are calculated by dividing the

figure of each season* by the figure of immediately preceding season and multiplying it by 100.

151

These percentages are called link relatives since they link each month (or quarter or other time

period) to the preceding one.

(ii) Calculating the average of the link relatives for each season. While calculating average we

might take arithmetic average but median is probably better. The arithmetic average would give

undue weight to extreme cases which were not primarily due to seasonal influences.

(iii) Convert these averages into chain relatives on the base of the first season.

(iv) Calculate the chain relatives of the first season on the basis of the last season. There will be

some difference between the chain relative of the first season and the chain relative calculated

by the previous method. This difference will be due to long-term changes. It is therefore neces-

sary to correct these chain relatives.

(v) For correction, the chain relative of the first season calculated by first method is deducted from

the chain relative (of the first season) calculated by the second method. The difference is

divided by the number of seasons. The resulting figure multiplied by 1,2,3 (and so on) is de-

ducted respectively from the chain relatives of the 2nd, 3rd, 4th (and so on) seasons. These are

correct chain relatives.

(vi) Express the corrected chain relatives as percentage of their averages. These provide the re-

quired seasonal indices by the method of link relatives.

The following example will illustrate the process.

Illustration : Apply method of link relatives to the following data and calculate seasonal indices.

QUARTERLY FIGURES

Quarter 1999 2000 2001 2002 2003

I 6.0 5.4 6.8 7.2 6.6

II 6.5 7.9 6.6 5.8 7.4

III 7.8 8.4 9.3 7.5 8.0

IV 8.7 7.3 6.4 8.5 7.1

Solution : CALCULATION OF SEASONAL INDICES BY MEHTOD OF LINK RELATIVES

Quarter

Year I II III IV

1999 — 108.3 120.0 111.5

2000 62.1 146.3 106.3 86.9

2001 93.2 95.6 143.1 68.8

2002 112.5 80.6 129.3 113.3

2003 77.6 110.6 109.6 88.8

Arithmetic average

Chain relative 100

Corrected 108.28 – 1.675 131.73 – 3.35 123.64 – 5.025

Chain relative 100 = 106.605 = 128.38 = 118.615

152

Seasonal Indices 100

= 94.0 = 113.21 = 104.60

The correction factor is calculated as follows :

Chain relative of the first quarter (on the basis of first quarter) = 100

Chain relative of the first quarter (on the basis of first quarter) =

Difference between these chain relatives = 106.7 – 100 = 6.7

Difference per quarter =

Adjusted chain relatives are obtained by subtracting 1 × 1.675, 2 × 1.675, 3 × 1.675 from the chain

relatives of 2nd , 3rd and 4th quarters, respectively.

Seasonal variation indices are calculated as below :

Seasonal variation index =

Meaning of “Normal” in Business Statistics

Business is often said to be “above normal” or “below normal”. When so used the term “normal” is

generally recognized to mean a level of activity which is characterized by the presence of basic trend and

seasonal variation. This implies that the influence of business cycles and erratic fluctuations on the level of

activity is assumed to be insignificant. Therefore, the product of trend value for any period when adjusted by

the seasonal index for that period gives us an estimate of the normal activity during that period.

Measuring Cycle as the residual

Business cyclical variations are measured either as the difference between the observed value and

the “normal”. Whatever remains after elimination of secular trend and seasonal variations from the time

series, is said to be composed of cyclical variations and Irregular movements.

Second degree Parabola

The simplest form of the non-linear trend is the second degree parabola. It is used to find long term

trend. We use the following equation for finding second degree trend –

Yc

= a + bX + cX2

To know the value of a, b and c we use the following three normal equations –

ΣY = Na + bΣX + cΣX2



A second degree trend equation is apporpriate for the secular trend component of a time series when

the data do not fall in a straight line.

Illustration: Fit a parabola (Yc = a + bX + cX2) from the following

Years 1 2 3 4 5 6 7

Values 35 38 40 42 36 39 45

153

Solution :

Year X X2 X3 X4 Y XY X2Y Trend value (Yr = a + bX + cX2)

1 – 3 9 – 27 81 35 – 105 315 39.09 – 3 + 0.5 × 9 = 36.54

2 – 2 4 – 8 16 38 – 76 152 39.09 – 2 + 0.5 × 4 = 37.29

3 – 1 1 – 1 1 40 – 40 40 39.09 – 1 + 0.5 × 1 = 38.14

4 0 0 0 0 42 0 0 39.09 – 0 + 0.5 × 0 = 39.09

5 1 1 1 1 36 36 36 39.09 + 1 + 0.5 × 1 = 40.14

6 2 4 8 16 39 78 156 39.09 + 2 + 0.5 × 4 = 41.26

7 3 9 27 81 45 135 405 39.09 + 3 + 05 × 9 = 42.54

N = 7 ΣX = 0 ΣX2 = 28 ΣX3 = 0 ΣX4 = 196 ΣY = 275 ΣXY = 28 ΣX2Y = 1104

Yc

= a + bX + cX2

Equations to compute the values of a, b and c

ΣY = Na + bΣX + cΣX2



Putting the value in the above equation we get

275 = 7a + b(0) + c(28)

28 = a(0) + b(28) + c(0)

1104 = a(28) + b(0) + c(196)

or 7a + 28c = 275 ...(i)

28 b = 28 ...(ii)

28a + 196c = 1104 ...(iii)

Multiplying the equation, (i) by 4, and deducting equation (iii) from (i) we get

28a + 112c = 1100 ...(iv)

28a + 196c = 1104 ...(v)

– 84c = – 4

c = 4/84 = 0.05

By substituting the value of c in equation (i) we get the value of a

7a + 28 × 4/48 = 275

7a = 275 – 1.33

a = 273.67/7 = 39.09

154

We may get the value of b with the help of equation (ii)

28b = 28

b = 1

The required equation would be:

Yc

= 39.09 + 1X + 0.05 X2

= 39.09 + X + 0.05 X2

With the help of above equation we can estimate the value for year 8 where x = 4

Yc

= 39.09 + 4 + 0.05 (4)2

= 39.09 + 4 + 0.8 = 43.89

Exponential Trend

The equation for exponential trend is of the form: y = abx

Taking log of both sides we get log y = log a + x log b

To get the value of a and b we have normal equation

Σlogy = Nlog a + logb ΣX

Σ(x. log y) = log aΣx + log bΣX2

When we slove these equations we get –

log a = and log b =

Illustration : The production of certain raw material by a company in lakh tons for the years 1996 to 2002

are given below:

Year : 1996 1997 1998 1999 2000 2001 2002

Production : 32 47 65 92 132 190 275

Estimate Production figure for the year 2003 using an equation of the form y = ab1 where x = years

and y = production

Solution :

Year Sales (y) x log y x 2 x.log y

1996 32 – 3 1.5051 9 – 1.5153

1997 47 – 2 1.6721 4 – 3.3442

1998 65 – 1 1.8129 1 – 1.8129

1999 92 0 1.9638 0 0

2000 132 1 2.1206 1 2.1206

2001 190 2 2.2788 4 4.5576

2002 275 3 2.4393 9 7.3179

Σx = 0 Σlog y = 13.7926 Σx2 = 28 Σx. log y = 4.3237

log a =

log b =

155

log y = 1.9704 + .154 x

for 2003, x would be 4 and log y will be

log y = 1.9704 + .154(4) = 2.5864

y = AL 2.5864 = 385.9

Thus estimated production for 2003 would be 385.9 lakh tons.

156

LESSON 9

THEORY OF PROBABILITY

Most of the decision-making situations in business management involve uncertainty. Since uncer-

tainty is present and is an important aspect in determining the consequences of various alternative courses

of action, it is imperative to get proper appreciation of it, draw a mathematical picture of it and attempt to

measure it in numerical terms. There are many advantages in having a numerical measure for uncertainty.

Besides facilitating understanding and allowing analysis, it helps in communication between executives.

Verbally, a manager at a meeting might indicate that he is “fairly sure” about the success of a particular

project. This phrase might mean something quite different to the other executives at the meeting. ‘Fairly

sure’ might mean that success will occur 9 out of 10 times to one decision maker (implying that he is 90

percent sure) while the same phrase might indicate 7 times out of 10 to another. Numbers remove such

confusion. Besides this, an important advantage of a numerical measure is the ability to use mathematics

for analysis. Uncertainty is expressed in numerical terms by the theory of probability as probability is at

once the language and measure of uncertainly. In this lesson, we are going to study as to how the probabil-

ity also provides a foundation for the whole of the analytical statistics that we are going to learn in the

course of these lessons.

Probability Foundations

The theory of probability takes on practical value when it is defined in relation to an experiment. Suchan experiment might be tossing a coin, taking out a card out of a standard deck of playing cards, tossing a six-faced dice, observing the number of reflectives in a lot of electric bulbs, tossing a pair of dice, drawing a ballform an urn containing balls, and so on. Once the experiment has been defined, all possible outcomes fromthe experiment are identified. This exhaustive set of outcomes constitutes the sample space, S. The samplespace is a key concept and an important base of probability theory.

One of the simplest sample spaces can be the set of outcomes when a pair of coins is tossed. Itconsists of four outcomes which can be conveniently represented as :

S = {HH, HT, TH, TT}

where H denotes a head the T denotes a tail.

Fig. 1 Sample space for pair of dice experiment

We can consider the case of a manufacturer who produces electric bulbs in large batches. From

each batch, a sample of 80 items is selected at random, and the number of defective items are recorded.

Although the number of defectives in any sample cannot be predicted with certainty, all of the possible

outcomes may be known. The number of defective items in a sample can be any integer from 9 to 80. Here

the sample space is:

S = {0,1, 2, 3,.......80}

In the same manner, when a pair of dice is tossed, a total of 36 outcomes are possible. This can be

represented as shown in figure. 1

In all the three examples, the number of outcomes from, the experiment are known to be finite. While

in most cases it is so but it is not a rule. The number of outcomes can be infinite as well. For example, it we

consider the experiment of observing the life-time of an electric bulb in hours, the outcome can be any real,

non-negative number. Thus, this sample space contains an infinite number of sample points.

157

Events

An event refers to any set of possible outcomes in a sample space. If the sample space for an event

has the elements S1 S

2,S

3,...Sn, an event in the sample space S would beany one, or collection of S

1, S

2,

S3,...Sn. In a sample space, every combination of sample points may be defined as an event. In an experi-

ment of counting defective items in a sample, the set of all possible outcomes having less than 10 defective

items can be represented by the event A.

Each sample point does not have to identify a separate event. The faces of a die provide a sample

space of six outcomes. If the occurrence of each face identifies a different event, there are six possible

events. On the other hand, suppose that an even number represents a gain of Rs. 100 to a person X and an

odd number represents a loss of Rs. 100 to the person Y. In this case, there are six outcomes but only tow

events– gain of Rs. 100 to X, and loss of Rs. 100 to Y.

Every event A is a subset of the sample space and every event is a collection of the elements of a

sample space. Events can be classified as being elementary or compound. Elementary events are said to be

those which have a single sample point whereas compound events are those which contain more than one

sample point. Thus, whereas the compound events can be decomposed, the elementary events cannot be.

The appearance of 3 on a die is an elementary event while the appearance of an even number on the die is

a compound event (as it contains three sample points).

Now, keeping in mind the definitions of experiment, sample space and events, we introduce some

more concepts.

(a) Mutually Exclusive Events and Overlapping Events

Two events are mutually exclusive if the occurrence of one event precludes the occurrence of the

other. For example, the events that (i) an employee would be late, and (ii) the employee would be absent, on

a particular day, are mutually exclusive since both cannot occur simultaneously. An employee cannot be both

late and absent on a particular day. Similarly, suppose we consider a box in which 20 cards, marked 1 to 20

are placed and a card is drawn at random. If A be the event that the number on the card is divisible by 3 and

B be the event that the number would be divisible by 7, then the events A and B are mutually exclusive. This

is because for A to occur, the number would be one of 3, 6, 9, 12, 15 or 18, and B to occur, it should be one of

7 or 14. Since no number is common to them, they are mutually exclusive.

On the other hand, two or more events which are not mutually exclusive are called overlapping

events. In the above example of cards, suppose A represents the event that the number on the card chosen

is divisible by 3 and B represents the event that the number is divisible by 5, then for A to occur the number

must be either 3,6,9,12,15 or 18, and for B to occur, it must be one of 5,10,15 and 20. Note that if the number

15 is obtained, it implies that both A and B have taken place. Thus A and B are not mutually exclusive.

We can use Venn diagram to depict mutually exclusive and overlapping events. This is shown in

figure 2. Part (a) of the figure shows the mutually exclusive events A and B, each of them defined over the

sample space S. Note that A and B have no sample points in common. On the other hand part (b) of the figure

shows overlapping events A and B, as they have some common sample points.

(b) Complementary Events

Events are said to be complementary when the sample space is partitioned into the segment that

represents the occurrence of an event A, and the segment that is not a part of A. Thus, the complement of an

event A is the collection of all possible outcomes that are not contained in event A. For example, in the toss

158

of a coin, appearance of head and tail are complementary to each other. Complementary events are shown

in figure 3. Here A and A are complementary to each other. The events of a person being able to hit a target,

and not being able to hit the target are complementary, and so are the events of the appearance of a head and

a taill on tossing a coin.

(c) Independent and dependent Events

Two events are said to be independent if the occurrence of one event in no way influences the

occurrence of the other event. For example, we toss a six-faced die and call the event of appearance of an

even number as the event A and the appearance of an odd number as the event B. Now, suppose that in the

first toss we get an even number. If we toss the die the second time, we can still get an even or an odd

number and their chances are not influenced by the result of the first trial. Thus, the appearance of an even

number in the first trial and the appearance of an even number in the second trial is an example of indepen-

dent events. Similarly, if we pick a card at random from a deck of playing cards, note its suit and put it back

and then draw one card from the deck, then the chances of a king card, for example, in the second trial is not

at all affected by the card we had drawn at the first trial. But if we take out a card and do not replace it back,

then the chances of drawing a king card in the second trial are certainly affected by the card we had drawn

in the first trial. If it were a king card the first time, then only 3 king cards remain in the 51 cards while if a

non-king card was drawn then we would have 4 king cards in the lot of 51 cards. So the chances of a king

card in the second trial are dependent upon the results of the first trial. This time, the events of a king card in

the first trial and king card in the second trial are not independent because the outcome in one trial is in some

way influenced by the outcome of the previous trial.

Methods of Assigning Probability

There are three methods of assigning probability to an event. They are :

(i) Classical approach,

(ii) Relative frequency approach, and

(iii) Personalistic approach.

We now discuss the three methods :

Classical Approach

The classical approach to determine probability is the oldest one. It originated with the games of

chance. According to this theory, if there are n outcomes of an experiment which are mutually exclusive and

equally likely to occur, then the probability of each sample point is 1/n. Thus, if a fair die is tossed, each of six

numbers 1, 2,... 6, is equally likely to occur and the probability that a given number, say 5, would occur is 1/6.

From this, the classical interpretation of probability is: if the sample space of an experiment has n(S) equally

likely outcomes and if an event A, defined on this sample space has n(A) sample points, then the probability

that event A would occur is the ratio of n(A) to n(S).

To illustrate, we consider the following examples.

Example : A six-faced die is tossed once. Find the probability that the number obtained on tossing is (i) and

odd number, (ii) a number greater than 2.

Solution : Let A : the event that the number is an odd number , and

B : the event that the number is greater than 2.

From the given information, n(S) = 6 (as there are six possible outcomes)

159

n(A) = 3 (being numbers 1 ,3 and 5), and

n(B) = 4 (being numbers 3, 4, 5 and 6)

∴ P(A) =

P(B) =

Example : A card is drawn from a deck of playing cards at random. Find the chance that (i) it is a face card,

(ii) it is a black ace card.

Solution : Let A : the event that the card is a picture card

B : the event that the card is black ace card

n(A) = 3

we have,n(S) = 52 (there being 52 cards)

n(A) = 12 (there being 4J, 4Q and 4K cards with faces)

n(B) = 2 (there being 2 black aces)

P(A) = and

P(B) =

Example : Find the probability that a leap year selected at random shall contain 53 Sundays.

Solution : Like every year, a leap year would have 52 full weeks. The remaining two days of the years could

be:

Sunday and Monday, Monday and Tuesday, Tuesday and Wednesday, Wednesday and Thursday,

Thursday and Friday, Friday and Saturday, or Saturday and Sunday.

We observe here that n(S) = 7. Since two of the above combinations have a Sunday included, we

have n(A) = 2.

Therefore, P(A) =

The classical theory, under the assumption of equally likely outcomes, depends on logical reasoning.

It does very well when we are concerned with balanced coins, perfect dice, well shuffled pack of cards and

all those situations where all outcomes are equally likely. However, problems are immediately encountered

when we have to deal with the unbalanced coins, loaded dice and so on. In such situations, we have to depend

on the relative frequency approach.

Relative Frequency Approach

It is based on the actual observation. For example, if we were interested in the probability of 50 or

fewer customers arriving at a super market before 10 a.m. we would pick a trial number of days (n) and

count how often 50 or fewer actually did arrive before 10 a.m. Our probability assessment would be the ratio

of days when 50 or fewer customers arrived to the total number of days observed. Similarly, to have an idea

of the probability that a head would appear on tossing a coin, we may actually toss the coin a number of times,

say 1000, and find the number of times a head appears. If 536 times a head has appeared, then the probability

of head to occur will be taken to be the ratio of the two : 536/1000. Naturally, if the coin is a fair one then the

ratio will approach to 0.5 as we continuously increase the number of trials. And if the coin is not a fair one

then the chances of a head would tend to approach the true probability of the head occurring on this coin

depending upon how biased is the coin.

Formally, the probability assessment for event A using the relative frequency approach is given by :

160

P(A) =

It can be easily visualised that when the number of trials increases, we get better and better estimate

of the true probability of the event in question.

Both the classical and the relative frequency approaches to probability are objective in nature. The

classical definition is objective in the sense that it is based on deduction from a set of assumptions while the

relative frequency approach is objective because the probability is derived from repeated empirical observa-

tions. However, both the ories tail when we are dealing with unique events. For example to determine the

probability that a certain student will succeed in a particular examination, we can apply none of the two. This

is because it cannot be ruled that for every student the events of succeeding and not succeeding are equally

likely. Similarly, we cannot subject the candidate to appear in the examination several times to estimate his

probability of success. In such cases, we have the personalistic approach to probability.

Personalistic Approach to Probability

The approach views that the probability of an event is a measure of the degree of belief that an

investigator has in the happening of it. It grants that the probability of the same event may be assigned

differently by different investigators according to the confidence each one has in its happening. Thus, whereas

the chances of a candidate succeeding in an examination may be placed at 80 percent by one person, another

might estimate the chances to be 95 percent. Accordingly, the two would assign a probability 0.80 and 0.95

respectively for the event to happen.

In may be mentioned that the three approaches to probability definitions are not competitive rather

they are complementary in nature.

Calculation of Probability

As we have seen already, the probability of an event is defined as the ratio of the number of favor-

able outcomes (for the event) to the total number of possible outcomes. Little difficulty is experienced when

the total and favorable outcomes are small in number but when they are large, we may require the use of

counting techniques to identify their number. Therefore, we first state the method of obtaining permutations

and combinations.

(1) If a job can be done in m ways and another job can be done in n ways, then the total

number of ways in which both of them can be done is m × n This is the fundamental

multiplication rule.

Example : A man can go from city A to city B by three routes and come back by any of four routes, in how

many ways can be perform his to and fro journey.

Solution : He can perform the journey in a total of 3 × 4= 12 different ways.

Example : Three balanced dice are tossed. Find the chance that the sum of digits on the two would be equal

to 10.

Solution : Total number of ways in which three dice can fall = 6 × 6 × 6 = 216.

Total number of ways in which a total of 10 can appear = 27 (as shown below)

(1,3,6), (1.4,5), (1,5, 4), (1, 6,3), (2,2,6), (2,3, 5), (2,4, 4),

(2,5,3), (2,6,2), (3,1,6), (3, 2,5), (3,3,4), (3,4, 3), (3,5, 2),

(3.6,1), (4,1,5), (4,2, 4), (4, 3, 3), (4,4,2), (4,5, 1), (5,1, 4),

(5,2,3), (5,3,2), (5,4, 1), (6, 1,3), (6,2,2), (6.3,1)

Accordingly, P(total of 10) =

(2) The total number of arrangements of n distinct objects considered all at a time is equal to

n !

161

Thus, nPn

= n !

Example : In how many ways can the letters in the word DELHI be arranged ?

Solution : Since all the 5 letters are different, they can be arranged in 5! = 120 ways.

(3) The total number of arrangements of n distinct objects taken at a time is equal.

nPr

=

Example : A car dealer has 4 places in his showroom, li has just received a consignment of 10 cars of

different shades. In how many ways can he arranges cars in the showroom ?

Solution :

nPr

= 10P4 =

(4) If out of n objects, k1 are alike, k

2, are alike, k

3 are a like....and so on such that k

1 + k

2,

+ k3 + ....= n, the number of arrangements of the n object would be equal to

nPk1!k2!k3

=

Example : In how many ways can the letters in the word STATISTICS be arranged ?

Solution : Here n = 10, k1 (S) = 3, k

2(T) = 3, k

3(l) = 2, k

4(A) = 1 and k

5(C) =1

Accordingly

(5) Out of a total of n distinct objects, the number of combinations of r objects can be ob-

tained as follows :

nCr

=

Example : How many ways can a committee of 3 persons be chosen out of a total of 10 persons ?

Solution : Here n = 10 and r = 3. The total number of committees would be :

nCr

= 10C3 =

Example : A committee of four is to be selected randomly out of a total of 10 executives, 3 of which are

chartered accountants. Find the probability that the committee would include exactly 2 C.A.s.

Solution : The committee of 4 executives can be selected out of a total of 10 executives in 10C4 ways. The

number of ways in which 2C. A.s. can be selected out of 3 is equal to 3C2 while the number of ways in which

2 executives out of a total of 7 executives is equal to 7C2.

∴ P (committee includes exactly 2 CAs) =

Example : Two cards are drawn at random from a well-shuffled deck of cards. Find the probability that both

are ace cards.

Solution : No. of ways in which 2 cards can be selected out of 52 cards.

=

No. of ways in which 2 aces can be selected out of 4 ace cards =

∴ P(2 ace cards) =

Probability Rules

The probability associated with any event represents the likelihood of that event occurring on a

particular trial of an experiment. This probability also measures the perceived uncertainty about whether the

event will occur. If we are not uncertain at all, we assign the event a probability of zero or one. If the event

162

be A, then P(A) = 0 means that event A would not occur, while P(A) = 1 indicates that event A would

definitely occur. Thus, for any event, the probability would range between zero and one. Probability is non-

negative concept. Symbolically,

Rule l : 0 ≤ P(A) ≤ 1

Example : (i) Determine the probability that a 7 would appear on a six-faced die tossed once.

(ii) Determine the probability that an even or an odd number would appear on tossing a die.

Solution : (i) P(7) = = 0 (since a 7 does not exist, therefore there is no question of its occurrence)

Rule 2 : The probability of the complement of event A is one minus the probability of event A.

Symbolically,

= I – P(A)

To be able to hit and not be able to hit a target for example, are complementary events. If the

probability of a person to hit a target is given to be 3/5, then the probability that he would not be able to hit the

target would be :

P(not hitting the large!) =

Addition Rule : When making a decision involving probabilities, we often need lo combine event

probabilities with some event of interest. Here we first consider the calculation of probability that event A or

B, each of them being defined on the sample space would occur. We use the addition rules of probability for

this purpose.

Rule 3 : When the events are mutually exclusive, the probability of occurrence of either of them is

given by the sum of their individual probabilities. For two events A and B which are mutually exclusive,

P(A or B) = P(A) + P(B)

Alternatively, P(A ∪ B) = P(A) + P(B)

where (A ∪ B) reads A union B and means A or B. Thus, for the mutually exclusive events, the

probability that either one of them would occur is given by the sum of their individual probabilities. This rule

is known as the special rule of addition. In general terms, the rule is

P(A ∪ B ∪ C ∪ ...K) = P(A) + P(B) + P(C) + ...+ P(K)

Example : A box contains 20 discs numbered 1 to 20. A disc is selected at random. Find the probability that

the number on the disc is divisible by 5 or 7.

Solution : Let A be the event that the number is divisible by 5, and B be the event that the number is divisible

by 7. Since there is no number which is common in these, the events A and B are mutually exclusive.

Accordingly.

P(A) = and P(A ∪ B) =

Rule 4 : When the events are overlapping : when two events A and B are overlapping, then the

probability that either A or B or both of them would occur is given by the sum of individual probabilities of

events A and B to occur minus the probability of their joint occurrence. Symbolically,

P(A or B or Both) = P(A) + P(B) – P(A and B)

Alternatively, P (A ∪ B) = P(A) + P (B) – P (A ∩ B)

When they are three overlapping events A, B, and C, we have ,

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) – P(A ∩ B) – P(A ∩ C) – P(B ∩ C) + P(A ∩ B ∩ C)

163

Example : A box contains 20 discs numbered I through 20. A disc is selected at random. Find the chance that

its number is divisible by 3 or 5.

Solution : Let A be the event that the number is divisible by 3, and B the event that the number is divisible by

5. Here six numbers 3,6, 9,12, 15 and 18 are divisible by 3 and four numbers 5,10,15 and 20 are divisible by 5.

We notice that the number 15 is included in both the lists. Thus, we have,

P(A) = , P(B) = and (A ∩ B) =

Accordingly, P(A ∪ B) =

Example : A survey conducted to know the smoking habits of 500 persons yielded the following results :

Cigarette Brand No. of smokers

A 140

B 175

C 100

A and B 45

A and C 38

B and C 44

A and B and C 18

Find the probability that a person selected at random from the above group would be

(i) a smoker of brand A or B, (ii) a smoker of A or B or C,

(iii) a non-smoker.

Solution : From the given information, P(A) = P(B) = , P(C) = , P(A ∩ B) = ,

P(A ∩ C) = , P(B ∩ C) = , and P(A ∩ B ∩ C) =

Accordingly,

(i) P(A ∪ B) = P(A) + P(B) – P(A ∩ B) = = 0.54

(ii) P(A ∪ B ∪ C) = P(A) + P(B) + P(C) – P(A ∩ B) – P(A ∩ C) – P(B ∩ C) + P(A ∩ B ∩ C)

=

(iii) P(non-smoker) = 1 – P(smoker) =

Conditional Probability

In dealing with probability we often need to determine the chances of two or more events occurring

either at the same lime or in succession. For example, a quality control manager for a manufacturing

company may be interested in the probability of selecting two successive defectives from an assembly line.

In other instances, the decision maker may know that an event has occurred and may want to know the

chances of a second event occurring. For example, the market research orgainization engaged by a com-

pany may give a favorable report for a high sales figure for a new product to be introduced by the company.

The company managing director might well be interested to know the probability making a high sales given

the favorable report.

These situations require tools different from those presented above in context of addition rules.

164

Specifically we need to understand the rules for conditional probability and multiplication of probabilities. To

understand, suppose that the employees of an organization are cross-classified according to sex and rank as

follows :

Officer Clerk Total

Males 400 300 700

Females 200 100 300

Total 600 400 1000

If an employee is selected at random then the probability that the employee would be a male = 700/

1000, since out of the total of 1000 employees, a total of 700 are males. This probability is unconditional in the

sense that we are not given any information about the type of employee selected. Now, If it is given that an

employee is selected at random and he is an officer, then the probability that he would be a male shall be equal

to 400/600, because the focus would be only on the officer employees which are 600 in all and of which there

are 400 who are males. This probability is conditional. If we let the event A to represent the event that the

employee would be a male, event A to represent that an employee would be an officer, we can write the

conditional probability as :

P(A/B) =

where (A/B) reads as event A given that event B has occurred. Upon a closer look we can represent

the probability as:

P(A/B) =

In this, P(A ∩ B) represents the probability that both events A and B would occur and P(B) is the

probability for the event B to occur. Thus, P(A/B) is the conditional probability of A given B and is defined

provided P(B) > 0.

For the above example, P(A ∩ B) = 400/1000 and P(B) = 600/1000. As such,

P(A/B) =

Example : Consider an experiment in which two successive draws are to be made from an urn containing

three white balls and five black balls. Assume that the balls are drawn at random and that the ball chosen on

the first draw is not replaced. Find the probability that (i) the first ball drawn is white, and (ii) the second one

is black.

Solution : (i) Let A be the event that the first ball drawn is white and B be the event that the second ball

drawn in black. From the given information, P(A) = , since there are three white balls in a total of eight balls.

(ii) To determine the conditional probability of B given A, P (B/A), which is the probability of drawing a black

ball on the second draw after drawing a white ball for the first draw, it should be noted that if A has already

occurred, then there is a total of seven balls remaining and five of them are black. Thus, P(B/A) =

Multiplication Rules. From the conditional probability defined in the proceeding paragraphs, since

P(A/B) =

Rule 5 : P(A ∩ B) = P(B) × P(A/B)

Also P(A ∩ B) = P(A) × P(B/A)

This is called the multiplication rule for the non-independent events A and B, and slates that the joint

probability of the events A and B is given by the probability of the event A multiplied by the probability for

165

event B given that event A has occurred (or the probability of event B, multiplied by the probability for event

A given that event B has occurred).

Similarly, for events A, B, and C which are not independent, we have

P(A ∩ B ∩ C) = P(A) × P(B/A) × P(C/A ∩ B)

Example : Two balls are selected one after the other from an urn containing 7 black and 8 green balls. The

first ball is not replaced before the second one is drawn. Find the probability that both would be green.

Solution : Let A be the event that the first ball drawn is green and B be the event that the second ball drawn

is green. From the given information.

P(A) = , P(B/A) = (if a green ball is taken out there would be 7 green balls in a total of 14 balls)

∴ P(A ∩ B) = P(A) × P(B/A) =

Rule 6 : If the events A and B are independent, the probability that both evens occur can be deter-

mined by using P(A) and P(B). As mentioned earlier, two events are independent if the occurrence of one has

no effect upon the occurrence of the other. More formally, if A and B are independent,

P(A/B) = P(A), and P(B/A) = P(B).

If A and B are independent, the conditional probability of A, given B, is the same as P(A), since the

occurrence of the event B does not affect the occurrence of the event B; P(A/B) = P(A).

The joint probability of independent events may be seen as the product of the probabilities of the

events A and B, Since

P(A/B) = and P(A ∩ B) = P(A) × P(B)

To generalize, for independent events A, B, C ... we have

P(A ∩ B ∩ C) = P(A) × P(B) × P(C) × ...

Example : Two balls are selected one after the other from an urn containing 7 black and 8 green balls. The

first ball is replaced before the second one is drawn. Find the probability that both would be green.

Solution : Let A and B be the events that the first and the second ball, respectively, would be green.

From the given information,

P(A) = and P(B) =

Accordingly,

P(A ∩ B) = P(A) × (B) =

If it is significant to note that the condition P(A ∩ B) = P(A) × P(B) is satisfied then the events A and

B are said to be independent, just as when they are independent then this relation is satisfied. This condition

can be employed to determine whether the given events A and B are independent.

Example : For the data given in example, test whether the events of an employee selected being male and

an employee selected being clerk are independent.

Solution : Let A be the event that an employee a located is male and it be the event that an employee

selected would be a clerk. From the green information.

the number of employees who are males = 700

the number of employees who are clerks = 400

the number of employees who are males and clerks = 300

166

Accordingly

P(A) = , P(B) = , and P(A ∩ B) =

Here since , therefore the events A and B are not independent.

Now we shall discuss the theorem of total probability, also called as the theorem of elimination.

Rule 7 : If H1, H

2 ....,H

n be n mutually exclusive event each with a non-zero probability, and E be an

event defined on the same sample space and can be associated with either of them, the total probability of

event E to occur is given by : P(E) = P(H1) × P(E/H

1) + P(H

2) × P(E/H

2) +.....+ P(H

n) × P(E/H

n)

Alternatively, P(E) =

In this formulation, of course. P(H1) and P(E/H

i) must all be given.

To illustrate the application of this theorem, consider the following example.

Example : Two sets of candidates are competing for the positions of board of directors of a company. The

chances for the first set to win are 60% while the chances for the second set are 40%. If the first set wins,

the probability that a product will be introduced is 0.80 while if the second set wins, the probability for the

product to be introduced is 0.30. Determine the probability that the product will be introduced.

Solution : If H1 is the event that the first set wins.

H2 is the event that the second set wins, and

E is the event that the product is introduced, then

P(H1) = 0.60, P(E/H

1) = 0.80, P(H

2) = 0.40. P(E/H

2) = 0.30

Accordingly,

P(E) = P(H1) × P(E/H

1) + P(H

2) × P(E/H

2)

= 0.6 × 0.8 + 0.4 × 0.3 = 0.48 + 0.12 = 0.60

Thus, there is a sixty per cent chance that the product shall be introduced.

Bayes’ Theorem

Conditional probabilities provide a tool of good information for decision makers. For instance, say the

medical researchers are interested in determining the probability of getting cancer by a person supposing he

was exposed to hazardous chemicals. That is. P(cancer/has arduous chemicals).

In such cases we use the conditional probability rule

P(A/B) =

However, in many practical applications, decision makers may know that an event has occurred but

do not know what the chances were of that event before the fact. This cannot be known by the use of

conditional probability rule directly. In such cases we employ an extension of conditional probability called

Bayes’ Theorem. This theorem deals with the conditional probability of an event Hi, given the probability of

E, where E may have elements in each of the events H1, H

2, H

3, ....H

n with no element of E in more than one

H,. The Venn diagram of the figure displays such a condition.

Fig. 4 Bayes’ Theorem

We shall illustrate the concept with an example and then make a generalization.

Example : Box 1 contains 5 white balls and 3 red balls. Box 2 contains 4 white balls and 4 red balls. A box

167

is selected at random and one ball is randomly taken from that box. If the ball is white, what is the probability

that it came from box 1 ? box 2 ?

Solution : Let H1 : the box 1 is selected,

H2 : the box 2 is selected, and

E : the ball is white.

From the given information, P(H1) = , P(H

2) = , P(E/H) = , and P(E/H

2) =

Here we wish to calculate P(H1/E) P(H

2/E). From the theorem of conditional probability,

= and =

P(H1 ∩ E) is the probability of first selecting box 1 and then selecting one white ball from it.

P(H1 ∩ E) = P(H

1) × P(E/H

1) =

P(H2 ∩ E) is the probability of first selecting box 2 and then selecting one white ball from it.

P(H2 ∩ E) = P(H

2) × P(E/H

2) =

Since the ball selected can be from box 1 or box 2, we have,

P(E) = P(H1 ∩ E) + P(H

2 ∩ E)

= P(H1) × P(E/H

1) + P(H

2) × P(E/H

2)

=

Accordingly :

=

Also, =

Notice here that naturally either box 1 or box 2 would have been selected. When no information

about the colour of the ball is known, the probability that box 1 is selected is 1/2 and so is the probability that

box 2 is selected. Thus, P(H1) = 1/2 and P(H

2) = 1/2 are the prior probabilities. Having known later on that

the ball selected is of the white colour, we have revised these probabilities of P(H1/E) = 5/9 and P(H

2/E) =

4/9. These probabilities are known as posterior probabilities. Thus the prior probabilities are transformed into

posterior probabilities by incorporating the additional information, with the help of conditional and joint prob-

abilities. The information in the above stated example can be restated as follows :

Event Prior Prob. Conditional Prob. Joint Prob. Posterior Prob.

(Hi) P(H

i) P(E/H

i) P(H

i ∩ E) P(H

i/E)

H1

1/2 5/8 5/16 5/9

H2

1/2 4/8 4/16 4/9

Total, P(E) = 9/16

We can formally state the Bayes’ theorem now as follows : If H,, H2, ...H

n be mutually exclusive and

collectively exhaustive events and E be an event which is arbitrarily defined on this sample space such that

P(E) > 0, then the Bayes’ Therom states that

= where in P(E) =

Example : A company has two suppliers of raw materials used in making cement. Vendor A supplies 30 per

cent of raw materials while vendor B supplies 70 per cent. Tests have shown the 40 per cent of vendor A’s

168

materials are poor quality whereas 5 per cent of vendor B’s materials are poor quality. The cement company’s

manager has just found that there is a poor quality material in inventory. Which company most probably

supplied the material ?

Solution : Let H1 be the event that the material is supplied by vendor A,

H2 be the event that the material is supplied by vendor B,

E be the event that the material is of the poor quality.

Given:

Prior probabilities: P(H1) = 0.30. P(H

2) = 0.70

Conditional probabilities: P(E/H1) = 0.40, P(E/H

2) = 0.05

Joint probabilities: P(H, ∩ E) = P(H1) × P(E/H

1) = 0.30 × 0.40 = 0.120

P(H2 ∩ E) = P(H

2) × P(E/H

2) = 0.70 × 0.05 = 0.035

Total probability, P(E) = P(H1 ∩ E) + P (H

2 ∩ E) = 0.120 + 0.035 = 0.155.

Posterior probabilities:

=

=

Thus, vendor A most likely supplied the poor quality material.

Expected Value. An important concept, which has its origin in gambling and to which the probability

is applied is the expected value. According to this, if an experiment has n outcomes that are assigned the

payoffs x1, x

2, ...... x

n occurring with probabilities p

1, p

2 ...... p

n respectively, then the expected value is given

by

E(x) = x1 × p

1 + x

2 × p

2 + ...... + x

n × p

n

Example: A player is engaged in the experiment of rolling a fair die. The player recovers an amount of

rupees equal to the number of dots on the face that turns up, except when face 5 or 6 turns up in which case

the player will lose Rs. 5 or Rs. 6 respectively. What is the expected value of the game to the player ?

Solution : From the given information we have :

Outcome: 1 2 3 4 5 6

Probability: 1/6 1/6 1/6 1/6 1/6 1/6

Payoff: 1 2 3 4 –5 –6

The expected value of the games is :

E(x) =

Thus, the player would expect to lose on an average Re 1/6 or 17 p. on each throw.

Example : A oil company may bid for only one of two contracts for oil drilling in two different areas A and

B. It is estimated that a net profit of Rs. 4,00,000 would be realized from the first field and Rs. 5.00,000 from

the second field. Legal and other costs of bidding for the first oil field are Rs. 1,02,500 and for the second one

are Rs. 1,05,000. The probability of discovering oil in the first field is 0.60 and in the second is 0.70. The

manager of the company wants to know as to for which oil field should the manager bid ?

Solution : The expected values for the two contracts are calculated below :

Calculation of Expected Value

169

Investment Outcome Amount Probability Expected Value

A Success 4,00,000 0.6 2,40,000

Failure (1,02,500) 0.4 (41,000)

Total 1,99,000

B Success 5,00,000 0.7 3,50,000

Failure (1,05,000) 0.3 (31,500)

Total 3,18,500

Some Additional Examples

Example. Given : P(A) = 0.60. P(B) = 0.30 and P(A and B) = 0.18

(i) Are A and B mutually exclusive ? Why or why not ?

(ii) Are A and B independent ? Why or why not ?

Solution : (i) For the mutually exclusive events A and B, the probability for joint occurrence must be equal to

0 since they cannot happen simultaneously. Here since P (A and B) = 0.18 and not equal to zero, the events

are not mutually exclusive.

(ii) Events A and B are independent if P (A) × P (B) = P (A and B). Here since 0.60 × 0.30 = 0.18

(given), therefore A and B are independent.

Example: A problem in statistics is given to three students who can solve it with a probability of 3/5,4/5 and

1/3. Find the probability that the problem shall be solved.

Solution : If A, B, and C represent the problem cannot be solved by the three students respectively, we have,

P(A) =

Thus P (problem is not solved) = P (A ∩ B ∩ C) =

∴ P(problem is solved) =

Example : A newly married couple is planning their family. They decide that they would have two children;

one boy and one girl. Assume that male and female births are equally likely and successive births are inde-

pendent,

(i) What is the probability that the couple shall have one child of each sex ?

(ii) They decide to have a third child if their plan does not work out. What is the probability that they

will decide for a third child ?

(iii) Find the probability that the couple shall have four daughters in a row ?

Solution : (i) The couple can have one male and one female child in the following ways : first male, second

female, or first female and second male.

P(M ∩ F) = P(M) × P(F) = (being independent)

and P(F ∩ M) = P(F) ∩ P(M) =

∴ P(one child of each sex) =

(ii) A third child would be planned by the couple when they don’t get one male and one female child.

∴ P(couple plans for a third child) =

(iii) Probability that the couple shall have four daughters in a row.

P(F1 ∩ F

2 ∩ F

3 ∩ F

4) = P(F

1) × P(F

2) × P(F

3) × P(F

4)

170

=

Example: The ABC scooter company is currently having 20 percent of the scooter market in a certain

region. Its major competitor, XYZ scooter company, has the remaining 80 per cent. The research and devel-

opment department of ABC reports that it is estimated that there is an 80 per cent chance that it would be able

to produce a much better model.

If the new model is developed and marketed by ABC. There is a 0.60 chance that XYZ will also

develop a similar product. If this happens, the chances are 0.20 that ABC will have an 80 per cent market

share, a 0.30 chance that ABC will have a 60 per cent market share, and a 0.50 chance for a 40 per cent

market. If XYZ is not able to develop a new model, then ABC has a 0.70 chance at an 80 per cent market

share, and a 0.30 chance at a 50 per cent market share. In the event ABC is not able to develop the new

model, it will retain its current 20 per cent market share.

What is the probability that ABC will gain a 60 per cent or more market share ?

Solution : The given information is depicted in the figure 5

Probability Tree

Figure 5

From the figure, it is clear that the probability of at least 60% market share equals :

P(at least 60% market share) = 0.096 + 0.144 + 0.244 = 0.484

Example : A Salesman has a 60 per cent chance of making a sale to each customer. The behavior of

successive customers is independent. If two customers enter, what is the probability that he will make sale to

A or B?

Solution : If E1 and E

2 be the respective events that he will make sale to A and B, then, according to the given

information, we have

P(E1) = 0.6, P(E

2) = 0.6, P(E

1 ∩ E

2) = 0.6 × 0.6 = 0.36.

∴ P(E1 ∪ E

2) = P(E

1) + P(E

2) – P(E

1 ∩ E

2) = 0.60 + 0.60 – 0.36 = 0.84

Example : Box 1 contains five white and three red marbles; Box 2 contains four white and four red marbles;

and Box 3 contains one white and five red marbles. One marble is to be selected at random from a box

chosen by the roll of a die. If a 1 is uppermost, then the marble is taken from Box 1. If either a 2 or a 3 is

uppermost, then the marble is taken from Box 2. If either a 4, 5, or 6 is uppermost, then the marble is taken

from Box 3. If the marble selected is white, what is the probability that it came from Box 2 ?

Solution : Let H1

= a marble in Box 1, H2 = a marble in Box 2

H3

= a marble in Box 3, E = a white marble.

From the given information,

P(H1) =

P(E) = P(H1 ∩ E) + P(H

2 ∩ E) + P(H

3 ∩ E)

= P(H1) × P(E/H

1) + P(H

2) × P(E/H

2) + P(H

3) × P(E/H

3)

=

Now we can obtain the required probability as :

=

171

Example : A box contains 4 white, 8 green and 8 red marbles. A player selects one marble at random. The

player wins Rs. 6 if the marble he selects is white, Rs. 2 if it is green, but must pay if it is red. How much

should he pay for a red marble if this is to be a fair game ?

Solution : Probability of selecting a red ball = Probability of selecting a green ball =

Probability of selecting a red ball =

For a game to be fair, its net pay off must be equal to zero. We have,

Colour of Ball Payoff Probability Expected Value

White 6 4/20 24/20

Green 2 8/20 16/20

Red – x (suppose) 8/20 –8x/20

0

∴ or

or x = = Rs. 5

172

LESSON 10

PROBABILITY DISTRIBUTIONS

Theoretical distributions refer to a set of mathematical models of the relative frequencies of a finite

number of observations of a variable. It is systematic arrangement of probabilities of mutually exclusive and

collectively exhaustive elementary events of an experiment. Observed frequency distributions are based

upon actual observation and experimentation. We can deduce mathematically a frequency distribution of

certain population based on the trend of the known values. This kind of distribution on experience or theoreti-

cal considerations is known as theoretical distribution or probability distributions. These distributions may not

fully agree with actual observations or the empirical distributions based on sample observations. If the num-

ber of experiments is increased sufficiently the observed distributions may come closer to theoretical or

probability distributions. Theoretical distributions are useful for situations where actual observations or ex-

periments are not possible. Moreover, it can be used to test the goodness of fit. They provide decision makers

with a logical basis for making decisions and are useful in making predictions on the basis of limited informa-

tion or theoretical considerations.

There are broadly three theoretical distributions which are generally applied in practice. They are :

1. Binomial distribution

2. Poisson distribution

3. Normal distribution

Binomial Distribution

It is a discrete distribution. The binomial distribution was discovered by James Bernoulli in 1700 to

deal with dichotomous classification of events. It is a probability distribution expressing the probability of one

set of dichotomous alternatives, i.e., success or failure. The binomial probability distribution is developed

under some assumptions which are :

(i) An experiment is performed under similar conditions for a number of times.

(ii) Each trial shall give two possible outcomes of the experiment success or failure.

S = {failure, success}

(iii) The probability of a success denoted by p remains constant for all trials. The probability of a

failure denoted by q is equal to ( 1 – p).

(iv) All trials for an experiment are independent.

If a trial of an experiment can result in success with probability p and failure with probability

q = (1 – p), the probability of exactly successes in n trials is given by

P(x) = nCxpxqn–x where x = 0, l, 2...n.

where P(x) = Probability of x successes

nCx

= (! is termed factorial)

The entire probability distribution of x = 0, 1, 2,....,n can be written as follows:

Binomial Probability Distribution

Number of success Probability

173

x P(x)

0 nC0p0qn

1 nC1p1qn–1

2 nC2p2qn–2

: :

x nCxpxqn–x

: :

n nCnp nqn–n

We should note that the variable x (number of successes) is discrete. It can take integer values 0, 1,

2, ...,n. The probabilities specified in the above table are in fact successes terms of the Binomial Expansion

of (p + q)n, which is

(q + p)n= nC0qnp0 + nC

lqn–1p1 + nC

2qn–2p2 + nCqn–3p3 + ... nC

nqn–npn

Properties of Binomial Distribution

(i) The shape and location of binomial distribution changes as p changes for a given n or n changes

for given p. If n increases for a fixed p, the binomial distribution moves to the right, flattens and

spreads out. The mean of the distribution (np) increases an n increases for constant value of p.

(ii) The mode of the binomial distribution is equal to the value of x which has the largest probability.

(iii) If n is large and p and q are not close to zero, the binomial distribution can be approximated by

a normal distribution with standardised variable.

Z =

(iv) The mean and the standard deviation of the Binomial distribution is np and respectively.

(v) The other constants of the distribution can be calculated.

µ2

= npq

µ3

= npq (q – p)

µ4

= 3n2p2q2 + npq(1 – 6pq)

We can calculate the value of β1 and β

2 to measure the nature of the distribution.

β1

=

β2

=

The binomial distribution is useful in describing variety of real life events. Binomial distribution is

useful to answer questions such as: If we conduct an experiment n times under the stated conditions, what is

the probability of obtaining exactly x successes? For example, if 10 coins are tossed simultaneously what is

the probability of getting 4 heads ? We shall explain the usefulness of binomial distribution with the help of

certain examples.

Example : A coin is tossed eight times. What is probability of obtaining 0, 1,2, 3,4, 5, 6,7 and all heads?

Solution : Let us denote the occurrence of head as success by p.

So that p =

174

∴ q = 1 – p = and n = 8(given)

We can calculate various probabilities be expanding the binomial theorem.

(q + p)8 = 8C0q8p0 + 8C

1q7p1 + 8C

2q6p2 + 8C

3q5p3 + 8C

4q4p4 + 8C

5q3p5 + 8C

6q2p6 + 8C

7q1p7 +

8Cnq0p8

Therefore the probability of obtaining 0 heads =

8C0q8p0 =

The probability of obtaining 1 head = 8C1q7p1

The probability of getting 2 heads = 8C2q6p2

The probability of getting 3 head = 8C3q5p3

=


=


=


=


=


=

We can also calculate the probability of 4 or more heads or maximum 6 heads. Probability of 4 or

more heads =

8C4q4p4 + 8C

5q3p5 + 8C

6q2p6 + 8C

7q1p7 + 8C

8q0p8 =

Probability of getting more than 6 heads = 8C7q1p7 + 8C

8q0p8 =

Example : A box contains 100 transistors, 20 of which are defective, 10 are selected at dom. Calculate the

probability that (i) all the defective, (ii) all 10 are good.

Solution : Let x represent the number of defective transistors selected. Then the value of x would be, x = 0,

1, 2, ..., 10.

Let us put p as the probability of a defective transistor.

∴ p = and

Using the formula for binomial expansion, the probability of x defective transistors is P(x) = nCxpxqn – x

(i) Probability that all 10 are defective = 10C10

p10q10 – 10

=

175

(ii) Probability that all 10 are good = 10C0p0q10

=

Poisson Distribution

This is also a discrete distribution. It was originated by a French mathematician Simon Denis Poisson

in 1837. The Poisson is the limiting form of binomial distribution as n becomes infinitely large (n > 20) and p

approaches zero (p < 0.05) such that np = m remains fixed. The possion distribution is useful for rare events.

Suppose in the binomial distribution.

(a) p is very small

(b) n is so large that np = m is constant

Then, we would get the following distribution

x 0 1 2 3 ....... x Total

Probability e –m ....... 1

It is a poisson distribution. Under these conditions the probability of getting x successes is

p(x) =

Sum of the probabilities of 0, 1, 2, 3 ....... successes is

= em.em = 1

where e is a constant whose value is 2.7183 and

m is the parameter of the distribution i.e., the average number of occurrences of an event.

A classical example of the Possion distribution is given by road accidents. As we know the number

of people travelling on the road is very large i.e., n is large. Probability that any specific individual runs into an

accident is very small. However.

np = average number of road accidents is a finite constant on any particular day.

Therefore, x (number of road accidents on a particular day) follows distribution.

The various parameters of Poisson Distribution are :

Mean (m) = np

(variance) = np = m

σ =

µ2

= np = m

µ3

= m

µ4

= m + 3m2

∴ β1

=

β2

=

Example : If one house in 1000 has a fire in a district per year. What is the probability that exactly 5 houses

will have fire during the year if there are 2000 houses ?

Solution : We shall apply possion distribution

176

m = np where n = 2000 , p =

∴ m = np =

P(x) = when x = 5 and e = 2.7183

P(5) =

= Reciprocal (AL(2 log 2.7183)) = Reciprocal

Example : If 3% of the bulbs manufactured are defective, calculate the probability that a sample of 100

bulbs will contain no defective and one defective bulb using poisson distribution.

Solution : Give number of defective bulbs are 3% (3/100).

∴ m = np =

Probability of no defective bulb in a sample of 100 is

P(x) = where m = 3 and e = 2.7183

P(0) = 2.7183–3 = 0.05 Ans.

Probability of one defective bulb in a sample of 100 is

P(1) = = 0.27183–3 × 3 = 0.15 Ans.

Normal Distribution

The most important continuous probability distribution used in the entire field of statistics is normal

distribution. The normal curve is bell-shaped that extends infinitely in both directions coming closer and closer

to the horizontal axis without touching it. The mathematical equation of normal curve was developed by De

Moivre in 1733. A continuous random variable x is said to be normally distributed if it has the probability

density function represented by the equation of normal curve.

Y =

Where µ and σ are mean and standard deviation which are two parameters and e = 2.7183,

p = 3.1416 are constants.

It may be understood that the normal distributions can have different shapes depending upon values

of µ and σ but there is one and only one normal distribution for any given pair values of µ and σ.

Properties of Normal Distribution

1. If the paramerers µ and σ of the normal curve are specified, the normal curve is fully deter-

mined and we can draw it by obtaining the value of y corresponding to different values of x (the

abscissa)

2. The normal curve tends to touch the x-axis only at infinity i.e. the x-axis is an asymptotic to the

normal curve. It is a continuous curve stretching from

3. The mean, median and mode of the normal distribution are equal.

4. The height of the normal curve is maximum at x = µ. Hence the mode of the normal curve is

x = µ.

5. The two quartiles Q1 and Q

2 are equidistant from the median.

Q1

= µ – 0.6754 σ

Q2

= µ + 0.6754 σ

Hence Quartile Deviation =

177

6. Mean deviation about mean is or MD = 0.7979 σ

7. The points of inflexion of the normal curve occur at x = µ + σ and x = µ – σ

8. The tails of curve extend to infinity on both sides of the mean. The maximum ordinate at X = µis given by

9. Approximately 100% of the area under the curve is covered by µ ± 3σ.

Distance from the mean % of total area under the

ordinate in terms of ± σσσσσ normal curve

Mean ± 3σ 68.27

Mean ± 2σ 95.45

Mean ± 3σ 99.73

10. All odd moments are equal to zero.

µ1

= µ3 = 0

β1

= 0 and β2 = 3. Thus the curve is mesokurtic.

11. The normal distribution is formed with a continuous variable.

12. The fourth moment is equal to 3σ4 for normal distribution.

The equation of the normal curves gives the ordinate of the curve corresponding to any given value

of x. But we are interested in finding out area under the normal curve rather that its ordinate (y). A normal

curve with 0 mean and unit standard deviation is known as the standard normal curve. With the help of a

statistical table which gives the area and ordinates of the normal curve are give corresponding to standard

normal variate.

z = and not corresponding to x.

Let us see the normal curve area under x-scale and z-scale.

Calculation of Probabilities : Now we discuss the method of calculating probabilities where the

distribution follows the normal pattern. In fact, the probability for the variable to assume a value within a

given range, say X1 and X

2, is equal to the ratio of the area under the curve in that range to the total area under

the curve. To obtain the relevant areas, we first transform a given value of the variable X into standardized

variate Z as follows :

Z =

Then we consult the normal area table. This table is constructed in a manner such that the areas

between mean (µ) and the particular values of Z are given. The first column of this table contains values of

Z from 0.0 to 3.0, while the top row of the table gives values 0.00; 0.01; 0.02; ... 0.09. To find the area (from

mean) to a specific value of Z, we look up in the first column for Z-value upto its first decimal place while its

second decimal place is read from the top row. To illustrate, if we want to find the area between mean and Z

= 1.42, then we look for 1.4 in the first column and 0.02 in the row. Corresponding to these, the value in the

table reads 0.4222. Similarly, it can be verified that area upto Z = 0.10 is equal to 0.398 while for Z = 2.59, it

is 0.4952.

Let us understand few more things :

(i) The area under the curve from Z = 0 (when X = µ) to a particular value of Z gives the proportion

of the area under this part of the curve to the total area under the curve. Thus, Z = 0 to Z = 1.42

178

the value 0.4222. Naturally, this is taken as the variable in question will assume a value with in

these limits.

(ii) Since the normal curve is symmetrical with respect to mean, the area between µ (Z = 0) and

particular value of Z to its right will be same as the value of Z its left. Thus, area between Z = 0

and Z = 1.5 is equal to area between Z = 0 and Z = –1.5. Remember that for values of X greater

than µ, the Z value will be positive while for X < µ, the value of Z would be negative.

(iii) The general procedure for calculating probabilities is like this :

(a) specify clearly the relevant area under the curve which is of interest.

(b) determine the Z value (s).

(c) obtain the required area (s) with reference to the normal area table.

Example : Find the area under the normal curve :

(i) between Z = 0 and Z = 1.20

(ii) between Z = 0.1 and Z = 2.43

(iii) to the right of Z = 1.37

(iv) between Z = – 1.3 and Z = 1.49

(v) to the right of Z = – 1.78

Solution : For each of these, the relevant portions under the normal curve are shown shaded and the areas

determined with reference to the normal area table.

Fig. 2

(i) Area between Z = 0 and Z = 1.20 is 0.3849.

Fig. 3

(ii) Area between Z = 0 and Z = 1.0 is 0.3413

Area between Z = 0 and Z = 2.43 is 0.4925

∴ Area between Z = 1.0 and Z = 2.43 is 0.4925 – 0.3413 = 0.1512.

Fig. 4

(iii) Area between z = 0 and Z = 1.37 is 0.4147.

Total area under the curve being equal to 1, the area to the right of Z = 0 is 0.5, as is the area to

the left of it.

∴ Area beyond Z = 1.37 is 0.5000 – 0.4147 = 0.0803.

Fig. 5

(iv) Area between Z = 0 and Z = –1.3 is 0.4032.

179

Area between Z = 0 and Z = 1.49 is 0.4319

∴ Area between Z = 1.3 and Z = 1.49 is 0.4032 + 0.4319 = 0.8351.

Fig. 6

(v) Area between Z = 0 and Z = –1.78 is 0.4625.

∴ Area to the right of Z = – 1.78 is 0.4625 + 0.5 = 0.9625.

Example : Balls are tested by dropping from a certain height of bounce. A ball is said to be fast if it rises

above 36 inches. The height of the bounce may be taken to be normally distributed with mean 33 inches and

standard deviation of 1.2 inches. If a ball is drawn at random, what is the chance that it would be fast ?

Solution : The given information is depicted in figure 7. Here we have to calculate the probability that the

height of the bounce, X would be greater that 36. This is shown shaded in figure 7.

Fig. 7

We have, X = 36, µ = 33 and σ = 1.2

Z =

From the normal area table, area between Z = 0 and Z = 2.5 is equal to 0.4938. So area beyound Z =

2.5 is 0.5 – 0.4938 = 0.0062. Therefore, P(X > 36) = 0.0062, the chance of getting a fast ball.

Example. The mean of the inner diameters (in inches) of a sample of 200 tubes produced by a machine is

0.502 and the standard deviation is 0.005. The purposes for which these tubes are intended allows a maxi-

mum tolerance in the diameter of 0.496 to 0.508, and otherwise the tubes are considered defective. What

percentage of tubes produced by the machine is expected to be defective if the diameters are found to be

normally distributed ?

Solution : Given µ = 0.502, and σ = 0.005. Since the tolerance limits are 0.496 – 0.508, the tubes which

exceed these limits on either side shall be defective. So we have to determine here P(0.496 > X > 0.508).

(i) For area below X = 0.496, We have,

Z =

The between Z = 0 and Z = –1.2 is found from the table to be equal to 0.3849.

∴ Area to the right of Z = 1.2 equals 0.5 –0.3849 = 0.1151.

(ii) For area to the right of X = 0.508, We have,

Z =

The area to the right of Z = 1.2 would also equal 0.1151, as above.

∴ P(0.496 < X < 0.508) = 0.1151 + 0.1151 = 0.2302

Thus, 23.02 per cent of the tubes shall be defective.

Example. Find out the ordinate of Normal curve.

corresponding to x = 7

So lution : It is observed for this distribution

µ = 0 and σ2 = 16 or σ = 4.

∴ Corresponding to x = 7, we obtain the standard normal variate

Z = = 1.75

180

and from the statistical table we can see the value of the ordinate is 0.0863.

Example : The life (x) of electric bulbs in hours is supposed to be normally distributed as

What is the probability that the life of a bulb will be :

(i) Less than 117 hours (ii) more than 193 hours (iii) between 117 and 193 hours.

Solution : Given µ = 155 and σ = 19

Therefore, corresponding to x = 117 standard normal variate is z =

Fig. 8

We have to obtain the area to the left of Z = –2[Pr(Z < – 2)]

From the table we see the area z = 0 and z = –2 and subtract it from 0.5

∴ 0.5 – .4772 = 0.0228

Hence the probability of life of bulbs more than 193 hours is 0.0228.

To obtain the probability that the life of the bulb is more than 193 hours, we obtain the corresponding

standard normal variate

z =

Fig. 9

And the area between 117 hours and 193 hours shall be

Fig. 10

Where Z = +2.

Hence Pr(– 2 < Z < +2) = Pr (117 < x < 193)

= .4772 + .4772 = .9544 Ans.

Example : The results of a particular examination are given below in a summary form :

Result Percentage of Candidates

Total passed 80

Passed with distinction 10

Failed 20

It is known that a candidate fails if he obtains less than 40 marks (out of 100), while he must obtain

at least 75 marks in order to pass with distinction. Determine the mean and the standard deviation of marks

assuming distribution of marks to be normal.

Solution : According to the given information,

Percentage of students getting marks less than 40 = 20,

Percentage of students getting marks between 40 and 75 = 70, and

Percentage of students getting marks above 75 = 10.

181

The relevant area is shown in figure 15.

Fig. 11

Here P(X < 40) = 0.20, P(40 < X < 75) = 0.70 and P(X > 75) = 0.10

Let µ and σ represent the mean and standard deviation of the distribution. We have, area between µand X = 40 equal to 0.30, and area between µ and X = 75 as equal to 0.40.

Now we have,

For X = 40, Z =

For X = 75, Z =

Corresponding to the area 0.30 in the normal area table, Z = 0.84. Thus, for X = 40, we have

Z = – 0.84 (Since the value of 40 lies to the left of µ). Similarly, for the area equal to 0.40, we have Z = 1.28.

We have, then and ...(i)

...(ii)

Rearranging the above equations, we get

µ – 0.84σ = 40 and ...(iii)

µ + 1.28σ = 75 ...(iv)

Subtracting equations (iii) form equation (iv), we get

2.12σ = 35

or σ = 35/2.12 = 16.51

Substituting the value of σ in equation (iii) and solving for µ, we get

µ – (0.84) (16.51) = 40

or µ = 40 + 13.87 = 53.87

Thus, Mean = 53.87 marks and standard deviation = 16.51 marks.

Example : There are 900 students in B.Com (Hons.) course of a college and the probability of a student

needing a particular book on a day is 0.10. How many copies of the book should be kept in the library that

there should be at least 0.90 chance that a student needing that book will not go disappointed ? Assume

normal approximation to the binomial distribution.

Solution : According to the given information, n = 900, p = 0.10, and p = 1 – p = 1 – 0.10 = 0.90. Therefore,

mean = np = 900 × 0.10 = 0.10 = 90, and Here we are required to determine X, to the right of which 10

percent of the area under the curve lies.

Area between µ = 90 and X would be equal to 0.50 – 0.10 = 0.40. Now, Z value corresponding to the

area 0.40 equals 1.28.

Thus, Z =

Solving for X, we get X = 1.28 × 9 + 90 = 11.52 + 90 = 101.52

Therefore 102 books should be kept in the library to meet the demand of students.

182

B.A. Prog./III/NS/2008

Application Course – Basic Statistics

Time : 2 hours Maximum Marks : 35

(a) (i) Question no. 1 is compulsory.

(ii) Attempt four more questions from question numbers 2 to 7 selecting at least one from

each of the sections I, II and III. Full explanation is to be given for these questions.

(iii) Marks are indicated against each question.

(b) Use of calculator is not allowed.

1. Short answers with proper justification are expected in all the five parts of this question. Each

part is of 3 marks.

(i) The distribution of the total finance charges, to the nearest Rs, which 240 customers paid

01 their budget accounts at a departmental store is as follows :

Amount (in Rs.) : 0-19 20-39 40-59 60-79 80-89

Frequency : 16 78 77 54 15

Draw a histogram of this distribution.

(ii) Goals scored by two teams - Aryans and Blues - in a football season were as follows :

No. of matches

No. of goals Aryans Blues

scored in a match

0 5 6

1 4 5

2 3 4

3 2 3

4 1 2

Kind which team ii more consistent.

(iii) The probability density function of the random variable X is given by

f(x) =

Find the value of k.

(iv) The members of a consulting firm rent cars from three rental agencies: 60% from agency

183

1, 30% from agency 2 and 10% from agency 3. It is known that 9% of the cars from

agency 1 need a tune up, 20 % of the cars from agency 2 need a tune up and 6% of the

cars from agency 3 need a tune up. If a rental car delivered to an agency needs a tune up,

find the probability that it came from rental agency 2.

(v) Let the correlation coefficient between two variables X and Y be zero. Comment on the

independence of X and Y. (15)

SECTION I

2. We randomly select two shirts, one after the other, from a carton containing twelve shirts, three

of which have blemishes. Find the probability that

(i) both of them will have blemishes

(ii) both of them will be without blemishes

(iii) one will be with blemishes and one without blemishes (10)

3. From the following results, obtain the two regression lines and estimate the yield of crops when

rainfall 29 cms.

Y X

Yield Rainfall (in kg.)

(in cm.)

Mean 508.4 26.7

S.D. 36.8 4.6

Co-efficient of correlation between yield an rainfall = 0.52. (10)

SECTION II

4. The customer accounts of a certain departmental store have an average balance of Rs. 120 and

a standard deviation of Rs. 40. Assuming that the account balances are normally distributed,

find

(i) the proportion of the accounts which is over Rs. 150

(ii) the proportion of the accounts which is between Rs. 100 and Rs. 150. It is given that the a

under the standard normal curve between 2 and z = 0.75 is 0.2734 and that between z = 0

and z = 0.5 is 0.1915.

5. Records show that the probability is 0.00005 that a car will have a flat tyre while driving through

184

a certain tunnel. Use the Poisson approximation to the binomial distribution to find the probability

that at least 2 of 10,000 cars driving through this tunnel will have flat tyres. It is given that e us =

0.607. (10)

SECTION III

6. In an election, 132 of 200 male voters and 90 of 159 female voters favour a certain candidate.

Find a 99% confidence interval for the difference between the actual proportions of male and

female voters who favour the candidate. (10)

7. A machine is designed to produce insulating washers for electrical devices of average thickness

of 0.025 cm. A random sample of 10 washers was found to have an average thickness of 0.024

cm. with a standard deviation of 0.002 cm. Test the significance of the deviation.

It is given that value of t for 9 degrees of freedom at 5% level is 2.262. (10)

185

B.A. Prog./III/2009

Application Course – Basic Statistics

Time : 2 hours Maximum Marks : 35

(a) (i) Question no. 1 is compulsory.

(ii) Attempt four more questions from question numbers 2 to 7 selecting at least one from

each of the sections I, II and III. Full explanation is to be given for these questions.

(iii) Marks are indicated against each question.

(b) Use of calculator is not allowed.

Note : The maximum marks printed on the question paper are applicable for the students of

the regular colleges (Cat. ‘A’). These marks will, however, be scaled up proportion-

ately in respect of the students of NCWEB at the time of posting of awards for compila-

tion of result.

1. Short answers with proper justification are expected in all the five parts of this question. Each

part is of 3 marks.

(i) The weights of the members of a football team (to the nearest pound) vary from 177 to 265

pounds. Indicate the limits of ten classes into which these weights might be grouped. Also

find the class-marks of these classes.

(ii) Determine the constant ‘C’ so that the following function can serve as the probability

distribution of a discrete random variable with the given range

f(x) = cs, x = 1, 2, 3, 4, 5.

(iii) If 40 percent of the mice used in an experiment will become very aggressive within one

minute after having been administered an experimental drug, find the probability that ex-

actly four of 12 mice which have been administered the drug become very aggressive

within one minute.

(iv) How many different random samples of size 5 can be drawn the finite population which

contains all the English alphabets A to Z. List any 3 different such samples.

(v) For certain X and Y series, which are correlated, the two lines of regression Y on X and X

on Y are respectively

5x – 6y + 90 = 0

15x – 8y – 130 = 0

Calculate the correlation coefficient between X and Y. (3 × 5 = 15)

186

2. Determine Q1 and Q

3 for the distribution of IQ levels of 65 students in a college :

IQ Frequency

less than 90 3

90 - 99 14

100 - 109 22

110 - 119 19

More than 119 7

Comment on the result. (10)

3. On four days it took a person 17, 12, 15 and 21 minutes to drive to work.

(a) Calculate the standard derivation of these data.

(b) Subtract 10 from each of the values and then calculate the standard deviation of the result-

ing data.

On comparing the two standard deviations obtained from (a) and (b). What do you

conclude ? (10)

SECTION II

4. Assuming that half the population are consumers of rice, so that the chance of an individual

being a consumer is % and assuming that 100 investigators each take 10 individuals to see

whether they are consumers, how many investigators would you expect to report three people

or less were consumers ? (10)

5. The mean weight of 500 male students in a certain college is 151 1b. and the standard deviation

is 15 Ib. Assuming the weights are normally distributed, find how many students weigh between

120 and I55 1b.

(It is given that area under the standard normal curve between 2 = 0 and Z = 0.27 is 0.1064, that

between Z =,0 and Z = 2.07 is 0.4808). (10)

SECTION III

6. In a survey of buying habits, 400 women shoppers are chosen at random in super market ‘A’

located in a certain section of the city. Their average weekly food expenditure is Rs. 250 with a

187

standard deviation of Rs. 40. For 400 women shoppers chosen at random in super market ‘B’ in

another section of the city, the average weekly food expenditure is Rs. 220 with a standard

deviation of Rs. 55. Test at 1% level of significance whether the average weekly food expendi-

ture of the two populations of shoppers are equal. (At 1% level of significance, the critical value

of Z = ±2.58). (10)

7. Sixteen Cartons are taken at random from an automatic filling machine. The mean net weight of

the 16 cartons is 15.4 OZ. and the standard deviation is 0.88 OZ. Can we say that there is a

significant difference of the sample mean from the intended weight of 16 OZ ?

(It is given that value oft for 15, 16, 17 degrees of freedom at 5% level of significance are 2.131,

2.120 ( and 2.110). (10)

Documents

NATURE, SCOPE AND L IMIT ATIONS OF STATISTICS · NATURE, SCOPE AND L IMIT ATIONS OF STATISTICS Introduction The term “statistics” is used in two senses : ... sales, birth, death