REC-001

Indira Gandhi National Open University REC-001 School of Social Sciences RESEARCH METHODOLOGY

REC-001: RESEARCH METHODOLOGY BLOCK 1 Foundations of Research Methodology in Economics 5 BLOCK 2 Quantitative Methods: Data Collection 23 BLOCK 3 Quantitative Methods: Data Analysis 54 BLOCK 4 Qualitative Methods 75 BLOCK 5 Database of Indian Economy 95 BLOCK 6 Use of SPSS and EVIEWS Packages for Analysis and Presentation of Data 130

2

EXPERT COMMITTEE Prof. Alakh N. Sharma Prof. Gopinath Pradhan Director, Institute of Human Development Professor of Economics New Delhi – 110 002. IGNOU, New Delhi Prof. B. Kamaiah Prof. Anjila Gupta Professor of Economics Professor of Economics University of Hyderabad IGNOU, New Delhi Hyderabad Prof. D.N. Rao Prof. Madhu Bala Rtd. Professor of Economics Professor of Economics CESP, School of Social Sciences IGNOU, New Delhi JNU, New Delhi – 110 067 Prof. Ila Patnaik Dr. K. Barik Professor of Economics Reader in Economics National Institute of Public Finance & Policy IGNOU, New Delhi New Delhi Prof. Pami Dua Dr. B.S. Prakash Professor of Economics Reader in Economics Delhi School of Economics IGNOU, New Delhi (University of Delhi), Delhi Prof. Romar Correa Sh. Saugato Sen Professor of Economics Lecturer (Selection Grade) University of Mumbai in Economics Mumbai IGNOU, New Delhi Prof. Tapas Sen Prof. Narayan Prasad (Convenor) Professor Professor of Economics National Institute of Public Finance & Policy IGNOU, New Delhi New Delhi-110067

Programme Coordinator: Prof. Narayan Prasad Course Coordinator: Prof. Narayan Prasad Course Preparation Team Block Resource Person Block Resource Person IGNOU Faculty (Content, Format and Language editing) 1. Prof. D. Narsimha Reddy 2 & 5 Sh. S.S. Suryanarayanan Prof. Narayan Prasad Retd. Professor of Economics Ex. Joint Advisor Professor of Economics University of Hyderabad Planning Commission IGNOU New Delhi Hyderabad New Delhi 3. Prof. Narayan Prasad 4. Prof. Narayan Prasad Secretarial Assistance IGNOU, New Delhi IGNOU, New Delhi Mrs. Seema Bhatia

School of Social Sciences Prof. D.M. Diwakar Giri Institute of

Development Studies Lucknow

6. Matter related to SPSS adapted from Unit 14 of course MFN-009 (Research Methods in Bio-statistics, part of M.Sc (DFSM), SOCE, IGNOU, New Delhi Dr. Alok Mishra, Manager, Evalue service.com.pvt.ltd. ____________________________________________________________________________________________________________ August 2009. Indira Gandhi National Open University All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from the Indira Gandhi National Open University. Further information about the Indira Gandhi National Open University courses may be obtained from the University’s Office at Maidan Garhi, New Delhi-110 068.

3

INTRODUCTION TO REC-001: RESEARCH METHODOLOGY In order to pursue research degree programme, you need to be equipped with the various constituents of Research Methodology and the different techniques applied in data collection/analysis. The present course aims to cater this need. The theoretical perspectives that guide research, tools and techniques of data collection and methods of data analysis together constitute the research methodology. This course deals with all these aspects. The course comprises of 6 Blocks. Block 1 on foundation of Research Methodology in economics covers the entire breadth of main trends of development in the philosophy of science and main debates in the methodology of economics. Introducing three approaches to research methodology – scientific, historical and institutional, - this block is devoted to scientific methodology. The first part deals with the philosophical foundations covering positivism, Karl Popper’s Critical Rationalism, followed by the three important models of scientific explanation - Hypothetico Deductive Model; (MD), Deductive Nomological Model (DN) and Inductive-Probabilistic Model (I-P). The third part is devoted to the main debates in the main stream economic methodology from the classical to the contemporary period. Each section in the block, besides giving outline of the subject matter also provides a detailed reading guide with some references on critical reading material included in boxes. Block 2 Studies of the behavior of variables and of relationships amongst them constitute the essentials of empirical research in economics. This necessitates measurement of the variables involved. Hence, Block 2 addresses the question how to assemble data on a scientific basis? Covering three methods of data collection (i.e. the census and survey method, the observation method, and the experimental method) and tools used in data collection, the block deals with the methods of selecting random and non-random samples from a population to make judgments about the population. Eight methods of random sampling and the details about (i) operational procedures for drawing samples, (ii) expressions for estimators of parameters and measures of their variation, and estimators of such variation where the population variation parameter is not known, have been discussed. The question of choosing the appropriate sampling method to a given research context has also been dealt with. Like first block, each section and sub-section ends with a box guiding you to relevant portions of one or more publications that give you more details on the topic(s) included. Block 3 Various statistical and econometrics techniques are applied in analysis of data. Broadly these techniques are descriptive and inferential types. Since the descriptive techniques like measures of central tendency, measures of dispersion, skewness, one way ANOVA, index numbers, time series, simple correlation and simple regression etc. are covered at the master’s degree level courses like statistics/quantitative techniques, the same have been skipped from here without undermining their significance and application in research. This Block essentially deals with the basic steps involved in an empirical study, estimation of parameters in two variables and in n variable situations, and their interpretation, testing hypothesis by applying parametric and non-parametric tests and tackling the problem of auto-correlation, hetero-scedasticity and multi-collenearity.

4

Block 4 Empirical evidences in conducting research are captured and analysed by two approaches i.e. quantitative and qualitative. In the situations like conducting an in depth scientific enquiry of complex events, their dimensions and variables, cardinal or quantitative approach is of limited use due to the long time taking exercise and the high cost involved. Hence, as an alternative, qualitative approach to research methods has been provided in Block 4. This block broadly deals with the philosophical foundations and research perspectives guiding the qualitative research, the principles governing the participatory approach, the process and stages involved in participatory and case study methods. Block 5 For undertaking any meaningful research in terms of situational assessment, testing of models, development of theory, evolving economic policy, assessing the impact of such policy, availability of data is crucial and determines the scope of analysis. Hence, the present block deals with the different database of Indian economy. The block deals with the data available on major macro variables like national income, saving, investment, etc. relating to Indian economy. The block also throws light on agricultural and industrial data base and data on trade, finance, social sectors (like employment, unemployment, education health, quality of life). Particular emphasis has been laid on different concepts used in data collection, data source and agencies involved in compilation of data. Block 6 The methodological advances in quantitative and qualitative analysis are also accompanied by a significant revolution in the computing power of the desktops/laptops PCs. SPSS, EVIEWS, SAS and NUDIST are among the popular sophisticated econometrics packages which are used in data analysis and data presentation in social sciences in general and in economics in particular. Hence, the SPSS and EVIEWS fundamentals and use of their statistical components have been covered in Block 6. The block aims to enable the learners to use the SPSS and EVIEWS softwares in computing the various statistical and econometric results and analyse and present the time series and cross section data.

5

BLOCK 1 FOUNDATIONS OF RESEARCH METHODOLOGY IN ECONOMICS

Structure 1.0 Objectives 1.1 Introduction 1.2 An Overview of the Block 1.3 Approaches to Research Methodology: Scientific, Historical and Institutional 1.4 Philosophical Foundations 1.4.1 Positivism 1.4.1.1 Central Tenets of Positivist Philosophy 1.4.1.2 Criticism of Positivism 1.4.2 Post-Positivism 1.4.3 Karl Popper and Critical Rationalism 1.4.4 Thomas Kuhn and Growth of Knowledge 1.4.5 I. Lakotas: The Methodology of Scientific Research Programmes 1.4.6 Paul Feyerabend: Methodological Dadaism and Anarchism 1.5 Models of Scientific Explanation 1.5.1 Hypothetico-Deductive (H-D) Model 1.5.2 Deductive Nomological (D-N) Model 1.5.3 Inductive- Probabilistic (I-P) Model 1.6 Debates on Models of Explanation in Economics 1.6.1 Classical Political Economy and Ricardo’s Method 1.6.2 Hutchison and Logical Empiricism 1.6.3 Milton Friedman and Instrumentalism 1.6.4 Paul Samuelson and Operationalism 1.6.5 Amartya Sen: Heterogeneity of Explanation in Economics 1.7 Research Problem and Research Design 1.7.1 Research Problem 1.7.2 Basic Steps in Framing a Research Proposal 1.8 Further Suggested Readings 1.9 Model Questions 1.0 Objectives The main objectives of this block are to:

• introduce the basic outline of the philosophical foundations of the main strands of scientific methodology as it evolved in the philosophy of science,

• apprise the students of evolution of basic structure and method of scientific explanation,

• guide the students about development and debates relating to the methodological approaches of economics, and

• explain the limitations and strengths of the methodological foundations of economics so as to appreciate the process of growth of knowledge and effectively contribute to the same.

6

1.1 INTRODUCTION Contemporary developments in research methodology have faced substantial challenges in social sciences like economics. While over the years there has been growing emphasis on rigour and methodological precision, there have also been serious reservations on the scientific basis of research not only in economics but also in other sciences. There is growing interest in the history of philosophy of science to understand what has been done and is being done so that we could think in the direction of what ought to be done. Before 1970s, the literature on methodology of economics was meager and mostly confined to classics. But since 1970s, interest in economic methodology has grown dramatically and, as Roger Backhouse observes, it is now possible to view economic methodology as a clearly identifiable sub-discipline within economics. 1.2 An Overview of the Block The block is very ambitiously designed to cover the entire breadth of both the main trends of development in the philosophy of science as well as the main debates in the methodology of economics. The first part, beginning with the historical background of positivism covers up to the contemporary trends in philosophy of science. The second part is devoted to the main debates in the mainstream economic methodology from the classical to the contemporary period. Since the focus in this block is substantially on the scientific method, emphasis primarily is on the methodology of the mainstream neo-classical economics. Each section in the block, besides an outline of the subject matter, provides a detailed reading guide with some critical reading material being included in boxes, wherever it is felt necessary to draw your attention to a specific reading. 1.3 Approaches to Research Methodology: Scientific, Historical and Institutional Unlike in science, social sciences had a long tradition of choice of methodology depending upon the schools of thought each one of which had a distinct way of conceptualization of social relations and processes of development. For instance, the Institutional school in Economics conceived economy as the system of related activities by which the people of any community get their living. This system embraces a body of knowledge and of skills and a stock of physical equipment. It also embraces a complex network of personal relations influenced by custom, ritual and dogma. The methodology focused personal and institutional relations and processes which often were descriptive without involving any testing or verification. Similarly, historical school based its analysis of social and economic developments on the basis of historical data and often historical methods too have been descriptive. There have been several developments in the Institutional approach with the emergence of New Institutional Economics and similarly in Historical method there has been climacterics. But, the present block, however, is entirely devoted to “scientific methodology” and hence no elaboration on alternatives. The alternative methods have been discussed in Block 4 of this course.

7

1.4 PHILOSOPHICAL FOUNDATIONS 1.4.1 Positivism The basic tenets of positivism had a relatively long history of evolution over the first half of the 20th century and went through different phases of development. A question is often asked whether it is ‘Positivism’ or ‘Positivims’? Positivism in its evolved form is often called as ‘Logical Positivism’ or ‘Logical Empiricism’. Bruce Caldwell (1982) in the first chapter of his book attempts to construct the development and basic aspects of Logical Empiricism. But, for the learners like you intending to know the Methodology of Science, it would be helpful to familiarize with I. Naletov’s first chapter (esp. pp. 23-58) which distinguishes three phases of development of Positivism and identify the third phase of 1940s and 1950s as Logical Positivism. Along with this, you should read Kolakowski’s ‘Rules of Positivism’ which are reproduced in the first chapter of CGA Bryant’s Positivism in Social Theory and Research (1985). 1.4.1.1 Central Tenets of Positivism The central tenets of Positivism are summarized as the following Kolakowski’s four rules of Positivism: K1 The Rule of Phenomenalism: According to this rule, science is entitled to record

only that which is actually manifested in experience. Science accepts that which is observable or experienced i.e. phenomenon, not noumina that which is represented by the phenomenon. Science is based on that which exists not on the essence of the existence.

K2 The Rule of Nominalism: Science involves recording experience and represents

that which is experienced. It is nominal of the observed and gives no extra independent knowledge.

K3 The Rule of Value-free Statements: There is no place for value judgments and

normative statements in science. Science is concerned with ‘what is’ and not ‘what ought to be’.

K4 Rule of Unity of Method in Science: There is only one scientific method,

irrespective of their subject matter. This is also known as methodological monism of Positivism.

These propositions of Positivism emphasize ‘verifiability’ as the basic requirement for scientific pursuits and verification/‘testability’ as the basic methodological requirement. These rules ensure pursuit of scientific knowledge that leads to unraveling of the regularity of occurrence of phenomenon which can be explained in terms of universal laws. Testability and repeatability are important dimensions of scientific method under Positivism. We shall return to the positivist structure of scientific explanation later. But presently, we shall turn to the limitations of Positivism and the resulting criticism.

8

1.4.1.2 Criticism of Positivism By 1960s, Positivism as a methodological approach to science was subjected to extensive criticism and its stature as the methodology of science has declined over the years. One of the earliest critics of Positivism was Karl Popper and much later the contemporary philosophers of science built alternative approaches to science. Some of the major criticisms of Positivism are listed below: First, the positivist rule of beginning all scientific investigation with facts and facts alone was questioned by Popper. Knowledge does not start with nothing – tabula rasa – or out of nothing. “Before we can collect data, our interest in data of a certain kind must be aroused; the problem always comes first”. There are no brute facts but all theory laden. Before one collects facts or observes things, one should have relevant questions which arise from existing knowledge. Second major criticism against Positivism relates to the problem of induction and the related problem associated with the test of verification. Popper has given the famous example of the colour of the swan. If one follows Positivism, it means if one observes repeatedly without any change in a number of places that the colour of the swan is white, then one would go to universalize that all ‘swans are white’. But Popper draws attention to the problem that any number of observations of swans in different locations does not mean that all swans would be white. There could be one still to be observed which may turn out to be other than white. This is a typical problem of induction. One cannot verify the truth of a phenomenon by any number of observations. But one non-white swan can falsify it. A universal theory can be shown to be false but never proven to be true. Therefore, verification is not a right test for universalisation. Popper insisted on falsification test to overcome the problem of induction. Third, unlike Positivist insistence of universal method, the theoretical idiom of different sciences varies. Unless there is a specific theory relevant to the subject, facts can not be expressed in recognizable form. Otherwise, there will be mere description of the instruments and the activities than underlying pursuit of knowledge. The famous example of Pierre Duhem is that one working with an oscillating iron bar with a mirror attached will mean only objects and facts, if one has no theory of measuring electrical resistances. These criticisms against Positivism were mainly aimed at the empirical extremes claiming that observations are theory-independent. 1.4.2 Post-Positivism As we have seen above, by 1960s, the limitations of Positivism were widely criticized. Much of the criticism took the form exploring alternative approaches to scientific methodology which emerged as the Post-Positivist philosophy of science. Post-Positivism consists of several approaches of which Karl Popper’s ‘Critical Realism’, Thomas Kuhn’s ‘Growth of Knowledge’, Imre Lakota’s ‘Scientific Research Programmes’ and Paul Feyerabend’s Criticism ‘Against Method’ are major contributions. We shall discuss each of these Post-Positivist approaches, their basic contents and the helpful reading material.

9

1.4.3 Karl Popper and ‘Critical Rationalism’ Karl Popper, as we have seen above, was the earliest critics of Positivism and over a period, his writings have emerged as an alternative to Positivism. His approach to Philosophy of Science is known as ‘Critical Rationalism’. Since writings evolved over a period of time, one may have to be careful in choosing the reading material. His contribution may be broadly grouped into (i) Criticism of Positivism, (ii) Basis of Knowledge, (iii) Problem of Induction and (iv) Methodology of Falsificationism. Growth of Knowledge Since we are familiar with Popper’s criticism of Positivism, we shall turn to the basics of his ‘Critical Rationalism’. According to him, the basis for growth of knowledge is the existence of ‘critical spirit in society’. He conceives of three autonomous worlds. The first world, he terms as the ‘physical reality’. The second is the ‘subjective knowledge’ referred to consciousness. It is the third world, which he calls as ‘objective knowledge’ that is the domain of pursuit of science through ‘theory, problems and arguments. Fallibilism All existing knowledge is full of errors and falliable. Objective truth exists and there are ways to recognize it. The advance of knowledge consists of merely in the modification of earlier knowledge. Real progress of knowledge involves elimination of errors. Induction and Falsificationism Because of the problem of induction discussed earlier, verification as a method is not suitable for scientific investigation. Falsification is his critical rationalist approach. Falsification methodology involves some rules or for behaviour of scientists – not merely logic. These rules are:

(i) propose and consider only testable or falsifiable theories, (ii) seek only to falsify scientific theories, and (iii) Accept those that withstand attempts to falsify as worthy of critical discussion. He goes on to give reasons for these exhortations towards falsificationism. Popper is against ‘verification’ for the additional reason that it makes scientists to look for facts which are likely help in verifying their theories and likely to result in commitment to their theories. Popper argues that commitment is crime.

Criticism of Popper’s Critical Rationalism Some of the main criticisms against Popper’s ‘falsificationism’ include the following: One, in the pursuit of scientific knowledge, logical falsifiability turns out to play only a minute role in the actual process of theory rejection or revision. Second, falsification no longer functions as a plausible criterion of demarcation of science and non-science. Third, since individual scientific theories need not be falsifiable, there is no logical asymmetry between verifiability and falsifiability of particular scientific theories. They are neither verifiable nor falsifiable.

10

You may begin with Blaug (1980 pp. 10-17) to understand Popper’s criticism of Positivism on the counts of verification, and alternative suggestion on the criteria of demarcation of science from non-science through ‘falsification’ as an approach to overcome the problems of induction. O’ Hear’s (1980) book would help you to appreciate the context of his writings on philosophy of science. Popper’s The Logic of Scientific Discovery would help you, if you are familiar with an overview of his contributions. Naletov’s (1984) chapter on Popper is especially useful for understanding his ‘Third World of Knowledge’. The best for not only critical appraisal but also an in-depth discussion of Popper’s Methodology is Hausman (1988). 1.2.6 Thomas Kuhn and Growth of Knowledge Philosophy of Science

1.4.4 Thomas Kuhn and Growth of Knowledge Philosophy of Science

“Philosophy of Science without history is empty, History of science without philosophy is blind”

E. Kant 1.4.4 Thomas Kuhn and Growth of Knowledge Thomas Kuhn turns to history of science to explain methodology of science. He questions earlier approaches to the methodology of science and offers an alternative to both Positivism and Popper’s Critical Rationalism. Kuhn’s approach is variously known as ‘Growth of Knowledge Philosophy of Science’, ‘Contemporary Philosophy of Science’ or ‘Trend in Science’. All earlier explanations show growth of knowledge as incremental, additive or cumulative. According to Kuhn science did not develop by accumulation of individual discoveries and inventions but by revolutionary breaks. Scientific progress can be understood by positive description and not by ‘normative’ prescription of rules as done by Positivism or Popper. The major contribution of Kuhn is the explanation of scientific progress in a structure of revolutions. Scientists operate within a ‘paradigm’. A paradigm refers to ‘knowledge embedded in shared exemplars’. Paradigm involves the entire constellation of beliefs, values, techniques etc. shared by members of a given scientific community. Working within a paradigm involves ‘normal science’. Normal science involves solving scientific questions. Normal science involves tacit knowledge learned by doing science, not by acquiring rules for doing it. Under a paradigm, novices acquire the shared practices as a part of training and as a part of a scientific group. In the course of practice of ‘normal science’ within a paradigm, there arise anomalies due to unsolved puzzles. As the anomalies increase, there arises a crisis which puts the paradigm on trial. With increasing puzzles, there would be paradigm-change which brings in new ways of analysis, new approaches and new knowledge. The paradigm-change is like ‘gestalt’ – total vision change and marks a revolutionary break from the past. Something like shifting from the notion of flat earth to round earth.

Read: Chapter 1, pp. 10-17, Mark Blaug (1980) The Methodology of Economics, Cambridge University Press, Cambridge. Daniel H. Hausman (1988) “An Appraisal of Popperian Methodology” in Neil De Marchi (ed), The Popperian Legacy in Economics, Cambridge University Press, Cambridge, pp. 65-76.

11

Kuhn’s second edition (1970) is a lucid piece of writing and you should take it as a basic reading (see Box). The first few pages give a detailed overview of the contents and would be of immense help in following the rest of the text. Blaug (1980, pp.29-34) contains a good summary version and the first part is a good introduction to both Kuhn and Lakotas. The first two sections of Aidan Foster Carter (1976) is an excellent summary of Kuhn’s ‘structure of scientific revolutions’. 1.4.51.4.5 1.4.5 1 Lakatos: Methodology of Scientific Research Programme (MSRP) Imre Lakotas also belongs to contemporary philosophy of science and like Kuhn also believes in history of science as a guide to explain the methodology of science. His approach is called ‘Methodology of Scientific Research Programmes’ (MSRP). He tries to bridge the contributions of Popper and Kuhn. MSRP is considered as a compromise between a historical aggressive rule-bound methodology of Popper on the one hand and relativistic, defensive and vindictive methodology of Kuhn. According to MSRP, validation in science involves not individual theories but clusters or interconnected theories which may be called scientific research programmes (SRP). SRPs are not scientific once and for all. SRP may experience ‘progressive’ or ‘degenerative’ phases. These phases contain ‘theoretical’ and ‘empirical’ components. If successive theoretical programmes contain excessive empirical content then these are empirically corroborated. The SRP may experience problem-shift from ‘degenerating’ to ‘progressive’ phase, like in psychology. If theory does not lead to much of empirical content then there will be ‘degenerating’ problem-shift like in astrology. There has been extensive use of Lakotas MSRP in theory appraisal both in sciences and social sciences like Economics. Lakotas is a difficult writer and yet Lakotas and Musgrave (1978) in parts would be useful reading. Lakota’s paper in the volume (it is lengthy and difficult but) particularly pp. 132-138, serves as a good introduction. Blaug’s (1980) summary (pp. 34-41) is helpful. Caldwell (1982) contains a brief summary on Lakotas. For those interested in the application of Kuhn and Lakotas to theory appraisal in Economics, Latsis (1976) is an important source.

Read: Thomas Kuhn (1970) The Structure of Scientific Revolutions, Reprint, International Encyclopedia of Unified Science, Vol. 2, No. 2.

Blaug, M (1980) The Methodology of Economics, Cambridge University Press, Cambridge, pp. 29-34. Aiden Foster-Carter (1976) “From Rostow to Gunder Frank: Conflicting Paradigms in the Analysis of Underdevelopment”, World Development, Vol. 4, No. 3.

Read: Imre Lakotas (1970) “Falsification and the Methodology of Scientific Research Programmes” in Imre Lakotas and A. Musgrave (ed) Criticism and Growth of Knowledge, Cambridge University Press, Cambridge, (esp. pp. 132-138), pp. 132-194.

Blaug, M (1980) The Methodology of Economics, Cambridge University Press, Cambridge, pp. 35-41.

12

1.4.6 Paul Feyerabend: Methodological Dadaism or Anarchism Feyerabend is highly critical of all prescriptive methodologies, particularly against Positivism and Popper. At one point, he began as more Popperian than Popper and moved to become more Kuhnian than Kuhn. But later turned against both in his philosophical approach. Apparently he appears to be preaching ‘against method’ but his observations are highly insightful and are particularly aimed against formalism, pretence and pose in the name of science. His major contribution is in terms of ‘theory-dependence thesis’, the ‘thesis of incommensurability’ and the interactivist view’ and plurality of scientific method. His anarchist epistemology or Dadaism in science, emphasizes the social purpose as the objective rather than driven by method as the objective of science. His two important works are: Against Method (1975) and Science in Free Society (1978). While the former cautions against pitfalls of a rigid method and pleads for breaking rules, the letter emphasizes the limitations of all methodologies and highlights the role of humility, tenacity, interactiveness and plurality. Caldwell (1982) has very useful summary of Feyerabend’s contribution (pp. 79-85). Naletov (1984) provides a good summary account of Against Method. But one caution is: do not stop with reading Blaug (1980, pp. 40-44) on Feyerabend. Blaug gives an impression that Feyerabend is a non-serious and flippant ‘methodologist’. This is only a caricature. The truth is Feyerabend needs careful attention.

1.5 MODELS OF SCIENTIFIC EXPLANATION Explanation assumes central place in pursuit of scientific knowledge. The search for making testability criterion concrete has been a major problem in the philosophy of science. In principle, at least until Popper raised serious questions, complete verification of observational evidence was considered meaningful. But, there was always a problem of strict verifiability and as a solution in terms of confirmation of some of the experimental propositions. Further developments in this direction resulted in the development of rules of correspondence between theoretical terms and observation terms. Out of this emerged an explanatory system called Hypothetico-Deductive Model (H-D Model). Scientific theories have three components: an ‘abstract calculus’, a set of rules that assign empirical content to the ‘abstract calculus’ and a model for explaining the abstract calculus. The H-D Model explicitly addresses the problems of a theory’s structure. The H-D Model by reducing the strict Positivist correspondence principle between science and observable phenomena, allowed substantial role for theories and theoretical terms. But, theories in these models were continued to be treated as

Read: Paul Feyerabend (1975) Against Method: Outline of an Anarchist Theory of Knowledge, New Left Books. Paul Feyerabend (1978) Science in Free Society, New Left Books. Caldwell, B (1982) Beyond Positivism: Economic Methodology of the Twentieth Century, Allen and Unwin, London, pp. 79-85.

13

eliminative fiction and considered establishing correlations among phenomena was all science could and should do. In fact early positivists considered that explanations had no role in science. This counter intuitive approach to scientific explanation was eventually replaced by the contribution of Hempel and Opperheim who developed Deductive-Nomological (D-N) Model or what is called ‘Covering Law Models’. However, it realized that many explanations in science because they make use of statistical laws, cannot be adequately accounted for by D-N Model. Later Hempel developed the ‘inductive-probabilistic’ (I-P) model. In I-P model the explanations consisted of statistical laws, Covering Law Models too came for criticism specifically for the explanation ‘symmetry thesis’ or symmetry between explanation and prediction and also the claims that these explain adequately almost all legitimate phenomena in natural and social sciences. 1.5.1 Hypothetico-Deductive (H-D) Model The basic developments leading to the development of H-D Model and its limitations are very well summarized in Caldwell (1982, pp. 23-32). This also provides an excellent summary of Carl Hempel’s emphasis on many positive functions of theories. For a more detailed discussion you may go through Hempel's collected essays (1965). 1.5.2 Deductive-Nomological (D-N) Model D-N Model is perhaps the most tenacious of all models of explanations that has survived much after the decline of Positivism. There is a brief summary on D-N Model in Hausman (1984, pp. 6-10) also in Caldwell (1982, pp. 28-32 and pp. 54-63). Blaug (1980, pp. 2-9) provides a summary critique of Covering Law Models. But, there is substitute for Hempel and Oppenheim paper in Brody (1970).

1.5.3 Inductive – Probabilistic (I-P) Model This is an extension of D-N Model by Hempel for application where statistical laws are involved. And the basic reading would involve sections referred to above in Caldwell (1982), Blaug (1980) and Hempel’s collected papers (1965). 1.6 DEBATES ON MODELS OF EXPLANATION IN ECONOMICS As Daniel Hausman observes, ever since its inception in eighteenth century, the science of economics has been methodologically controversial. There has always been the

Carl G. Hempel and Paul Oppenheim (1948) “Studies in the Logic of Explanation” (pp. 9-20) in Baruch Brody (ed) Readings in the Philosophy of Science, Prentice Hall, Engelwood Cliffs, New Jersey, 1970.

Read: Caldwell, B (1982) Beyond Positivism …, George Allen & Unwin, London, pp. .

14

haunting question whether economics is a science at all? Beginning with the early 1980s, there has been a resurgence of interest in philosophical and methodological questions concerning economics. When there are serious doubts expressed about their scientific credibility, economists appear to be turning to methodological reflection in the hope of finding some flaw in previous economic study or to find a new methodological directive that will better guide their work in the future. We intend to trace the origins of methodological interest in political economy and the desire to model economics as a science by attempting to adopt the methods of science. In the process, it is hoped you would be in a position to see the methodological concerns beginning with classical economics to the present times. It will help you to understand the consequences of obsession to adopt methods of natural sciences to complex social science like economics. In the end, you will be in a position to appreciate the limitations of the present mainstream methodological approach which in spite of decline of Positivism as a methodology of the natural sciences has overwhelming Positivist influence. Let us begin with the methodological position of classical political economy with David Ricardo’s method. Though Ricardo did not himself write on methodology explicitly, his writings carried the seal of abstract deductive method that was at length dealt with by his followers like N.W. Senior, J.S. Mill, J.E. Cairnes and J.N. Keynes. It is followed by the Neo-Classical School especially Lionel Robbins and the controversy it generated, acting as a turning point in economic methodology. Then, we thereafter shall turn to methodological contributions of contemporary prominent mainstream economists that include Milton Friedman and Paul Samuelson. The last refers to Amartya Sen’s contribution in explanation of economics. 1.6.1 Classical Political Economy and Ricardo’s Method As pointed out earlier, Ricardo’s methodological habit is described as ‘abstract deductive method’. Though Ricardo claimed that laws of economics (political economy in his times) were as exact as that of laws of gravity, he did not explain the method of economics. On the contrary, his laws were abstract laws without any appeal to evidence or verification. It was left to his followers like Senior, Mill etc. to defend him. N.W. Senior in his Outline of Science of Political Economy, in 1836 differentiated between pure and strictly positive science on the one hand and an impure and inherently normative art on the other hand, and considered Ricardian system of explanation as science. Senior identified a few general propositions of Ricardo’s work to lay claim on scientific status. It was J.S. Mill in his essay ‘On the Definition of Political Economy and on the Method of Investigation to It’ (1836) who took upon himself the task of laying bare the nature of economics and the method adopted by Ricardo. Mill is the first economist to spell out the ‘economic man’ as conceived in the classical economics. Mill maintained that economic science was ‘hypothetical’ and a science of ‘tendencies’, the laws of which were overwhelmed by various disturbances. Mill’s view was influential on the nature of the methodology of economics throughout the nineteenth and even early twentieth centuries. J.N. Keynes and J.E. Cairnes carried forward Mill’s methodological views. All this was a period when economics asserted scientific status on the basis of abstract deductive explanation without any appeal to testing or verification.

15

4.2 Robbins, Positivism and Appriorism in Economics Overview For an introductory discussion of Ricardo’s Classical Methodology, chapter 3 of Blaug (1980) is very useful but you have to ignore the title of the chapter “Verificationists”, there are no verificationists but only Senior, Mill, J.N. Keynes and Cairnes are discussed. There is an excellent discussion of J.S. Mill in Hausman (1981). J.S. Mill’s essay is reproduced in Hausman (1984) and it is an insightful essay to realize even at present why economics turns out to be an ‘inexact science’. T. Hutchison (1988) also provides an overview of the early methodological approaches in economics. Lionel Robbins’ An Essay on the Nature and Significance of Economic Science (1935) is a path breaking methodological contribution to economics that held sway for a substantial part of the first half of the twentieth century and even today serves as the one that defines the nature of the subject matter of economics. The major objective of the essay was to rid economics from ethics and normative welfare considerations and approach it as a ‘pure science’. His contention is to show that economics is a pure theory based on prior experience and hence no testing is needed. He, therefore, conceives economics as ‘apriori science’ and his approach has been described as ‘apriorism’. At the same time, he claimed the status of positive science to economics because of his insistence on getting rid of all normative considerations from economic analysis. His claims of positive science of pure theory without any need for testing was subjected to extensive criticism and Hutchison was foremost among the critics. In fact, the latter’s criticism severely undervalued Robbins’ scientific claims to economics. Chapter 4 of Blaug (1980) will be a useful introduction to more of Hutchison’s Criticism of Robbins. The sixth chapter in Caldwell’s (1982) is very useful for both Robbins and Hutchison. But, there is no substitute for reading Robbins original essay reproduced in Hausman (1984). There is a brief but succinct survey of the methodological position of Robbins in D.P.O ‘Brien (1988). 1.6.2 Hutchison and Logical Empiricism Terence Hutchison is instrumental in turning economics to logical empiricism and testability. His first book, The Significance and Basic Postulates of Economic Theory (1938), was the first systematic attempt to apply logical positivism to economics. He termed claims for economics as a pure apriori science as a bogus claim and insisted that economics if should stake claims as science, its propositions should be in the testable form. He insisted not only the theories but also assumptions of economics should be

Read: Blaug, M (1980) The Methodology of Economics, Cambridge University Press, Chapter 3 “The Verificationists”. Daniel M. Hausman (1981) “John Stuart Mill’s Philosophy of Economics”, Philosophy of Science, Vol. 48, pp. 363-385.

Read: Lionel Robbins (1935) The Nature and Significance of Economic Science” reproduced in Daniel Hausman (1984) The Philosophy of Economics: An Anthology, Cambridge University Press, Chapter 3, pp. 83-110. B. Caldwell (1982) Beyond Positivism …., George Allen & Unwin, Part II Chapter 6, “Robbins Vs Hutchison”, pp. 99-128.

16

subjected to testing and this earned him the description as ‘radical empiricist’. It also led to a debate on the testability of assumptions in economics. Aided by other developments in the improved sources and methods of data collection, it certainly turned economics more towards empirical research. 1.6.3 Milton Friedman and Instrumentalism Blaug (1982) discusses in the fourth chapter contribution made by Hutchison. A more detailed discussion of Hutchison’s contribution and its significance is found in the sixth chapter of Caldwell (1982) referred to above. But, it would be essential to read Hutchison’s essay reproduced in Hausman (1984). Of all the methodological debates in modern economics, the one that revolves round Milton Friedman’s contribution has far reaching significance because it not only tries to establish formal foundations of research in economics but also brings into wide view the limitations and difficulties involved in putting up a scientific façade to economics. His methodology is known as ‘instrumentalism’ since he considers the function of assumptions and theory is to get predictions, nothing more. Milton Friedman’s paper (1953) is known as a remarkable masterpiece of marketing what he assumes as the methodology of positive economics. For him, the ultimate goal of science is the development of theories or hypotheses that can provide valid and meaningful predictions about phenomena. The criteria for acceptability of a theory or a hypothesis are:

i) logically consistent statements with meaningful counterparts, ii) testable substantial hypothesis, and iii) the only text of the validity of a hypothesis is the correspondence between the

prediction and the experience. Since there are many competing hypotheses, simplicity and fruitfulness are important requirements. For test, predictability is the main criterion and realism of assumption do not matter. And he offers effective arguments why the realism of assumptions do not matter. For him, if the predictions come true, then one can treat as if the assumptions were true. This earned his methodology the name “as if” methodology. He goes on to say that “…truly important and significant hypotheses will be found to have assumptions that are widely inaccurate descriptive representatives of reality and, in general”, he concluded “the more significant the theory, the more unrealistic the assumptions”. This last part has become the notorious ‘F-Twist’ in his methodology. He went on to argue that the mainstream neo-classical economics has an excellent predictive track record, as if there has been no methodological problem at all in economics. Such a provocative contribution did invite extensive criticism, though the comfortable abstraction from reality did bring a large following to Friedman’s methodology. The reading guide that follows here is kept deliberately elaborate to capture the range of theoretical and methodological difficulties that are often glossed over.

Friedman’s paper (1953) is now reproduced in many readings on Methodology. Bruce Caldwell (1984) is very helpful because it also reproduces the debate that followed from the American Economic Review. To follow the criticism that Friedman used the expression “assumptions” in a limited sense and doesn’t distinguish between the different

Read: Terence W. Hutchison (1956) “On Verification in Economics” in Daniel Hausman (1984), The Philosophy of Economics: An Anthology, Cambridge University Press, Chapter 7, pp. 158-167.

17

senses in which ‘unrealism’ is used, one may begin with Nagel in Caldwell (1984). For an elaboration on this see Boland (1979). Melitz (1965) provides a good summary of the reasons advanced by Friedman as why search for realistic assumption is futile. Besides, Melitz also helps us to locate Friedman in the appropriate historical context in the evolution of economic methodology. For critical remarks and specifically for characterizing the Friedmanian ‘instrumentalism’ as an ‘F-Twist’ Samuelson in Caldwell (1984) is very useful. Mason (1980) is a drastic criticism of Friedman, whose work he calls as “a mythology resulting in methodology”. This critique is with particular reference to monetary theory. There is a good summary of the whole debate in Blaug (1980). There is a good discussion of Friedman’s method along with other shades of empiricism in Eugene Rotwein (1980). For a short but stimulating contrast of positivist ‘predictive’ approach of Friedman with that of anti-positivist “assumptive” approach of F.Knight see Abraham and Eva Hirsh (1980). They trace the Friedmanian approach to Senior – Cairnes. For an excellent analysis of the Chicago School with Friedman at the centre, tracing the origins of logical positivism, the insularity of positivism giving rise to a kind of “ideal type” see C.K. Wilber and Jon D. Wisman (1980). Bruce Caldwell (1982) provides a brief but succinct summary of Friedman’s essay, Boland’s restatement and the philosophical rejection of Instrumentalism. Rotwein (1959) contains a good summary of Friedman’s methodology, followed by critical appraisal. Melitz (1965) gives an account of the debate on the realism of assumptions and the significance of testing assumptions. Boland (1979) provides a valiant defence of Friedman by an attempt to answer every point of criticism.

1.6.4 Paul Samuelson and Operationalism Paul Samuelson’s two major thesis on methodology are: (i) that economists should seek to discover ‘operationally meaningful theorems’ and (ii) there is no explanation in science but only description. For this reason, his methodological approach is described as ‘operationalism’ or ‘descriptivism’. Samuelson comes close to Popper’s ‘rational reconstruction’ and there has been affinity to Hutchison’s insistence on testability, though not as radical in insisting on testing of assumptions as well. He was critical of Milton Friedman on assumptions. Samuelson himself is criticized for not practicing what he proposed as methodological tenets.

Read: Milton Friedman (1953) “Methodology of Positive Economics” reproduced in Bruce Caldwell (1984). Jack Melitz (1965) “Friedman and Machlupon: Significance of Testing Economic Assumptions”, Journal of Political Economy, February 1965, pp. 37-60.

Read: Bruce Caldwell (1982) Beyond Positivism …, George Allen & Unwin, Chapter 9, pp. 189-200.

18

Bruce Caldwell (1982) devotes a chapter (No. 9) to Samuelson. Samuelson’s writings on methodology is limited and is reproduced in Caldwell (1984). There is substantial work on Sameulson’s methodology in Stanley Wong (1973) and (1978). 1.6.5 Amartya Sen: Heterogeneity of Explanations in Economics Perhaps one of most reflective contribution on the methodology of contemporary economics is found in Amartya Sen (1989). It is a broad based critique of contemporary methodology based on the deep seated heterogeneity of the subject. He observes a good deal of discontent on methods and traditions that are vogue in economics. He comes to the conclusion that given the diversity of the subject matter of economics, there is a need for plurality in methodological approaches. He observes that the diversity in economics could be seen in terms of three broad exercises viz.

(i) predicting the future and causally explaining the past, (ii) choosing appropriate description of states and events in the past and present

and, (iii) providing normative evaluations of states, institutions and policies. He feels

‘methodology of economics’ should admit enough diversity to deal with all these three states. Once it is done, then there is place for all activities ranging from verifiability and testing, static general equilibrium analysis, value based welfare evaluations, application of formalism and mathematics to work on assuming rationality.

1.7 RESEARCH PROBLEM AND RESEARCH DESIGN 1.7.1 Research Problem For any successful and fruitful research work, the basic requirement is clarity and simplicity in formulating research problem. A problem might be defined as an issue that exists in the literature, theory or policy that leads to a need for the study. A problem should be stated within a context, and the context should be provided and briefly explained in terms of the conceptual or theoretical framework in which it is embedded. There is extensive literature, both in print and electronic form on research problem, proposals and various dimensions of research design. Much of this literature is designed for behavioural sciences and more often addressed to problems in psychology and education, since dissertation work is common in these areas. Creswell (1994) and Kerlinger (1979). 1.7.2 Basic Steps in Framing a Research Proposal

Read: Amartya Sen (1989) “Economic Methodology: Heterogeneity and Relevance”, Social Research, Vol. 56, No. 2, Summer 1989, pp. 299-329.

Read: Creswell, J.W (1994) Research Design: Qualitative and Quantitative Approaches, Sage Publications.

Kerlinger, F.N. (1979) Behavioural Research: A Conceptual Approach, H.Rinchart of Winston, New York.

19

Once there is basic clarity on the research problem to be pursued, you may get down to make a research proposal. There are certain basic steps involved in preparing a research proposal and this preparation will facilitate smooth sailing in carrying out the research work. The following are rudimentary steps in preparing a research design. 1. Introduction “The introduction is the part of the paper that provides readers with the background information for the research reported in the paper. Its purpose is to establish a framework for the research, so that readers can understand how it is related to other research”. 2. Statement of the Problem Effective problem statements answer the question “Why does this research need to be conducted.” If a researcher is unable to answer this question clearly and succinctly, and without resorting to hyperspeaking (i.e. focusing on problems of macro or global proportions that certainly will not be informed or alleviated by the study), then the statement of the problem will come off as ambiguous and diffuse. 3. Objectives of the Study Objectives should be stated clearly and these should be kept in view throughout the investigation and analysis. One of the important character of a good price of research work is that the findings are sharply linked to the set objectives. Since the objectives provide direction to the entire research work, these should be limited and focused and too many objectives are likely to be a hindrance to analysis and interpretation. 4. Review of Literature The review of literature is meant to gain insight on the topic and gain knowledge on the availability of data and other materials on the theme of proposed area of research. The literature reviewed may be classified into two types viz. (i) literature relating to the concepts and theory and (ii) empirical literature consisting of findings in quantitative terms by studies conducted in the area. This will help in framing research questions to be investigated. Academic journals, conference proceedings, government reports, books etc. are the main sources of literature. With the spread of IT, one can access a large volume of literature through internet. 5. Questions or Hypotheses Questions are relevant to normative or census type research (How many of them are there? Is there a relationship between them?). They are most often used in qualitative inquiry, although their use in quantitative inquiry is becoming more prominent. Hypotheses are relevant to the theoretical research and are typically used only in quantitative inquiry. When a writer states hypotheses, the reader is entitled to have an exposition of the theory that lead to them (and of the assumptions underlying the theory). Just as conclusions must be grounded in the data, hypotheses must be grounded in the theoretical framework. Hypothesis can be formulated as a proposition or set of propositions providing most probable explanation for occurrence of some specified phenomenon. Hypotheses when

20

empirically tested may either be accepted or rejected. A hypothesis must, therefore, be capable of being tested. A hypothesis stated in terms of a relationship between the dependent and independent variables are suitable for econometric treatment. The manner in which hypothesis is formulated is important as it provides the required focus for research. It also helps in identifying the method of analysis to be used. 6. Methods of Analysis / Methodology The methods or procedures section is really the heart of the research proposal. The activities should be described with as much detail as possible, and the continuity between them should be apparent. There is need to indicate the methodological steps to be taken to answer every question or to test every hypotheses. The issues relating to sources of data, nature of data, sampling design, methods of collection of data, methods of analysis etc. all should be clearly discussed in this section. 7. Limitations of the Study No research proposal or project is likely to be totally perfect. There will always be weaknesses and limitations. It is always desirable to spell out these limitations so as to keep the work within the feasible limits and make this known to the readers. 8. Significance of the Study There is always a question whether the proposed research leads to ‘value addition’ – in this case addition to knowledge in the domain of the proposed research. It would be important to indicate how the proposed research will refine, revise or extend existing knowledge in the area of investigation. 9. References Proper documentation is an essential part of any research work. There are number of style sheets which would be of help for proper references in text and in the reference list. 1.8 Further Suggested Readings Abraham and Eva Hirsh (1980) “The Heterodox Methodology of Two Chicago Economists” in W.J. Samuels (ed.). Bruce Caldwell (1982) “Beyond Positivism: Economic Methodology in the Twentieth Century, George Allen & Unwin, London. Bruce Caldwell (1984) Appraisal and Criticism in Economics, Allen and Unwin, Boston. Carl Hempel (1965) Aspects of Scientific Explanation and Other Essays in the Philosophy of Science, Free Press, New York. C.K. Wilber and Jon D. Wisman, (1980) “The Chicago School: Positivism or Ideal Type” in W.J. Samuels (ed.).

21

D.M. Hausman (1988) “Economic Methodology and Philosophy of Science” in Boundaries of Economics, edited by Gordon C. Winston and Richard F. Teichgraeber II. D.P.O’ Brien (1988) Lionel Robbins, McMillan, London, pp. 23-40. Daniel M. Hausman (ed) (1984) The Philosophy of Economics, Cambridge University Press, Cambridge. Duncan Hodge (2007) : Economics, realism and reality: a comparison of Maki and Lawson, Cambridge Journal of Economics, Volume 32 Number 2 March 2008, Oxford University Press. E. Nagel (1963) “Assumptions in Economics Theory” in Caldwell (1984). Eugen Rotwein (1959) “On the Methodology of Positive Economics”, QJE. Eugene Rotwein (1980) “Empiricism and Economic Method” in Warren J. Samuels (1980). Igor Naletov (1984) Alternatives to Positivism, Progress Publishers, Moscow. IGNOU (2006): Research Methodology: Issues and Perspectives (Block 1) of MEC-009 course on Research Methods in Economics. L. Boland (1979) “A Critique of Friedman’s Critics” JEL, June 1979, also in Caldwell (1984). M. Blaug (1980) Methodology of Economics, CUP. Neil de Marchi (ed) (1988) The Popperian Legacy in Economics, Cambridge University Press, Cambridge. P.A. Samuelson (1963 & 1964) on Friedman in Caldwell (1984). Spiro Latsis (ed) (1976) Method and Appraisal in Economics, Cambridge University Press, Cambridge. Stanley Wong (1978) The Foundations of Paul Sameulson’s Revealed Preference Theory. Stanley Wong (1973) “The F-Twist” and the Methodology of Paul Samuelson, American Economic Review, June 1973. W. Salmon (1990) Four Decades of Scientific Explanation, University of Minnesota Press, Minneapolis. W.J. Samuels (1980) The Methodology of Economic Thought, Transactions Books, New Burnswick.

22

William E. Mason (1980) “Some Negative Thoughts on Friedman’s Positive Economics” Journal of Post Keynesian Economics Vol. III No. 2, pp. 235-255. 1.9 Model Questions 1. Why does the question Positivism or Positivism arise? How does one come round

this problem? 2. According to Popper, what are the limitations of Positivism? Critically examine

Popper’s philosophy of science. 3. How does Kuhn explain growth of knowledge? 4. Why is Lakatos’ contribution called a bridge between Popper and Kuhn? 5. What is the significance of Feyerabend’s tirade ‘against method’? 6. Discuss the evolution of explanatory structures within Positivism. 7. Discuss the contribution of Hempel and Oppenheim to explanatory models. 8. Critically examine the methodological contention of the Classical School. 9. What is apriorism? How does Robbins defend it? 10. Evaluate the methodological contribution of Hutchison. 11. How does one explain the all pervasive appreciation as well as criticism of Milton

Friedman’s methodology of positive economics? 12. Is there a room for methodological heterodoxy in economics? What is the

significance of Sen’s methodological contribution?

23

Block 2 QUANTITATIVE METHODS: DATA COLLECTION Structure 2.1 Introduction 2.2 Objectives 2.3 An Overview of the Block 2.4 Method of Data Collection 2.5 Tools of Data Collection 2.6 Sampling Design 2.6.1 Population and Sample Aggregates and Inference 2.6.2 Non-Random Sampling 2.6.3 Random or Probability Sampling 2.6.4 Methods of Random Sampling 2.6.4.1 Simple Random Sampling with Replacement (SRSWR) 2.6.4.2 Simple Random without Replacement (SRSWR) 2.6.4.3 Interpenetrating Sub-Samples (I-PSS) 2.6.4.4 Systematic Sampling 2.6.4.5 Sampling with Probability Proportional to Size (PPS) 2.6.4.6 Stratified Sampling 2.6.4.7 Cluster Sampling 2.6.4.8 Multi-Stage Sampling 2.7 The Choice of an Appropriate Sampling Method 2.8 Let Us Sum Up 2.9 Further Suggested Readings 2.10 Some Useful Books 2.11. Model Questions

2.1 INTRODUCTION Research is the objective and systematic search for knowledge to enhance our understanding of the complex physical, social and economic phenomena that surround us. It involves a scientific study of the variety of factors or variables that shape such phenomena, the interrelationships amongst them and how these impact on our lives. The results of such studies give rise to more questions for us to find answers and egg us on to further research, resulting in the extension of the frontiers of knowledge. Studies of the behaviour of variables and of relationships amongst them necessitate measurement of the variables involved. Variables can be quantitative variables like GNP or qualitative variables like opinions of individuals on, say, ban on smoking in public places. The former set assumes quantitative values. The latter set does not admit of easy quantification, though some of these can be categorised into groups that can then be assigned quantitative values. Research strategies thus adopt two approaches, quantitative and qualitative. We shall deal with the quantitative approach in this Block. The basic ingredient of quantitative research is the measurement, in quantitative terms, of the variables involved or the collection of the data relevant for the analytical and interpretative processes that constitute research. The quality of the data utilised in research is important because the use of faulty data in such endeavour results in misleading conclusions, however sophisticated may be the analytical tools used for analysis. Research processes, be it testing of hypotheses and models or providing the

24

theoretical basis for policy or review of policy, call for objectivity, integrity and analytical rigour in order to ensure academic and professional acceptability and, above all, an effective tool to tackle the problem at hand. Data used for research should, therefore, reflect, as accurately as possible, the phenomena these seek to measure and be free from errors, bias and subjectivity. Collection of data has thus to be made on a scientific basis. 2.2 OBJECTIVES After going through this Block you will be able to

• appreciate different methods of collecting data; • acquire knowledge of different tools of data collection; • define the key terms commonly used in quantitative analysis, like parameter,

statistic, estimator, estimate, inference, standard error, confidence intervals, etc.; • distinguish between random and non-random sampling procedures for data

collection; • appreciate the advantages of random sampling in the assessment of the

“precision” of the estimates of population parameters; • acquire knowledge of the procedure for drawing samples by different methods; • develop the ability to obtain estimates of key parameters like population total,

proportion, mean, etc.; and of the “precision” of such estimates under different sampling methods; and

• appreciate the feasibility/appropriateness of applying different sampling methods in different research contexts.

2.3 AN OVERVIEW OF THE BLOCK How to assemble data on a scientific basis? There are broadly three different methods of collecting data. These are dealt with in Section 2.4. The tools that one can use for collecting data – the formats and the devices that modern technology has provided – are enumerated in section 2.5. There are situations where it is considered desirable to gather data from only a part of the universe, or a sample selected from the universe of interest to the study at hand, rather than a complete coverage of the universe, for reasons of cost, convenience, expediency, speed and effort. Questions then arise as to the manner in which such a sample should be chosen – the sampling design. This question is examined in detail in Section 2.6. The discussion is divided into a number of sub-topics. Concepts relating to population aggregates like mean and variance and similar aggregates from the sample and the use of the latter as estimates of population aggregates have been introduced in sub-section 2.6.1. There are two types of sampling: random and non random. Non random sampling methods and the contexts in which these are used are described in sub-section 2.6.2. A random sample has certain advantages over a non random sample - it provides a basis for drawing valid conclusions from the sample about the parent population. It enables us to state the precision of the estimates of population parameters in terms of (a) the extent of their variation or (b) an interval within which the value of the population parameter is likely to lie with a given degree of certainty. Further, it even helps the researcher to determine the size of the sample to be drawn if his project is subject to a sanctioned budget and permissible limits of error in the estimate of the

25

population parameter. These principles are explained in sub-section 2.6.3. Eight methods of random sampling are then detailed in sub-section 2.6.4. These details relate to(i) operational procedures for drawing samples, and (ii) expressions for (a) estimators of parameters and measures of their variation and b) estimators of such variation where the population variation parameter is not known. Different sampling procedures are also compared, as we go along, in terms of the relative precision of the estimates, they generate. Finally, the question of choosing the sampling method that is appropriate to a given research context is addressed in Section 2.7. A short summing up of the Block is given in Section 2.8. Each Section/subsection ends with a box guiding you to relevant portions of one or more publications that give you more details on the topic(s) handled in it. Fuller details of these publications are indicated in Section 2.10. Section 2.9 is meant to kindle your interest and appetite for recent developments in the subject and Section 2.11 for evaluation of your knowledge of the subject matter covered in this Block. 2.4 METHOD OF DATA COLLECTION There are three methods of data collection – the Census and Survey Method, the Observation Method and the Experimental Method. The first is a carefully planned and organised study or enquiry to collect data on the subject of the study/enquiry. We might for instance organise a study on the prevalence of the smoking habit among high school children – those aged 14 to 17 - in a certain city. One approach is to collect data of the kind we wish to collect on the subject matter of the study from all such children in all the schools in the city. In other words, we have a complete enumeration or census of the population or universe relevant to the enquiry, namely, the city’s high school children (called the respondent units or informants of the Study) to collect the data we desire. The other is to confine our attention to a suitably selected part of the population of high school children of the city, or a sample, for gathering the data needed. We are then conducting a sample survey. A well known example of Census enquiry is the Census of Population conducted in the year 2001, where data on the demographic, economic, social and cultural characteristics of all persons residing in India were collected. Among sample surveys of note are the household surveys conducted by the National Sample Survey Organisation (NSSO) of the Government of India that collect data on the socio-economic characteristics of a sample of households spread across the country. The Observation Method records data as things occur, making use of an appropriate and accepted method of measurement. An example is to record the body temperature of a patient every hour or a patient’s blood pressure, pulse rate, blood sugar levels or the lipid profile at specified intervals. Other examples are the daily recording of a location’s maximum and minimum temperatures, rainfall during the South West / North East monsoon every year in an area, etc. The Experimental Method collects data through well designed and controlled statistical experiments. Suppose for example, we wish to know the rate at which manure is to be applied to crops to maximise yield. This calls for an experiment, in which all variables other than manure that affect yield, like water, quality of soil, quality of seed, use of insecticides and so on, need to be controlled so as to evaluate the effect of different levels of manure on the yield. Other methods of conducting the experiment to achieve the same

26

objective without controlling “all other factors” also exist. Two branches of statistics - The Design and Analysis of Experiments and Analysis of Variance - deal with these. Read Sections 1.2 to 1.7, Chapter 1, M.N.Murthy (1967), pp. 3 – 20. 2.5. TOOLS OF DATA COLLECTION How do we collect data? We translate the data requirements of the proposed Study into items of information to be collected from the respondent units to be covered by the study and organise the items into a logical format. Such a format, setting out the items of information to be collected from the respondent units, is called the questionnaire or schedule of the study. The questionnaire has a set of pre-specified questions and the replies to these are recorded either by the respondents themselves or by the investigators. The questionnaire approach assumes that the respondent is capable of understanding and answering the questions all by himself/herself, as the investigator is not supposed, in this approach, to influence the response in any manner by interpreting the terms used in the questions. Respondent-bias will have to be minimised by keeping the questions simple and direct. Often the responses are sought in the form of “yes”, “no” or “can’t say” or the judgment of the respondent with reference to the perceived quality of a service is graded, like, “good”, “satisfactory” or “unsatisfactory”. In the schedule approach on the other hand, the questions are detailed. The exact form of the question to be asked of the respondent is not given to the respondent and the task of asking and eliciting the information required in the schedule is left to the investigator. Backed by his training and the instructions given to him, the investigator uses his ingenuity in explaining the concepts and definitions to respondents to obtain reliable information. This does not mean that investigator-bias is more in the schedule approach than in the questionnaire approach. Intensive training of investigators is necessary to ensure that such a bias does not affect the responses from respondents. Schedules and questionnaires are used for collecting data in a number of ways. Data may be collected by personally contacting the respondents of the survey. Interviews can also be conducted over the telephone and the responses of the respondent recorded by the investigator. The advent of modern electronic and telecommunications technology enables interviews being done through e mails or by ‘chatting’ over the internet. The mail method is one where (usually) questionnaires are mailed to the respondents of the survey and replies received by mail through (postage pre-paid) business-reply envelopes. The respondents can also be asked (usually by radio or television channels or even print media) to send their replies by SMS to a mobile telephone number or to an e-mail address. Collection of data can also be done through mechanical, electro-mechanical or electronic devices. Data on arrival and departure times of workers are obtained through a mechanical device. The time taken by a product to roll off the assembly line and the time taken by it to pass through different work stations are recorded by timers. A large number of instruments are used for collecting data on weather conditions by meteorological centres across the country that help assessing current and emerging weather conditions. Electronic Data Transfers (EDT) can also be the means through which source agencies like ports and customs houses, where export and import data originate, supply data to a

27

central agency like the Directorate General of Commercial Intelligence and Statistics (DGCI&S) for consolidation. The above methods enable us to collect primary data, that is, data being collected afresh by the agency conducting the enquiry or study. . The agency concerned can also make use of data on the subject already collected by another agency or other agencies – secondary data. Secondary data are published by several agencies, mostly Government agencies, at regular intervals. These can be collected from the publications / compact discs or the websites of the agencies concerned. But such data have to be examined carefully to see whether these are suitable or not for the study at hand before deciding to collect new data. Errors in data constitute an important area of concern to data users. Errors can arise due to confining data collection to a sample. (sampling errors). It can be due to faulty measurement arising out of lack of clarity about what is to be measured and how it is measured. Even when these are clear, errors can creep in due to inaccurate measurement. Investigator bias also leads to errors in data. Failure to collect data from respondent units of the population or the sample due to omission by the investigator or due to non-response (respondents not furnishing the required information) also results in errors. (non-sampling errors). The total survey error made up of these two types of errors need to be minimised to ensure quality of data. Read Chapter 3, p-69, Kultar Singh (2007). 2.6 SAMPLING DESIGN We have looked at methods and tools of data collection, chief among which is the sample survey. How to select a sample for the survey to be conducted? There are a number of methods of choosing a sample from a universe. These consist of two categories, random sampling and non-random sampling. Let us turn these methods and see how well the results from the sample can be utilised to draw conclusions about the parent universe. But first let us turn to some notations, concepts and definitions. 2.6.1 Population And Sample Aggregates And Inference Let us denote population characteristics by upper case (capital) letters in English or Greek and sample characteristics by lower case (small) letters in English. Let us consider a (finite) population consisting of N units Ui (i = 1,2,….N). Let Yi (i = 1,2,….N) be the value of the variable y, the characteristic under study, for the ith unit Ui (i = 1,2,…..N). For instance, the units may be the students of a university and y may be their weight in kilograms. Any function of the population values Yi is called a parameter. An example is the population mean ‘μ’ or ‘M’ given by (1/N)∑iYi, where ∑i stands for summation over i = 1 to N. Let us now draw a sample of ‘n’ units ui (i = 1,2,…..n)1 from the above

1 The sample units are being referred to as ui (i = 1,2,…..n) and not in terms of Ui as we do not know which of the population units have got included in the sample. Each ui in the sample is some population unit

28

population and let the value of the ith sample unit be yi (i = 1,2,…..n)2. In other words, yi (i = 1,2,….n) are the sample observations. A function of the sample observations is referred to as a statistic. The sample mean ‘m’ given by (1/n)∑iyi , ∑i (i =1 to n), is an example of a statistic. Let us note the formulae for some important parameters and statistics. Population total Y = ∑iYi , ∑i stands for summation over i = 1 to N (2.1) Population mean = ‘μ’ or ‘M’ , = (1/N)∑iYi, ∑i , i = 1 to N (2.2) Population variance σ2 = (1/N) ∑iYi

2 – M2 , ∑i , i = 1 to N (2.3) Population SD = σ = +√[(1/N) ∑iYi

2 – M2] , ∑i , i = 1 to N (2.4) Sample mean = (1/n) ∑iyi , ∑i , (i =1 to n) (2.5) Sample variance s2 = (1/n) ∑i yi

2 – m2 , ∑i i = 1 to n (2.6) = [ss]2 /n , where [ss]2 = ∑(yi – m)2 = ∑ yi

2 – n m2 = sum of squares of sample observations from their mean ‘m’ (2.7) Sample standard deviation ‘s’ = +√[(1/n)∑i yi

2 – m2 ], ∑i , i = 1 to n (2.8) Population proportion P = (1/N)∑iYi = Nı / N, (where Nı is the number of units in the population possessing a specified characteristic) (2.9) σ2 = (1/N) ∑i Yi

2 – M2 = P – P2 = P(1 – P) = PQ, where Q = [(N – Nı)/N] = (1 – P). (2.10) m = p, (proportion of units in the sample with the specific characteristic) (2.11) s2 = p(1 – p) = pq, where p is the sample proportion and p + q = 1 (2.12) [ss]2 = npq (2.13) The purpose of drawing a sample from a population is to arrive at some conclusions about the parent population from the results of the sample. This process of drawing conclusions or making inferences about the population from the information contained in a sample chosen from the population is called inference. Let us see how this process works and what its components are. The sample mean ‘m’, for example, can serve as an estimate of the value of the population mean ‘μ’. The statistic ‘m’ is called an estimator (point estimator) of the population mean ‘μ’. The value of ‘m’ calculated from a specific sample is called an estimate (point estimate) of the population mean ‘μ’. In general, a function of sample observations, that is, a statistic, which can be used to estimate the unknown value of a population parameter, is an estimator of the population parameter. The value of the estimator calculated from a specific sample is an estimate of the population parameter.

2 The same reasons apply for referring the sample values or observations as yi (i = 1,2,…n) and not in terms of the population values Yi . yi wil be some Yi.

29

The estimate ‘m1’ of the population parameter ‘μ’, computed from a sample, will most likely be different from ‘μ’. There is thus an error in using ‘m1’ as an estimate of ‘μ’. This error is the sampling error, assuming that all measurement errors, biases etc., are absent, that is, there are no non-sampling errors. Let us draw another sample from the population and compute the estimate ‘m2‘ of ‘μ’. ‘m2‘ may be different from ‘m1’ and also from ‘μ’. Supposing we generate in this manner a number of estimates mi (i = 1,2,3,…….) of ‘μ’ by drawing repeated samples from the population. All these mi (i = 1,2,3,….) would be different from each other and from ‘μ’. What is the extent of the variability in the mi (i = 1,2,3,….), or, the variability of the error in the estimate of ‘μ’ computed from different samples? How will these values be spread or scattered around the value of ‘μ’ or the errors be scattered around zero? What can we say about the estimate of the parameter obtained from the specific sample that we have drawn from the population as a means of measuring the parameter, without actually drawing repeated samples? How well do non-random and random samples answer these questions? The answers to these questions are important from the point of view of inference. Let us first look at the different methods of non-random sampling and then move on to random sampling. Read Sections 3.1 to 3.3, pp. 66 – 77 and Section 3.9, pp. 97 – 107, Chapter 3, Richard I Levin and David S Rubin (1991). 2.6.2 Non-Random Sampling There are several kinds of non-random sampling. A judgment sample is a sample that has been selected by making use of one’s expert knowledge of the population or the universe under consideration. It can be useful in some circumstances. An auditor for example could decide, on the basis of his experience, on what kind of transactions of an institution he would examine so as to draw conclusions about the quality of financial management of an institution. Convenience Sampling is used in exploratory research to get a broad idea of the characteristic under investigation. An example is one that consists of some of those coming out of a movie theatre; and these persons may be asked to give their opinion of the movie they had just seen. Another example is one consisting of those passers by in a shopping mall whom the investigator is able to meet. They may be asked to give their opinion on a certain television programme. The point here is the convenience of the researcher in choosing the sample. Purposive Sampling is much similar to judgement sampling and is also made use of in preliminary research. Such a sample is one that is made up of a group of people specially picked up for a given purpose. In Quota Sampling, subgroups or strata of the universe (and their shares in the universe) are identified. A convenience or a judgement sample is then selected from each stratum. No effort is made in these types of sampling to contact members of the universe who are difficult to reach. In Heterogeneity Sampling units are chosen to include all opinions or views. Snowball Sampling is used when dealing with a rare characteristic. In such cases, contacting respondent units would be difficult and costly. This method relies on referrals from initial respondents to generate additional respondents. This technique enables one to access social groups that are relatively invisible and vulnerable. This method can lower search costs substantially but this saving in cost is at the expense of the representative character of the sample. An example of this method of sampling is to find

30

a rare genetic trait in a person and to start tracing his lineage to understand the origin, inheritance and etiology of the disease. It would be evident from the description of the methods given above that the relationship between the sample and the parent universe not clear. The selection of specific units for inclusion in the sample seem to be subjective and discretionary in nature and, therefore, may well reflect the researcher’s or the investigator’s attitudes and bias with reference to the subject of the enquiry. A sample has to be representative of the population from which it has been selected, if it is to be useful in arriving at conclusions about the parent population. A representative sample is one that contains the relevant characteristics of the population in the same proportion as in the population. Seen from this angle, the non-random sampling methods described above do not yield representative samples. Such samples are, therefore, not helpful in drawing valid conclusions about the parent population and the way these conclusions change when another sample is chosen from the population. Non-random sampling is, however, useful in certain circumstances. For instance, it is an inexpensive and quick way to get a preliminary idea of the variable under study or a rough preliminary estimate of the characteristics of the universe that helps us to design a scientific enquiry into the problem later. It is thus useful in exploratory research. ------------------------------------------------------------------------------------------------------------- Read Sections “Non-Probability Sampling” and “Other Sampling Designs”, Chapter 5, Royce A. Singleton (2005) pp.132 – 138. Section on “Non-Probability Sampling”, Chapter 4, Kultar Singh (2007), pp. 107 – 108. ------------------------------------------------------------------------------------------------------------ 2.6.3 Random or Probability Sampling Random sampling methods, on the other hand, yield samples that are representative of the parent universe. The selection process in random sampling is free form the bias of the individuals involved in drawing the sample as the units of the population are selected at random for inclusion in the sample. Random sampling is a method of sampling in which each unit in the population has a predetermined chance (probability) of being included in the sample. A sampling design is a clear specification of all possible samples of a given type with their corresponding probabilities. This property of random sampling helps us to answer the questions we raised at the end of sub-section 2.6.1 above. That is, we can make estimates of the characteristics of the parent population from the results of a sample and also indicate the extent of error to which such estimates are subject or the precision of the estimate. This is better than not knowing anything at all about the magnitude of the error in our statements regarding the parent population. Let us see how random sampling helps in this regard. A. Precision of Estimates – Standard Errors and Confidence Intervals We noted earlier (the last paragraph of subsection 2.6.1) that the sample mean (an estimate of the population mean ‘μ’) will have different values in repeated samples drawn from the population and none of these may be equal to ‘μ’. Suppose that the repeated

31

samples drawn from the population are random samples. The sample mean computed from a random sample is a random variable. So is the sampling error, that is, the difference between‘μ’ and the sample mean. The values of the sample means (and the corresponding errors in the estimate of ‘μ’) computed from the repeated random samples drawn from the population are the values assumed by this random variable with probabilities associated with drawing the corresponding samples. These will trace out a frequency distribution that will approach a probability distribution when the number of random samples drawn increases indefinitely. The probability distribution of sample means computed from all possible random samples from the population is called the sampling distribution of the sample mean. The sampling distribution of the sample mean has a mean and a standard deviation. The sample mean is said to be an unbiased estimator of the population mean if the mean of the sampling distribution of the sample mean is equal to the mean of the parent population, say, μ. In general, an estimator “t” of a population parameter “θ” is an unbiased estimator of “θ” if the mean of the sampling distribution of “t”, or the expected value of the random variable “t”, is equal to “θ”. In other words, the mean of the estimates of the parameter made from all possible samples drawn from the population will be equal to the value of the parameter. Otherwise, it is said to be a biased estimate. Supposing the mean of the sampling distribution of sample mean is Kμ or K+μ, where K is a constant. The bias in the estimate can be easily corrected in such cases by adopting m/K or (m – K) as the estimator of the population mean. The variance of the sampling distribution of the sample mean is called the sampling variance of the sample mean. The standard deviation of the sampling distribution of sample means is called the standard error (SE) of the sample mean. It is also called the standard error of the estimator (of the population mean), as the sample mean is an estimator of the population mean. The standard error of the sample mean is a measure of the variability of the sample mean about the population mean or a measure of the precision of the sample mean as an estimator of the population mean. The ratio of the standard deviation of the sampling distribution of sample means and the mean of the sampling distribution is called the coefficient of variation (CV) of the sample mean or the relative standard error (RSE) of the sample mean. That is,

CV or RSE = C = standard deviation / mean (2.14) CV (or RSE) is a free number or is dimension-less, while the mean and the standard deviation are in the same units as the variable ‘y’. (These definitions can easily be generalised to the sampling distribution of any sample statistic and its SE and RSE.) We have talked about the unbiasedness and precision of the estimate made from the sample. What more can we say about the precision of the estimate and other characteristics of the estimate? This is possible if we know the nature of the sampling distribution of the estimate. The nature of the sampling distribution of, say, the sample mean, or for that matter any statistic, depends on the nature of the population from which the random sample is drawn. If the parent population has a normal distribution with mean μ and variance σ2 or, in short notation, N (μ, σ2), the sampling distribution of the sample mean, based on a random sample drawn from this, is N (μ, σ2/n). In other words, the variability of the

32

sample mean is much smaller than that of the variable of the population and it also decreases as the sample size increases. Thus, the precision of the sample mean as an estimate of the population mean increases as the sample size increases. As we know, the normal distribution N (μ, σ2) has the following properties:

(i) Approximately 68% of all the values in a normally distributed population lie within a distance of one standard deviation (plus and minus) from the mean,

(ii) Approximately 95% of all the values in a normally distributed population lie within a distance of 1.96 standard deviation (plus and minus) of the mean,

(iii) Approximately 99% of all the values in a normally distributed population lie within a distance of 2.576 standard deviation (plus and minus) of the mean.

The statement at (iii) above, for instance, is equivalent to saying that the population mean μ will lie between the observed values (y – 2.576 σ) and (y + 2.576 σ) in 99% of the random samples drawn from the population N(μ, σ2). Applying this to the sampling distribution of the sample mean, which is N(μ, σ2/n), we can say that

Pr.[(m – 2.576 σ / √n) < μ < (m+ 2.576 σ / √n )] = 0.99 (2.15) or that the population mean μ will lie between the limits computed from the sample, namely, (m – 2.576 σ/√n) and (m+ 2.576 σ√n) in 99% of the samples drawn from the population. This is an interval estimate, or a confidence interval, for the parameter with a confidence coefficient of 99% derived from the sample. The general rule for constructing a confidence interval of the population mean with a confidence coefficient of 99% is: the lower limit of the confidence interval is given by the “estimate of the population mean minus 2.576 times the standard error of the estimate” and the upper limit of the interval by the “estimate plus 2.576 times the standard error of the estimate”. (2.16) B. Assessment of Precision – Unknown Population Variance If the parent population is distributed as N (M, σ2) and σ2 is not known, we make use of an estimate of σ2. The statistic ‘s2 ‘ given in formula 2.6 can be one such, but this is not an unbiased estimate of σ2 as E (s2) = [(n – 1)/ n] σ2. We, therefore, by using (2.6) & (2.7) have: v(y)=[ns2 /(n – 1)] as an unbiased estimate of σ2. (2.17) v(y) = [1/(n – 1)] [ss]2 or, v(y) = [(1/(n - 1)][ ∑i yi

2 – nm2 ] (2.18) As the sampling variance of the sample mean ‘m’ is σ2/n, an unbiased estimate v(m) of the sampling variance will be v(y)/n. Let us now consider the statistic defined by the ratio,

t = (m – M) / [√v(y) / √n] . (2.19)

33

The numerator is a random variable distributed as N(0, σ2/n) and the denominator is the square root of the unbiased estimate of its variance. The sampling distribution of the statistic ‘t’ is the Student’s t-distribution with (n – 1) degrees of freedom. It is a symmetric distribution. A confidence interval can now be constructed for the population mean M from the selected random sample, say with a confidence coefficient of (1 - α )%. The values of ‘tα ‘ for different values of α = Pr.[t > tα ] + Pr.[(- t) < (- tα)] = 2 Pr.[ t > tα ] and different degrees of freedom have been tabulated in, for instance, Rao, C.R. and Others (1966). The confidence interval with a confidence coefficient (1 - α) for the population mean M would be as in 2.19 below – easily computed from the sample observations.

[m – tα v(m) < M < m + tα v(m)] (2.20) We note that the rule 2.16 applies here also except that we use (i) the square root of the unbiased estimate of the sampling variance of the estimate of the population mean in the place of the standard error of the estimate of the population mean, and (ii) the relevant value of the ‘t’ distribution instead of the normal distribution ……… (2.21) We have so far dealt with parent populations that are normally distributed. What will be the nature of the sampling distribution of the sample mean when the parent population is not normally distributed? We examine this question in the next subsection C. C. Assessment of Precision–Parent Population has a Non-Normal Distribution The Central Limit Theorem ensures that, even if the population distribution is not normal,

♦ the sampling distribution of the sample mean will have a mean equal to the population mean regardless of the sample size and

♦ as the sample size increases, the sampling distribution of the sample mean approaches the normal distribution.

Thus for large ‘n’ (sample size), say 30 or more, we can proceed with the steps mentioned in sub-section A above. Further, the Student’s t-distribution also approaches the normal distribution as ‘n’ becomes large so that we can use the statistic ‘t’ in sub-section B as a normally distributed variable with mean 0 and unit variance for samples of size 30 or more. We may then adopt the procedure outlined in sub-section A. Read

(1) Sections 2.6 to 2., pp. 38 – 45, Chapter 2 and Sections 3.9a to 3.9d, pp.81 – 84, Chapter 3, M.N.Murthy (1967) ,

(2) Section 3.10, Chapter 3, pp. 108 –110, Section 5.6, pp. 219 – 230, Chapter 5, Sections 6.3 and 6.4, pp. 266 - 278, Chapter 6 and Sections 7.1, pp. 300 - 304 and Sections 7.3 to 7.7, pp. 307 - 325, Chapter 7, Richard Levin & David S. Rubin (1991) and

(3) Chapter s 6 & 7, pp. 113 – 151, P.K.Viswanathan (2007). D. Determination of Sample Size

34

Random sampling methods also help in determining the sample size that is required to attain a desired level of precision. This is possible because the standard error and the coefficient of variation C.V. of the estimate, say, sample mean ‘m’, are functions of ‘n’, the sample size. C.V. is usually very stable over the years and its value available from past data can be used for determining the sample size. We can specify the value of C.V. of the sample mean that we desire as, say, C(m) and calculate the sample size with the help of prior knowledge of the population C.V., namely, C. That is, C(m) = C /√n ; so that √n = C/C(m), or, n = [C/C(m)]2 (2.22) Or we can define the desired precision in terms of the error that we can tolerate in our estimate of ‘M’ (permissible error) and link it with the desired value of C(m). Then, n = [2.576C / e]2 , where the permissible error e = ⏐(m – M)⏐/ M. (2.23) If the sanctioned budget is F for the survey: Let the cost function be of the form F0 + F1n, consisting of two components – overhead cost and cost per unit to be surveyed. As this is fixed as F, F = F0 + F1n, and the sample size becomes n = (F – F0 )/F1. The coefficient of variation of C(m) is not at our choice in this situation since it gets fixed once ‘n’ is determined. We can, however, determine the error in the estimate of ‘m’ from this sample (in terms of the RSE of m), if the population CV, C is known. If further we suppose that the loss in terms of money is proportional to the value of RSE of m, say, Rs. ‘l’ per 1% of RSE of ‘m’, the total cost of the survey becomes, L(n) = F0 + F1n + l (C/ √n). We can then determine the sample size that minimises this new cost (which includes the cost arising out of loss). Differentiating L(n) w.r.t n and equating to zero and simplifying,

n = [(l/2) (C/F1)]2/3 (2.24) See also the sub-section below on stratified sampling.

Read Section 1.6, Chapter 1, pp.13 – 16; Section 2.13, Chapter 2, p. 48; and Sections 4.2 to 4.9, pp. 19 Chapter 4, M.N.Murthy (1967), pp.96 – 123.

2.6.4 Methods of Random Sampling We have so far dealt with random samples drawn form a population. We did not specify the size of the population. We had assumed that the population is infinite in size. In practice, a population may have a size N, however, large. Let us, therefore, consider drawing random samples of size ‘n’ from a population of size ‘N’. We shall consider the following methods of random sampling:

A. Simple Random Sampling (With Replacement) [SRSWR], B. Simple Random Sampling (Without Replacement) [SRSWOR] C. Interpenetrating Sub-Samples (I-PSS), D. Systematic Sampling (sys), E. Sampling with Probability Proportional to Size (pps) F. Stratified Sampling (sts), G. Cluster Sampling (cs) and

35

H. Multi-Stage Sampling (mss) We shall indicate in the following sections a description of the above methods, the relevant operational procedure for drawing a sample and the expressions/formulae for (a) the estimator of the population mean/total/proportion, (b) the sampling variance of the sample mean/total/population and (c) unbiased estimate of the sampling variance. 2.6.4.1 Simple Random Sampling With Replacement (SRSWR) The method: This method of drawing samples at random ensures that (i) each item in the population has an equal chance of being included in the sample and (ii) each possible sample has an equal probability of getting selected. Let us select a sample of ‘n’ units from a population of ‘N’ units by simple random sampling with replacement (SRSWR). We select the first unit at random, note its identity particulars for collection of data and place it back in the population. We choose at random another unit – this could turn out to be the same unit selected earlier or a different one, note its identity particulars and place it back. We repeat this process ‘n’ times to get an SRSWR sample of size ‘n’. In such a sample one or more units may occur more than once. A sample of ‘n’ distinct units is also possible. It can be shown that the number of possible samples that can be selected by SRSWR method is N n and that the probability of any one sample being chosen is 1/ N n. Operational procedure for selection of the sample by SRSWR method: Tables of Random Numbers are used for drawing random samples. These tables contain a series of four-digit (or five-digit or ten-digit) random numbers. Supposing a sample of 30 units is to be selected out of a population of 3000 units. First allot one number from the set of numbers to 0001 to 3000 as the identification number to each one of the population units. The problem of drawing the sample of size 30 then reduces to that of selecting 30 random numbers, one after another, from the random number tables. Turn to a page of the Tables at random and start noting down, from the first left-most column of (four or five or ten-digit) random numbers, the first four digits of the numbers from the top of the column downwards. Continue this operation on to the second column till the required sample size of 30 is selected. If any of the random numbers that comes up is more than 3000, reject it. If some numbers (< 3000) get repeated in the process, it means that the corresponding units of the population would be selected more than once, this being sampling with replacement. Estimators from SRSWR samples (using notations set down earlier): msrswr = (1/n)∑i yi (∑i , i = 1 to n) is an unbiased estimator for M (2.25) Note: If a unit gets selected in the sample more than once, the corresponding value of yi will also have to be repeated as many times in the summation for calculating msrswr . Sampling Variance of msrswr : V(msrswr) = σ2/n = [1/n] [E(yi

2 ) – M2 ] (2.26) Standard Error of msrswr : SE(msrswr) = σ / √n (2.27) CV or RSE of msrswr : C(msrswr) = (1/ √n)(σ / M) = C(y)/√n. (2.28)

36

Note that the sampling variance, SE and CV(RSE) of the sample mean in SRSWR is much less than SE and CV of the variable y and these decrease as the sample size increases. The precision of the sample mean in SRSWR, as an estimator of M increases as the sample size increases. However, the extent of decrease in the standard error will not be commensurate with the size of the increase in the sample size. We would need an unbiased estimator of σ2, as σ2 may not be known. This is v(y) = [1/(n – 1)][ss]2 (2.29); Therefore, v(msrswr) = v(y)/n (2.30) an unbiased estimate of Y, or, Y*srswr = Nm (2.31) V(Y*srswr) = N2 (σ2 /n) (2.32) v(Y*srswr) = N2 (1/n)[1/(n – 1)] ∑(yi – msrswr)2 (2.33) . the sample proportion ‘psrswr’ is an unbiased estimate of P (2.34) V(psrswr) is PQ/n (2.35); & v(psrswr) = npq/(n – 1) (2.36) C(psrswr )-( = √[(1/n)(PQ)]/ P = [1/√n)]√[Q/P]. (2.37) Confidence intervals for the population mean/proportion and the sample size for a given level of precision and/or permissible error can now be derived easily. Read Sections 3.1 to 3.4, pp. 55 – 66 and Sections 3.7 to 3.8a, pp. 76 – 79, Chapter3, ibid. 2.6.4.2 Simple Random Sampling Without Replacement (SRSWOR) The Method: This method of sampling is the same as SRSWR but for one difference. If a unit is selected, it is not placed back before the next one is selected. This means that no unit gets repeated in a sample. Operationally, we draw random numbers between 1 and N and if a random number comes up again, it is rejected and another random number is selected. This process is repeated till ‘n’ distinct units are selected. It can be shown that the number of samples of size n that may be selected from a population of ‘N’ units by this method is NCn = N ! /[ (N – n) ! n ! ] =[N(N –1) (N – 2) …(N – n + 1)] / [n(n – 1) (n – 2)…….1]. The probability Psrswor(S) of any one of the samples being chosen is, 1/ NCn . Estimators from SRSWOR samples: msrswor = (1/n)∑i yi , ∑i , i =1 to n, is an unbiased estimator of M (2.38)

V(msrswor) = [(N – n)/ (N – 1)] [σ2/n] = [(N – n)/(N – 1)][1/n][(1/N) ∑i(Yi – M)2], ∑i, i=1 to N (2.39)

37

V(m)srswor < V(msrswr) since (N – n) / (N – 1) is less than 1 for n > 1, (2.40) Both msrswor and msrswr are unbiased estimators of M but msrswor is a more efficient estimator of M than msrswr . The factor [(N – n)/(N – 1)] in (2.40) is called the finite population correction or finite population multiplier. The finite population correction required for finite population need not, however, be used when the sampling fraction (n / N) is less than 0.05. v(msrswor) = [(N – n)/N][1/n][1/(n – 1)][∑i (yi – m)2] , ∑i , i= 1 to n =[(N– n)/N][1/n][1/(n – 1)][ss]2, (2.41)

. Unbiased estimate of population total Y, that is Y*

srswor = Nmsrswor (2.42) V(Y*srswor) = N2 V(msrswor ) (2.43) unbiased estimate of V(Y*srswor), namely, v(Y*srswor) = N2 v(msrswor (2.44) C(Y*srswor) = C(msrswor ) (2.45) the sample proportion ‘p’ is an unbiased estimate of ‘P’ in SRSWOR also. (2.46) V(p) = [(N – n)/ (N – 1)] [PQ/n] , where P + Q =1 (2.47) v(p) = [(N – n)/N] [pq/(n – 1)] , where p + q = 1 (2.48) C(p) = √[(N – n)/(N – 1)] √[(1/n) √[Q/P] and (2.49) Read Sections 3.5 to 3.7, pp. 67 – 78 and Sections 3.8 b & c, pp.80 – 81, Chapter 3, ibid. 2.6.4.3 Interpenetrating Sub-Samples (I-PSS) Suppose a sample is selected in the form of two or more sub-samples drawn according to the same sampling method so that each such sub-sample provides a valid estimate of the population parameter. The sub-samples drawn in this way are called interpenetrating sub-samples (I-PSS). This is operationally convenient, as the different sub-samples could be allotted to different investigators. The sub-samples need not be independently selected. There is, however, an important advantage in selecting independent interpenetrating sub-samples. It is then possible to easily arrive at an unbiased estimate of the variance of the estimator even in cases where the sampling method/design is complex and the formula for the variance of the estimator is complicated. Let {t i}, i = 1,2,…,h be unbiased estimates of a parameter θ based on ‘h’ independent interpenetrating sub-samples. Then,

38

t = (1/h)∑i t i , (∑i , i = 1 to h) is an unbiased estimate of θ (2.50) v(t) = [1/h(h – 1)] [∑i (t i – t)2 ], (∑i , i = 1 to h) is an unbiased estimate of V(t) (2.51) If the unbiased estimator ‘t’ of the parameter θ is symmetrically distributed (for example, normally distributed), the probability of the parameter θ lying between the maximum and the minimum of the ‘h’ estimates of θ obtained from the ‘h’ sub-samples is given by: Prob.[Min of {t1, t2 ,---- t h } < θ < Max of {t1, t2,-----t h}] = [1 – (1/2)( h – 1) ] (2.52) This is a confidence interval for θ from the sample. The probability increases rapidly with the number of I-P sub-samples – from 0.5 (two sub-samples) to 0.875 (four sub-samples). The IPSS technique is also useful in assessing non-sampling errors. (see Box below.) Read Section 2.12, Chapter 2, ibid. p 47. 2.6.4.4 Systematic Sampling The Method: Let {Ui}, i = 1,2,………. N be the units in a population. Let ‘n’ be the size of the sample to be selected. Let ‘k’ be the integer nearest to N/n - denoted usually as [N/n] - the reciprocal of the sampling fraction. Let us choose a random number from 1 to k, say, ‘r’. We then choose the rth unit, that is, Ur . Thereafter, we select every kth unit. In other words, we select the units Ur, Ur+k , Ur+2k ,………… This method of sampling is called systematic sampling with a random start. ‘r’ is known as the random start and ‘k’ the sampling interval. There would thus be ‘k’ possible systematic samples, each corresponding to one random start from 1 to k. The sample corresponding to the random start ‘r’ will be {Ur+jk }, j = 0,1,2, r+jk ≤ N. The sample size of all the ‘k’ systematic samples will be ‘n’ if N = nk. All the ‘k’ systematic samples will not have a sample size ‘n’ if N ≠ nk. For example, if we have a sample of 100 units and we wish to select systematic samples of size 14, the sampling interval is k = [100/15] or 7 The samples with the random starting 1 and 2 will be of size 15 while the other 5 systematic samples (with random starts 3 to 7) will be of size 14. In systematic sampling, units of a population could thus be selected at a uniform interval that is measured in time, order or space. We can for instance choose a sample of nails produced by a machine for five minutes at the interval of every two hours to test whether the machine is turning out nails as per the desired specifications. Or, we could arrange the income tax returns relating to an area in the order of increasing gross income returned and select every fiftieth tax return for a detailed examination of the income of assesses of the area. Systematic samples are thus operationally easier to draw than SRSWR or SRSWOR samples. Only one random number needs to be chosen for selecting a systematic sample.

39

Estimators from Systematic Samples: An unbiased estimator of the population mean M based on a systematic sample is given by a slight variant of the sample mean, namely, msys* = (k/N) ∑i yi , ∑i , i =1 to n*, n* is the size of the selected sample and k the sampling interval v(2.53) If N = nk, msys* = m the sample mean. If N ≠ nk, there is a bias in using the sample mean as the estimator for M, and the bias in using the sample mean as an estimator of M is likely to be small in the case of systematic samples selected from a large population. (2.54) The disadvantages, referred to above, in systematic sampling, namely, N not being a multiplier of the sample size n and the sample mean not being an unbiased estimator of the population mean can be overcome by adopting a procedure called Circular Systematic Sampling (CSS). If ‘r’ is the random start, and k the integer nearest to N / n, we choose the units. {Ur+jk }, if r+jk ≤ N and {Ur+jk - N }, if r+jk > N ; j = 0,1,2,…………… ( n - 1). Taking the earlier example of selecting a systematic sample of size 14 from a population of 100 units (N == 100, k = 7 and n = 15) all the samples can be made to have a size of 15 by adopting the CSS. A random start of 5 will lead to the selection of a sample of the 15 units 5,12,19,26,33,40,47,54,61,68.75,82,89,96 and 3 (96 + 7 – 100). This procedure ensures equal probability of selection to every unit in the population. Besides constancy of the sample size from sample to sample, the CSS procedure ensures that mr the sample mean is an unbiased estimate of the population mean.

(2.55) Let nk = N. Then m* = m. There are k possible samples, each sample with a probability of 1/k. Let the sample mean of the r-th systematic sample be mr = (1/n)∑i yir, where yir is the value of the characteristic under study for the i-th unit in the r-th systematic sample, summation is from i = 1 to n. As already noted mi is an unbiased estimator of M or E(mr) = M. We thus have k possible unbiased estimates of M. Denoting the sample mean in systematic sampling as msys, the sampling variance of msys, and related results of interest are: V(msys)= σb

2 (the between-sample variance). (2.56) V(msys) = V(y) – σw

2 , where σw2 is within-sample variance. (2.57)

Equation 2.57 shows that (i) V(msys) is less than the variance of the variable under study or the population variance, since σw

2 is > 0 and (ii) V(msys) can be reduced by increasing σw

2, or by increasing the within-sample variance. (ii) would happen if the units within

40

each systematic sample are as heterogeneous as possible. Since we select a sample of ‘n’ units from the population of N units by selecting every k-th element from the random start ‘r’, the population is divided into ‘n’ groups and we select one unit from each of these ‘n’ groups of population units. Units within a sample would be heterogeneous if there is heterogeneity between the ‘n’ groups. This would imply that units within each of the n groups would have to be as homogeneous as possible. All these suggest that the sampling variance of the sample mean is related to the arrangement of the units in the population. This is both an advantage and disadvantage of systematic sampling. An arrangement that conforms to the conditions mentioned above would lead to a smaller sampling variance or an efficient estimate of the population mean while a ‘bad’ arrangement would lead estimates that are not as efficient. Aspects of systematic sampling listed below are important. Find out about these from the suggested readings (Box below).

(i) When is V(msys) < V(msrswr) or V(msrswor)? (ii) It is not possible to get v(msys). Why? How can this problem be overcome? (iii) Systematic sampling is not recommended when there is a periodic or cyclic

variation in the population. What is the solution? Read Sections 5.1 and 5.2, pp. 133 – 141, Sections 5.4 to 5.10, pp. 142 – 171, Chapter 5, ibid. 2.6.4.5 Sampling With Probability Proportional To Size (pps) The Sampling Method: We have so far considered sampling methods in which the probability of each unit in the population getting selected in the sample was equal. There are also methods of sampling in which the probability of any unit in the population getting included in the sample varies from unit to unit. One such method is sampling with probability proportional to size (pps) in which the probability of selection of a unit is proportional to a given measure of its size. This measure may be a characteristic related to the variable under study. One example may be the employment size of a factory in the past year and the variable under study may be the current year’s output. Does this method lead to a bias in our results, as units with smaller sizes would be under represented in the sample and those with larger sizes would be over represented. It is true that if the sample mean ‘m’ were to be used to estimate the population mean M, m would be a biased estimator of M. However, what is done in this method of sampling is to weight the sample observations with suitable weights at the estimation stage to obtain unbiased estimates of population parameters, the weights being the probabilities of selection of the units. Estimates from pps sample of size 1: Let the population units be {U1,U2, -------UN}.Let the main variable Y and the related size variable X associated with these units be{Y1, X1; Y2, X2; …………YN, XN}. The probability of selecting any unit, say, Ui in the sample will be Pi = (X i / X), where ∑iX i = X, where ∑i , i = 1 to N. Let us select one unit by pps method. Let the unit selected thus have the values y1 and x1 for the variables y and x. The variables y and x are random variables assuming values Yi and X i respectively with probabilities Pi , i = 1,2,…………N. The following results based on the sample of size 1 can be derived easily:

41

An unbiased estimator of population total Y is Y*(1) pps = y1 /p1 (2.58) An unbiased estimator of M is m*(1)pps = (1/N) Y*(1)pps = (1/N)(y1 /p1) (2.59)

V[Y*(1) pps] = ∑i (Yi

2 /Pi ) – Y2 (2.60)

V[m*(1)pps ] = (1/N2) V[Y*

(1)pps] = (1/N2) [ ∑i (Yi2

/Pi ) – Y2] (2.61) These show that the variance of the estimate will be small if the Pi are proportional to Yi. Estimators from pps sample of size > 1 [pps with replacement (pps-wr)] A sample of n ( > 1) units with pps can be drawn with or without replacement. Let us consider a pps-wr sample. Let {yi , pi} be respectively the sample observation on the selected unit and the initial probability of selection at the i–th draw, i = 1,2,----n. Each (yi / pi) , i = 1,2,----- n in the sample is an unbiased estimate [Yi(pps-wr)*] of the population total Y and V(Yi(pps-wr)*) = ∑i (Yr

2 /Pr) – Y2, ∑r , r = 1 to n. (see 2.60). Estimates from pps-wr samples are: Y*pps-wr = (1/n) ∑i (yi / pi) = (1/n) ∑i Y*(i)pps-wr ; ∑i , i = 1 to n. (2.62) V(Y*pps-wr ) = (1/n)[ ∑r (Yr

2 /Pr) – Y2] ; ∑r , r = 1 to N. (2.63) V(mpps-wr ) = (1/N2 )(1/n)[ ∑r (Yr

2 /Pr) – Y2] ; ∑r , r = 1 to N. (2.64) v (Y*pps-wr )= [1/{n(n – 1)}][∑r (yr

2 / pr

2 – nY*2]; ∑r , r = 1 to n; (using 2.51) (2.65) Operational procedure for drawing a pps-wr sample: The steps are: (1) Cumulate the sizes of the units to arrive at the cumulative totals of the unit sizes.

Thus

Ti - 1 = X1 + X2 + -----+ Xi –1 ; Ti = X1 + X2 +---+ Xi - 1 + Xi = Ti - 1 + Xi ; i = 1,2,……..., N.

(2) Then choose a random number R between 1 and TN = X1 + X2 +………..+ XN =

X . (3) Choose the unit Ui if R lies between Ti - 1 and Ti, that is, if Ti - 1 < R ≤ Ti . The

probability P(Ui) of selecting the i-th unit will thus be P(Ui) = (Ti - Ti – 1) /TN = Xi / X = Pi

(4) Repeat the operation ‘n’ times for selecting a sample of size n with pps-wr.

42

There are other sampling methods under pps like pps without replacement and pps systematic sampling. (Readings – Box below) ------------------------------------------------------------------------------------------------------------ Read Sections 6.1 to 6.4, pp. 183 – 197 and Section 6.10, pp. 200 – 202; Chapter 6, Section 6.10a to 6.10c, pp. 201 – 208; Section 6.11 a to c, pp. 209 – 215, ibid. 2.6.5.6 Stratified Sampling The Method: We might sometimes find it useful to classify the universe into a number of groups and treat each of these groups as a separate universe for purposes of sampling. Each of these groups is called a stratum and the process of grouping stratification. Estimates obtained from each stratum can then be combined to arrive at estimates for the entire universe. This method is very useful as (i) it gives estimates not only for the whole universe but also for the sub-universes and (ii) it affords the choice of different sampling methods for different strata as appropriate. It is particularly useful when a survey organistion has regional field offices. This method is called Stratified Sampling. Let us divide the population (universe) of N units into k strata. Let Ns be the number of units in the s-th stratum. Ysi be the value of the i-th unit in the s-th stratum. Let the population mean of the s-th stratum be Ms. Ms = (1/Ns)∑Ysi, ∑i , i = 1,2,----,Ns (that is over the units within the s-th stratum) and the population M is =(1/N)∑sNs Ms = ∑sWs Ms , where Ws = (Ns / N) and ∑s being over the strata s = 1,2,……,k). Suppose that we select random samples from each stratum and the sampling method for different strata are different. Let the unbiased estimate of the population mean Ms of the s-th stratum be ms. Denoting ‘st’ for stratified sampling, an unbiased estimator of M is given by mst =∑s Ws ms = (1/N) ∑s Ns ms, ∑s , s = 1 to k. (2.66) V(mst) = ∑sWs

2 V(ms) = (1/N2 )∑s Ns2 V(ms), ∑s , s= 1 to k (2.67)

Cov.(ms,mr) = 0 for s ⋅≠ r ; (samples from diff. strata are independently chosen) .. (2.68) Yst* = ∑sYs* ; ∑s , s = 1 to k. (2.69) V(Yst*) = ∑sV(Ys*), ∑s , s = 1 to k. (2.70) Thus estimators with smaller variance (efficient estimators) can be obtained in stratified sampling if we form the strata in such a way as to minimise intra-strata or within-strata variation, that is, variance within strata. This would mean maximising between-strata or inter-strata variation, since the total variation is made up of within-strata and between-strata variation. In other words, units in a stratum should be homogeneous. Stratified sampling enables us to choose the sample we wish to select by drawing independent samples from each of the different strata in to which we have grouped the universe. How do we allocate the total sample size ‘n’ among the different strata? One way is to allocate the sample size to different strata in proportion to the size of individual strata measured by the number of units in these strata, namely, Ns, [ ∑s Ns = N, (s= 1 to

43

k)]. This method is especially appropriate in situations where no information is available except the sizes of the strata. The sample sizes for the samples from the stratum, say, the s-th stratum, would then be ns = n(Ns/N) and ∑s ns can easily seen to be equal to ‘n’. There are other methods like allocation of the sample size among strata in proportion to the stratum totals of the variable under study, that is, Ys , the stratum total of the s-th stratum, ‘s’ = 1 to k. We shall not go into the details of other methods here except one situation, namely, when we have a fixed budget F sanctioned for the survey. Let the cost function F be of the form F0 + ∑nsFs , (∑s , s = 1 to k), where F0 , ns and Fs are respectively the overhead cost, the sample size in stratum ‘s’ and the per unit cost of surveying a unit in stratum ‘s’ (s = 1,2,-----, k). We can determine the optimum stratum-wise-sample-size by minimising the sampling variance of the sample mean (2.67) subject to the constraint that the cost of the survey is fixed. The stratum-wise optimum sample size is given by ns = [(F - F0)] [Ws√(Vs / Fs )] / [ ∑s Ws√(VsFs)] , s = 1, k. (2.71) The stratum sample size should, therefore, be proportional to Ws√(Vs / Fs). The minimum variance with the ns so determined is,

Min. V(mst ) = [∑Ws√(VsFs)]2 / (F - F0)] (2.72)

Read Section 7.1, pp. 232 – 233, Section 7.2, pp. 235 – 236 and Section 7.4, pp. 239 – 243 (especially Section 7.4b, pp. 241 – 243), Chapter 7, ibid. ------------------------------------------------------------------------------------------------------------ 2.6.4.7 Cluster Sampling The method: Supposing we are interested in studying certain characteristics of individuals in an area. We would naturally select a random sample of individuals from all the individuals residing in the area and collect the required information from the selected individuals. We might also think of selecting a sample of households out of all the households in the area and collect the required details from all the individuals in the selected households. The households in the area are clusters of individuals and what we have done is to select a sample of such clusters and to collect the information needed from all the individuals in the selected clusters instead of selecting a random sample of individuals from all persons in the area. What we have done is cluster sampling. Cluster Sampling is a process of forming suitable clusters of units and surveying all the units in a sample of clusters selected according to an appropriate sampling method. The clusters of units are formed by grouping neighbouring units or units that can be conveniently surveyed together. Sampling methods like srswr, srswor, systematic sampling, pps and stratified sampling discussed earlier can be applied to sampling of clusters by treating clusters themselves as sampling units. The clusters can all be of equal size or varying sizes, that is, the number of units can be the same, or vary from cluster to cluster. Clusters can be mutually exclusive, that is, a unit belonging to one cluster will not belong to any other cluster. They could also be overlapping. Estimates from cluster sampling: Let us consider a population of NK units divided into N mutually exclusive clusters of K units each – a case of clusters of equal size. The

44

population mean M and the cluster means are given respectively by M = (1/N)∑sms, ∑s being over clusters s = 1 to N and ms = (1/K)∑i Ysi, ∑i being from i = 1 to K within the s-th cluster. Let us draw a sample of one cluster by srs. The cluster mean mc-srs (the subscript c-srs denotes cluster sampling with srs) is an unbiased estimate of M. The sampling variance of the sample cluster mean is V(mc-srs) = (1/N)∑s (ms – M)2 = σ b2 = Variance between clusters; ∑s , s = 1 to N;

(2.73) Let us compare V(mc-srs) with the sampling variance of the sample mean when K units are drawn from NK units by SRSWR method. How does the “sampling efficiency” of cluster sampling compare with that of SRSWR?. The sampling efficiency of cluster sampling compared to that of SRSWR, Ec/srswr , is defined as the ratio of the reciprocals of the sampling variances of the unbiased estimators of the population mean obtained from the two sampling methods. The sampling variances and sampling efficiency are V(msrswr) = (1/K) [(1/NK)∑s∑iYsi

2 – M2] = σ2/K (2.74) σ2 = σw

2 + σb2 = within-cluster variance + between-cluster variance. (2.75)

Ec/srswr > 1 if σw

2 > (K – 1)σb2 (2.76)

Thus, cluster sampling is more efficient than SRSWR if the within-cluster variance is larger than (K –1) times the between-cluster variance. Is this likely? This is not likely as the between-cluster variance will usually be larger than the within-cluster variance due to within-cluster homogeneity. Cluster sampling is in general less efficient than sampling of individual units from the point of view of sampling variance. Sampling of individuals could provide a better cross section of the population than a sample of clusters since units in a cluster tend to be similar. Read Sections 8.1, 8.2 and 8.2a, Chapter 8, ibid. pp. 293 – 297, ------------------------------------------------------------------------------------------------------------ 2.6.4.8 Multi-Stage Sampling We noted in the sub-section on cluster sampling that random sampling of units directly is more efficient than random sampling of clusters of units. But cluster sampling is operationally convenient. How to get over this dilemma? We may first select a random sample of clusters of units and thereafter select a random sample of individual units from the selected clusters. We are thus selecting a sample of units, but from selected clusters of units. What we are attempting is a two-stage sampling. This can thus be a compromise between the efficiency of direct sampling of units and the relatively less efficient sampling of clusters of units. This type of sampling would be more efficient than cluster sampling but less efficient than direct sampling of individual units. In the sampling procedure now proposed, the clusters of units are the first stage units (fsu) or the primary stage units (psu). The individual units constitute the second stage units (ssu) or the ultimate stage units (usu).

45

This procedure of sampling can also be generalised to multi-stage sampling. Take for instance a rural household survey. The fsu’s in such a survey may consist of districts, the ssu’s may be the tehsils or taluks chosen from the districts selected in the first stage, the third stage units could be the villages selected from the tehsils or taluks selected in the second stage and the fourth and the ultimate stage units (usu’s) may be the households selected from the villages selected in the third stage. Such multi-stage sampling procedures help in utilising such information related to the variable under study as may be available in choosing the sampling method appropriate at different stages of sampling. In a multi-stage sampling, estimates of parameters are built up stage by stage. For instance, in two-stage sampling, estimates of the sample aggregates relating to the fsu’s are built up from the ssu’s using the sampling method adopted for selecting the ssu’s. These estimates are then used with the sample probabilities of selection of fsu’s to build up estimates of the relevant population parameters. Read Sections 9.1 and 9.2, Chapter 9, ibid. , pp. 317 – 322. Chapter 3, NSSO (1997), pp. 12 – 15. 2.7 THE CHOICE OF AN APPROPRIATE SAMPLING METHOD We have considered a number of random sampling methods in the foregoing sub-sections. A natural question that arises now is – which method is to be adopted in a given situation? Let us consider this question, although the answer to it lies scattered across the foregoing sub-sections. The choice of a sampling design depends on considerations like a priori information available about the population, the precision of the estimates that a sampling design can give, operational convenience and cost considerations. 1. When we do not have any a priori information about the nature of the population

variable under study, SRSWR and SRSWOR would be appropriate. Both are operationally simple. However, SRSWOR is to be preferred, since V(msrswor) < V(msrswr). This advantage holds only when the sampling fraction is not small, or N and n are not large.

2. Systematic sampling is operationally even simpler than SRSWR and SRSWOR, but it should not be used for sampling from populations where periodic or cyclic trends/variations exist, though this difficulty can be overcome if the period of the cycle is known. V(msys) can be reduced if the units chosen in the sample are as heterogeneous as possible. But this will call for a rearrangement of the population units before sampling.

3. When additional information is available about the variable ‘y’ under study, say, on a variable (size variable) ‘x’ related to ‘y’, the pps method should be preferred. The sampling variance of Y* (or m) gets reduced when the probability of selection of units Pi = ( Xi /N) are proportional to Yi , that is, the size Xi is proportional to Yi or the variables x and y are linearly related to each other and the regression line passes through the origin. In such cases pps is more efficient than SRSWR. Further, this method can be utilised along with other sampling methods and their relative efficiencies. pps is operationally simple. Pps-wor combines the efficiency of SRSWOR and the efficiency-enhancing capacity of pps. However, most of the procedures of selection available, estimators and their variance for pps-wor are complicated and are not commonly used in practice. This is particularly so in large-

46

scale sample surveys with a small sampling fraction, as in such cases sampling without replacement does not result in much gain in efficiency. Hence unless the sample size is small, we should prefer pps-wr.

4. Stratified sampling comes in handy when we wish to get estimates at the level of sub-

populations or regions or groups. This method also gives us the freedom to choose different sampling methods/designs in different strata as appropriate to the group (stratum) of the population and the opportunity to utilise available additional information relating to the stratum. The sampling variance of estimators can also be brought down by forming the strata in such a way as to ensure homogeneity of units within individual strata. In fact, the stratum sizes can be so chosen as to minimise the variance of estimators, when there is a ceiling on the cost of the survey. Stratified sampling with SRS, SRSWOR or pps-wr presents a set of efficient sampling designs.

5. Sometimes, sampling of groups of individual units than direct sampling of units

might be found to be operationally convenient. Supposing it is easier to get a complete frame of clusters of individual units than that of units or, only such a frame, and not that of the units, is available. (e.g. households are clusters of individuals). In such circumstances, cluster sampling is adopted. This is in general less efficient than direct sampling of individual units, as clusters usually consist of homogeneous units. A compromise between operational convenience and efficiency could be made by adopting a two-stage sampling design, by selecting a sample of individual units (second stage units) from sampled clusters (the first stage units). A multi-stage design would be useful in cases where clusters have to be selected at more than one stage of sampling.

6. Finally, we can use the technique of independent I-PSS in conjunction with the

chosen sampling design to get at (i) an unbiased estimate of V(m) for any sampling design or estimator of V(m), however complicated, (ii) a confidence interval for ‘M’ based only on the I-PSS estimates (when the population distribution is symmetrical) and (iii) a tool for monitoring the quality of work of the field staff and agencies.

7. SRSWOR, stratified sampling with SRSWOR and, when available information

permits, pps-wr and stratified sampling with pps-wr, turn out to be a set of the more efficient and operationally easy designs to choose from. I-PSS can also be used in these designs where possible and necessary.

Read also

Sections 14.8 and 14.9, Chapter 14, M.N.Murthy (1967) pp.493 – 497. ------------------------------------------------------------------------------------------------------------ 2.8 LET US SUM UP There are broadly three methods of collecting data. The array of tools used for data collection by such methods has expanded over time with the advent of modern technology. Confining data collection efforts to a sample from the population of interest to the study, inevitably leads to questions like the use of random and non-random samples. Judgment sampling, convenience sampling, purposive sampling, quota sampling

47

and snowball sampling all belong to the latter group. The absence of a clear relationship between a non-random sample and the parent universe and the presence of the researcher’s bias in the selection of the sample render such samples useless for drawing valid conclusions about the parent population. But these methods are inexpensive and quick ways of getting a preliminary idea of the universe for use in designing a detailed enquiry and in exploratory research. Random samples, on the other hand, are free from such drawbacks and have properties that help in arriving at valid conclusions about the parent population. The simplest of the sampling methods – SRSWR - ensures equal chance of selection to every unit of the population and yields a sample in which one or more units may occur more than once. ‘msrswr’ is an unbiased estimator of M. Its precision as an estimator of M increases as the sample size increases. SRSWOR yields a sample of distinct units. ‘msrswor’ is also unbiased for ‘M’. SRSWOR is a more efficient than SRSWR as V(msrswor) < V(msrswr). But this advantage disappears when the sampling fraction is small (< 0.05). Both provide an unbiased estimator of V(m). An operationally convenient procedure - interpenetrating sub-samples (I-PSS) – also provides an unbiased estimator of V(m) for any sampling design and estimator for V(m), however complicated.

Systematic sampling is a simple and operationally convenient method used in large-scale surveys that requires only a random start and the sampling interval k = [N/n] for drawing the sample. A slight variant of ‘m’ is unbiased for ‘M’. Circular systematic sampling takes care of problems that arise when N/n is not an integer.. An unbiased estimate of V(m) is not possible but this problem can be tackled easily. Systematic sampling is not recommended when there is a periodic or cyclic variation in the population. This problem too can be overcome if the period of the cycle is known. An example of methods where the probability of selection varies from unit to unit is pps. The “size” could be the value of a variable related to the study variable. In pps, each yi / pi , where yi is the value of the study variable associated with the selected unit and pi the probability of selection of the unit, is an unbiased estimate (Y*) of the population total Y and [(1/N) Y*] an unbiased estimator of M. As V(Y*) is small if the probabilities Pi are roughly proportional to Yi , pps sampling is more efficient than SRS if the size variable x is proportional to y, that is, x and y are linearly related and the regression line passes through the origin. pps sampling can be done with SRSWR, SRSWOR or systematic sampling. In pps-srswr, [(1/n) ∑i(yi / pi)], (∑i i = 1 to n), is an unbiased estimator of Y. This being the mean of n independent unbiased estimates with the same variance V(Y*), v(Y*) can be derived using the I-PSS technique. Stratified Sampling is used when (i) estimates are needed for subgroups of a universe or (ii) the subgroups could be treated as sub-universes. It gives us the freedom to choose the sampling method as appropriate to each stratum. Estimates of parameters are available for the sub-universes (strata) and these can then be combined over the strata to get estimates for the entire universe. SE of estimates based on stratified sampling can be small if we form the strata in such a way as to minimise intra-strata variance. Each stratum should thus consist of homogeneous units, as far as possible. Stratum-wise sample sizes can also so chosen as to minimise the variance of estimators. Another operationally convenient sampling method, cluster sampling, is to sample groups of units or clusters of units at random and collect data from all the units of the

48

selected clusters. For example, the household is a cluster of individuals. SRSWR, SRSWOR, pps or systematic sampling can be used for sampling clusters. Cluster sampling is, in general, less efficient than direct sampling of units from the point of view of sampling variance. The question here is one of striking a balance between operational convenience and cost reduction on the one hand and efficiency of the sampling design on the other. We could improve the efficiency of cluster sampling by selecting a random sample of units from each of the selected clusters - introduce another stage of sampling. This is two-stage sampling. This would be more efficient than cluster sampling but less efficient than direct sampling of units. Multi-stage sampling can also be done. Such designs are commonly used in large-scale surveys as these facilitate the utilisation of information available and the choice of appropriate sampling designs at different stages. Thus while non-random sampling methods are useful in exploratory research and preliminary work on planning of enquiries, random sampling techniques lead to valid judgments regarding the universe. Among random sampling methods, SRSWOR, stratified sampling with SRSWOR and, when available information permits, pps-wr and stratified sampling with pps-wr, turn out to be a set of the more efficient and practically useful designs to choose from. I-PSS can also be used in these designs where possible and necessary. 2.9 FURTHER SUGGESTED READINGS. Current developments in sample survey theory and methods touch upon all the sub-areas of the subject, namely, data collection and processing, survey design and estimation or inference. Use of telephones for surveys, where tele-density is high, for selection of samples and data collection (Random Digit Dialing), tackling non-response through techniques like split-questionnaires, ordering of questions on sensitive information in the questionnaire, application of artificial neural network for editing data and imputation, total survey design approaches that tackle total survey error, the Dual Frame Methodology to enable small area estimation that is so necessary for regional planning, sampling on more than one occasion and related issues, replication methods to tackle measurement error, post stratification (stratification after collection of data), use of auxiliary information at the time of estimation, resampling methods like jacknife and bootstrap, methods of estimation of complex functions of parameters like distribution functions, quantiles, poverty proportion and ordinates of the Lorenz Curve, especially in the presence of measurement error, and their variances and the related computer packages are receiving increasing attention of researchers and survey practitioners. The following references for further reading may be useful for appreciation of these developments. Sankhya : The Indian Journal of Statistics Rao J.N.K. (1999): Some Current Trends in Sample Survey Theory and Methods, (Special Issue on Sample Surveys), Vol. 61, Series B, Part 1, pp. 1 – 57. Fuller W. A. & Jay Breidt F.(1999):Estimation for Supplemented Panels, ibid, pp.58 – 70. Shao, Jun & Chen, Yinzhong (1999): Approximate Balanced Half Sample and Related Replication Methods for Imputed Survey Data, ibid, pp. 197 – 201.

49

Godambe, V.P. (2002): Utilising Survey Framework in Scientific Investigations, (Special Issue on San Antonio Conference – Selected Papers), Vol. 64, Series A, Part 2, pp. 268 – 289. Rao, J.N.K; Yung, W; Hidiroglou (2002): Estimating Equations for the Analysis of Survey Data Using Poststratification Information, ibid, pp.364 –378. Chaudhuri, Arijit & Saha, Amitava (2004): Extending Sitter’s Mirror-Match Bootstrap to Cover Rao-Hartley-Cochran Sampling in Two Stages with Simulated Illustration, Vol. 66, Part 4, pp.791 – 802. Zhu, Min & Wang, You-Gan (2005): Quantile Estimation from Ranked Set Sampling Data, (Special Issue on Quantile Regression and Related Methods), Vol. 67, Part 2, pp. 295 – 304. Matei, Alima & Tille, Yves (2005): Maximal and Minimal Sample Coordination, Vol. 67, part 3, pp. 590 – 612. Journal of the American Statistical Association Pfeffermann, Danny & Tiller, Richard (2006): Small – Area Estimation with State – Space Models Subject to Benchmark Constraints, Vol.101, No. 476, pp. 1387 – 1397. Mach, Lenka; Reiss, Philip T. & Schiopu-Kratina, Ioana (2006): Optimizing the Expected Overlap of Survey Samples via the North West corner Rule, Vol. 101, No. 476, pp.1671 – 1679. Qin, Jing; Shao, Jun & Zhang, Bio (2008): Efficient and Doubly Robust Imputation for Covariate-Dependent Missing Responses, Vol. 103, No. 482, pp. 797 – 810. The Canadian Journal of Statistics Kim J.K. & Park H (2006): Imputation Using Response Probability, Vol. 34, No. 1, pp. 171 – 182. Kim J.F. & Kim J.J. (2007): Non Response Weighting Adjustment Using Estimated Response Probability, Vol. 35, No.4, pp. 501 – 514. : Books Krishniah P.R. & Rao C.R. (Eds) (1988) Skinner, C.J. ; Holt, D. & Smith, T.M.F. (1989): Analysis of Complex Surveys, New York Wiley. Ghosh, Malay & Meeden, Glen (1997): Bayesian Methods for Finite Population Sampling, Chapman and Hall, London. Mukhopadhyay, Parimal (1998) Indian Statistical Institute (2003): Report on Audit Sampling, Applied Statistics Unit, Indian Statistical Institute, Kolkota. 2.10 SOME USEFUL BOOKS & REFERENCES Burgess, R.G.(ed) (1982): Field Research: A Sourcebook and Field Manual.

(Contemporary Social Research 4), George Allen and Unwin, London.

Des Raj & Chandok P(1998): Sampling Theory, Narosa Publishing House, New Delhi. Levin, Richard I. & Rubin, David S. (1991): Statistics for Mangement, Fifth Edition, Prentice-Hall of

50

India (Private) Limited, M-97 Connaught Circus, New Delhi – 110001.

Krishniah, P.R. & Rao, C.R. (Eds.) (1988): Handbook of Statistics – Vol. 6 : Sampling, North Holland, Amsterdam. Mukhopadhyay, Parimal (1998): Theory & Methods of Survey Sampling, Prentice-Hall of India Pvt. Ltd., New Delhi. Murthy, M.N. (1967): Sampling Theory and Methods, Statistical Publishing Society, 204, Barrackpore Trunk Road, Kolkota-700108. NSSO (1997): Employment and Unemployment in India, 1993-94, National

Sample Survey Organisation, Ministry of Statistics and Programme Implementation, Sardar Patel Bhavan, New

Delhi – 110001. Rao, C.R., Mitra, S.K. and Matthai, A. (1966): Formulae and Tables for Statistical Work, Statistical

Publishing Society, Kolkota – 700108. Sampath, S. (2005): Sampling Theory & Methods, Second Edition, Narosa

Publishing House, New Delhi, Chennai, Mumbai, Kolkata. Singh, Kultar (2007): Quantitative Social Research Methods, Sage Publications (Pvt.) Limited, New Delhi Singleton Jr., Royce A. & Straits, Bruce C. (2005): Approaches to Social Research, 4th Edition, Oxford University Press, New York – Oxford. Viswanathan, P.K. (2007): Business Statistics – An Applied Orientation, Darling Kindersely (India) Pvt. Ltd. – licensees of Pearson Education in South Asia. 2.11 MODEL QUESTIONS 1. You have been asked to conduct a study to determine the literacy rate in a district.

The choice before you is to adopt a census approach or a random sample survey. How would you make a choice between the two? What considerations would lead you to a choice?

2. What tools of data collection would you make use of in the above enquiry and why? 3. What is meant by a ‘parameter’? Define the term ‘statistic’. Give the expressions for

population mean, sample mean and population variance and sample standard deviation.

4. What is the most important purpose of studying the population on the basis of a

sample? In this context, define the terms ‘estimator’ and ‘estimate’ with a suitable example.

51

5. Define the term ‘representative sample’. How is ‘random sampling’ principally different from that of ‘non-random sampling’? What could be the use of the latter despite its major drawback vis-à-vis the former?

6. Explain what is meant by ‘the sampling distribution of a statistic’ and ‘sampling

variance of a statistic’. What do you mean by an unbiased estimator of a parameter? 7. How does the random sampling procedure help in correcting for the bias of an

estimate? Illustrate this with the help of an example. 8. The sample proportion ‘p’ calculated from a random sample of size ‘n’ may be

considered as normally distributed with mean P and standard deviation √(PQ/n), when n is sufficiently large. Construct a confidence interval for ‘P’ with a confidence coefficient 0.99, when n is large.

9. Explain the terms ‘coefficient of variation’ and ‘relative standard error’. 10. A population has 80 units. The relevant variable has a population mean of 8.2 and a

variance of 4.41. Three SRSWR samples of size (i) 16, (ii) 25 and (iii) 49 are drawn from the population. What is the standard error (SE) of the sample means in the three samples? Is the extent of reduction in SE commensurate with that of the increase in sample size?

11. What are the results when the sampling method in drawing the three samples in

problem 10 above is changed to SRSWOR? What is your advice regarding the choice between increasing the sample size and changing the sampling method from srswr to srswor?

12. Why is it said that SRSWOR is not a more efficient sampling design from the point of

view of the precision of the sample mean as an estimator of the population mean for sampling fractions of less than 0.5?

13. Indicate whether the following statements are true (T) or false (F). If false, what is the

correct position?

(i) The standard error of the sample mean decreases in direct proportion to the sample size.

(ii) SRSWOR method of sampling is more advantageous than srswr for a sampling fraction of 0.02.

(iii) If Y* = Nm and Variance of m is V(m), the variance of Y* is NV(m). 14. What should be the size of the SRSWR sample to be selected if the coefficient of

variation of the sample mean should be 0.2? The population coefficient of variation is known to be 0.8. What will be the sample size if we decide to adopt SRSWOR sampling method?

15. Why would you recommend the use of the technique of interpenetrating sub-samples

in random sampling?

52

16. Four estimates of the population mean M obtained from four independent I-Pss samples of equal size are, 20, 18, 23 and 28. Obtain an unbiased estimate of the sampling variance of the sampling mean. Assuming that the population is normally distributed, compute a confidence interval for M. What is the confidence coefficient for this confidence interval? Do these results depend on the sample size/

17. A systematic sample of size 18 has to be selected from a population of 124. What

problems do you face in selecting the sample? Is the sample mean the unbiased estimator of the population mean M? How do you overcome these problems?

18. Is the sample mean ‘m’ always an unbiased estimator of the population mean ‘M’ in

systematic sampling? If not when? What then is an unbiased estimator of M in cases where ‘m’ is not an unbiased estimator of ‘M’? What is the bias in using the sample mean ‘m’ as the estimator of ‘M’ in these situations? Show that this bias is likely to be small for systematic samples from large populations.

19. What is the sampling variance of msys? what steps can be taken to reduce V(msys)? 20. When will systematic sampling be more efficient than (i) SRSWR ; (ii) SRSWOR ? 22. “It is not possible to get v(msys)” What are the reasons for this situation in the case of

systematic sampling? How is this problem overcome and how then can we get v(msys)?

23. What are the situations in which systematic sampling should not be adopted? What

information is needed in such situations to use systematic sampling and how would you use such information?

24. When is pps method adopted? 25. When will the sampling variance of Ypps* be small? 26. Say true (T) or false (F): (a) pps and stratified sampling can be combined with other sampling methods. (b) V(mst) is reduced by ensuring that units within individual strata are heterogeneous. (c) The size of a stratified sample can be allocated among the strata in proportion to the size of the strata, the size being the number of population units in a stratum. (d) Systematic sampling is a kind of stratified sampling. 27. We wish to study the wage levels of factory labour. What type of sampling method

would you adopt for the study and why if (a) just a list of factories is available with the Chief Inspector of Factories of different State Governments, (b) if the list in (a) above also gives the total number of employees in the individual factories at the end of last year and (c) the list also indicates both the kind of product manufactured in the factory along with the information specified in (b) above.

53

28. Show that cluster sampling is less efficient than direct sampling of individuals. Why does this happen?

29. How does two-stage sampling improve upon cluster sampling? 30. Is it correct to say that stratified sampling is a kind of multistage sampling? Why? 31. Why are multi-stage designs useful in large scale surveys?

54

BLOCK 3 QUANTITATIVE METHODS: DATA ANALYSIS

Structure 3.0 Objectives 3.1 Introduction 3.2 An overview of the Block 3.3 Important Steps involved in an Econometric Study 3.4 Two Variable Regression Model 3.4.1 Estimation of Parameters 3.4.2 Goodness of Fit 3.4.3 Functional Forms of Regression Model 3.4.4 Hypothesis Testing 3.5 Multi-Variable Regression Model 3.5.1 Regression Model with two explanatory variables 3.5.1.1 Estimation of Parameter: Ordinary Least Squares Approach 3.5.1.2 Variance and Standard Errors 3.5.1.3 Interpretation of Regression Coefficients 3.5.2 Goodness of Fit: Multiple Coefficient of Determination (R2) 3.5.3 Analysis of Variance 3.5.4 Inclusion and Exclusion of Explanatory Variables 3.5.5 Generalisation to N-Explanatory Variables 3.5.6 Problem of Multicollinearity 3.5.7 Problem of Heteroscedasticity 3.5.8 Problem of Autocorrelation 3.6 Further Suggested Readings 3.7 Model Questions 3.0 OBJECTIVES The main objectives of this block are to:

• have an overview of the basic steps involved in conducting an empirical study, • know the issue of linearity in regression model and appreciate its probabilistic

nature,

• estimate the unknown population regression parameters with the help of the sample information. Explain the concept of goodness of fit and use the various functional forms in the estimate of regression model,

• conduct some tests of hypothesis regarding unknown population regression

parameters estimate and interpret the regression model by introducing first, one additional explanatory variable and then extending further to n explanatory variables, and

• tackle the problem of auto-correlation, heteroscedasticity and multicollinearity in

the multiple regression analysis.

55

3.1 INTRODUCTION As a researcher, you may be tempted to examine whether the economic laws hold good in the real world situation reflected in the displayed pattern of the relevant data. Similarly as professional economist in government or private sector you may be interested in estimating the demand or supply of various products and services or to know the effect of various levels of advertisement expenditure on sales and profits. As macro economist, you may like to measure and evaluate the impact of various government policies say monetary and fiscal policies on important variables such as employment, unemployment, income, imports and exports, interest rates, inflation rates etc. As stock market analyst, you may seek to relate the prices of a stock to the characteristics of the company issuing the stock and the overall state of the economy. Such types of issues are investigated by employing various statistical techniques. Regression Modeling is one of the primary statistical tools employed for conducting such type of research studies. 3.2 AN OVERVIEW OF THE BLOCK The empirical research in economics is concerned with the statistical analysis of economic relations. Often, these relations are expressed in the form of regression equations involving dependent variable and independent variables. The formulation of an economic relation in the form of a regression equation is called a regression model in econometrics. The major propose of a regression model is the estimation of its parameters in two variable and multivariable situations and testing the hypotheses. In this block you will be able to find an overview of the basic steps involved in conducting an empirical study, the estimation of parameters in two variable and multivariable situation by applying the ordinary least square approach, the various functional forms of regression models, hypothesis testing, the consequences of the violation of the basic assumptions in brief i.e. multicollinearity, heteroscedasticity and autocorrelation and, how to tackle these problems. 3.3 IMPORTANT STEPS INVOLVED IN AN ECONOMETRIC STUDY The basic steps followed in conducting a regression model based empirical study are: (1) based on the knowledge of economic theory, past experience or other studies, specification of a model, (2) gathering the data, (3) estimating the model, (4) subjecting the model to hypothesis testing, and interpreting the results. Specifying a Model: In economics, like other physical sciences, the model (logical structure of system) is set up in the form of equations, which precisely describe the behavior of economic and related variables. The model may consist of a single equation or several equations. In the specification of single equation, the behavior of a single variable (denoted by Y) is explained. Placed on the left hand side, Y is referred to by a number of names like dependent variable, regressand or explained variable. On the right hand side a number of variables that influence the dependent variable are identified (denoted by Xs). These are referred to as independent variables, exogenous variables, explanatory variables or regressors. If a single equation model consists of one independent variable, it is referred to as two variable regression model. In case, the investigator identifies more than one

56

independent variable to explain the behavior of Y, it is referred to as a multi-variable regression model. In simultaneous equation models, the behavior of more than one dependent variable is studied and accordingly several equations are specified together. Gathering the data:

In order to estimate the econometric model, data on dependent and independent variables are needed. As we have studied in the last block (Block-2) of this course, data can be collected by way of experimental, sample survey or observational method. If you aim to explain the variation of the dependent variable over a period of time, you will be required to obtain observations at different points of time (referred to as time series data). The periodicity may be annual, quarterly, monthly or weekly depending on the need and requirement. If we want to analyse the characteristics of a dependent variable at a given point of time, we need the cross section data i.e. the observations for a variable for different units at the same point of time. In pooled data, we have time series observations for various cross sectional units. Here we combine the element of time series with that of cross section. Estimating the Model:

After formulation of the model and gathering the data, the next step before us comes to estimate the unknown parameters of the model like the intercept termα , the slope term β etc. We shall discuss this issue in the next section. We have already seen that the formulation of the basic model is guided by economic theory, investigator’s perception of the underlying behavior and past experience or studies. Consequently, the specified model is subjected to variety of diagnostic tests for ascertaining whether the underlying assumptions and estimation methods are appropriate for the data. The final stage of the empirical investigation is to interpret the results. If the chosen model does not refute the hypothesis or theory under consideration, we may use it to predict the future value of the dependent variable Y on the basis of known or expected future value of the explanatory variables. Read Introductory Econometrics with applications, fifth edition (2002) by Ramu Ramanathan Chapter 1 (PP2-15) and Chapter 14 (PP 568-579) --------- Learning India(P) Ltd., New Delhi

3.4 TWO VARIABLE REGRESSION MODEL 3.4.1 Estimation of Parameters: The Ordinary Least Squares Approach (i) The first step is to specify the relationship between X and Y variable. Assuming that there is a linear relationship between the two variables, we can specify the model as Y = α + UX +β . By linearity, we often mean a relationship in which, the dependent variable is a linear function of the independent variable. However, the linearity in the

57

context of regression analysis can be interpreted in two different ways: linearity in the variable, and linearity in the parameters. Since the main purpose of regression analysis is the estimation of its parameters, we shall consider only those models, which are linear in parameter, no matter whether they are linear in variable or not. In fact, models that are non-linear in variables but linear in parameters can be easily estimated by extending the basic procedure that we are discussing here. By the very nature of social science, the relationship among different variables cannot be expected exact or deterministic. Hence the dependent variable Y tends to be probabilistic or stochastic in nature. That is why; we specify the regression model by incorporating a random or stochastic variable. The random variable U is also called the disturbance or error term and is a sort of catch-all variable that represents all kinds of indeterminacies of an inexact relationship. Y=α + UX +β (3.1) The above equation is called the population regression function. In this formulation, Y is a stochastic or random, but X is non-stochastic or deterministic in nature. This asymmetry in the treatment of the dependent and the independent variable can be removed by making both Y and X stochastic in nature. However, such a model is beyond the scope of the present discussion. It should be clear that a random variable like U is introduced in the population regression function to incorporate the element of randomness of a statistical relationship. We have an unknown bivariate population and hence to estimate the population regression function, we require the sample observations. Accordingly, the concept of sample regression function comes into the picture. The sample regression function is quite similar to that of the population regression function and can be presented as

∧∧∧∧

++= UXaY β (3.2)

Here ∧∧∧∧

U and , , βαY are interpreted as the sample estimates for their corresponding unknown population counterparts. Thus, we hypothesize that corresponding to the linear population function ,UXY ++= βα there is a linear sample regression function given by

UXY ˆˆˆˆ ++= βα . Assumptions of Classical Regression Model :

i) The disturbance term U has a zero mean for all the values of X i.e. E (U) = 0 ii) The variance is constant for all the values of X, i.e. V(X) = 2σ iii) The disturbance term for two different values of X are independent i.e.

Cov (Ui, Uj) = 0 for i≠j iv) X is non-stochastic v) The model is linear

58

Estimation of Sample regression function : The philosophy behind the least square method is that we should fit in a straight line through the scatter plot in such a manner that the vertical difference between the observed values of Y and the corresponding values obtained from the straight line for the different values, called errors, are minimum. In others words, we should choose our α and β in such a manner that sum of the squares of the vertical differences between the actual values or observed values of Y and the one obtained from the straight line is minimum. The straight line that we obtain is called the line of best fit. Mathematically:

222 )()( Minimize XYEYYU∧∧∧

−−=−∑=∑ βα with respect to ∧∧

βα and . By following the usual minimization procedure, we obtain the so called normal equations. The two normal equations are then simultaneously solved for

∑∑=

∧

2

xy

xβ (3.3)

and

xy ˆ βα −=∧

(3.4)

The least square estimators ∧∧

βα and are taken as the estimators of the unknown population parameters βα and because they satisfy the following desirable properties. (1) Least square estimators are linear

(2) Least square estimators are unbiased i.e. E ( .)E( and ) ββαα ==∧∧

(3) Among all the linear unbiased estimators, least square estimators have the

minimum variance and are therefore termed as efficient estimators. All these properties of the least square estimators lead to Gauss-Markov theorem. In other words, the least square estimators are the best linear unbiased estimators i.e. BLUE

Standard Error of the Regression Estimate:

The standard errors of the estimates also known as standard deviations of the sampling distributions of least square estimates are taken as a measure of the precision of these estimates. These are obtained by taking the positive square root of the variances

of∧∧

βα and . The expressions for both the variance and standard error of the least square estimators are given below:

59

)8.3( )(

)ˆSe( )7.3( )(

)(

)6.3( )(

X)Se( )5.3(

)(X

)(

22

2

22

22

2

2

∑∑

∑∑

∑∑

−=

−=

−=

−=

∧

∧∧

XXXXVar

XXnXXnVar

σβσβ

σασα

Here an unbiased estimator of 2

ˆ is

2

−∑n

Uσ (3.9)

(n-2 is the degree of freedom) Replacing σ by its unbiased estimator, we can compute the standard errors of

both ∧∧

βα and and can write the unbiased estimator of σ as

2)ˆ(

2

ˆ

22

−

−=

−= ∑∑

∧∧

nUU

nU

σ (3.10)

3.4.2 Goodness of Fit Once the regression line is fitted, we may be interested to know how faithfully the sample regression line describes the unknown population regression line. The regression error term or residual U plays an important role in this regard. Small quantities of residuals imply that a large proportion of variation in the dependent variable has been explained by the regression equation and hence the fit is good. Similarity, large quantities of residuals obviously point to the poor fit. The coefficient of determination (The square of the correlation coefficient i.e. R2) acts as a measure of goodness of fit. Example: Given the following estimated regression model, interpret the results:

(3.11) 989345.0R 965217.00217.14 2 =+−=∧

XY

Se = (7.9382644) (0.0354118) Degree of freedom = 8

Where Y = average employment, X=Level of labour force

The slope coefficient ∧

β =0.965217 estimates the rate of change of employment with respect to labour force. If 100 more persons start searching jobs, about 97

percent get actually employed. The intercept 017.14−=∧

α indicates the average combined effect of all those variables that might affect employment but have been omitted for the purpose of the above mentioned regression. The coefficient of regression R2 = 0.989345 indicates that 99 percent of variation can be explained by a variation in the labour force which is indeed very high and is a good fit for the given sample.

60

3.4.3 Functional Forms of Regression Model: As we have discussed above, linear in parameter regression model is relevant for us. Linear in parameter regression models have following functional forms. S.

No.

Model Equation Slope ⎟

⎠⎞

⎜⎝⎛=

dxdy

Elasticity ⎟⎟⎠

⎞⎜⎜⎝

⎛=

yx

dxdy .

1. Linear XY 21 ββ += 2β ⎟⎠⎞

⎜⎝⎛

YX

2β

2. Log-Linear Ln XY ln21 ββ += X

Y.2β 2β

3. Log-Lin Ln XY 21 ββ += Y.2β 2β .X

4. Lin-log XY ln21 ββ += X1.2β Y

1.2β

5. Reciprocal X

Y 1.21 ββ += - 221.

Xβ -

XY1.2β

6. Log Reciprocal Ln

XY 1.21 ββ −= 22 .

XYβ

X1.2β

Choice of Functional Form A great deal of skill and experience are required in choosing an appropriate model for empirical estimation. However, following guidelines can be helpful in this regard:

(i) The underlying theory may suggest a particular functional form. (ii) The knowledge of above formula will be helpful to compare the various

models. (iii) The coefficients of the model chosen should satisfy certain a priory

expectation. (iv) One should not overemphasize the r2 measure in the sense that higher the

r2 , the better the model. The theoretical underpinnings of the chosen model, the signs of the estimated coefficients and their statistical significance are of more importance in this regard.

Read Basic Econometrics (Fourth Edition) by Damodar N. Gujrati and Sangeeta, Tata Mcgraw Hill Publishing Company Ltd. Delhi (2007 edition), Chapter 3,5 and 6 (PP60-108, 169-196)

3.4.4 Hypothesis Testing: To examine whether the unknown parameter βα or assumes a particular value or not is known as hypothesis testing in statistics.

Although we may test some hypothesis about interceptα , our main concern in

61

regression model is the slope coefficient β . Hypothesis testing consists of three basic steps: (i) Formulating two opposing hypothesis:

0:0:

1

0

≠=

ββ

HH

(ii) Deriving a test statistic and its statistical distribution under the null hypothesis which is conventionally denoted by t. Thus

)ˆ.(.

)ˆ(ˆ

βββ

esEt −

=

The t statistic obtained above has n-2 degrees of freedom because we are

estimating two parameters βα and .

(iii) Deriving a decision rule for rejecting or accepting a null hypothesis: The following steps are involved in this process:

(a) 0100 : ,: ββββ ≠= HH

(b) the test statistic is )ˆ.(.

ˆˆ0

βββ

est

−= and can be calculated from the sample

information. Under the null hypothesis, it has the t distribution with n-2 degree of freedom. If the modulus of t is large, we would suspect that β is probably not equal to 0β .

(c) In the t table, trace the critical value of t for n-2 d.f. at the desired level of significance (say a)

(d) Reject 0H if tand t valuecomputed ( ⊗⊗ => cc ttt critical t value recorded in the t table)

If .accept 0Httc⊗<

Example : Given the GDP at factor cost and final consumption expenditure for the Indian economy during the period 1980-2001 at 1993-94 prices, we run the regression model of the final consumption expenditure and got the results through SPSS software as under:

FCE = α + β GDP

FCE = 108206.4 + 0.719674 GDP

s.e. = (233.203) (0.007865)

t = (17.35968) (91.50314)

R2 = 0.997617 d.f. = 20

62

Null hypothesis 80.0:80.0:

1

0

≠=

ββ

HH

t statistics in this case is given by

212841.10007865.0080324.0

007865.080.0719676.0

)ˆ.(.)ˆ(ˆ

=−

=−

=−

=β

ββes

Et

This computed value of t statistics (10.212841) exceeds the critical values of 2.845 and 2.086 for a degree of freedom of 20 at 1% and 5% levels of significance respectively. Thus on the basis of sample information, the difference between estimated value of β and its mean is so much that even in 1 out of 100 cases or 5 out of 100 cases we do not obtain such a difference. Thus on the basis of the sample information, we are not in a position to accept the null hypothesis. Hence in all probability during the sample period of 1980-2001, India’s marginal propensity to consume has not been as high as 80%. In this example, we considered two tailed test. Similarly one tailed test can also be conducted. It all depends upon the type of enquiry that we intend to conduct. The t test discussed above is an example of a small sample test. However, if the sample is sufficiently large, then by virtue of central limit theorem, the distribution of the test statistic discussed above approximately follows the standard normal distribution. Accordingly, the entire test can be conducted by consulting the standard normal table instead of the t table and in this case one need not bother about the degrees of freedom. For deciding about whether a sample is sufficiently large or not one has to consider the size of the sample (n). A rule of thumb is, if n is 30 or more, the sample can considered to be a large sample, otherwise, it is to be taken as a small sample.

3.5 MULI-VARIABLE REGRESSION MODELS

We shall extend the regression analysis further to make it more realistic and comprehensive by

(i) introducing one more explanatory variable and re-examine the model, (ii) interpreting the partial regression coefficients, (iii) considering how many explanatory variables must be included in the

model and what should be the touch stone for arriving at such decision. (iv) discussing the conditions or assumptions which make these extensions and

generalizations possible. (v) examining the possible effects of violations of one or more assumptions,

particularly, multicollinearity, heteroscedasticity and autocorrelation 3.5.1 Regression Model with two explanatory variables: (The following matter has been adapted from Unit 10, Block 3 of MEC-009 course) For the simplicity and better comprehension, we shall write the model Y= 110 Xββ + (3.1) in a different form to add more explanatory variables like X2, X3 with their

63

respective coefficients as 32 , ββ etc. Thus, a model with two explanatory variables in stochastic form can be written as:

Y= UXX +++ 22110 βββ (3.12) = E(Y) + U (3.13)

Using subscript t with Y, X1, X2 and U to denote the tth observation of these variables, the above equation can be written as tttt UXXY +++= 22110 βββ 3.5.1.1 Estimation of Parameter: Ordinary Least Squares Approach

We collect the sample observations on Y, X1 and X2 and write down the sample regression function as tttt eXXbbY +++= 2110 …………… (3.14) where b0, b1 and b2 replace the corresponding population parameters 320 and , βββ and random population component Ut is replaced by sample error term et. By applying the principle of ordinary least squares, we work out the values of b0, b1 and b2 such that residual sum of squares ( )∑ 2

te is minimum. Here tttt XbXbbYe 22110 −−−= (3.15)

[ ]222110

2∑ −−−∑= tttt XbXbbYe (3.16) Differenting 3.16 w.r.t. b0, b1, b2 and equating to 0 gives us the three normal equations.

These three equations give us the following expressions for b0, b1, and b2 respectively:

( )( ) ( )( )( )( ) ( )

(3.21)

(3.20)

221

22

21

212221

1

22110

tttt

ttttttt

xxxxxxxyxxyb

XbXbYb

∑−∑∑∑∑−∑∑

=

−−=

( )( ) ( )( )( )( ) ( )2

2122

21

211212

2tttt

ttttttt

xxxxxxxyxxy

b∑−∑∑

∑∑−∑∑= (3.22)

The lower case letters denote, as usual, the deviations from the respective means:

)( xand ),(),( 22211 XXXXxYyy tttttt −=−=−=

(3.19) Y

(3.18) Y

(3.17)

222211202t

2122

11101t

22110

∑∑ ∑∑∑∑ ∑∑++=

++=

++=

ttttt

ttttt

ttt

XbXXbXbX

XXbXbXbX

XbXbbY

64

3.5.1.2 Variance and Standard Errors

Var ( )( )

22

2122

21

212121

22

22

21

021 σ

⎥⎥⎦

⎤

⎢⎢⎣

⎡

∑−∑∑∑−∑+∑

+=tttt

tttt

xxxxxxXXxXxX

nb (3.23)

( )

( )( )

( )( )

(3.26) freedom of degreefor stands 3-n where3

ˆ

asout workedis ˆ Thus, obtained. is ˆestimator OLS unbiased its andunknown is where

var)(

)25.3( .)(

var)(

(3.24) .)(

var)(

22

222

22

22

2122

21

21

2

11

22

2122

21

22

1

00

−∑

=

=

∑−∑∑∑

=

=

∑−∑∑∑

=

=

ne

bbSE

xxxxxbVar

bbSE

xxxxxbVar

bbSE

t

tttt

t

tttt

t

σ

σσσ

σ

σ

3.5.1.3 Interpretation of Regression Co-efficients

Mathematically, b1 and b2 represent the partial slopes of regression plane with respect to X1, and X2 respectively. In other words, b1 shows the rate of change in Y as X1 alone undergoes a unit change, keeping all other things constant. Similarly, b2 represents rate of change of Y as X2 alone changes by a unit while other things are held constant. 3.5.2 Goodness of fit: Multiple Coefficient of Detomination: R2

We have seen above that in case of single independent variable, r2 measures the goodness of fit of the fitted sample regression line. When we have two explanatory variable X1, and X2, we might be interested in the the proportion of total variation in

2tyY ∑= explained by X1, and X2 jointly. This information is conveyed by multiple

coefficient determination, denoted by R2. It can be computed by the following formula.

222112

t

tttt

yxybxyb

R∑

∑+= ∑ (3.27)

R2 lies between 0 and 1 and closer it is to 1, better is the fit which implies that the estimated regression line is capable of explaining greater proportion of variation in Y. The positive square root of R2 is called coefficient of multiple correlation. 3.5.3 Analysis of Variance (ANOVA) In the context of regression, a study of the components of total sum of squares (TSS) is called analysis of variance. We know the relationship: TSS = ESS + RSS

65

Where ESS = Explained sum of squares, and RSS is Residual sum of squares. This is equivalent to saying:

RSSe

ESSxybxyb

TSSy tttttt

22211

2 ∑+

+∑=

∑ (3.28)

It should be noted that every sum of squares has some degree of freedom (df) associated with it. Accordingly, in our 2-explanatory variable case, the degree of freedom will be TSS=n-1, RSS = n-3, ESS=2 One may be interested in testing a null hypothesis. H0 : 021 == ββ

In such a case, we find that dfRSSdfESS

// is the ratio of the variance explained by X1, and X2

to unexplained variance and it follows F distribution with 2 and n-3 degrees of freedom. In general, if a regression equation estimates ‘K’ parameters including the intercept, then F has (K-1) df in numerator and (n-k) df in the denominator. F values can be expressed in terms of R2 as under:

)/()1()1/(

2

2

knRKRF

−−−

= (3.29)

Interpretation : larger the variance explained by fitted regression line, larger the numerator will be in relation to the denominator. Thus, a larger F value is evidence against the truthfulness of H0: 021 == ββ . Thus in the case of an F value larger than 1, one cannot accept the hypothesis that the variables X1, and X2, taken together, do not have any effect on Y. Read Basic econometrics (fourth edition) by Damodar N. Gujrati and Sangeeta Chapter 8 PP 253-265

3.5.4 Inclusion and Exclusion of Explanatory Variables

As we add more and more explanatory variables Xs, the explained sum of squares (ESS) keeps on rising and, consequently, R2 goes on rising. However, each additional variable that is added eats up one degree of freedom and our definition of R2 makes no allowance for this loss of degree of freedom. Thus, the philosophy of improving the goodness of fit by sensibly increasing the number of explanatory variables may not be justified. We know that TSS always has (n-1) degree of freedom. Therefore, comparing two regression models with same dependent variable but different number of independent variables will not be justified also. Hence we must adjust our measure of goodness of fit for degrees of freedom. This measure is called adjusted 2R , denoted by 2R . It can be derived from R2 in the following manner:

66

knnRR

−−

−−=1)1(1 22 (3.30)

Therefore, it is recommended that, we must include new variables only if (upon inclusion) 2R increases and not otherwise. A general guide is provided by‘t’ statistic, if the absolute value of the coefficient of added variable is greater than one, retain it. (Let us remember that‘t’ value is calculated under the hypothesis that population value of that coefficient is zero). We should note here that besides R2 and adjusted R2, there are other criteria also for judging the goodness of fit like Akaike’s Information criterion and Amemia’s Prediction criteria. However, a description of such criteria is beyond the scope of the present discussion. 3.5.5 Generalisation To N-Explanatory Variables In general, our regression model may have a large number of independent variables. Each of those variables can, on priory grounds, be expected to have some influence over the ‘dependent’ or ‘explained’ variable. Consider a very simple example. What can be possible determinants of demand for potatoes in a vegetables market? One obvious choice will be the price of potatoes. What else can affect the quantity demanded? Could it be availability of vegetables which can be paired off with potatoes? In that case, prices of a large number of vegetables which are cooked along with potatoes will become ‘relevant explanatory variables’. You cannot ignore income of the community that patronizes the particular market. Needless to say, the dietary preferences of members of the households can also affect the demand and so on. In the next part, we shall discuss techniques which help us restrict the analysis to a selected ‘few variables, though theoretical considerations may find a huge number of them to be ‘useful’ and ‘powerful’ determinants. In fact, in economic theory, we usually append the phrase ceteris paribus, with many a statements. This phrase means keeping all other things constant. That means, we may focus on impact of only a few selected variables on the dependent variable while assuming that all other variables remain ‘unchanged’ during the period of analysis. However, before taking recourse to this assumption; we have to weigh the need to include more and more variables in our model with the ‘gains’ in explanatory power of the model. We have developed, in previous section 3.54 a working touchstone for inclusion of more variables in terms of improvement in 2R and have tried to give it a ‘practical’ shape in the form of the magnitude of ‘t’ values of the relevant slope parameters. With these considerations in mind, we can generalise the linear regression model as follows: We hypothesize that in population, the dependent variable Y depends upon k explanatory variables, X1, X2, ………..Xk. We also assume that the relationship is linear in parameters. Three more assumptions are made and they have very significant bearing on the analysis. These are:

a) Absence of multicollinearity;

b) Absence of heteroscedasticity; and

c) Absence of autocorrelation

67

We will discuss the complications, which arise because of violations of these assumptions in section 3.5.6, 3.5.7, and 3.5.8 respectively. We can present the Classical Linear General Regression Model in k explanatory variables in the following fashion:

(3.31) 1 2211 ntXXXY ktkttt =+++= βββ In this model, we have omitted the constant intercept term to facilitate the exposition. From the model it is clear that Y and each X have n values (t=1………….n), forming n (k+1)-tuples like (Y11, X11, X21, ……………..Xk1) and so on of 1 dependent variable and k explanatory variables. We can write an elaborate system of n equations for the n values of the dependent variable Y in terms of k explanatory Xs very conveniently in the matrix equation form: Y = Xβ+U (3.32)

Where ⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

nkn U

UU

Y

YY

11

and ,1

β

ββ

also

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

=

knnnn

k

k

XXXX

XXXXXXXX

X

,,,

,,,,,,

321

2322212

1312111

We assume that (1) Expected values of error terms are equal to zero; that is E(ui) = 0 for all ‘i’. In matrix notation

0)(

)()( =

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

0

00

E i

nuE

uUE

2) The error terms are not correlated with one another and they all have same variance for σ 2 all sets values of the variables X. That is,

iuE

jiuuE

i

ji

∀=

≠∀=

;)(

and ;0)(22 σ

in matrix notation: 2

1 2 1

2 2' 21 2 2 2

2

221 2

( ), ( ) ( ) 0 0 0( ), ( ) ( ) 0 0

0 0 00 0 0( ), ( ) ( )

n

nn

n n n

E u E u u E u u

E u u E u E u uE UU I

E u u E u u E u

σσ σ

σσ

σ

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎡ ⎤ = = =⎣ ⎦ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎣ ⎦⎣ ⎦

21

68

3) The explanatory X1……………..Xk variables are non-random (i.e., non-stochastic) variables.

4) The matrix X has full column rank equal to k. This means, it has k linearly independent columns. It implies that the number of observations exceeds number of co-efficient to be estimated (n > k). It also implies that there is no exact linear relationship among the X variables. This in fact is the assumption of the absence of multicollinearity. Note: Assumptions that E(uiuj) = 0 means that error terms are not correlated. The implication of diagonal terms in matrix E(UU′) being all equal to σ 2 is that all error terms have the same variance, σ 2 . This is also called the assumption of homo-scedasticity. We can write the regression relation for the sample as: e = Y – Xb

where e, Y, X and b are appropriate matrices.

Sum of squared residuals will be

[ ] [ ]XbXbYXbYYXbYXbYee

XbXbYe ktkttt

′′+′′−=

−′−=′=

=+−∑=∑=

2form)matrix (in

(3.33) n 1 t)(

1

211

2φ

Note: b′ X'Y is a scalar and therefore equal to its transpose Y′ Xb.

By equating 1st order partials of φ w.r.t each bi (i = 1…k), to zero, we get k normal equations. This set of equations in matrix form is:

(3.35)

(3.34) 022

YXXbX

XbXYXb

′=′

=′+′−=∂∂φ

when X has rank equal to k, the normal equation 3.35 will have a unique solution and least squares estimator b is equal to:

[ ] [ ] (3.36) 1 YXXXb ′′= − We have assumed b to be estimator for β and thus E(b) = β , therefore we can rewrite 3.36 as

[ ] [ ][ ] [ ]

[ ][ ][ ] [ ]

βμμβ

β

β

β

=

′′+=+=∴

+=

′′+′′=

+′′′=

−

−−

−

)(.)()()()( 111

111

11

1

EXXXBEEXXXEEbE

UXXX

UXXXXXXX

uXXXXb

69

Variance of 12 )( −′= XXb σ Notes

1. In this course our objective is simply to introduce the concepts. You will find the concepts at much more rigorous level in the courses on Basic Econometrics (REC 003) – included as compulsory course of M.Phil/Ph.D. Programme in Economics.

2. The other ideas regarding coefficient of determination R2 and adjusted R2 remain the

same as they were developed for two explanatory variable case. Now we can safely turn to discussions about non-satisfaction or violation of assumptions. 3.5.6 Problem of Multicollinearity Many a times X variables may be found to have some other linear relationships among themselves. This vitiates our classical regression model. Let us illustrate it with help of our 2 explanatory variable model.

122110 UXXY iii +++= βββ

Let us give specific names to variables, say, X1 is price of commodity Y and X2 is family income. We expect β1 to be negative and β2 to be positive. Now we go one step further. Let Y be demand for milk, X1 be price of milk and suppose, the family wise demand for milk is being estimated for a family, which also produces and sells milk, Clearly, larger the value of X1, higher the magnitude of X2 will be. In such situations, the estimation of price and income co-efficients will not be possible. Recall, we wanted variables X in our matrix equations to be linearly independent. If that conditions is not satisfied X matrix becomes singular, that is, its determinant will be equal to zero. Thus, there will be no solution to the normal equation 3.34 (or 3.35). However, if co-linearity is not perfect, we can still get OLS estimates and they remain the best linear unbiased estimates (BLUE) – though one or more partial regression co-efficient may turn out to be individually insignificant. Not only this, the OLS estimates still retain the property of minimum variance. Further, it is found that multi-co-linearity is essentially a sample regression problem. The X variables may not be linearly related in population but some of our suppositions while drawing a sample may create a situation of multiple linear relations in the sample. The practical consequences of multicollinearity. Gujarati, (D.N.) has listed the following consequences of multiplicity of linear relationships: 1. Large variances /SEs of OLS estimates 2. Wider confidence intervals 3. Insignificant ‘t’ ratios for β parameters 4. A high R2 despite few significant t values 5. Instability of OLS estimators: The estimators and their standard errors (SEs)

become very sensitive to small changes in data. 6. Sometimes, even signs of some of the regressions may turn out to be theoretically

unacceptable like a rise in income having negative impact on demand for milk.

70

7. When many regressions have insignificant coefficients, their individual contributions to the explained sum of squares cannot be assessed properly.

Multicollinearity can be detected by: (1) high R2 but few significant ‘t’ ratios, (2) high pair wise correlation between explanatory variables. One can try partial

correlations, subsidiary or auxiliary regressions as well. But each such technique increases the burden of calculations.

3.5.7 Problem of Heteroscedasticity The Classical Linear Regression Model has a significant underlying assumption that all the error terms are identically distributed with zero mean and the same standard deviation equal to σ (or variance equal to σ2). The second part of the assumption; that errors have a constant standard deviation or variance is known as the assumption of homoscedasticity. What happens when this assumption of homoscedasticity does not hold? In symbolic terms, n1i )( 22 == iiuE σ , that is, if the expectation of squared errors is no longer equal to σ2 ⎯ each error term has its own σ2, or variance that varies from observation to observation. It has been observed that usually time series data do not suffer from this problem of hetero-scedasticity but in cross-section data, the problem may assume serious dimensions. The consequences of heteroscedasticity: If the assumption of homoscedasticity does not hold, we observe the following impact on OLS estimators. 1. They are still linear 2. They are still unbiased 3. But they no longer have minimum variance – that is we cannot call them BLUE:

the Best Linear Unbiased Estimators. In fact, this point is relevant both for small as well as large samples.

4. The reason for this problem hinted at in (3) above is that generally, OLS estimators have some bias built into their formulae. We try to rectify that by making use of degrees of freedom. For instance 2σ , (the estimator for true population σ2) given by dfe2

1∑ no longer remains unbiased. And this very 2σ enters into calculation of standard errors of OLS estimates.

5. Since, estimates of standard errors are themselves no longer reliable, we may end up drawing wrong conclusions while using conventional reasoning in procedures for testing hypotheses.

How to detect heteroscedasticity In applied regression analysis, plotting the residual terms can give us important clues about whether or not one or more assumptions underlying our regression model hold. The pattern exhibited by 2

ie plotted against the values of the concerned variable can provide

71

an important clue. If no pattern is detected – homoscedasticity holds or heteroscedasticity is absent. On the other hand, if errors form a pattern with the values of the variable like expanding, increasing linearly or changing in some non-linear manner, there is a distinct possibility of the presence of heteroscedasticity. Some statistical tests have been designed to detect the presence of heteroscedasticity. Some of the prominent ones are: Park Test, Glejser Test, White’s General Test, Spearman’s Rank Correlation Test, Goldfield- Quadnt Test etc. But here, the limitation of space does not permit us to go into their details. We are forced to refer the learners again to the course on Basic Econometrics for such details. How to tackle the heteroscedasticity? Our ability to tackle the problem will depend upon the assumptions that we can make about the error variance. Thus, the following situations may emerge i) When 2

iσ is known Here the CLRM ii uXY ++= 110 ββ can be transformed by dividing each value by the corresponding σi. Thus,

i

i

i

i

ii

i uXYσσ

βσ

βσ

+⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛= 10

1

This effectively transforms the error terms to i

iuσ

which can be shown to be

homoscedastic and therefore, the OLS estimators will be free of disability caused by heteroscedasticity. The estimates of β0 and β1 in this situation are called Weighted Least Squares Estimators (WLSEs).

ii) When σ2 is unknown: we make some further assumptions about error variance:

(a) Error variance proportional to the Xis. Here, the Square Root transformation is enough. We divide on both sides by iX . Thus, our regression line looks like:

here

1

1

10

10

i

ii

iii

i

i

i

i

ii

i

Xu

XX

Xu

XX

XXY

=

++=

+⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟

⎟⎠

⎞⎜⎜⎝

⎛=

ν

νββ

ββ

and this is sufficient to address the problem. (b) Error variance is proportional to X1

2. Here, instead of division by iX , we divide by Xi on both the sides and estimate

72

iii

i

ii

ii

i

X

Xu

XXY

ηββ

ββ

++=

++=

1

1

0

0

The error term will be i

ii X

u=η and this will be free of heteroscedasticity and thus, will

facilitate the use of CLS techniques. iii) Respecification of the Model: Assigning a different functional form to the model, in place of speculating about the nature of variance may be found to be expedient. For example, instead of the original model, we can estimate this model:

iii uXY ++= lnln 10 ββ This loglinear model is usually adequate to address our concerns.

Read: Introduction to Econometrics by Christopher Dougherty (2002) chapter 8pp 220-230, Oxford University Press 3.5.8 Problem of Autocorrelation The classical regression model also assumes that disturbance terms uis do not have any serial correlation. But, in many situations this assumption may not hold. The consequences of the presence of serial or auto correlation are similar to those of hetero scedasticity: the OLS are no longer BLUE. Symbolically no autocorrelation means E(uiuj)= 0, when i ≠ j. Autocorrelation can arise in economic data on account of many factors: i) Inertia is a major reason for the presence of autocorrelation. An economic time

series generally displays a cyclical pattern of upswings and downswings for various reasons and these swings have a tendency to continue. This tendency is called inertia.

ii) Specification Bias is an important source of autocorrelation. Specification bias

may be due to under specification or the use incorrect functional form of the model. For example, one might use only a few explanatory variables and thereby might exclude rather large systematic components to be clubbed with errors. Similarly one might use a linear form, instead of a non-linear form for the model.

iii) Cobweb Phenomenon is another factor which may also give rise to the problem

of autocorrelation in certain types of economic time series (especially agricultural output and the like).

iv) Polishing of Data also sometimes results in the presence of autocorrelation. We

sometimes manipulate the monthly data to obtain quarterly data or similarly manipulate the quarterly data to derive the half-yearly series and the like. Such manipulations usually involve certain averaging procedure. This may also be

73

responsible for autocorrelation; because the averaging process dampens the fluctuations of the original data.

The consequences of autocorrelation are not different from those of heteroscedasticity listed in 10.7 above. Here too OLS estimators are biased and consequently, not BLUE. In fact, t and F tests cease to be reliable. As result, one can no longer depend upon the computed value of R2 as a true indicator of goodness of fit. There are many tests for detecting autocorrelation. Some of them can be visual inspection of error plots, Runs Test and Swed-Eisenhart Critical Runs Test. But the test most commonly used is Durbin-Watson d Test. This is defined as

However, again, we are holding back information on practical detections and avoidance of problem of autocorrelation for the reasons of limitation of space here. Read: Introduction to Econometrics by Christopher Dougherty (2002) chapter 13 pp 337-358, Oxford University Press

3.6 FURTHER SUGGESTED READINGS

1. Maddala, G.S. (2002), Introduction to Econometrics, Third Edition, Chapter 3 and Chapter 4, John Wiley & Sons Ltd., West Sussex.

2. Pindy ck, Robert S. and Rubin field, Daniel L. (1991) Econometric models and economic forecasts, third edition, Chapter 1 Mc graw Hill, New York, U.S.A.

3. Ramanathan Ramu (2002); Introductory Econometrics (fifth edition), chapter 8,9,10 and 14, Cengage Learning Private Limited, New Delhi

4. Karmel, P.H. and Polasek, M. (1986), Applied statistics for economists, fourth edition, chapter 8, Khasala Publishing House, New Delhi.

3.7 MODEL QUESTIONS

1. State the various forms of regression models. When will you use a log linear regression model? Give an illustration in support of your answer.

2. How do you interpret the estimated slope coefficient of a log linear regression model?

3. If you want to estimate India’s rate of growth of per capita income during the period 1990-2008, what should be the functional form of your regression model?

4. How do you interpret coefficients of multiple regression model? Give an example in support of your answer.

∑

∑

=

=−−

= n

tt

n

ttt

e

eed

1

2

2

21)(

74

5. What is multi-co-linearity? What are its consequences? 6. What are the consequences of hetero-scedasticity? How will you tackle the

problem of hetero-scedasticity? 7. “Inclusion of more variables always increases R2 the goodness of fit. So to

make a regression model ‘good’ what we need to do is simply increase the number of explanatory variables”. Do you agree/disagree with this statement? Give reasons.

8. How do you interpret coefficients of multiple regression model? 9. From a sample of 209 firsm, the following regression results are given

Log (salary) = 4.32+0.280 log (sales) + 0.0174 roe + 0.00024 ros Se = (0.32) (0.035) (0.0041) (0.00054) R2 = 0.283 Where salary = salary of CEO Sales = annual firm sales roe = return on equity in percent ros = return on firms’s stock And where figures in the parentheses are the estimated standard errors. a. Interpret the preceding regression taking into account any prior

expectations that you may have about the signs of the various coefficients. b. Which of the coefficients are individually statistically significant at the 5

percent level? c. What is the overall significance of the regression? Which test do you use?

And why?

75

BLOCK 04 QUALITATIVE METHODS

Structure

4.0 Objectives 4.1 Introduction 4.2 An Overview of the Block 4.3 Research Approaches 4.3.1 Philosophical Foundation 4.3.2 Frameworks for Qualitative Research 4.3.3 Research Strategies 4.3.4 Methods of Qualitative Research 4.4 Participatory Rural Appraisal (PRA) Approach 4.4.1 Rapid Rural Appraisal and Participatory Rural Appraisal 4.4.2 Other Streams of PRA 4.4.3 Principles of PRA 4.4.4 Organizing PRA 4.4.5 Methods and Techniques of PRA 4.4.6 Sequence of Techniques 4.4.7 Practical Applications 4.4.8 Validity and Reliability 4.4.9 Vulnerability and Risks 4.4.10 Challenges 4.5 Case Study Method 4.5.1 Types of Case Studies 4.5.2 Case Study Design 4.5.3 Component of case studies 4.5.4 Sources of evidence 4.5.5 Principles of case studies 4.5.6 Steps of case studies 4.6 Further Suggested Readings 4.7 Model questions

4.0 OBJECTIVES The main objectives of this block are to:

• apprise you of the philosophical foundations and research perspectives guiding the qualitative research,

• to explain the various principles governing the participatory method of qualitative approach,

• discuss the process and stages involved in participatory method, • apply the various tools and techniques of PRA approache in research, • appreciate the limitations and challenges faced in participatory method, and • explain the principles, research design and the steps involved in conducting

studies by applying case study method.

76

4.1 INTRODUCTION

Research methodology deals with the branch of philosophy that analyses the principles and procedures of scientific inquiry in a particular discipline with a set of pedagogy to understand complex reality. Principles and procedures of scientific enquiry tend to unfold causality of factors to understand complex phenomena through empirical evidences and their validations. Empirical evidences are captured through quantitative and qualitative approaches and variables. The first three blocks of this course contain the different aspects of the quantitative approach namely Foundations of Research Methods, data collection and Analysis of Data through Quantitative Methods. Quantitative approach broadly deals with data and sampling errors but still lacking reliability of data on account of non-sampling errors to handle such situations. On the other hand, Qualitative approach is an in depth scientific enquiry of complex events, their dimensions and variables, which are difficult to be captured through cardinal or quantitative approach. For example, it is easier to collect data on income and expenditure of the households canvassing simple structured questions in the Keynesian framework of psychological laws. Moreover, traditional quantitative research methods are considered time taking exercise to produce results which are often irrelevant for particular time bound policy drive. These methods also involve high cost of formal surveys. Keeping in view all these limitations of quantitative approach, this block covers qualitative approach to research methods, tools and techniques of data collections, formatting, processing and analysis of data, report writing, etc. This block focuses on a few important methods of qualitative approach: participatory rural appraisal (PRA) and case study method (CSM). 4.2 AN OVERVIEW OF THE BLOCK The major difference between quantitative and qualitative approach lies in the underlying beliefs and assumptions, the framework guiding research and the methodological prescriptions. Critical theory and interpretivism have emerged alternative paradigm to positivism and post positivism in the context of qualitative approach. Qualitative approach is an in-depth scientific enquiry of complex events, their dimensions and variables which are difficult to be captured through quantitative approach. Qualitative research methods can be put broadly under two categories: (i) traditional established methods like ethnography including case studies, interviewing, history and historiography, (ii) emerging qualitative methods: Participatory methods specially Participatory Rural Appraisal (PRA), . These methods are used in conducting research in the area of agriculture, rural development, health, nutrition, agro-forestry, natural resource assessment, emergencies and disasters, etc. Hence, principles guiding the PRA approach, methods used for data collection and analysis have been discussed in this block. Case study is another important method for probing the issues. Underlying principles and steps involved in conducting case studies ,therefore, have also been taken up in this block.. 4.3 RESEARCH APPROACHES: QUANTITATIVE AND QUALITATIVE The major difference between the quantitative approach and qualitative approach is not the type of data used or preferred but is much broader and deeper. It lies in the underlying beliefs and foundational assumptions (i.e. paradigm) that guide the use of particular research method and assumed to be true. A paradigm is a comprehensive belief system, world view or framework that guides research and practice in the field. It consists of:

• At the basic or fundamental level, a philosophy of science that makes a number of assumptions about fundamental issues relating to nature and characteristics of

77

truth or reality (ontology) and the theory of knowledge dealing with how can we know the things that exist (epistemology).

• World view or framework that guides research and practice in field, • General methodological prescriptions including instrumental techniques about

how to conduct work within the paradigm.

Since knowledge of philosophical foundations, frameworks and paradigms enable us to understand when, how and where a particular method will be appropriate, a brief discussion on these three components is desirable. 4.3.1 Philosophical Foundation Ontology and Epistemology : Ontology and Epistemology are the two major aspects of a branch of philosophy called metaphysics. Ontology is concerned with nature of reality whereas epistemology refers to a theory of knowledge how human beings come to have knowledge of the world around them. Two theories of knowledge have pre-dominated in philosophical discourse: Rationalism and Empiricism. Rationalism is based on the idea that reliable knowledge is derived from the use of pure reason, establishing indisputable axioms and then using formal logic to arrive at conclusions. Empiricism, on the other hand, relies on the use of human senses to produce reliable knowledge. These philosophical positions can be further elaborated in terms of two dominant epistemological positions and their associated ontological positions: materialism and idealism, thus generating a four-way classification scheme. (i) Empiricism: Materialist ontology and nominalist epistemology together constitutes empiricism. Under this position reality is viewed as being constituted of material things to be observed by the human senses. (ii) Substantialism: Again the view that ‘matter constitutes the reality’ is adopted. But here people in different times and places can interpret reality differently. (iii) Subjectivism: Since, reality here is viewed as socially constructed and interpreted, knowledge of this reality is available from the accounts that social actors provide. (iv) Rationalism: ‘Reality is made up of ideas’ under this position. It is believed that it exists independently of people and their consciousness. Knowledge can be obtained only by examining thought process. These four positions are associated with the major paradigms (philosophy of sciences) in the following manner. Empiricism – Positivism and Post- Positivism or falsificationism Substantalism – Critical Realism, Critical Theory Subjectivism - Interpretativism The exact number of world views and the names associated with a particular paradigm vary from author to author, but three paradigms in the context of qualitative approach of research are important:

• Positivism and Post-Positivism • Critical Theory • Interpretivism

Qualitative research sometimes is described as interpretative, critical or post modern research whereas quantitative research is often called as empirical, positivist, post-positivist or objectivist. There are important differences between positivism and post-Positivism or post modernism and interpretivism. However these differences are less

78

important than the similarities. Critical theory and interpretativism are the most important paradigms qualitative research. The peculiar features of these paradigms are:

• They differ on the question of reality. • They offer different reasons or purposes for doing research. • They point us to quite different types of data and methods as being valuable and

worthwhile. • They have different ways of deriving meaning from the collected data. • They vary in the relationship between research and practice.

The above three paradigms have been the dominant guiding frameworks in research in the social sciences. Differences between Post positivism and Critical Theory on the Five Major Issues Post Positivism Critical Theory Nature of reality Material and external to the

human mind Material and external to the human mind

Purpose of research Find universals Uncover local instances of universal power relationships and empower the oppressed

Acceptable methods and data

Scientific method Objective data

Subjective inquiry based on ideology and values; both quantitative and qualitative data are acceptable

Meaning of data Falsification is Used to test theory

Interpreted through ideology; used to enlighten and emancipate

Relationship of research to practice

Separate activities Research guides practice

Integrated activities Research guides practice

(Source: Foundations of Qualitative Research by Jerry W. Willis (2007) p.p.83) Differences between Post positivism and Interpretivism on the Five Major Issues Post Positivism Interpretivism Nature of reality External to human mind Socially constructed Purpose of research Find universals Reflect understanding Acceptable methods and data

Scientific method

Subjective and Objective research methods are acceptable

Meaning of data Falsification Use to test theory

Understanding is contextual. Universals are deemphasized

Relationship of research to practice

Separate activities Research guides practice

Integrated activities Both guide and become the other

(Source: Foundations of Qualitative Research by Jerry W. Willis (2007) p.p.95) 4.3.2 Framework for Qualitative Research

79

Qualitative researchers have the option to choose conceptual frameworks out of various attachments. A framework is a set of broad concepts that guide research. Researches working within interpretive and critical theory paradigms have a number of frameworks to choose. There are many commonalities of these frameworks but there are many differences too. These differences often lead to the development and use of different research methods. The important frameworks that appeal to a number researches today include:

• Analytic Realism • Interpretive perspective • Einsner’s connoisseurship model of inquiry • Semiotics • Structuralism • Post structuralism and Post modernism

All these frameworks put forward different options before a qualitative researchers and point towards certain research methods, goals and topic. Under the positivist or post-positivist paradigm, a researcher undertakes the research by stressing the variables, hypothesis and propositions derived from a particular theory that sees the world in terms of cause and effect. On the other hand, under the interpretative paradigm, emphasis is laid on socially constructed realities, inter-subjectivity, local generalisations, practical reasoning and ordinary talk. Critical researches underline the importance of terms like action, structure, culture and power fitted into a general model of society. Under the feminist perspective, research focuses on gender, reflectivity, emotion and an action orientation. All these frameworks have different applications in different disciplines of social sciences and provide options towards certain research methods, goals and topics. Two characteristics – the search for contextual understanding, the place of universal laws and correspondingly research design – make the qualitative approach distinct from post positivist quantitative approach. Read Chapter 5 Frameworks for qualitative Research , Foundations of qualitative Research by Jerry W.Wills (2007), Sage Publications, P 147-181 4.3.3 Research Strategies The four research strategies work with different ontological assumptions: Positivism - Induction Falsification - Deduction Critical Realism - Retroduction Interpretativism - Abduction Each strategy has different starting points:

1. The Inductive strategy begins with collection of data from which generslisation is made and can be used as an elementary explanation.

2. The deductive strategy starts with theory that provides a possible answer. The theory is tested in the context of a research problem by collection of relevant data.

3. The retroductive strategy starts out with a hypothetical mode of a mechanism that could explain the occurrence of a phenomenon under investigation.

80

4. The abductive strategy starts with laying the concepts and meanings that are contained in social quarters’ accounts of activities related to a research problem.

Read: Philosophy of Social Research (pp.816 to 820) in The Sage Encyclopedia of Social Science Research Methods by Michael S. Lewis-Beck, Alan Bryman, Tim Futing Liao (ed.), Sage Publications (2004)

4.3.4 Methods of Qualitative Research

Methods of Qualitative Research

I Established Qualitative II Emerging Qualitative Methods Research Methods

• Ethnography A. Participatory Method

• Interviewing B. Emancipatory Methods (structured, semi structured (i) Critical emancipatory research

and open) (ii) Feminist and standpoint research (iii) Critical action research

• History and Historiography The traditional qualitative research methods under category I have been used mostly in the different disciplines of social sciences like anthropology, Sociology, psychology, education and history etc. Similarly Historiography method is useful to investigate the issues in history. Emancipatory research developed from critical theory perspective, is useful in studying the worker-management relationship in a large factory and is based on the assumption that research should lead to greater freedom and control on the part of participants. With emergence of interpretive and critical theory approach as an alternatives to positivism and post-positivism, participative approach is being increasingly used to conduct the evaluative research studies in economics involving the people who have been the subjects of research in the research process. Rapid Rural Appraisal and participatory “Rural Appraisal are alternatives to sample survey approach in collecting the data at the village level studies particularly in a situation when we need the results in emergent situations like flood, earthquakes. Hence in the further discussion of this block we shall take up the two important methods – PRA and the Case Study

Read Chapter 7 Methods of qualitative Research, Foundations of qualitative Research by Jerry W.Wills (2007), Sage Publications P 260-278

81

4.4 PARTICIPATORY RURAL APPRAISAL (PRA) METHOD Participation is now widely accepted as a philosophy and mode in development research. One practical set of approaches which have evolved and spread in the early 1990s is termed as Participatory Rural Appraisal (PRA). It is used to describe a growing family of approaches and methods to enable local people to share, enhance and analyze their knowledge of life and conditions, to plan and to act. PRA flows from and owes much to the traditions and methods of participatory research: PRA has many sources. Important among them are:

- Rapid Rural Appraisal - Activist Participatory Research - Agroeco System Analysis - Applied Anthropology - Field Research on Farming Systems

Read Participatory Rural Appraisal (PRA): Analysis of Experience by Robert Chambers, World Development, vol.22 N.9 pp 1253-1268, 1994

4.4.1 Rapid Rural Appraisal (RRA) and Participatory Rural Appraisal (PRA) PRA has evolved from RRA. RRA itself began as a response in the late 1970s and early 1980s to the biased perceptions derived from rural development tourism and the many defects and high costs of large-scale questionnaire surveys. The basic distinction between PRA and RRA is that in RRA information is more elicited and extracted by outsiders while in PRA, it is more shared and owned by local people to conduct their own analysis and often to plan and take action. In this sense, PRA often implies radical personal and institutional change. The comparison between both the approaches has been summerised in the following table. Table 4.1: RRA and PRA compared ________________________________________________________________________ RRA PRA ________________________________________________________________________ Period of major development Late 1970s, 1980s Late 1980s, 1990s Major innovators based in Universities NGOs Main users at first Aid agencies, NGOs

Universities Government field organisations

Key resource earlier undervalued Local people’s knowledge Local people’s analytical capabilities Main innovations Methods Behaviour Team management Experiential training Predominant mode Elicitive, Extractive Facilitating, Participatory Ideal objectives Learning by outsiders Empowerment of local people

82

Longer term outcomes Plans, projects, publications Sustainable local action and institution ________________________________________________________________________ Source: (Chambers 1994a:958) In short, RRA methods are more verbal with outsiders more active, while PRA methods are more visual with local people more active. The methods between two approaches are broadly shared. Thus

(i) RRA approach is extractive-elicitive in nature wherein data is collected by outsiders.

(ii) PRA is sharing-empowering approach where the main objectives are variously investigation, analysis, learning, planning, action, monitoring and evaluation by insiders.

In practice, there is a continuum between RRA and PRA in the following manner: Table 4.2: RRA and PRA continuum ________________________________________________________________________ Nature of process RRA PRA ________________________________________________________________________ Mode Extractive elicitive Sharing empowering Outsider’s role Investigator Facilitator Information owned, analysed Outsider Local people and used by Methods used Mainly RRA Mainly PRA

+ sometimes PRA + sometimes RRA ________________________________________________________________________ Source: (Chambers 1994a:959)

RRA has its own advantage for macro policy decisions but PRA not only takes variations into account but also ensures ownership of findings and analysis with local people as end user. In RRA outsiders remain moving force and therefore ownership of findings may not have similar fate. However, there is continuum in RRA and PRA also as shown in table 4.2. 4.4.2 Other Streams of PRA (a) Activist participatory research The term “activist participatory research” is used to refer to a family of approaches and methods which use dialogue and participatory research to enhance people’s awareness and confidence, and to empower their action. The contributions of the activist participatory research stream to PRA have been more through concepts than methods. They have in common three prescriptive ideas: - that poor people are creative and capable, and can and should do much of their own

investigation, analysis and planning; -that outsiders have roles as conveners, catalysts and facilitators; -that the weak and marginalized can and should be empowered. (b) Agroecosystem analysis

83

It was developed in Thailand from 1978 onwards and has combined analysis of ecology and system properties with pattern analysis of space, time, flows and relationship, relative values and decisions. It has contributed significantly to RRA and PRA particularly transacts and informal mapping, diagramming, scoring and ranking. Some of the major contributions of agroecosystem analysis to current RRA and PRA have been: - transects (systematic walks and observation); - informal mapping (sketch maps drawn on site); - diagramming (seasonal calendars, flow and causal diagrams, bar charts, Venn or chapati

diagrams); - innovation assessment (scoring and ranking different actions). (c) Applied anthropology It took over stage of social anthropology in 1980s. Rapid assessment procedure (RAP) and rapid ethnographic assessment (REA) were adopted in the field of health and nutrition. In this exercise conversation, observation, informal interviews, and focus group, etc., were used for data collection. Idea of field learning, participant observation, importance of attitude behaviour and rapport and validity of local knowledge are major contribution to RRA and PRA. Some of the many insights and contributions coming from and shared with social anthropology have been: - the idea of field learning as flexible art rather than rigid science; - the value of field residence, unhurried participant- observation, and conversations; - the importance of attitudes, behavior and rapport; - the emit-etic distinction; - the validity of indigenous technical knowledge. (d) Field research on farming systems It is of multi-disciplinary approach for complex and diversified problems with systematized methods for investigating, understanding, and prescribing for farming system complexity. In this method, farmers’ capabilities of experiment are recognized. Field research on farming systems have contributed the appreciation and understanding of: -the complexity, diversity and risk-proneness of many farming systems; -the knowledge, professionalism and rationality of small and poor farmers; -their experimental mindset and behavior; -their ability to conduct their own analyses. 4.4.3 Principles of PRA

Principles of PRA evolved in due course of experiments and results drawn from development tourism. List of principles therefore varies from practitioners to practitioners, as and when it evolved. However, there are certain commonalities shared by most of them.

(i) Principles shared by RRA and PRA: (a) Reversal of Learning: It is a departure from dominant paradigm of

learning from formal institutions and from consolidated published information. In this approach, face to face learning from the people takes place at site from their local, physical, technical and social knowledge and analysis.

(b) Learning rapidly and progressively: It has inbuilt adaptable rapid learning process with conscious exploration, flexible use of methods, improvisation, iteration and cross checking and not following strictly a blue print programme.

84

(c) Offsetting biases; It is a paradigm of offsetting biases of development tourism by being relaxed and not rushing, listening instead of lecturing, intensive probing, not being arrogant, listening poor and common people instead of getting influenced by powerful and riches.

(d) Optimizing trades off: This approach is based on trade offs between the cost of learning and usefulness of information, quantity, relevance, accuracy, and timeliness. This also includes principles of optimal ignorance and appropriate imprecision, i.e., it is better to be approximately right than precisely wrong.

(e) Triangulation: PRA is based on triangulated framework of cross checking, progressive learning and approximation through plural investigation with three methods: sets of conditions, distribution, individuals or groups,

(f) Seeking diversity: This approach is based on maximising variability rather than average. It applies purposive sampling not in strictly statistical sense but it looks for contradictions, anomalies and differences.

(ii) Additionally evolved and stressed principles in PRA: In addition to above principles PRA has some special emphasis on four following principles:

(a) Subject self – analysis: This investigation is carried out to the results and analysis by the local people with their ownership. It assumes and reposes confidence in their capabilities of doing these exercises. Facilitators simply initiate the process and become passive sitting back or walking away from the scene ensuring least interruption.

(b) Self – critical awareness: Facilitators remain self critical and correcting their failures and dominant behaviour.

(c) Personal responsibility: PRA is not a manual based approach but based on personal responsibilities and best judgement at all times, and

(d) Sharing: It is based on sharing of information and ideas between local and facilitators and also between different practitioners of different regions and countries.

4.4.4 Organizing PRA A PRA activity involves a team of people working for two to three weeks on workshop discussions, analyses, and fieldwork. Several organizational aspects should be considered:

• Logistical arrangements should consider nearby accommodations, arrangements for lunch for fieldwork days, sufficient vehicles, portable computers, funds to purchase refreshments for community meetings during the PRA, and supplies such as flip chart paper and markers.

• Training of team members may be required, particularly if the PRA has the second objective of training in addition to data collection.

• PRA results are influenced by the length of time allowed to conduct the exercise, scheduling and assignment of report writing, and critical analysis of all data, conclusions, and recommendations.

• PRA covering relatively few topics in a small area (perhaps two to four communities) should take ten days to four weeks, but a with a wider scope over a larger area can take several months. Allow five days for an introductory workshop if training is involved.

85

• Reports are best written immediately after the fieldwork period based on notes from PRA team members. A preliminary field report should be available within a week or so. Final report should be made available to all the participants and the local institutions involved.

4.4.5 Methods and Techniques of PRA The more developed and tested methods of PRA include participatory mapping and modelling, transect walks, matrix scoring, well-being grouping and ranking seasonal calendars, institutional diagramming, trend and change analysis, and analytical diagramming, all undertaken by local people. Broadly four methods are applied for PRA. Hundreds of participatory techniques and tools have been developed in a variety of occasions and taught in training courses around the world. These techniques are divided into four categories:

• Group dynamics, e.g. learning contracts, role reversals, feedback sessions • Sampling, e.g. transect walks, wealth ranking, social mapping • Interviewing, e.g. focus group discussions, semi-structured interviews,

triangulation • Visualization e.g. venn diagrams, matrix scoring, time lines

In order to ensure that people are not excluded from participation, generally these techniques avoid writing wherever possible, relying on the tools of oral communication like pictures, symbols, physical objects and group memory. Efforts are made in many projects, however, to build a bridge to formal literacy; for example by teaching people how to sign their names or recognize their signatures. Tools of data collection for RRA and PRA are often overlapping and of supplementary nature but the basic difference lies in the forms of ownership and end user. Team contracts and interactions: It is always better to decide collectively and unanimously with permissible variation about certain norms and behaviour before team proceeds for PRA. It may agree to interact on issues with certain distance or closeness, mode of discussion and its consolidation, division of labour, etc. Role reversal: It is very important that unlike development tourism in PRA, catalysts are always in learning mode. They are merely facilitators. Therefore, dominant behaviour and arrogance of being custodian of solution and knowledge is not acceptable. Deliberate effort to change behaviour and attitude is essential for a successful PRA. They may initiate discussion and leave the site or sit back passively but observe carefully. No manual is strictly followed. Group may decide collectively as they find effective. Feed back session: In order to review and consolidation of progress feed back session is essential element in PRA. Transact walks: In this exercise, facilitators create an environment in which surroundings are discussed and learnt with local people while walking through the village with people of the area. This gives an empowerment and opportunity for local people to get involved in sharing their observations in discussions about features of village, its qualities of resources, infrastructures, technological levels in production, patterns of uses, production and productivities, life styles, customs and festivals, public sharing events, leadership qualities, problems of locality, solutions, hurdles, etc. Facilitators have to be patient

86

listeners while transacting walks and providing opportunities to local people to present information through various forms: mapping, modelling, using symbols, etc. Thus, this exercise empowers local people to consolidate information with collective wisdom for initiating data collection by them. Identification of Key informants: Identification of articulated experts of the area is an important task on which success and quality of PRA depends. Key informants are identified through participatory social mapping of a village. Social Mapping: It provides a basis for household listings, and for indicating population, social group, health and other household characteristics. This can lead to identification of key informants and discussions with them. A village social map provides an up dated household listing to be used for well-being or wealth ranking of households, based on these lists, focus group consisting of different categories of people are formed. These groups express their different preferences leading to discussion, negotiation and reconciliation of priorities. Resource maps help to understand the natural and environmental settings in a particular village. Well-being wealth grouping and ranking: In this exercise people participate in identifying households with different level of well-being or wealth. Local wisdom needs to identify whether the household is landless or with different size of farms, poor or poorest of the poor, etc. This task may also identify basis and indicators of wellbeing. Value of a particular activity or item according to a range of criteria is ranked. For example, a range of different land care group activities could be assessed against a set of criteria such as attendance rate, cost and value to members. Group participation: Participation of heterogeneity of the cross section of people from the community with a mix of different levels of seriousness and varieties is essential for participatory collective feed back. This method has been used by many streams of research. Do it your self: In order to make it participatory, catalyst needs asking to be taught, being taught, and performing village tasks, agricultural operations, hut thatching and dressing, fetching waters, collecting fuels, stitching and washing clothes, etc. Local People do it: Villagers perform as investigators and researchers. They transact, interview, observe, analyse data, and present results. Semi structured interviews: This is one of the most important techniques generally based on visual and oral framework of participation. A tentative checklist with categorical option of open ended feature is the core of participation allowing space with flexibility for narratives and symbols. Participatory analysis of secondary sources: Files, reports, maps, aerial photographs, satellite imagery, articles and books are used for analysis. Mostly data in pictorial form are analysed by the community in which illiterate can equally participate. Participatory mapping and modelling: In this exercise local people use ground, floor or paper to make social demographic, health, education, natural resources, services, opportunities, using locally available symbols, units, etc.

87

Oral history and ethno biography: Narratives about an event of change in village by local people can be one of the best sources of recording collective wisdom. This may include any dimensions of local opportunities and resources. Livelihood analysis: This is one of the core exercises to understand local realities. Generally local people have a rhythm of life with opportunities of livelihood. They need to reflect on resources, income, expenditure, credit, difficulties, potential opportunities, possibilities to overcome crises, imagination of their framework of solution, etc., in their convenient and converse language and units, which can be later standardised if needed for planning and policy. Participatory linkage diagramming: It is always necessary to coordinate the discussion in participatory framework with causality connections. Facilitators need to streamline discussions to achieve the goal with permissible variations but not at the cost of participation. Venn diagramming: It is an exercise to identify individuals and institutions which are considered important in carrying out tasks, resolving problems, creating hurdles, etc., and their relations with individuals and institutions. Matrix scoring and ranking: Preparing matrices with convenient local symbols, units, do not require any formal education or literacy. Information can be symbolised by drawing trees, people, huts, crops, live stocks, water, forest, quality of soils, etc. Matrix scoring or ranking, elicits villagers’ criteria of value of a class of items (trees, vegetables, fodder grasses animal credit grasses, varieties of a crop or animal, sources of credit, market outlets, fuel types) which leads into discussion of preferences and actions by the implementers and the local community. Time lines, trend and change analysis: It is an exercise of local people’s account of the past events, listing of changes they remember, initiate them with examples, such as change in use of technology, production, cropping pattern, infrastructures, demography, values of life, customs, capabilities, employment, consumption pattern, live stocks, living conditions, institutions, causes of change, etc. This exercise can be as intensive as the objective of the study permits for data requirement. Hence a tentative checklist with open ended option is always helpful to consolidate participation effectively. Seasonal calendars: Information of practices and changes through local calendars is always better in terms of reliability. This may later be converted into standard units with a precaution that basic features of variations are not lost. Daily time use analysis: Even practices of village in terms of their routine of life may help in understanding their potential and uses. Analysis of difference: While conducting wellbeing ranking or doing any exercise of PRA, analysis of difference in terms of social groups, gender, farm size, poor and non-poor provides insights for drawing lines in understanding problems to decide priorities, planning and actions.

88

Contrast and comparison with various dimensions empower local people to understand their problems with various dimensions. Estimates and quantifications: Collecting data in local units by the local people is easier. Even illiterates with small pebbles or smaller parts of brick chips, seeds, sticks, drawing lines and tallies on the ground or walls can serve the purposes of counting for data. They manage their accounts with their phenomenal memories about the events. Combining with mapping, modelling and matrices, an excellent quantification may be made by the local people. Key probes: Direct questions relating to set objectives of investigation lead to focussed discussion for non controversial issues. Indirect narratives help in initiating discussions on controversial issues. Stories portraits and case studies: Narratives provide excellent insights for which local people are inexhaustible treasure. They can reflect through memory lane about the events what they resolved or not resolved and their outcomes. Case studies help in unfolding and correcting general perceptions. Presentations and analysis: It is always better to make presentation and cross check by local people. If at all they need a little direction, facilitator can help in but participatory spirit of presentation must not be distorted. After analysis of participatory action planning, budgeting, implementation and monitoring need be decided with time frame. Report writing: Report writing is to be done without delay in the field itself collectively dividing assignments among the designated people involved in the process of learning through PRA sequences. Feedback from the groups and local people is always correct approach to validate understanding of the problems and solutions. 4.4.6 Sequence of Techniques PRA techniques can be combined in a number of different orders and ways depending upon the topic focus, goal and objectives under investigation. Some general rules of thumb, however, may be useful. Rapport building is the core of the success of any PRA. Mapping and modelling are good techniques to start with because they involve several people, stimulate much discussion and enthusiasm, provide the PRA team with an overview of the area, and deal with noncontroversial information. Maps and models may lead to identification of key informants; transect walks, perhaps accompanied by some of the people who have constructed the map; listing of the households. Wealth ranking is best done later in a PRA, once a degree of rapport has been established, given the relative sensitivity of this information followed by focus group discussion, matrix scoring, and preference ranking. However, sequence of techniques should be decided by the groups through brainstorming. This exercise of group discussion may take place more than once depending upon the felt need. The group may decide stages to include or exclude outsider while conducting group discussion. The current situation can be shown using maps and models, but subsequent seasonal and historical diagramming exercises can reveal changes and trends, throughout a single year or over several years. Preference ranking is a good ice breaker at the beginning of a group interview and helps focus the discussion. Later, individual interviews can follow up on the different preferences among the group members and the reasons for these differences. 4.4.7 Practical Applications

89

This approach has been popular in the field of natural resource management, agriculture, implementation of rural development programmes, poverty eradication and social development: health, education and food security, etc. Now it has spread in creating and managing self help group, marketing and commercial sector also. Read The Origins and Practice of Participatory Rural Appraisal, by Robert Chambers,

World Development, vol.22 N.7 pp 953-963, 1994 4.4.8 Validity and Reliability Since RRA and PRA are considered closer to reality and direct response from the local people, validity and reliability of data through these approaches are expected to be better. Robert Chambers has reviewed findings of practitioners and reached to the conclusion that there is no significant difference in findings of PRA in case of farm and households survey through large questionnaire methods. Even for ranking, participatory village censuses, and rain fall data the results are not significantly different. It is argued that measurement data apply rigorous statistical tools and techniques which claim better precision over comparison of preferences through these approaches. However, it is also argued that discrepancies have been recognised in large scale data collection by outsiders at various levels with inherent biases. Practitioners in these areas have been experimenting and evolving better ways of learning by doing what worked while moving towards closer to reality. Moreover, these approaches have been reversal from the main professional practices towards their opposites. There are four clusters of reversal intertwine and mutually reinforcing: (a)reversal of frames from etic to emic, i.e., from outsiders to local; (b)reversal of modes from individual to group, from verbal to visual, from measuring to comparing; (c)reversal of relations: from reserve to rapport, from frustration to fun; and (d) reversal of power: from extracting to empowering. 4.4.9 Vulnerability and Risks PRA and RRA have spread very fast as bottom up approach after brief hesitation of acceptance from the academic professionals. However, there are vulnerability and risks of rapid and rigid adoption. These approaches have come in instant fashion which is vulnerable of becoming discredited if not applied properly. Rapid word injects elements of rushing towards misleading conclusions. Standardisation has every risk of formalising the codes, methods, check list and manuals which will defeat the purpose of becoming best judge at the site. 4.4.10 Challenges Real challenge is to make it participatory in true sense. Rural society is so complex and chained in various sacks of identities such as caste, religion, power groups and class, it is not easy to make the group participatory in real sense. Rapport building and allowing space for the poor in participating with their experience and wisdom has long way to go. However, Robert Chambers has considered and listed seven challenges given below:

(a) Beyond farming system research (b) Participatory alternative to questionnaire surveys (c) Issues of Empowerment and equity (d) Local people as facilitator and trainers (e) Policy research and change (f) Personal behaviour attitude and learning (g) PRA in organisations

90

These approaches may have many shortcomings but elements of creative learning, empowerment and ownership of results have made them distinctly different from others. Read Participatory Rural Appraisal (PRA): Challenges, Potentials and Paradigm by

Robert Chambers, World Development, vol.22 N.10 pp 1437-1454, 1994 4.5 CASE STUDY METHOD The term ‘case study’ refers to research that studies a small number of cases possibly even just one, in considerable depth. Frequently, but not always, case study implies the collection of unstructured data and qualitative analysis of such data. Generally case study aims to capture cases in their uniqueness, rather to use them as a basis for wider empirical or theoretical conclusions. The case study is distinct from other two types of research design i.e. survey and experiments: - In case study the number of cases are less but large amount of information are

collected about one or few cases, across a wide range. On the other hand, in surveys a large number of cases are studied but relatively a small amount of data is gathered.

- In experimental research, in a small number of cases are investigated compared to survey work. However, it involves direct control of variables. Here, the researcher creates the case(s) studied, whereas in case study the researchers identifies cases out of naturally occurring social phenomena.

The case study is used to refer to a variety of different approaches and it raises some fundamental methodological issues. - Does the case study aim to produce an account of each case from an external or

research point of view, which may contradict the views of the people involved? Or, is it solely to portray the character of each case ‘in its own terms’?

- Whether case study is a method-with advantages and disadvantageous to be used depending on the problem under investigation – or a paradigmatic approach that one simply chooses or rejects on philosophical or political ground.

Viewed as a method, there can be following variations in the specific form of case study depending on the purpose intending to serve:

• In the number of cases studied; • In whether there is comparison and, if there is, in the role it plays; • In how detailed the case studies are; • In the size of the case(s) dealt with; • In what researchers as the context of the case, how they identify it, and how much

they seek to document it; • In the extent to which case study researchers restrict themselves to description,

explanation, and/or theory or engage in evaluation and/or prescription. When it is designed to test or illustrate a theoretical point, it will deal with the case as an instance of a type, describing it in terms of a particular theoretical framework (implicit or explicit). When it is exploratory or concerned with developing theoretical ideas, it is likely to be more detailed and open-ended in character. The same is true when the concern is with describing and/or explaining what is going on in a particular situation for its own sake. When the interest is in some problem in the situation investigated, the discussion will be geared to diagnosing that problem, identifying its sources, and perhaps outlining what can be done about it. Variation in purpose may also inform the selection of cases for investigation.

91

Read: Philosophy of Social Research, Vol. 1 (pp.92 to 94) in The Sage Encyclopedia of Social Science Research Methods by Michael S. Lewis-Beck, Alan Bryman, Tim Futing Liao (ed.), Sage Publications (2004) 4.5.1 Types of Case Studies Broadly there are three types of case studies: exploratory, explanatory, and descriptive.. Each of those three approaches can be either single or multiple-case studies, where multiple-case studies are replicators, not sampled cases.

(a) Exploratory: In this type of case study fieldwork and data collection may be undertaken prior to defining research questions and hypotheses. In view of time constraints a willing and easy case needs to be identified.

(b) Explanatory: This type of cases is suitable for doing causal studies.

(c) Descriptive: In these type of cases, investigator begins with a descriptive theory,

or face the possibility that problems will occur during the project.

Case studies have been widely used in case of education, law medicine. Schools of business have been most aggressive in the implementation of case based learning. It has also been used in IT sector. Recently farmers’ suicides cases have been studied to understand agrarian crises. 4.5.2 Case study design

In CSM there is no rigid number of cases to be undertaken. Case studies can be single or multiple-case designs, where a multiple design must follow a replication rather than sampling logic. When no other cases are available for replication, the researcher is limited to single-case designs. Unlike measurement, in CSM generalization of results, from either single or multiple designs, is made to theory and not to populations.

4.5.3 Component of case studies

R.Yin identified five components of research design that are important for case studies:

• A study's questions • Its propositions, if any • Its unit(s) of analysis • The logic linking the data to the propositions • The criteria for interpreting the findings

4.5.4 Sources of evidence

R. Stake and R. Yin identified following six sources of evidence in case studies:

• Documents

• Archival records • Interviews • Direct observation • Participant-observation • Physical artefacts

4.5.5 Principles of case studies

R. Yin emphasised three principles for case study researcher:

92

• Show that the analysis relied on all the relevant evidence • Include all major rival interpretations in the analysis • Address the most significant aspect of the case study

4.5.6 Steps of case studies Case study researchers have proposed six steps for case study:

(a) Determine and define the research questions In this step researcher identifies phenomenon and object, establishes focus, formulate purposes, questions through historical, social, economic, political contexts, linkages and inter relations in the light of literature review.

(b) Select the cases and determine data gathering and analysis techniques The researcher must determine whether to study cases which are unique in some way or cases which are considered typical and may also select cases to represent a variety of geographic regions, a variety of size parameters, or other parameters. Specific case is identified and in case of multiple cases, each case is treated as single unit. The researcher must use the designated data gathering tools systematically and properly in collecting the evidence. Throughout the design phase, researchers must ensure that the study is well constructed to ensure construct validity, internal validity, external validity, and reliability

(c) Prepare to collect the data Case study research generates a large amount of data from multiple sources, systematic organization of the data is important to prevent the researcher from becoming overwhelmed by the amount of data and to prevent the researcher from losing sight of the original research purpose and questions. Advance preparation assists in handling large amounts of data in a documented and systematic fashion. Researchers prepare databases to assist with categorizing, sorting, storing, and retrieving data for analysis.

(d) Collect data in the field Researchers carefully observe the object of the case study and identify causal factors associated with the observed phenomenon. Renegotiation of arrangements with the objects of the study or addition of questions to interviews may be necessary as the study progresses. Case study research is flexible, but when changes are made, they are documented systematically.

(e) Evaluate and analyze the data The case study method, with its use of multiple data collection methods and analysis techniques, provides researchers with opportunities to triangulate data in order to strengthen the research findings and conclusions. Researchers categorize, tabulate, and recombine data to address the initial propositions or purpose of the study, and conduct cross-checks of facts and discrepancies in accounts. Focused, short, repeat interviews may be necessary to gather additional data to verify key observations or check a fact. Specific techniques include placing information into arrays, creating matrices of categories, creating flow charts or other displays, and tabulating frequency of events. Researchers use the quantitative data that has been collected to corroborate and support the qualitative data which is most useful for understanding the rationale or theory underlying relationships.

93

(f) Prepare the report Techniques for composing the report can include handling each case as a separate chapter or treating the case as a chronological recounting. Some researchers report the case study as a story. During the report preparation process, researchers critically examine the document looking for ways the report is incomplete. Thus, CSM is a useful method of qualitative approach to handle complex

phenomenon and objects in depth with specific variations and multiple dimensions of data closure to real life.

4.6 FURTHER SUGGESTED READINGS

• Chambers, Robert (1995): Rural Appraisal: Rapid, Relaxed and Participatory, in

Mukherjee, Amitava (ed), Participatory Rural Appraisal, Vikas Publishing House Pvt. Ltd., New Delhi.

• Crawford, I.M., Marketing Research and Information Systems. (Marketing and Agribusiness Texts - 4), FAO, Rome, !997, Chapter : 8, Rapid Rural Appraisal, http://www.fao.org/docrep/W3241E/w3241e08.htm#TopOfPage accessed on 12.02.2009.

• Graham Gibbs; (2007) Analysing Qualitative Data; Sage Publications • Manfred Max Bergman; (Ed,) (2008): Advances in mixed methods Research;

Sage Publications • Mukherjee, Amitava (Ed): Participatory Rural Appraisal, Vikas Publishing House

Pvt. Ltd., New Delhi, 1995. • Piore, Michael J., Qualitative Research: Does it fit in economics? Massachusetts

Institute of Technology http:// econ-www.mit.edu/files/1125 accessed on 18.02.2009

• Soy, Susan K. (1997). The case study as a research method. Unpublished paper, University of Texas at Austin, [email protected], Last Updated 02/12/2006.

• Tellis, W. (1997, July). Introduction to case study [68 paragraphs]. The Qualitative Report [On-line serial], 3(2). Available: http://www.nova.edu/ssss/QR/QR3-2/tellis1.html

• Uwe Flick; (1998); An Introduction to Qualitative Research: An introduction to Qualitative Research; SAGE Publications

• Uwe Flick; (2008): Designing Qualitative Research, SAGE Publications. • Uwe Flick; (2008)’ Managing Quality in Qualitative Research; SAGE

Publications. • Wignaraja, Ponna, Akmal Hussain, Harsh Sethi and Ganeshan Wignraja (1991):

Participatory Development: Learning from South Asia, United Nations University Press.

4.7 MODEL QUESTIONS 1. “Ontological assumptions guide the research strategy to be followed in

conducting the research study”. Explain. 2. Discuss the different frameworks of research. Which one seems most problematic

to you? Give reasons in support of your answer.

94

3. What is the distinction between PRA and RRA approach of qualitative research? Discuss the various methods and techniques of PRA with illustrations.

4. Develop a one or two page plan for a research study on topic of your choice involving semi-structured interview as a major source of data.

5. Do you think that a researcher would make more progress using different frameworks for different studies in the field? Give reasons.

6. How do the strategies of qualitative enquiry affect the method of data/material collection?

7. Explain how interpretivist philosophy of science is significant departure from post positivism?

8. Frame a research proposal of your own choice specifying the purpose of research to conduct the study from a critical theory perspective.

9. Frame a research proposal of your own choice specifying the purpose of research to conduct the study from a interpretive perspective.

10. Make a distinction between case study method and experimental method. Explain the different steps involved in case study method.

11. What are the main sources of participatory rural appraisal?

95

Block 5 DATABASE OF INDIAN ECONOMY Structure 5.1 Introduction 5.2 Objectives 5.3 An Overview of the Theme 5.4 Macro Variable Data 5.4.1 The Indian Statistical System 5.4.2 National Income & Related Macroeconomic Aggregates 5.4.3 National Income & Levels of Living 5.4.4. Saving 5.4.5 Investment 5.5 Agricultural Data 5.5.1 Introduction 5.5.2 Agricultural Census 5.5.3 Studies on cost of Cultivation 5.5.4 Annual Estimates of Crop Production 5.55 Livestock Census 5.5.6 Data on Production of Major Livestock Products 5.5.7 Agricultural Statistics at a Glance (ASG) 5.5.8 Another Source of Data on Irrigation 5.5.9 Other Data on the Agricultural Sector 5.6 Industrial Data 5.6.1 Introduction 5.6.2 Data Sources Covering the Entire Industrial Sector 5.6.3 Factory (Registered) Sector – Annual Survey of Industries (ASI) 5.6.4 Monthly Prodn. of Selected Industries and Index of Industrial Production (IIP) 5.6.5 Industrial Credit and Finance 5.6.6 Contribution to GDP 5.7 Trade 5.7.1 Introduction 5.7.2 Merchandise Trade 5.7.3 Services Trade 5.7.4 E-Commerce 5.8 Finance 5.8.1 Introduction 5.8.2 Public Finances 5.8.3 Currency, Coinage, Money and Banking 5.8.4 Financial Markets 5.9 Social Sectors 5.9.1 Introduction 5.9.2 Employment, Unemployment & Labour Force 5.9.3 Education 5.9.4 Health 5.9.5 Environment 5.9.6 Quality of Life 5.10 Let Us Sum Up 5.11 Further Suggested Reading 5.12 References 5.13 Model Questions

96

5.1 INTRODUCTION We noted in Block 2 that statistical data constitutes an essential input to the research process and talked about the methods and tools of data collection. One of the tools of data collection, we noted, is to assemble secondary data or data already collected, compiled and published by other agencies and make use of the same if these met the requirements of the proposed research endeavour. A large number of Government agencies and several non- Government agencies collect, compile, analyse and publish data on various aspects of the Indian economy and society. Such data cover the performance of the economy in different directions, socio-cultural trends and the impact of such performance on the levels of living of different sections of society. Let us look at in this Block the kind of data available, their quality, reliability and timeliness for the purposes for which these are collected. 5.2 OBJECTIVES After going through this Block, you will be able to:

• know the manner in which the Indian statistical system is organized; • describe the data on national income and related macro aggregates, saving and

investment that are useful for analyzing various aspects of Indian economy; • state the use of the input-output transaction table compiled by CSO in economic

and econometric analysis; • appreciate the limitations of estimates of corresponding state and district level

aggregates for assessing regional progress; • know the different sources of agricultural data and limitations of their data; • explain the multivariable industrial data and their reliability; • describe the kind of data available on trade • explain the divergence between RBI’s BOP data and DGCI&S data; • discuss the agencies involved in the compilation of data on finance; • know the sources of data on various aspects of employment and unemployment,

labour welfare, education, health, levels of consumption and environment that determine the quality of life of people; and

• describe the concepts used for collecting data on different variables of social sector.

5.3 AN OVERVIEW OF THE THEME We shall look at the database of the Indian economy in this Block. Section 5.4 starts with a short description of the Indian statistical system and then moves on to discuss the kind of data compiled and disseminated on the overall performance of the economy - macro variables depicting it like the national income and state income, the national and regional accounts, the input output transaction table depicting inter-relationships between economic activities, instruments facilitating growth like saving and investment and finally, the real test of economic performance, namely, standards of living of the people. The databases of important individual sectors of the economy are dealt with in the subsequent sections. Section 5.5 discusses available data on the production of agricultural

97

crops, cost of cultivation of crops, agricultural holdings and inputs to agriculture including irrigation, livestock and livestock products and agricultural credit. Section 5.6 deals with the kind of data available on industrial production and employment and related technical ratios in the organised and unorganised sectors, the index of industrial production and industrial credit. Section 5.7 looks at data on trade – merchandise and services – and quantum and unit value indices on merchandise trade as also different measures of terms of trade. Section 5.8 dwells upon data on the lifeline of the economy - finance. It discusses availability of data on Central and State Government finances, transactions with the rest of the world, currency, coinage, money, banking and financial markets. The discussion then focuses attention on the database of several sub sectors of the social sector. (Section 5.9). First is the means of participation in the development process and benefit from it – employment. Sub-section 5.9.2 discusses data available on employment, unemployment and the quality and adequacy of employment. Sub-section 5.9.3 looks at data on efforts to develop human capabilities – the educational infrastructure, the extent of utilisation of the infrastructure and the impact of this on the educational profile of the population. Sub-section 5.9.4 deals with data on health infrastructure and its impact on the health status of the population. Sub-section 5.9.5 looks at the database of an area that is crucial for the future of human existence itself – environment – and the steps taken to monitor and control pollution. Finally, information available for an assessment of the progress towards the ultimate objective of development planning – quality of life of the country’s people - is discussed in Section 5.9.6. Section 5.10 is a short summing up of the Block. Each Section/subsection ends with a box guiding the reader to relevant portions of one or more publications that contain more details on the subject handled in it. Full details of these publications are indicated in Section 5.12. Section 5.11 is to enable the reader to be in touch with emerging developments relating to the review, refinement and expansion of the database in different aspects/sectors of the economy. Section 5.13 is for evaluation of the reader’s knowledge of the subject matter covered in this Block. 5.4 MACRO VARIABLE DATA

5.4.1 The Indian Statistical System

The Indian Statistical System generates data generally through large scale enquiries like the Census or sample surveys of the kind conducted by the National Sample Survey Organisation (NSSO), periodic statutory returns received by Government Departments/organisations and as a by-product of administration at different levels. Someone has to take the lead, in such a situation, to ensure adoption of appropriate standards, concepts and definitions for the phenomena on which statistical data are collected. The necessary institutional structures were created in India in the early Fifties and strengthened over the years. Most recently, the National Statistics Commission (NSC) made detailed recommendations to revamp the Indian statistical system to ensure the quality, reliability and timeliness of data generated by the system.

At the apex of Indian Statistical System is the permanent National Commission on Statistics (NCS) to serve as a nodal body for all core statistical activities of the country,

98

evolve, monitor and enforce statistical priorities and standards and to ensure statistical co-ordination among the different agencies involved. The Chief Statistician of India functions as the Secretary to NCS and also as the ex-officio Secretary, Ministry of Statistics & Programme Implementation (MOSPI). The Statistics Wing of MOSPI, functioning under the guidance of NCS, is the nodal Ministry in the Government of India for the integrated development of the statistical system in the country, coordination of the work statistical directorates/divisions in the Central Ministries and the State Governments and for all policy matters relating to the Indian Statistical Institute (ISI). It has under it three organisations, the Central Statistical Organisation (CSO), the National Sample Survey Organisation (NSSO) and the Computer Centre (CC). The State Directorate of Economics & Statistics (SDES) is at the apex of the system at the State level, responsible for coordination of statistical activities carried on by statistical cells/divisions/directorates in different departments. SDESs have statistical offices in the districts and, in some cases, also in the regions. CSO has revived the Conference of Central and State Statistical Organisations (COCSSO). It is being held annually to deliberate matters relating to the development of statistical data on aspects of socio-economic life of the country. Agencies concerned disseminate the data they collect, process and analyse to data users in print or in electronic formats. CSO disseminates not only its own data but also those relating to different sectors and aspects of the economy and society published by other Government agencies. So do the Reserve Bank of India (RBI) and several non-Government sources. SDESs provide a similar service in the States.

It would be instructive to judge the Indian situation in an international setting. Publications of the United Nations (UN)/its agencies, the International Monetary Fund (IMF), the World Bank and regional agencies publish data on the economies of member countries. The IMF has formulated a “Special Data Dissemination Standards” (SDDS) covering real sector (national accounts, production index, price indices, etc.,), Fiscal Sector, Financial Sector, External Sector and Socio-demographic data, to facilitate transparency in the compilation/dissemination/cross-country comparison of data on important aspects of the economy. Countries under SDDS provide to IMF a National Summary Data Page for each area/sub-area listed in SDDS and the relevant metadata as per a Dissemination Format and disseminate an advance release calendar on the internet of the IMF’s Data Dissemination Bulletin Board (DSBB). CSO, the Registrar General & Census Commissioner (RGI&CC), and RBI furnish the data required and an advance release calendar for such data. The information provided by any SDDS country can be accessed on the internet. (Search parameter Special Data Dissemination Standards IMF). 5.4.2 National Income and Related Macroeconomic Aggregates (1) System of National Accounts (SNA) How to assess the performance of an economy? Trends in the production of a specific product or trends in the overall index of industrial production will only assess the performance in the output of the specific product or the industrial sector. But we should like to go beyond levels of output or production and look at performance in terms of incomes flowing from output in the form of rent, wages, interest and profit to those participating in the creation of the output, namely, the factors of production – land, labour, capital and entrepreneurship. Alternatively, we would like to base our judgement of performance on value addition made by the production system, namely, value of output net of the (intermediate) costs incurred in creating the output. It is (i) this overall

99

value addition computed for all sectors/ activities of the economy that is referred to as the National Product and (ii) macro-aggregates related to it and (iii) trends in (i) and (ii), that can help us in analysing the performance of an economy. National Income (NI) is the Net National Product (NNP). It is also used to refer to the group of macroeconomic aggregates like Gross National Product (GNP), Gross Domestic Product (GDP) and Net Domestic Product (NDP). All these of course refer to the total value (in the sense mentioned above) of the goods and services produced during a period of time, the only differences between these aggregates being depreciation and /or net factor income from abroad. There are other macroeconomic aggregates related to these that are of importance in relation to an economy. What data would you, as a researcher or an analyst, like to have about the health of an economy? Besides a measure of the National Product every year or at smaller intervals of time, you would like to know how fast it is growing over time. What are the shares of the national product that flow to labour and other factors of production? How much of the national income goes to current consumption, how much to saving and how much to building up the capital needed to facilitate future economic growth? What is the role of the different sectors and economic activities – in the public and private sectors or in the organised and unorganised activities or the households in the processes that lead to economic growth? How does the level and pattern of economic growth affect or benefit different sections of society? How much money remains in the hands of the households for consumption and saving after they have paid their taxes (Personal Disposable Income) – an important indicator of the economic health of households? What is the contribution of different institutions to saving? How is capital formation financed? Such a list of requirements of data for analysing trends in the magnitude and quality of, and also the prospects of, efforts for economic expansion being mounted by a nation can be very long. Such data, that is, estimates of national income and related macroeconomic aggregates form part of a system of National Accounts that gives a comprehensive view of the internal and external transactions of an economy over a period, say, a financial year and the interrelationships among the macroeconomic aggregates. National Accounts thus constitute an important tool of analysis for judging the performance of an economy vis-à-vis the aims of economic and development policy. The UN has recommended a System of National Accounts (SNA) to promote international standards for compiling national accounts and as an analytical tool and international reporting of comparable national accounting data. 1968 SNA and 1993 SNA are the second and third (latest) versions. The 15th International Conference of Labour Statisticians (ICLS) (January, 1993) adopted a resolution on statistics on the informal sector, a sector that makes a sizeable contribution to national income, to help member countries of the International Labour Organisation (ILO) in reporting comparable statistics of employment in the informal sector. The UN Statistical Commission (UNSC) and 1993 SNA endorsed it. India uses a mix of 1968 SNA and 1993 SNA in compiling national accounts and is moving towards the full implementation of SNA methodology. Read Section 13.8, NSC Report (2001), pp.535 – 543

100

(2) Estimates of National Income and Related Macroeconomic Aggregates (a) Estimates Prepared by CSO

CSO of MOSPI compile and publish National Accounts, which include estimates of National Income and related macroeconomic aggregates like NNP, GNP, GDP & NDP, consumption expenditure, saving, capital formation and so on for the country and for the public sector for every financial year. Quarterly Estimates (Qtly.Es) of .GDP are also made. Estimates are prepared for any year at the prices prevailing in that year, that is, estimates at current prices and at also at constant prices, that is, at the prices of a selected year (called the base year). CSO changes the base year from time to time to take into account the structural changes in the economy and depict a true picture of the economy. The base year from January, 2006 is 1999-2000. Estimates of national accounts aggregates are published in considerable detail in CSO’s Annual publication National Accounts Statistics (NAS), the latest being NAS 2008. CSO releases through Press Notes every January (on the 30th this year), Quick Estimates (QE) of GDP, National Income, per capita National Income and Consumption Expenditure by broad economic sectors for the financial year that ended in March of the preceding year (time lag - ten months) and Revised Estimates (RE) of national accounts aggregates for earlier financial years. Further, Advance Estimates (AEs) of GDP, GNP, NNP and per capita NNP at factor cost for the current financial year are also released in February - two months before the close of the financial year. (AEs for 2008-09 released on 9/2/2009.) These AEs are revised thereafter and the updated AEs are released by the end of June, three months after the close of the financial year. Meanwhile, by the end of March, Qly.Es of GDP for the quarter ending December of the preceding year are also released. Thus by the end of every financial year (31st March), AEs for that financial year, QEs for the preceding financial year and the Qtly.Es up to the quarter ending December of the financial year) become available. In fact, CSO sets before itself an advance release calendar for the release of national accounts statistics over a period of two years, in line with SDDS requirements. NAS (NAS 2008) presents QEs of macroeconomic aggregates for 2006-07, AEs for 2007-08 and Qtly.Es of GDP for 1999-00 to 2007-08, summary statements of GNP, NNP, GDP and NDP at factor cost at constant (1999-00) prices and market prices and estimates of the components of GDP, aggregates like Government Final Consumption Expenditure (GFCE), Private Final Consumption Expenditure (PFCE) in the domestic market, Exports, Imports, the share of the public sector in GDP, industry wise GDP & NNP, GDP at crop/item/category level and the consolidated accounts of nation. CSO’s estimates of NDP for rural and urban areas by economic activity at current prices for 1970-71, 1980-81 and 1993-94 are published in NAS 2000. The list of publications of the National Accounts Division (NAD) of CSO can be seen in the MOSPI website. See Tables in Parts I, II & V, NAS 2008. (b) Other Publications Giving CSO’s Estimates:

101

CSO’s Monthly Abstract of Statistics (MAS) and the annual Statistical Abstract of India (SAI), RBI’s Monthly Bulletin and the Handbook of Statistics on the Indian Economy (2008), RBI website http://www.rbi.org.in), Centre for Monitoring Indian Economy (CMIE) (Economic Intelligence Unit – EIU), Mumbai publication National Income Statistics and the publication of Economic and Political Weekly Research Foundation - EPWRF (EPWRF, December, 2004)”, and www.epwrf.res.in also give time series estimates of national income and related macro aggregates. (c) Limitations of the Estimates:

The concepts and methodology used and the data sources utilised for making these estimates are set out in two publications, namely, (CSO 2007) and (CSO, 1999a). The methodology for (i) the New series of National Accounts Statistics with base year 1999-2000 is given in the Brochure on New Series on NAS (Base Year 1999-2000), (ii) AEs in: NAS 1994, (iii) estimates of factor incomes in NAS – Factor Incomes (March, 1994) and (iv) Qtly.Es of GDP in a Note in NAS 1999. Besides, the publication NAS of every year has a chapter “Notes on Methodology and Revision in the Estimates”. Sections 13.2 & 13.3, Chapter 13, of the NSC Report, (pp. 436 to 492) also contain methodological and conceptual details and data sources utilised in estimating National Income and related macroeconomic aggregates, data gaps and measures to overcome these. Changes in and adoption of improved methodology, expansion of the coverage of the estimates, change in the base year, improvements in the quality of data and the use of new data sources over the last 50 years all have their beneficial impact on the quality of national income estimates, etc., that is, in estimating the “true values” of the aggregates as correctly as possible. These efforts can also affect the comparability of estimates over time although CSO does make all efforts to minimise the level of non-comparability. Read the chapter “Notes on Methodology and Revisions in the Estimates” in CSO (2005), pp.220 – 228; and also the same chapter in CSO (2008). (3) The Input – Output Table: Any economic activity is dependent on inputs from other economic activities for generating its output and the output from this economic activity serves as inputs for producing the output from other activities. Data relating to such interrelationships among different sectors of the economy and among different economic activities are thus important for analysing the behaviour of the economy and, therefore, for formulation of development plans and setting targets of macro variables like output, investment and employment. Such an input-output table will also be useful for analysing the impact of changes in a sector of the economy or economic activity on other sectors of the economy and indeed the entire economy. CSO publishes an Input-Output Transaction Table (I-OTT) every five years since 1968. The latest is the one relating to 2003-04. It gives, besides the complete table, the methodology adopted, the database used, analysis of the results and the supplementary tables derived from the I-OTT giving the input structure and the commodity composition of the output. The Planning Commission updates and recalibrates the I-OTT and prepares Input-Output Tables (I-OT) for the base and the terminal years of a Five Year Plan and publishes the results of such an exercise as the Technical Note to the Five Year Plan. (The latest is the one for the Tenth Plan.) It

102

contains the relevant I-OT, the methodology adopted and related material. The two I-OTs are useful in economic and econometric analysis. (4) Regional Accounts - Estimates of State Income and related Aggregates (a) Estimates of State Domestic Product (SDP) Prepared and Released by State Governments and Union Territory Administrations State Accounts Statistics (SAS) consist of various accounts showing the flows of all transactions between the economic agents constituting the State economy and their stocks. The most important aggregate of SAS is the State Domestic Product (SDP) (State Income). Estimates of GSDP and NSDP at constant and current prices are being prepared and published by all SDES except those of Dadra & Nagar Haveli, Daman & Diu and Lakshadweep. These estimates are also available in the CSO website and the publications of the preceding section and in EPWRF (June, 2003) and its CD ROM.. [Read Section 13.7, Chapter 13, NSC Report, pp. 528 – 535 and Annexures 13.8 to 10] (c) Limitations of Estimates of SDP The preparation of estimates of SDP call for more detailed data than for the preparation of national level estimates, especially on flows of goods and services and incomes across geographical boundaries of States/Union Territories. Conceptually, estimates of SDP can be prepared by two approaches -.the income originating approach and the income accruing approach. In the former case, the measurement relates to the income originating to the factors of production physically located within the area of a State. In other words it is the net value of goods and services produced within a State. In the latter case, the measurement relates to the income accruing to the normal residents of a State. The income accruing approach provides a better measure of the welfare of the residents of the State and also for preparing Human Development Indices (HDI), but it calls for data on inter-State flow of goods and services and incomes, which are not available. Thus only the income originating approach is used in preparing estimates of SDP. This has to be kept in mind while using estimates of SDP. Although efforts have been made by the CSO over the years to bring about a good degree of uniformity across States and Union Territories in SDP concepts and methodology, SDP estimates of different States are not comparable. The successive Finance Commissions got comparable estimates of NSDP and per capita NSDP made by CSO for their work (available in the Reports of the successive Finance Commissions). EPWRF (June, 2003) also provides comparable estimates of SDP and compares these with those made for Finance Commissions. The question of comparability of estimates of SDPs is important for econometric work involving inter-State or regional comparisons. [Read 1. Section 13.7.1 & 13.7.2, Chapter 13, pp. 528 – 535 NSC Report; 2. Preface and Chapters 5, 7,8 & 10 in EPWRF (2003); 3.CSO (1974); 4. CSO(1976); 5. CSO (1979); 6. CSO (1980)]

103

(5) Regional Accounts - Estimates of District Income The need for preparing estimates of district income has become urgent in the context of decentralisation of governance and the importance of, and the emphasis on, decentralised planning. Estimates of District Domestic Product (DDP) are being prepared by ten SDESs using income originating approach and published in State Statistical Handbooks/Abstracts/Economic Surveys and also posted on their websites. (Another State is preparing estimates only for commodity producing sectors.) It is necessary to make adjustments in these estimates for flow of incomes across territories of districts (or States) that are rich in resources like minerals and forest resources and where there is a daily flow of commuters. [Read 1. Subsection 13.7.7, Chapter 13, NSC Report, p. 532; 2. Paper on Methodology for DDP in 1996 by SDESs Uttar Pradesh & Karnataka, CSO website; 3. Katyal, R.P., Sardana, M.G., Satyanarayana, J. (2001). ] 5.4.3 National Income and Levels of Living What do trends in macroeconomic aggregates say about the welfare of different sections of society? Precious little, perhaps, especially when these are considered without information on the distribution of these aggregates among these sections of society. Per-capita national income or even per-capita personal disposable income can only indicate overall (national) averages. Distribution of population by levels of income can be a big step forward in understanding how well the performance in the growth of GDP has translated into, or has not translated into, improvements in levels of living for sections of society below levels considered the minimum desirable level. It would also help us analyse trends in levels of inequalities in living standards, unemployment and employment, quality of employment, the health status of people and the status of women. Or, to consider all these together, what are the levels of human development and gender discrimination? Such lines of analysis and the data required for the purpose are important from the point of view of planning for a strategy of growth with equity. The quinquennial Consumer Expenditure Surveys of the NSSO, the latest being the 61st Round (2004-05), provide the distribution of households by monthly per capita consumption expenditure (MPCE) classes. Data on trends in the growth rate of employment are available from the quinquennial employment and unemployment surveys of the NSSO (the latest being the 61st Round). These and the GDP data enable us to look at trends in employment elasticity. Comprehensive indicators like HDI and the Gender Discrimination Index (GDI) have been prepared for the country and the States by the Planning Commission (Human Development Report – HDR - 2001) and for individual States and districts by several State Governments. These contain detailed data on different facets of levels of living. All the reports are available in print and electronic on their websites. Read 1. Chapter 1 (pp. 1 – 6) and Technical Appendix (pp.132 – 133), HDR 2001 of the Planning Commission; 2. Sub-sections 9.8.6 to 9.8.21, Chapter 9, NSC Report, pp. 333 – 336.

104

5.4.4 Saving As you are aware, broadly speaking, GNP is made up of consumption, saving, exports net of imports, besides net factor income from abroad. Saving is important in as much as it goes to finance investment, which in turn brings about growth of GNP. What is the volume of Savings relative to GNP? How much of it is consumed by the needs of depreciation? Who all contribute, and how much, to the total volume of Savings? Let us see what kind of data is available on such questions. Estimates of Gross Domestic Saving (GDS) and Net Domestic Saving (NDS) in current prices and the Rate of Saving are made by CSO and published in the National Accounts Statistics (NAS) and the Press Note of January of every year releasing Quick Estimates. . These are first made for any year along with QEs of GDP, etc., and revised and finalised along with the revision of QEs of GDP. etc., subsequently. The structure of Saving, that is, the distribution of GDS and NDS by type of institution – household sector, private corporate sector and public sector are also available in NAS. Part III of NAS 2008 also presents the time series of estimates of GDS and NDS in current prices from 1950-51. Statistics on Saving are also published in the publications mentioned under national accounts. Estimates of Gross and Net Domestic Saving at the State and Union Territory levels are not being made at present by SDESs (as per the NSC Report). Limitations that estimates of Savings suffer from are indicated in NSC Report. A high level Committee on Savings under the Chairmanship of Dr. Rangarajan (2007) is making a critical review of estimates of savings and investments in the economy. [Read Sub-sections 13.6.1 to 13.6.6 (pp. 508 – 509) & 13.6.10 to 13.6.16(pp. 511 – 528), NSC Report.] 5.4.5 Investment Investment is Capital Formation (CF). Investment of money in the shares of a company is not investment but buying a house or machinery is investment. In other words, investment is creation of physical assets like machinery, equipment, building and so on, adding to the capital stock (of such assets) in the economy, enhances the productive capacity of the economy. Investment or CF is another important component of GNP and the rate of investment – expressed as a proportion of GNP – largely determines the rate of growth of the economy. How is capital formation financed by the economy? What is the contribution of different sectors to capital formation or, how much is used up by different sectors? What is the capital stock available in the economy? These are all the questions that rise in one’s mind when considering strategies for economic growth. What kind of data is available? The annual documents NAS of CSO present such data. NAS 2008 presents estimates of Gross Domestic Capital Formation (GDCF), Gross Domestic Fixed Capital Formation (GDFCF), Change in Stocks, Consumption of Fixed Capital (cfc), Net Domestic Fixed Capital Formation (NDFCF) and Net Domestic Capital Formation (NDCF) in current prices and at constant (1999-00) prices. These estimates are made along with QEs of National Income every January (as in 2009) and the revision of these estimates proceeds along with that of the estimates of national income aggregates. Thus NAS 2008 and the MOSPI Press Note of 30/1/09 also present estimates of the distribution of these

105

aggregates at current and constant prices by type of institutions, by economic activity, the manner in which CF is financed, external (current and capital) transactions and so on. Publications referred to in GDP etc., sub-section also present time series of such data but the one of EPWRF also provides capital-output ratios and average net fixed capital stock ratios (NFCS) to output ratios (ACOR). CSO publications and the NSC Report contain the relevant methodological details. Estimates of GFCF at the State level are being prepared in 14 States. See also the EPWRF publication on NAS. . Gaps in data for the estimation of capital formation exist in a number of relevant areas, as indicated in NSC Report. See Parts III & V, NAS 2008. [Read 1. Section 13.6.7 to 13.6.16, Chapter 13, NSC Report, pp. 509 – 528; 2. Chapters 6, 12, Exhibit E - C.4, Statistical Annexures VI & VII, EPWRF ( June, 2003).] 5.5 AGRICULTURAL DATA 5.5.1 Introduction You are aware of the importance of agriculture to the Indian economy and indeed to the Indian way of life. You would, therefore, like to examine several aspects of agriculture like the level of production of different crops and commodities, the availability and utilisation of important inputs for agricultural production, incentives, availability of post-harvest services and the role of agriculture in development. Similarly, you would like to know about livestock and their products, fisheries and forestry, people engaged in these activities and so on. All these analyses require enormous amount of data over time and space. Let us have a look at what kind of data are available and where. The Directorate of Economics and Statistics, Ministry of Agriculture and Cooperation (DESMOA) and the Animal Husbandry Statistics Division (AHSD) of the Department of Animal Husbandry, Dairy and Fisheries (DAHDF) of the same ministry are the major sources of data on agriculture and allied activities. Some of the major efforts at collection of data mounted at regular intervals are (i) the quiquennial agricultural census and input survey, (ii) cost of cultivation studies, (iii) annual estimates of crop production, (iv) the quinquennial livestock census and (v) integrated sample survey to estimate the production of major livestock products. The major publications containing statistics flowing from these activities are DESMOA’s Agricultural Statistics at a Glance (ASG) (annual, also accessible at www.dacnet.nic.in ), Cost of Cultivation in India and the monthly bulletin “Agricultural Situation In India” and ASHD’s biennial publication Basic Animal Husbandry Statistics (BAHS). 5.5.2 Agricultural Census Started from 1970-71, the seventh census related to 2001. The census collects data on holdings like its area, the gender and social group of the holder, irrigation status, tenancy particulars, the cropping pattern and the number of crops cultivated. The Input Survey, conducted in the following year, gathers data on the pattern of input-use across crops, regions and size-groups of holdings, covering infrastructural facilities, chemical fertilizers, organic manures, pesticides, agricultural implements and machinery, livestock,

106

agricultural credit and seeds. The results of the 2001 Agricultural Census and those of the Input Survey, 1996-97 are on the census website at the national and State levels and also ASG 2008). The results of the next census (2005-06) and Input Survey (2006-07) are awaited. Those of Input Survey 2001-02 are being finalized. [Read 1. Sections 4.9 (pp.136 – 139), 4.14 (pp. 146 – 147) & 4.22 (159 – 161), NSC Report; 2 http://www.agcensus.nic.in ; 3. Table Set 16, ASG (2008).] ------------------------------------------------------------------------------------------------------------ 5.5.3 Studies on Cost of Cultivation DESMOA implements a comprehensive scheme for studying the cost of cultivation of principal crops in India and this results in the collection and compilation of field data on the cost of cultivation and production in respect of 29 principal crops leading to estimates of crop-wise and State-wise costs of cultivation and also computation of the index of the terms of trade between agriculture and non-agricultural sectors (ITT). The scheme covers 16 States and foodgrain crops, oil seeds and commercial crops and selected vegetables. These are published in Cost of Cultivation in India and in ASG. The Commission for Agricultural Costs and Prices (CACP) makes use of these estimates and data on number of variables and make recommendations to Government on Minimum Support Prices (MSP). MSPs for different commodities and ITT are also published in ASG. [Read 1. http://dacnet.nic.in/cacp 2. Table Set 8, ASG (2008), 4. Sections 4.12 (pp.142 – 144) and Sub-sections 4.20.3 to 4.20.6 and 4.20.8 (pp. 156 – 157), Chapter 4, NSC Report.] 5.5.4 Annual Estimates of Crop Production DESMOA makes annual estimates of area, production and yield of principal crops of foodgrains, oil seeds, sugar cane, fibres and important commercial and horticulture crops. These crops account for about 87% of the total agricultural output. Estimates of area and yield form the basis of these estimates. While estimates of area are based on a reporting system that is a mix of complete coverage and coverage by a sample, those of yield are based on a system of crop cutting experiments and General Crop Estimation Surveys. Advance estimates of crop production are also required even before the crops are harvested for policy purposes. The first such assessment of the kharif crop is made in the middle of September, the second – a second assessment of the kharif crop and the first assessment of the rabi crop – in January, the third at the end of March or early April and the fourth in June. Time series of final estimates of annual production3, gross area under different crops and yield per hectare of these crops and Index Numbers on these variables [base year the triennium ending (TE) 1993-94 = 100] are published in ASG. So are estimates of production of crops by States. [Read 1. pp.1 – 4 and Table Set 4, ASG 2008; 2. Sections 4.2 to 4.4, Chapter 4, NSC Report, pp. 118 – 128.]

3 Crop area forecasts and final area estimates are now sample based as suggested by NSC

107

5.5.5 Livestock Census The latest (17th ) quinquennial livestock census for which results are available, conducted in October, 2003, collected information, district-wise on livestock, poultry, fishery and also agricultural implements. Livestock covers cattle, buffaloes, sheep, goats, pigs, horses and ponies, mules, donkeys, camels, yak, mithun and also pigs, dogs and rabbits. These are classified by age, sex, breed, function. Poultry covers cock, hen, duck, and drake, which are classified as desi and ‘improved’ varieties. Fishery covers fishing activity (inland capture, inland culture, marine capture and marine culture), persons engaged in fishing, craft/gear by type, size and horsepower, agricultural implements/equipment, equipment for livestock and poultry and horticulture tools. The results are available on the DAHDF website up to district level. ASG and BAHS also present some census data. The fieldwork for the 18th Livestock Census has been completed in October, 2007. Quick results of the Census based on village/ward are expected by March, 2009. [Read Section 4.13 & 4.14, Chapter 4, NSC Report pp. 144 – 147; & www.dahd.nic.in 5.5.6 Data On Production Of Major Livestock Products AHSD is responsible for collection of statistics on animal husbandry, dairy and fisheries. These are published in BAHS. The latest relates to 2006 and it presents data, production of milk, eggs, meat and wool, per capita availability of milk and eggs, contribution of cows, buffaloes and goats to milk production and of fowls and ducks to egg production, imports/exports of livestock/livestock products, area under fodder crops, pastures and grazing, dry and green fodder production, artificial inseminations performed, achievements in key components of dairy development, livestock and poultry. State wise and time series data are presented in most cases. [Read 1. Section 4.15, Chapter 4, NSC Report pp. 147 – 148; 2. www.dahd.nic.in ; 3. Table Sets 19 & 5.5.7 Agricultural Statistics At a Glance (ASG) The total geographical area of the country is made up of land and water bodies like rivers and lakes. Land in turn consists of forests, barren and uncultivable land, land used for non-agricultural purposes, pastures, fallows, cultivable land and so on. What is the pattern of utilisation of land and how has this pattern been changing over time? How much is used for agriculture? Land utilisation statistics are available in ASG. ASG also provides information on size distribution of operational holdings, cropping intensity, irrigation status, irrigation source, consumption of fertilizer and farmyard manure by size classes of operational holdings and crops, soil conservation, utilisation of inputs and so on.

The subject Subsidies is a much-debated subject nowadays and agricultural subsidies in developing countries and developed countries of Europe and in USA are also much in the news. ASG provides a time series of the amount of subsidy given to agriculture with its break-up into subsidy for (i) fertilizers, (ii) electricity and (iii) irrigation (the excess of

108

operating costs of the Government Irrigation System over gross revenue is treated as the imputed irrigation subsidy) and (iv) other subsidies given to marginal farmers and Farmers’ Cooperative Societies in the form of seeds, development of oil seeds, pulses, etc. ASG also presents the share of agricultural subsidies in selected OECD countries – and in particular a that shows the amount of support to farmers, irrespective of the sectoral structure of a given country. Other kinds of data on the agricultural sector presented in ASG are procurement of food and non-food grains, Marketed Surplus Ratios of important agricultural commodities, per capita availability of important articles of consumption, stocks of cereals, imports and exports of agricultural commodities and so on. [Read Table Sets 9 to 16, ASG (2008)] 5.5.8 Another Source of Data on Irrigation Besides DESMOA, data on irrigation are collected by the Central Water Commission (CWC) under the Ministry of Water Resources (MOWR). CWC collects hydrological data on all the important river systems in the country through 877 hydrological observation sites. The Ministry conducts periodic Censuses of Minor Irrigation Works along with a sample check to correct the Census data. The latest Census related to 2000-01. The report, which can be seen in www.wrmin.nic.in, provides information on minor irrigation works like the type of works, crop-wise utilisation of the potential created, the manner of distribution. NSC has stressed the need for statistical analysis of the data with CWC and the MOWR, for users being made aware of the reasons for variation between MOWR data and DESMOA data and reduction in time lag of both data. [Read Section 4.8, Chapter 4, NSC Report, pp. 134 – 136, & Annexure 4.7] 5.5.9 Other Data on the Agricultural Sector Data on forest cover is part of land-use statistics presented on the basis of a nine fold land-use classification in ASG. Forest Survey of India (FSI) also collects data on forest cover through a biennial survey by using Remote Sensing (RS) technology since 1987. Digital interpretation has reduced the time lag in the availability of such data obtained earlier through periodic reports from field formations. There are discrepancies between ASG & FSI data on forest area due to differences in concepts and definitions. Data on production of industrial wood, minor forest produce and fuel wood are available with the Principal Chief Conservator of Forests in the Ministry of Environment & Forests. The annual reports of National Bank for Agriculture and Rural Development (NABARD) and its other publications like Statistical Statements Relating to the Cooperative Movement in India and Key Statistics on Cooperative Banks, besides its website and the RBI Handbook are useful sources of information on agricultural credit. NAS provides data on the contribution of agriculture and its sub-sectors to GDP and other measures of national/domestic product, value of output of various agricultural crops, livestock products, forestry products, inland fish and marine fish and on capital formation in agriculture and animal husbandry, forestry and logging and fishing.

109

[Read Sections 4.5 (pp. 129 – 130) and 4.17 (pp. 150 – 152), Chapter 4, NSC Report.] 5.6 INDUSTRIAL DATA 5.6.1 Introduction The industrial sector can be divided into a number of subgroups on the basis of framework factors like coverage of certain laws, employment size of establishments or criteria for promotional support by Government. Such groupings are the organised and unorganised sectors, the factory sector (covered by the Factories Act, 1948), small-scale industries, cottage industries, handicrafts, khadi and village industries (KVI), directory establishments (DE) (those employing six or more persons), non-directory establishments (NDE) (employing at least one person), own account enterprises (OAE) (sel employed). Attempts have been made to get at a detailed look at the characteristics of some of these sub-sectors of the industrial sector, as the data sources covering the whole sector often do not provide information in such detail. Let us turn to the kind of data available for individual subgroups and those that cover the entire industrial sector.

5.6.2 Data Sources Covering the Entire Industrial Sector These sources provide levels of industrial employment. The first is the decennial Population Census (2001 is the latest) providing data, up to the district level, on levels of employment (i) by economic activities and broad occupational divisions and (ii) by economic sectors, age groups and education. The time lag in availability is large. The second is the quinquennial sample surveys relating to employment and unemployment conducted by the National Sample Surveys Organisation (NSSO), the latest being for 2004-05. These also provide similar type of data on industrial employment up to State levels within a year or two. Data by the 72 NSS regions are also possible with the unit record data available on floppies from NSSO. The third is the Employment Market Information Programme (EMIP) of the Directorate General of Employment & Training (DGE&T), Ministry of Labour & Employment and the State Directorates of Employment (SDEs), based on statutory quarterly employment returns from non agricultural establishments in the private sector employing 10 or more persons and all public sector establishments. (the organised sector). It provides data on employment in the organised sector at quarterly intervals down to district levels (Quarterly Reviews) in about a year’s time. Detailed data by economic activity are available in the Annual Employment Reviews of the DGE&T and SDEs after a large time lag..

Economic Census (EC): Conducted by CSO since 1977, the latest (the fifth) EC was in 2005. It covers all economic enterprises in the country except those engaged in crop production and plantation and provides data on employment in these enterprises, besides providing a frame for the conduct of more detailed follow up (enterprise) surveys (FuS) covering different segments of the uorganised non-agricultural sector. EC gathers basic information on the number of enterprises and their employment by location, type of activity and nature of operation. The all-India Report for EC 2005 (accessible on the MOPSI website) and most of the State reports have been published.

5.6.3 Factory (Registered) Sector – Annual Survey of Industries (ASI)

110

The Annual Survey of Industries (ASI) launched in 1960 collects detailed industrial statistics relating to industrial units in the country like capital, output, input, value added, employment and factor shares and the survey has been conducted every year since 1960 except in 1972. The frames for the survey since 1998-99 consists of (i) all factories registered under Sections 2m(i) and 2m(ii) of the Factories Act, 1948 employing 10 or more workers using power as well as those employing 20 workers but without using power and (ii) biri and cigar manufacturing establishments registered under the Biri and Cigar Workers (Conditions of Employment) Act, 1966 with coverage of units as in (i) above.

. The reference period for the survey is the accounting year April to March preceding the date of the survey. The sampling design and the schedules for the survey were revised in 1997-98, keeping in view the need to reduce the time lag in the availability of the results of the survey. The survey does not attempt estimates at the district level. NIC 04 is used for classifying economic activities from ASI 2004-05. Final results of ASI 2004-05 have been released results relating to selected characteristics at various levels of aggregation available in the ASI section of MOSPI website are: (i) all industries by States, (ii) all India by 2-digit level of NIC 04 with rural-urban break-up, (iii) all India by 2/3/4- digit level of NIC 04, (iv) States by 2/3/4- digit level of NIC 04 and (v) Unit level data with suppressed identification, etc. Data for the past surveys are also available on the website. CSO has also released time series data on ASI in 5 parts, each volume covering parts of the period 1959 to 1997-98, which present data on important characteristics for all-India at two-digit and three-digit NIC code levels and for the States at two-digit NIC code levels. These publications are also available in electronic media on payment. EPWRF (April, 2002) also provides time series ASI data on the principal characteristics of the factory sector along with concepts and definitions used. These are also on the website of EPWRF and on interactive CD ROMS.

The data available from ASI can be used to derive estimates of important technical ratios like capital-output ratio, labour-output ratio, capital–labour ratio, labour cost per unit of output, factor shares in net value added and productivity measures for different industries as also trends in these parameters. The most important use of the detailed results arises from the fact that these enable derivation of estimates of (i) the input structure per unit of output at the individual industry and (ii) the proportions of the output of each industry that are used as inputs in other industries, enabling us to use the technique of input-output analysis to evaluate the impact of a change effected in (say) the output of an industry on the rest of the economy. The construction of the I-O TT for the Indian economy is largely based on ASI data. [Read 1. ASI section of MOSPI website; 2. EPWRF (April, 2002); 3. Section 5.1, Chapter 5, NSC Report, pp.162 – 173.]

5.6.4 Monthly Production of Selected Industries and Index of Industrial Production (IIP)

111

CSO prepares and releases monthly indices of industrial production (IIP) and the monthly use-based index of industrial production (base year 1993-94). The present IIP with base year 1993-94 is a quantitative index based on production data received from 14 source agencies covering 543 items clubbed into 285 groups in the basket of items of the index. The SDESs had been preparing IIPs for their respective areas but these were not comparable with each other or with CSO’s national IIP because of differences in the base year, basket of items, data and methodology used for constructing the indices. The work of preparing State-wise IIPs comparable with the national IIP is at different stages in different States and Union Territories. CSO releases Quick Estimates of IIP within six weeks of the close of the reference month, in line with SDDS requirements. CSO has released the IIP for December, 2007, the first revision of IIP for November 2007 and the final IIP for September, 2007 through the Press Release of 12/2/09 (see MOSPI website). NSC has a made a number of recommendations to improve the quality of IIP. The CSO monthly publication Monthly Production of Selected Industries in India provides monthly data regarding production in individual industries covered in IIP along with monthly IIP at 2-digit level and monthly use-based IIP. These are also published in the Monthly Abstract of Statistics and RBI Handbook. The latter also gives index numbers of Infrastructure Industries. The Indian Bureau of Mines (IBM) publishes a Monthly Statistics of Mineral Production. The CSO publication Energy Statistics (the latest is 2008) provides at one place data on different sources of energy - time series on production, availability, consumption and price indices of major sources of conventional energy, which is also available on floppies. The Ministries of Petroleum and Gas, Power and Non-Conventional Energy provide information in their respective spheres of activity. EIS of CMIE publishes volumes on Energy and Infrastructure that present detailed data on the trends in these sectors and the EPWRF website gives time series data on industrial production. [Read MOSPI & EPWRF websites; Section 5.4, Chapter 5, NSC Report. Pp.187 – 200]. 5.6.5 Data on the Unorganised Industrial Sector The Development Commissioner for Small Scale Industries (DCSSI) in the Central Ministry of Small Scale Industries and the State Directorates of Industries provide data on small-scale industrial units registered with the latter set of agencies. The DCSSI has conducted a census of small scale industrial units thrice – the latest in November, 2002 (reference year 2001-02). The results of the third census are in the publication Final Results: Third all India Census of SSI – 2001-02 of the Ministry of Small Scale Industries. Broad details of the performance of small-scale industries are available in the Annual Reports of the Ministry of Small Scale Industries. Time series data on employment, production, labour productivity in small-scale industries (SSI) and value of exports of the products of small-scale industry are also available in the RBI Handbook. Data on some part of Khadi and Village Industries Commission (KVIC), handlooms and handicrafts do get included in ASI but data relating exclusive to these sub-sectors are available in the Annual Reports of these organisations or in the Annual Reports of the Ministries under which these Boards/Commissions function.

112

FuSs as a follow up of EC have covered unorganised manufacturing at quinquennial intervals from 1978-79 to 2005-06 (and during 1999-00 and 2000-01). The results of 2005-06 survey (62nd Round) have been published in NSSO reports numbered 524 to 526 on the unorganised manufacturing sector (i) operational characteristics; (ii) employment, assets and borrowings; and (iii) input, output and value added. NSC has made a number of recommendations for enhancing the quality of data on this sector. How is the capital financed? Let us look at some of the sources that throw light on these matters in the next sub-section. [Read Sections 5.2 & 5.3, Chapter 5, NSC Report. Pp. 173 – 187.] 5.6.6 Industrial Credit and Finance The RBI Handbook provides time series data on the sectoral deployment of non-food gross bank credit provided by Scheduled Commercial Banks to different sectors of the economy and also on the health of SSI and non-SSI units. The last category gives data on sick and weak units for SSI and non-SSI sectors and the amounts outstanding (loans) from each of these categories of units. The ASI provides some data on financial aspects of industries – fixed capital, working capital, invested capital, loans outstanding and also the interest burden of industrial units (up to the 4-digit NIC code level). From where and how have the industries raised capital needed by them? We have looked at once source of capital or working capital, namely, bank credit. Time series data on new capital issues and the kinds of shares/instruments issued (ordinary, preference or rights shares or debentures, etc.,) and the composition of those contributing to capital (like promoters, financial institutions, insurance companies, Government, underwriters and the public) are also presented in RBI Handbook. Also available sare data on assistance sanctioned and disbursed by financial institutions like Industrial Development Bank of India (IDBI) etc., and financig of project costs of companies..The publication of the Securities Exchange Board of India (SEBI) Handbook of Statistics on the Indian Securities Market - 2008 provides annual and monthly time series data on industry-wise classification of capital raised through the securities market. A reference to the two volumes of CMIE (EIS), Industry: Financial Aggregates and Industry: Market Size and Shares would be rewarding. Section 5.9 deals with data on foreign direct investment (FDI), another source of capital finance. 5.6.7 Contribution to GDP The National Accounts Statistics (NAS) presents a short time series of estimates of (i) value of output and GDP of each two-digit NIC code level industry in the registered and the unregistered sub-sector of the manufacturing sector, (ii) value of output of major and minor minerals and GDP and NDP of the mining & quarrying sector, and (iii) GDP and NDP of the sub-sectors electricity, gas and water supply.

113

5.7 TRADE 5.7.1 Introduction Trade is the means of building up an enduring relationship between countries and the means available to any country for accessing goods and services not available locally for various reasons like the lack of technical know-how. It is also the means of earning foreign exchange through exports so that such foreign exchange could be utilised to finance essential imports and to seek the much-needed technical know-how from outside the country for the development of industrial and technical infrastructure to strengthen its production capabilities. Trade pacts or agreements between countries or groups of countries constitute one way of developing and expanding trade, as these provide easier and tariff-free access to goods from member countries. While efforts towards such an objective will be of help in expanding our trade, globalisation and the emergence of World Trade Organisation (WTO) have only sharpened the need to ensure efficiency in the production of goods and services to compete in international markets to improve our share of world merchandise trade and trade in services. Trade is also closely tied up with our development objectives since trade deficit or surplus, made up of deficit/surplus in merchandise trade and trade in services, contributes to current account deficit or surplus. Data on trade in merchandise and services would enable us to appreciate the trends and structure of trade and identify areas of strength and those with promise but need sustained attention. 5.7.2 Merchandise Trade The Directorate General of Commercial Intelligence and Statistics (DGCI&S) collects and compiles statistics on imports and exports. It releases these data at regular intervals through their publications and through CDs. It prepares “Quick Estimates” on aggregate data of exports and imports and principal commodities within two weeks of the reference month and releases these in the monthly press release. It publishes a monthly brochure (i) Foreign Trade Statistics of India (Principal Commodities and Countries) containing provisional data issued to meet the urgent needs of the Ministry of Commerce, other government organisations, Commodity Boards (CBs), Export Promotion Councils (EPCs) and research organisations. It contains commodity-wise, country-wise and port-wise foreign trade information, (ii) Monthly Statistics of Foreign Trade of India, Volume I (Exports) & Volume II (Imports) containing detailed data on foreign trade at the 8-digit level codes of the ITS(HTS) (iii) Foreign Trade Statistics of India (Principal Commodities and Countries), and (iv) Inland and Coastal Trade Statistics and Shipping & Air Cargo Statistics. The DGCI&S website (www.dgciskol.nic.in) provides summary data on principal commodities and countries, access on free and payment basis to final foreign trade data at 8-digit commodity and principal commodity level, a Priced Information Service System (PISS) to private parties, EPCs, CBs, Foreign Embassies etc., and aggregate and detailed data to Centre for Monitoring Indian Economy (CMIE), Mumbai for an efficient trade intelligence service. The DGCI&S data are also presented in the RBI Handbook 2008, CSO’s Monthly Abstract of Statistics, CMIE’s volume on Foreign Trade and BoP and EPWRF website.

What would we like to know about foreign trade? The volume of trade, that is, the volume of exports and imports, the size of export earnings, the expenditure on imports,

114

the size of exports relative to imports, earnings from exports compared to expenses incurred in imports since exports earn foreign exchange while imports imply outflow of foreign exchange. We should like to know about the trends in these variables. Besides looking at the trends in the quantum and value of imports and exports, it is important to analyse the growth in foreign trade both in terms of value and volume, since both are subject to changes over time. Exports and imports are made up of a large number of commodities and fluctuations in the export and imports of individual commodities contribute to overall fluctuations in the volume and value of exports and imports. We, therefore, need a composite indicator of the trends in trade. The index number of foreign trade of a country is a useful indicator of the temporal fluctuations in exports and imports of the country in terms of value, quantum and unit price and so on. Similarly, measures of the terms of trade could be derived from such indices relating to imports and exports. The existing index numbers have the base year 1978-79. RBI Handbook 2008 publishes time series data (DGCI&S data) on value (in US $ and Indian Rs.) of exports and imports and trade balance, value of exports of selected commodities to principal countries, Direction of Foreign Trade (in US $ Indian Rs.) by trade areas, groups of countries and countries, year-wise Unit Value Indices (UVI) and Quantum Indices (QI) for imports and exports and for each product and the three terms of trade measures, Gross Terms of Trade (GTT), Net Terms of Trade (NTT) and Income Terms of Trade (ITT).

RBI also generates data on merchandise imports and exports or trade data. The Balance of Payments (BoP) data reported by RBI (published in RBI Handbook) show the value of merchandise imports on the debit side and that of exports on the credit side and also trade balance, all in the balance payment format as part of current account, which also shows another entity ‘invisibles’. However, there is a divergence in trade deficit /surplus in merchandise trade shown by DGCI&S data and that shown by RBI’s BoP data. This discrepancy also affects data on current account deficit (CAD) or surplus (CDS), since CAD/CDS is the total of trade deficit/surplus and net invisibles. (see later for ‘invisibles’). There are three reasons for the divergence between the two sources. First, DGCI&S tracks physical imports and exports while BoP data tracks payment transactions relating to merchandise trade. Second, DGCI&S data fail to capture Government imports, which are exemptrd from customs duty. (e.g. Defence imports). Finally, DGCI&S data do not capture imports that do not cross the customs boundary (e.g. oil rigs and some aircrafts) while they are still paid for and get captured in BoP data.

5.7.3 Services Trade

Besides export and import of merchandise, a number of services, like transportation services, travel services, software, Information Technology-Enabled Services (ITES), business services and professional services are exported and imported. These are captured by “non-factor services” included in the entry “Invisibles” in the Tables on India’s Overall BoP and on Key Components of India’s BoP in the RBI Handbook 2008. Table 146 of the Handbook gives the distribution of ‘net invisibles’ by four types of transactions – ‘net non-factor services’, etc. ‘Net non-factor services’ are further divided into five classes, ‘travel – net’, ‘miscellaneous – net’, etc. and under each of these classes are shown ‘Receipts’ and ‘Payments’, which respectively correspond to export and import of these services. But Table 42 on India’s Overall BoP in the RBI Monthly Bulletin, Feb. 2009 gives more detailed data. Naming the item “non-factor services” in the BoP table of the Handbook as “services”, it gives (i) figures for ‘credit’, ‘debit’ and

115

‘net’ against each of the items in the BoP table and (ii) separate figures for “software services” under the category “ miscellaneous “.

5.7.4 E - Commerce

A newly emerging way of conducting business is E – Commerce, defined as the production, distribution, marketing, sale or delivery of goods and services by electronic means (telephone, fax, internet), is growing fast. It is necessary to organise collection of data relating to this area. The NSC has made some recommendations in this regard.

[Read 1. Notes on/Footnotes to Tables on Foreign Trade and BOP, RBI Handbook (2008) and RBI Monthly Bulletin Feb., 2009; 2. Section 10.9 pp. (383 – 391) & 10.12 (pp. 396 – 398), Chapter 10, NSC Report.]. 5.8 FINANCE 5.8.1 Introduction While trade and finance have been closely bound up with each other ever since the time money replaced barter as the means of exchange, finance is the lifeline of all activities. It flows from the public as taxes to Government, as savings to banking and financial institutions and as share capital or bonds or debentures to the entrepreneur. It then gets used for a variety of development and non-development activities through Government and other agencies and flows back to members of the public as incomes in various ways, as factor incomes. It would, therefore, be of interest to know how funds get mobilised for various purposes and get used. This section looks at the kind of data available that could enable us to analyse this mobilisation process and the flows of funds to different areas of activity. . The finance sector consists of public finances, the central bank (the RBI), the scheduled banks, urban and rural cooperative banks and related institutions. The financial market consists of the stock exchanges dealing with scrips like shares, bonds and other debt instruments, the primary and secondary markets, the foreign exchange market, the treasury bills market and the securities market where financial institutions, mutual funds, foreign institutional investors, market intermediaries, the market regulator the Securities Exchange Board of India (SEBI), the banking sector and the RBI all play important roles. There is also the unorganised sector made up of financial operators like money-lenders and pawn brokers. Insurance is another area of finance. 5.8.2 Public Finances What would we like to know about public finances? We would like to know how they are managed. What are the sources of such finances and how and on what are they spent? Does the Government restrict its expenditure within its means or does it spend beyond the resources available to it? Does it, in the process, borrow heavily to finance its expenditure? The Budget documents of the Central and State Governments, the pre-Budget Economic Survey and the publication Indian Public Finance Statistics of the Ministry of Finance, the Planning Commission’s Five Year Plan Documents and RBI

116

Handbook 2008 and the RBI Monthly Bulletins and EPWRF website provide a variety of data on public finances. The Economic Survey, for instance, gives an overall summary of the budgetary transactions of the Central and State governments and Union Territory Administrations. This includes the internal and extra-budgetary resources of the public sector undertakings for their plans and indicates the total outlay, the current revenues, the gap between the two, the manner in which the gap is financed by net internal and external capital receipts and finally, the overall budgetary deficit. It gives the break-up of the outlay into developmental and non-developmental outlays and the components of these and those of current revenues. The RBI Handbook 2008 presents time series data in respect of public finances in five groups – (i) Central Govt. Finances, (ii) Finances of the State Govts., (iii) Combined Finances of Central and State Govts., and (v) Transactions with the Rest of the World. Besides covering data areas like Govt. receipts and expenditure, the first two groups cover key deficit indicators, the financing of Gross Fiscal Deficit (GFD) and outstanding liabilities. The third group covers in addition the range and weighted averages of Central and State Govt. dated securities and the shares of categories of holders of Central and State Govt. Securities. We have looked at one area of the fourth group, namely, trade in merchandise and services in the section on Trade. There are other areas in which India interacts with the rest of the world. Foreign exchange flows into the country as a result of exports from India, external assistance/aid/loans/borrowings, returns from Indian investments abroad, remittance and deposits from NRI and foreign investment (FDI and portfolio investment) in India. Foreign exchange reserves are used up for purposes like financing imports, retiring foreign debts and investment abroad. What is the net result of these transactions on the foreign exchange reserves? What are the trends in these flows and their components? What is the size of the current account imbalance relative to GDP and its composition? If it is a deficit, is it investment-driven? What is the size of foreign exchange reserves relative to macro-aggregates like GDP, the size of imports and the size of the short term external debt? While the Weekly Statistical Supplement and the RBI Monthly Bulletin give data on forex reserves and related data, the RBI Handbook gives time series data (in US$ and in Indian Rupees) on these parameters. As for FDI, the coverage of data compiled by RBI and the Department of Industrial Policy & Promotion (DIPP) in the Ministry of Commerce & Industry since is in accordance with the best international practices since 2000-01. The RBI Handbook 2008, the SEBI Handbook 2008 and the RBI Monthly Bulletin, Feb. 2009 and the websites of www.rbi.org.in and www.dipp.nic.in provide time series data on FDI. 5.8.3 Currency, Coinage, Money and Banking Economic transactions need a medium of exchange. We have come a long way from the days of barter and come to the use of money and equivalent financial instruments as the medium of exchange. Banks function as important financial intermediaries not only in this process but also in matters of resource mobilisation and the deployment of such resources. The central bank of the country (RBI in India) regulates the functioning of the banking system. In addition, it issues currency notes and takes steps to regulate the money supply in the economy to achieve simultaneously the objectives of ensuring adequate credit to development activities and maintain stability in prices. We should, therefore, be interested in data on money supply or the stock of money and its structure

117

and the factors that bring about changes in these, the kind of aggregates that need monitoring, the transactions in the banking system in pursuance of the nation’s development objectives, the flow of credit to different activities, indicators of the health and efficiency of banks which are the custodians of the savings of the public. We should also be interested in data on prices, as price level affects the purchasing power of money and indices of prices appropriate for the purpose/group in question – prices for producers and consumer prices for different groups of consumers. Most of these of data are compiled by the RBI on the basis of its records and those of NABARD and returns that it receives from banks and can be found in RBI Bulletins and RBI Handbook. These are also published in the Monthly Abstract of Statistics of CSO. The Wholesale Price Index (WPI) (base year 1993-94) is compiled by the Economic Adviser’s Office in the Ministry of Industry, the Consumer Price Index for Industrial Workers (CPI - IW) (base year 2000) and CPI for Agricultural Labour (CPI - AL) (base year 1986-87) by Labour Bureau), Shimla, and CPI for Urban Non-Manual Employees (CPI - UNME) (base year 1984-85) by CSO and published by the agencies concerned (also posted on their websites) and are available in the RBI and CSO publications mentioned above. The Monthly Abstract of Statistics also gives monthly data on average rural retail prices of (i) selected commodities/services and (ii) controlled/rationed items collected by NSSO. Two other reports of the Reserve Bank of India published every year – the Report on Currency and Finance and the Report on Trends in Banking provide a wealth of information of use to analysts. EPWRF website provides data on Banking, Money and Finance. 5.8.4 Financial Markets What would we like to know about financial market and their functioning? We would like to know about the ways in which financial resources can be accessed and at what cost. What are the prevailing interest rates payable for funds to meet short-term or long-term requirements? How do new ventures access the large amount of resources that are needed for the new ventures? How do term lending institutions access funds required for their operations? What are the sources of funds? RBI, which regulates banking operations and the operations of NBFCs and FIs, SEBI, which regulates the capital market and the Department of Company Affairs, which administers the Companies Act, are the major sources of data on financial markets. The RBI Handbook 2008 (also RBI website) and SEBI’s Handbook of Statistics on the Indian Securities Market 2008 (also, www.sebi.gov.in) contain comprehensive data on financial markets. The two together provide time series data on several aspects of financial markets. Examples are the structure of interest rates, resource mobilisation in the Private Placement Market, net resources mobilised by mutual funds (MFs), new capital issues by non-govt. public ltd. companies, absorption of private capital issues – the no. of issuing companies, the number of shares and amount subscribed by various categories of subscribers, annual averages of share price indices, resources raised by the corporate sector through equity/debt issues, the share of private placement in total debt and total resource mobilisation, the share of debt in total resource mobilisation; pattern of funding for non-govt. non-financial public limited companies, capital raised by economic activity/size of capital raised/region, trends on trading on stock exchanges, indicators of liquidity - market capitalisation-GDP ratio (BSE & NSE), turnover ratio (BSE) and

118

traded value ratio (BSE & NSE) and comparative evaluation of indices (BSE SENSEX etc.) through Price to Earnings Ratio and Price to Book Ratio.

[Read Sections 10.1 to 10.11, Chapter 10, NSC Report, pp.337 – 396; Article “New Monetary and Liquidity Aggregates”, RBI Bulletin of November, 2000.] 5.9 SOCIAL SECTORS 5.9.1 Introduction Social Sector consists of education, health, employment, environment and levels of living or quality of life in general. Investments in this sector pay rich dividends in terms of rising productivity, distributed growth, reduction in social and economic inequalities and levels of poverty, though after a relatively longer time span than in the case of investment in physical sectors. Let us look at the kind of data available in this sector.

5.9.2 Employment, Unemployment and Labour Force Employment is the means to participate in the development process and also benefit from it. Creation of employment opportunities is an important instrument for tackling poverty and to empower people, especially women. We should, therefore, know how many are employed and how many are ready to work but are unable to gain access to employment opportunities. How do women fare in these matters? Or, for that matter, what is the experience of men and women belonging to different social/religious/disadvantaged groups? Are children employed in any economic activity that is not only hazardous to their health but which also adversely impacts on our dream of a golden future for them through efforts to ensure their mental and physical well being. What is the quality of employment opportunities available to the work force? What are the conditions in which people work? (a) Magnitude of Employment & Unemployment

Data on employment in selected sectors are available from several sources. We have already looked at EMIP of DGE&T. This is the only source of employment data on the organised sector of the economy that is available at quarterly intervals but is subject to the limitations arising out of non-response in the submission of returns and incompleteness of the employers’ register (the frame). EMIP also produces a biennial reports on the occupational and educational pattern of employment in the public and private sectors, based on another return but these have lost their utility over the years due to a high level of non response. The DGE&T and the SDEs also provide data on the number of jobseekers by age, sex and educational qualifications and type of social/physical handicap on the live register of employment exchanges. All the registered jobseekers are not unemployed and all the unemployed do not register with the employment exchange, registration being voluntary. Some register at more than one exchange. The size of the live register cannot be an accurate estimate of the level of unemployment. It does represent the extent of pressure in the job market, especially for Govt. and public sector jobs.

119

The second and third sources, EC and ASI, have also been discussed earlier. The quality of ASI data is tied to the completeness of the frame of factories, which, in turn depends on the quality of the enforcement of the two relevant Acts. Data on employment in the Railways and the Banking sector are available from Railway Board and RBI respectively. The Indian Labour Statistics (ILS) (the latest is for 2006) published by the Labour Bureau (LB), Ministry of Labour & Employment, Shimla presents data on employment in a number of sectors that are covered by different labour legislations, but all these suffer from inadequacy of response in the submission of statutory returns and partial coverage of the relevant Act. Comprehensive data on employment and unemployment covering the entire country at regular intervals are available from two sources, namely, the decennial Population Census and the quinquennial sample surveys on employment and unemployment (EUS) of the National Sample Survey Organisation (NSSO). Workers in the Census are enumerated as main workers and marginal workers. The Census 2001 presents data up to the district level on the male and female workforce (main workers and marginal workers) by detailed economic activity categories and the employed and unemployed by age and education and social groups. (B & C Series Tables). These are available on CDs and floppies for users on payment and on the Census website http://www.census.nic.in.

The latest NSSO quinquennial EUSs relates to 2004-05 (61st Round). Its results are published in NSSO reports numbered 515 to 521. Report no. 515 Parts I & II “Employment & Unemployment Situation in India” is the overall report. These provide data on employment and unemployment in greater detail than the Census up to the State level. Analyse at the level of the 72 NSS regions is possible with the unit record data obtainable from NSSO. EUSs provide data on the distribution of the employed/unemployed by a number of characteristics like sex, rural/urban residence, age, education, employment status, economic activity (of the employed) and the monthly per capita expenditure class (MPCE) of the household concerned and also data on the incidence of underemployment, average daily wage levels of workers, etc.

(b) Quality/Adequacy of Employment EUS data throws light on aspects of quality and adequacy of employment, like underemployment of the employed, the share of casual employment in employment, male-female/regular/casual worker wage differentials and employment status of the workers from poor households. ILS of LB provide data on wage/earnings in the organised sector based on statutory returns and the distribution of workers in different occupations by level of earnings in selected industries through Occupational Wage Surveys (OWS). (Some of the sixth round reports have come out in 2008). ASI gives data on wages/ emoluments in different industries in the factory sector. DE, NDE and OAE and other Establishment Surveys provide data on average annual earnings for men, women and children in the unorganised sector. Wage Rates in Rural India for 2005-2006 and the 2005 Report on the Working of the Minimum Wages Act, 1948 give rural/unorganised sector wage levels vis-à-vis statutory minimum wages. Data on child labour and bonded labour from the Census and NSSO do not fully reflect the ground level realities.

120

ILS also publishes data on several aspects of labour welfare like industrial injuries, compensation to workers for injuries and death, industrial disputes and access to health insurance and provident fund. These are incomplete, being based on statutory returns. ILS also gives statistics relating to welfare funds set up in different industries. LB’s reports on their ongoing programme of surveys throw light on the working and living conditions of Scheduled Caste/Tribe workers, unorganised workers and contract labour. [Read 1. http://www.census.nic.in (especially the write-ups on metadata); 2. Sub-Paper 4.1 of Paper on Agenda Item 4, 15th COCSSO, CSO; 3. Section 9.4, NSC Report, pp. 292 – 302; 4. Chapter 2 of NSSO((1997), pp.3 – 7. 5. Sections on Labour Bureau and DGE&T of http://labour.nic.in; 4. Sections on EC and FuS of www.mospi.gov.in ] 5.9.3 Education (a) Introduction Education is an important instrument for empowering people, especially women. Education nurtures and develops their innate talents and capabilities and enables them to contribute effectively to the development process and reap its benefits. It is also an effective instrument for reducing social and economic inequalities. We have built up over the years a vast educational system in an attempt to provide education for all to ensure that the skill and expertise needs of a developing economy are met effectively and at the same time, to monitor the functioning of the educational system as an effective instrument for tackling inequalities. We would, therefore, like to look at data on different aspects of the educational and training system such as its size and structure, the physical inputs available to it for its effective functioning, its geographical spread, the type of skills and expertise it seeks to generate, the access of different sections of society and areas of the country to it and the progress made towards the goals like ‘total literacy’, ‘universalisation of secondary education’, ‘education for all’ and ‘removal of inequalities’.. The Department of School Education and the Department of Higher Education of the Ministry of Human Resources Development (MHRD), the National Council for Educational Research and Training (NCERT) and the University Grants Commission (UGC) collect and publish educational statistics and conduct research studies and surveys in the area of education. Selected Educational Statistics (2004-05), Education in India -Vols. I & IV (1998-99) and Annual Financial Statistics of Educational Sector (2005-06) published annually by MHRD, the All India Educational Survey (Seventh AIES) – (2002-03) of the NCERT, Annual Report of the DGE&T, the National Technical Manpower Information System (NTMIS) and annual Manpower Profile of India of the Institute of Applied Manpower Research (IAMR), the 52nd , 53rd, , 55th and 61st Round surveys of NSSO (1995-96, 1998-99, 1999-00 and 2004-05), the B and C series tables of Census 2001 and the Planning Commission’s National Human Development Report, (NHDR), 2001 and the State HDR reports are the major sources of data on education. .

(b) Educational Infrastructure: These volumes together give data on the number of institutions established at various levels – schools, colleges, polytechnics, industrial training institutes and facilities for apprenticeship training, their intake capacity, teaching positions created and filled, training facilities for the physically disadvantaged, special

121

campaigns like Sarva Siksha Abhiyan and Total Literacy Campaign, adult education, availability and adequacy of physical facilities in schools – type of buildings, number of rooms for teaching vis-a-vis the number of pupils, access to drinking water and toilets and urinals (and separately for girls}, distance of the school from the pupil’s residence and direct and indirect expenditure on education.

(c) Infrastructure Utilisation and Access to Educational Opportunities: These volumes provide data on literacy rates – overall and for different groups, enrolment, enrolment ratios and drop out rates at different levels of education and courses, teacher-pupil ratios, output from various professional and non-professional courses, utilisation patterns of professional manpower, stocks of different categories of manpower, the educational profile of the population, incidence of disability in the population, output of training facilities created for the vocational rehabilitation of the physically challenged and their vocational rehabilitation.

[Read . 1. Sub-Paper 4.2 of Paper on Agenda Item 4, 15th COCSSO, pp.18 – 35;

2. Section 9.5, Chapter 9, NSC Report, pp. 306 – 323.]

5.9.4 Health (a) Introduction: One of the important dimensions of quality of life is health. A healthy individual can contribute effectively to production of goods and services. Investment in health is, therefore, an essential instrument of raising the quality of life of people and the productivity of the labour force. What is the health status of the population? What are the challenges to the health of the population and how are these being tackled? What kind of data is available about these aspects of the population, the health infrastructure and the efforts being made to deal with problems of health? What is the impact of these on the health situation, especially of women and children? The annual publication National Health Profile of India (NHPI) (2007 is the latest) of the Central Bureau of Health Intelligence (CBHI) of the Ministry of Health & Family Welfare (MHFW), the publications Sample Registration System (SRS): Statistical Reports, SRS Compendium of India’s Fertility & Mortality Indicators, 1971-1997, Mortality Statistics and Cause of Death and SRS Bulletin (half yearly) and the Social & Cultural Tables (C Series Tables) of Census 2001 of the Registrar General India & Census Commissioner (RGI&CC), the Planning Commission’s NHDR 2001 and the HDRs of the State Governments and the Report on the National Family Health Survey (NFHS – 3): 2004-05 of the International Institute of Population Sciences contain a large amount of information on these aspects of health. (b) Health Infrastructure: NHPI provides data on the number of public and private hospitals, dispensaries and the number of beds in these in rural/urban areas, similar data on various health insurance schemes of the Government in different sectors/for different sections of population, facilities for Indian Systems of Medicine, facilities for training medical and health manpower, manning of medical and health positions in the health system, stocks of medical and health manpower, programmes for controlling specific communicable diseases and expenditure on health and family welfare. (c) Public Health, Morbidity and Mortality Indicators: NHPI presents data on programme for vaccination of children and pregnant women, incidence of communicable

122

and other diseases and mortality due to these, incidence of leprosy and tuberculosis, National Aids Control Programme and other National Control/Eradication programmes, levels of utilisation of different health insurance schemes, infant mortality, maternal mortality, birth and death rates, fertility rates, incidence of disability and expectation of life. (d) National Family Health Survey (NFHS)-3: NFHS – 1, 2 & 3 conducted in 1992-93, 1998-99 and 2005-06 succeeded in building up an important demographic and health database in India. These provide State-level estimates of demographic and health parameters and also data on various socio-economic and programmatic factors that are crucial for bringing about desired changes in India’s demographic and health situation. NFHS – 3 covers all states. Some of the types of data provided are (website http://www.nfhsindia.org) data on age at first marriage of women, current fertility, median age of women at the first and last birth of child, knowledge and practice of contraception, estimates of age-specific death rates, crude death rates, infant/child mortality rates, morbidity of selected diseases, immunization of children, vitamin A supplementation of children, nutritional status of children, anaemia among them and indicators of acute and chronic malnutrition among children – weight for age index, height for age index and weight for height index, health status of women (Body Mass Index and prevalence of anaemia), health problems of pregnancy and so on. [Read 1. Section 9.3, Chapter 9, NSC Report, pp.275 - 292.] 5.9.5 Environment The process of development adversely affects the environment and through it, the quality of life of society. For instance, the excessive use of fertilizers and pesticides rob the soil of its nutrients. Letting sewers and drainage and industrial effluents without prior treatment into rivers and water bodies pollute these, causing destruction of aquatic life and endangering the health of people using such polluted water. The recent outcry in Tirupur, near Coimbatore, a place well known for garment exports, against untreated effluents from garment factories being let into the river used for drinking purposes is a case in point. The exhaust fumes containing Carbon Monoxide (CO) and lead (Pb) particles let in to the air we breathe by vehicles using petrol or diesel is an example of air pollution. The best example of industrial pollution through insufficient safety measures is the Bhopal gas disaster where lethal gases leaking from a factory’s storage cylinder killed many people immediately and maimed many others for life. The forest cover of the country is continuously getting reduced due to indiscriminate felling of trees leading to reduction in rainfall and changes in rainfall pattern, besides climatic changes. The destruction of mangroves along seacoasts for housing/tourism development often leads to soil erosion along the coast by the sea. The adverse effects of current models of development on environment and the realisation of the need to take note of the cost to development represented by such effects have now led to the development of environmental economics as a new discipline in economics. The Central and State Pollution Control Boards and the Ministry of Environment and Forests (MOEF) evolve and monitor implementation of policies to protect the environment. Statistics on environment are collected through this process by these agencies and the CSO. The annual reports of the MOEF and the Compendium on

123

Environment Statistics, India 2007 published by the CSO from time to time are excellent sources of data on environment. The latter especially is very comprehensive and includes a very informative write up. The Compendium (and the annual report of MOEF) can be accessed in the websites of the two organisations. Illustrative types of data on environment available from these publications are Ambient Air Quality Status [concentration of Sulphur di-oxide, Nitrogen di-oxide and Solid Particulate Matter (SPM) in air] in major cities of India, percentage of petrol-driven two-wheelers, three-wheelers and four-wheelers meeting CO emission standards, and water quality of Yamuna river (in the Delhi stretch) in respect of selective physio-chemical parameters during a year – dissolved oxygen (milligrams./litre), Biological Oxygen Demand (BOD) (mg./l), faecal coliforms (number/100ml), total coliforms (number/100ml) and ammonical nitrogen (mg/l). (SPM consists of metallic oxide of silicon, calcium and other deleterious metals. The most common contamination in water is from disease-bearing human wastes that are usually detected by measuring faecal coliform levels.) [Read 1. Section 9.7, Chapter 9, NSC Report, pp. 327 – 331.] 5.9.6 Quality of Life We have already looked at several of the factors determining the quality of life of the people – education, health, employment and environment. Shelter and amenities is another. Ironically, development projects also displace people from their normal way of life. One other factor, an important one, is the level of income or consumption. The relevant data are available from quiquennial surveys of the NSSO on consumer expenditure - those on levels of consumption for different MPCE classes. These and others considered in the earlier sections lead to measures of the dimensions of poverty and inequalities in income (consumption) and non-income aspects of life, HDIs, Gender Development Indices (GDI) measuring gender discrimination, BMIs evaluating the health status of women and the measures, Weight for Age Index, Height for Age Index and Weight for Height Index gauging the nutritional status of children. All these measures are also available from NSSO, the NHRD Report 2001 and the State HDI Reports for judging the quality of life of the population and of the Scheduled Castes and Tribes. The Social Statistics Division of CSO has a number of regular publications presenting data on the elderly, gender differentials in different areas, progress towards millennium development goals and a report on home based workers. (see MOSPI website.) The Sachar Commission Report on Minorities and the reports of the Commissions for (i) Scheduled Castes and Tribes, (ii) Backward Classes Commission, (iii) the Minorities and (iv) the Women at the national and state levels review improvements in the quality of life of these sections of society and their mainstreaming, empowerment and physical and emotional safety and in so doing assemble an enormous amount of data from various sources. Likewise, the Commissions for the Aged and for Children are sources of data on these groups gathered at one place from various primary sources. The Annual Reports of the Ministry of Social Justice and Empowerment, the nodal agency for the welfare/development/empowerment of all these groups and the physically and mentally challenged, is another source of data on the status of these vulnerable groups in society. [Read 1. Sections 9.6 (pp. 323 – 327) & 9.8 (pp. 331 – 336), Chapter 9, NSC Report. ]

124

5.10 LET US SUM UP The Indian Statistical System with the National Commission on Statistics at the apex collects, compiles and disseminates an enormous amount of data on diverse aspects of the Indian economy. Sections 5.4 to 5.10 have highlighted the characteristics of the database on different aspects of the Indian economy, looking specifically at certain major ones. National Accounts Statistics, consisting of estimates of national income, saving and investment and related macroeconomic aggregates are compiled by CSO and published annually in their publication NAS, besides releasing quick estimates and advance estimates of national income through press notes. Estimates of SDP are similarly made by SDESs. These are also available in a number of other publications like the RBI Handbook and the websites of MOSPI, RBI and the States. CSO also compiles an I-OTT, which is useful in analysing the impact of changes in a sector of the economy on the others and on the entire economy. Estimates of agricultural crop production and inputs and cost of cultivation, livestock and livestock products and data on land utilisation, irrigation, size distribution of operational holdings and their contribution to production and access to agricultural inputs are compiled by the Ministry of agriculture and mainly disseminated through the publication ASG of DESMOA. Data on industrial production and related data flow through ASI, the IIP compiled by CSO and SDESs, the CSO publication Monthly Production of Selected Industries, EC and FuS, the Census of Small Scale Industries and RBI and SEBI Handbooks, besides publications of EPWRF and CMIE(EIS). DGCI&ES and RBI Handbook and Bulletins are the repositories of data on trade in merchandise and services. Statistical data on government finances, India’s transactions with the rest of the world are available in the RBI Handbook and Bulletins, the budget documents and the Economic Survey. The SEBI and RBI Handbooks and RBI Bulletins are fairly comprehensive sources of information on financial markets. Substantial data on different aspects of the employment and unemployment situations flow from the decennial population census and the NSSO’s quinquennial EU surveys. ASI, EC and EMPI also generate data on employment though only for parts of the economy. The educational administrative system generates a large amount of data on the educational infrastructure built up over the years and the extent of its utilisation, while the census, surveys and other regular data collection mechanisms make sizable additions to it. The health system likewise generates a lot of data that is supplemented by the census and other regular efforts and in particular the NFHS. Data on environment are compiled and disseminated by the MOEF and CSO. Those on quality of life consist of a large body of data consisting of levels of consumption by MPCE classes, estimates of poverty and inequality, indices like HDI and GDI, the status of the socially and physically challenged, minorities and the elderly. The observations of NSC on these data areas and their suggestions can be seen in the reading material cited under each Section/Sub-section. The entire NSC Report should in fact serve as a guide to anyone wishing to know about data in any area and their quality. 5. 11 FURTHER SUGGESTED READING Improvement of the database is a continuous process. The National Statistical Commission is the most recent effort to take a look at the Indian database and identify deficiencies in it and to suggest steps to overcome these and develop new initiatives to

125

build a data system that delivers quality data that is reliable and timely. The National Commission on Statistics and the Indian statistical system working under its guidance are already taking steps in this direction. As a researcher and analyst requiring statistical data for your work it would be useful to be in touch with developments relating to the review, refinement and expansion of the database in different aspects/sectors of the economy. • The Journal of Income and Wealth of the Indian Association for Research in

National Income and Wealth is useful for those interested in methodological developments in the field of National Accounts and examination of questions of adequacy or suitability of available data for use in National Accounts work. For instance, the Journal’s recent issues (Issue No. 24 – 1&2) have a Paper A Case Study on Estimation of Green GDPof Manufacturing Sector in India by S.K.Nath & Samiram Mallick that would be relevant in the context of the emerging emphasis on environment-friendly industrial development. Another Paper Services Sector in the Indian Growth Process: Myths and Reality by Sanjay Kumar Hansda would be useful in the current context of the perceived dominance of the services sector in GDP.

• The discrepancy between estimates of PFCE made by CSO and those made from

NSS household surveys has been the subject of discussion for a long time. The Report on the Cross Validation Study on Estimates of Private Consumption Expenditure Available from Household Surveys and National Accounts prepared by CSO and NSSO for the Study Group on Non - Sampling Errors is published in Sarvekshana, (Issue 88, Vol. XXV XXVI, No. 41, pp. 1 – 69.). Also see in this connection Section 13.4 (pp. 492 to 506) and in particular sub-sections 13.4.7 (about the above Study)(pp. 503 – 506), NSC Report.

• Report fthe Committee on Capital Formation in Agriculture. • Environmental Accounting/ Natural Resource Accounting are areas of interest to

economists. A look at Natural Resource Accounts of Air & Water Pollution – Case Studies of Andhra Pradesh and Himachal Pradesh and Environmental Accounting of Land and Forestry Sector – Madhya Pradesh and Himachal Pradesh would be rewarding. These are on the MOSPI website.

• A clear understanding of the concepts used in the surveys of NSSO would be very

useful for an analyst/researcher. Explanations of technical terms, their definitions and the underlying concepts in NSS socioeconomic surveys up to the 55th Round (excluding the terms used in ASI, price collection work and crop surveys) are given in Concepts and Definitions Used in NSS (May, 2001). Modifications made in definitions etc., in recent Rounds (60 onwards) are available Round wise. See NSSO/SDRD section of the MOSPI website.

• National Seminar on NSS 61st Round Results (October, 2007) Report on the

Seminar and Papers are in the MOSPI website. • Official Statistics Seminars Series organised by MOSPI to promote new ideas.

Papers of the Second Seminar (Nov., 2004) contain suggestions to overcome deficiencies in the data in various areas and suggestions for improvement, etc. For

126

instance, a Paper Reengineering ASI to improve the database of the Organised Sector by G.C.Manna makes a number of suggestions to tackle NSC’s observations and suggestions. See MOSPI website for the seminar papers.

• The National Commission on Statistics Workshop on Conceptual Issues relating

to Measurement of Employment and Unemployment (Dec., 2008). Papers and the proceedings can be seen on the Commission’s section of the MOSPI website.

5.12 REFERENCES Chandrasekhar, C.P. (Ed)(2001): India’s Socio-economic Data Base, New Delhi. Tulka. CSO(1979): The article “Comparable Estimates of SDP – 1970-71 to 1975-76” in NAS, (January, 1979), CSO, MOSPI, Govt. of India, New Delhi. ------(1980): Article “The Status of State Income Estimates” appearing in the Monthly Abstract of Satistics, (October, 1980), CSO.

------(March, 1994): National Accounts Statistics (NAS) – Factor Incomes, CSO, MOSPI, Govt. of India, New Delhi.

------(1994): NAS, 1994, CSO, MOSPI, Govt. of India..

------(1999 a): New Series on NAS (Base Year 1993-94), CSO, MOSPI, Govt. of India, New Delhi.

------(1999 b): NAS 1999, CSO, MOSPI, Govt. of India.

------(2005): NAS 2005, CSO, MOSPI, Govt. of India.

------(Feb. 2006): New Series of NAS (1999-00 prices), CSO.

------(2007): National Accounts Statistics – Sources and Methods, CSO, MOSPI , Govt. of India, New Delhi. ------(2008): NAS, 2008, CSO, MOSPI, Govt. of India.

Department of Agriculture (several years): Cost of Cultivation in India, DESMOA Ministry of Agricukture, Govt. of India, New Delhi. EPWRF (April, 2002): Annual Survey of Industries, 1973-74 to 1997-98 – A Database on the Industrial Sector in India, EPWRF, Mumbai. ------------(June, 2003): Domestic Products of States of India: 1960-61 to 2000-01, EPWRF, Mumbai. ------------(Dec., 2004): National Accounts of Statistics of India, 1950-51 to 2002-03 – Linked Series with 1993-94 as the base Year, EPWRF,Mumbai.

International Labour Organisation(1993)::Resolution concerning Statistics of Employment in the Informal Sector, 15th International Conference of Labour Statisticians (ICLS), Geneva.

Indian Association for Research in National Income and Wealth & Institute of Economic Growth (1998): Golden Jubilee Seminar on Data Base of the Indian Economy, Delhi. .

Journal of Income & Wealth (1976): Article “Mahabaleshwar Accounts of States” in the October, 1976 issue.

127

Katyal, R.P., Sardana, M.G., and Satyanarayana, J. (2001): Estimates of DDP, Discussion Paper 2, National Workshop on State HDRs and the Estimation of District Income Poverty Under the State HDR Project Executed by the Planning Commission (GOI) with UNDP Support held in Bangalore in July, 2001, UNDP, New Delhi.

MOSPI (2001): Report of the National Statistical Commission, MOSPI, Govt. of India,

NSSO (1997): Report No. 409 – Employment and Unemployment in India, 1993-94 (50th Round). System of National Accounts 1993 (SNA 1993) Commission of the European Communities, International Monetary Fund, Organisation for Economic Cooperation and Development, United Nations, World Bank, Brussels/Luxembourg, New York, Paris, Washington, D.C. World Trade Organisation (2007): International Trade Statistics 2007, WTO, Geneva. 5.13 MODEL QUESTIONS 1. Describe the organisational structure of the Indian Statistical System. 2. What do you understand by the term Special Data Dissemination Standards?

What data are supplied by India under SDDS? What are the release calendars for our data?

3. Why do you we need to make estimates of National Income and related macro aggregates? List the various components of the system of National Accounts.

4. What is the need to make estimates of macro aggregates at current as well as at constant prices? Why is the base year changed from time to time? Which have been the base years so far for national income and related aggregates?

5. Are estimates of SDP prepared by SDESs comparable and why? What problems arise as a result? How have these been tackled?

6. What are the approaches available for preparing SDP and district domestic product? Which approach is being adopted and with what consequences?

7. What are the relationships depicted in the Input-Output Tables? How are these tables useful in research work and economic policy decisions?

8. What kind of saving data are published by CSO in National Accounts Statistics? 9. With the help of NAS 2008, explain the structure of savings in India. 10. What are the gaps/deficiencies in data that come in the way of better estimates of

Saving? What steps can and should be taken to fill these gaps/deficiencies? 11. What do you mean by the term capital formation? 12. What estimates of regarding capital formation are presented in the national

accounts? What are their limitations? What data gaps come in the way of compiling better estimates? What should be done to overcome these problems?

13. What are the data sources on contribution of different sectors to capital formation?

14. Explain using national accounts how capital formation is financed? 15. What type of data will you need to assess the performance of Indian economy?

Explain the various sources of such data and their quality. 16. How are annual estimates of (i) crop production and (ii) crop forecast made by

DESMOA? What are NSC’s views on the procedures/methodology adopted for these?

17. Give two major sources of data on Agriculture and allied activities like dairy and fisheries and comment on their quality and time lag in availability.

128

18. From where and how can you get data on per-capita availability of milk, egg, wool, cow and buffaloes? What do you think of its timeliness and reliability?

19. What kind of data on inequalities is available from the Agricultural Census? Comment on the uses to which such data this can be put.

20. What do you mean by the term 'cropping intensity'? 21. What data is available on subsidy given to agriculture? Comment on its accuracy.

Attempt to work out measures like those relevant to subsidy worked out in OECD countries and presented in ASG 2004. (Producer Support Estimate – PSE and %PSE).

22. What are the major efforts made to collect data on different aspects of ag. Sector? 23. Discuss the characteristics of data flowing from the agencies involved in

compilation of agricultural data. 24. Indicate the major sources of data on levels of industrial employment. Comment

on their scope, coverage, reliability and timeliness.

25. Discuss the role of the Economic Census in the industrial database.

26. Discuss the kinds of data that ASI provides to an understanding of different aspects of the factory sector. What are its contributions to economic analysis?

27. Discuss the adequacy, quality and the representative character of the Index of Industrial Production (IIP).

28. Discuss the limitations of the time series data on small scale industries available from DCSSI.

29. Enumerate the data sources on flow of credit to different sub sectors of industry. 30. "The detailed data available in the industrial sphere relates to the factor sector"

explain. 31. Do you think that data available on the unorganised sector is inadequate? What

suggestion would you like to make in this regard? 32. Which are the major two sources of data on merchandise trade? What kind of

Trade Data is compiled by the two sources?

33. How are the measures Gross Terms of Trade, Net Terms of Trade and Income Terms of Trade obtained?

34. What are the reasons for divergence between the two sources of trade data? 35. Explain the terms ‘net invisibles' and ‘non factor services’. To what detail is data

on trade in services available and where? 36. Indicate the documents that provide different kinds of data on public finances 37. Indicate the various measures of deficit and the relationship between them. 38. Identify the transactions other than trade in merchandise India has with the rest of

world? 39. Which document does contain the methodology for compiling liquidity

aggregates? 40. List the kind of time series data available in RBI Handbook 2008. 41. Identify the major sources of data on financial markets. 42. Name the sources that provide data on employment and unemployment. 43. Discuss the scope, reliability and utility of EMIP and live register data as a source

of data on levels of employment and levels unemployment respectively. 43 Explain the kind of data on employment and unemployment available from the

population census 2001.

129

44. Explain the different measures of employment/unemployment/labour force used by NSSO in different quinquennial surveys and their use in judging quality and adequacy of employment.

45. Discuss the availability of data on adequacy and quality of employment and the utility of those that are available.

46. Identify the different aspects of labour welfare on which data are compiled by the Labour Bureau.

47. Name the organisations/institutions that collect and publish data on different aspects of education, their timeliness and quality.

48. What is the contribution of AIES of NCER&T to education database? 49. How has the National Technical Manpower Information System (NTMIS)

improved data availability on development and utilisation of technical manpower? 50. What kind of data are published in NHPI? Comment on their quality and

timeliness. 51. What is the Sample Registration System? What information does it supply on a

regular basis? Comment on the quality and timelines of the data generated. 52. How have the series of NFHSs helped in developing the health information

system, especially on the health status of women and children? 53. Examine the availability of data on air, water and soil pollution in India. 54. What measures are available for measuring quality of life? Examine the adequacy

of the Indian database for developing reliable estimates of these measures? 55. Inclusive growth is very much in the news. Examine the adequacy of the database

for measuring the flow of the benefits of growth to (a) women and (b) the bottom half of the population (in the distribution by MPCE classes).

130

BLOCK 6 USE OF SPSS AND EVIEWS PACKAGES FOR

ANALYSIS AND PRESENTATION OF DATA Structure

6.0 Objectives 6.1 Introduction 6.2 An overview of the block 6.3 SPSS Package 6.3.1 Features of SPSS for Windows 6.3.2 Getting acquaintance with SPSS 6.3.3 Menu commands and Sub-commands 6.3.4 Basic steps in Data analysis 6.3.5 Defining, Editing and Entering Data 6.3.6 Data file management functions 6.3.7 Running a Preliminary Analysis 6.3.8 Understanding Relationship between Variable: Data Analysis 6.3.9 SPSS Production Facility 6.4 Statistical Analysis System (SAS) 6.5 NUDIST 6.6 EVIEWS Package 6.6.1 EVIEWS Files and Data 6.6.1.1 Creating a Work file 6.6.1.2 Importing Time Series Data from Excel 6.6.1.3 Transforming the Data 6.6.1.4 Copying Output 6.6.1.5 Examining the Data 6.6.1.6 Displaying Correlation and Covariance Matrices 6.6.1.7 Seasonality of the series 6.6.1.8 Estimating Equations 6.6.1.9 Testing for Unit Roots 6.6.1.10 ARIMA/ARMA identification and estimation 6.6.1.11 Granger Causality Test 6.6.2 Vector Auto Regression (VAR) 6.0 OBJECTIVES After studying this block you will be able to

• describe the main features of SPSS, • develop skill in the use of SPSS for basic statistical analysis with a special focus

on the measures of central tendency, dispersion, correlation, and regression analysis,

• present the data and SPSS results graphically, • explain the basic features of EVIEWS package, and • acquire skills to analyze time series data through application of various

sophisticated time series methods in EVIEWS.

131

6.1 INTRODUCTION We have learned in block 2 to 5 that statistical analysis and interpretation of data constitute an integral part of research in economics. The methodological advances in quantitative analysis are also accompanied by a significant revolution in the computing power of the desktops/laptops which are often called PC. Earlier, the softwares which could only be run on large mainframe computers can now be run with considerable ease on the PCs. One of such package is MS-Excel. You would certainly be familiar with this software and would know that it provides statistical, analytical and scientific functions. It has various features to offer namely fast calculations, what-if analysis, charts (graphs), automatic re-calculations and many more. Besides this, the software packages being used in large set of data, data analysis, result presentation, etc. for policy decision and research tool simplification include: RATS, SPSS, NCSS, MATLAB, LIMDEP, OX STATA, and EVIEWS. Among all these sophisticated econometrics packages, SPSS and EVIEWS are popular software packages used in data analysis and data presentation in social sciences in general and economics in particular. Hence in this block, we shall focus the SPSS and EVIEWS fundamentals and the use of their statistical components. We shall also look at the several statistical techniques (for quantitative and qualitative data analysis) and discuss situations in which you would use each of these techniques. Light will also be thrown on the assumptions made by each method, how to set up analysis using SPSS/EVIEWS, as well as how to interpret the results. The statistical analysis system (SAS) software for quantitative data analysis will also be discussed. For qualitative data analysis we well introduce software called NUDIST. 6.2 AN OVERVIEW OF THE BLOCK SPSS (Statistical Package for Social Sciences) is one of the packages often preferred by the researchers and analysists for data management and detailed statistical analysis. It provides techniques for data processing and procedures for logistic regression, log-linear analysis, multivariate analysis and analysis of variance. It also equips us with the procedures for constrained non-linear regression obit, Cox and actuarial survival analysis. Further, it also creates high quality presentation of data. SPSS also performs comprehensive forecasting and time series analysis with multiple curve fitting models, smoothing models and methods for estimating of autoregressive functions. In the first part of this block, efforts have been made to introduce the various operational procedures relating to data processing, statistical analysis and presentation of data. 6.3 STATISTICAL PACAKAGE FOR SOCIAL SCIENCES (SPSS) PACAKAGE Once the data has been collected, the first step is to look at it in a variety of ways. While there are many specialized software application packages for different types of data analysis (relating to scientific, commercial and financial problems), a researcher is often faced with a situation where the general treatment and standard statistical analysis of the quantitative data is required. SPSS (Statistical Package for Social Sciences) is one-such package that is often used by researchers and analysts for data management and exploring it before attempting a detailed statistical analysis. It is a preferred choice for research analysis due to its easy-to-use interface and comprehensive range of data manipulation and analytical tools.

132

Suppose, you are interested in knowing the attitude of students towards distance education and for that you have administered a data collection instrument (commonly known as questionnaire) to some students. Now you want to process and analyze the data. Till recently, data were processed manually and it was indeed a cumbersome process. Fortunately, now we live in an age when high-speed computers can do the job of processing and analysis of data in a very short period of time and of course without errors. What you have to do is to learn some fundamental concepts used in this programme. Now you can sit at the computer and process and analyze the data that you have collected by administering a questionnaire. In fact, you will find it helpful and interesting to keep the SPSS Application guide nearby while you process and analyze your data. Here to help you work with the SPSS, some general features are highlighted next. 6.3.1 Features of SPSS for Windows SPSS is one of the leading desktop statistical packages. It is an ideal companion to the database and spreadsheet, combining many of their features, as well as, adding its own specialized functions. SPSS for Windows is available, as a base module and a number of optional add-on enhancements are also available. Some versions present SPSS as an integrated package including the base and some important add-on modules. SPSS Professional Statistics provides techniques to examine similarities and dissimilarities in data and to classify data, identify underlying dimensions in a data set. It includes procedures for cluster, k-cluster, discriminate, factor, multi- dimensional scaling, and proximity and reliability analysis. SPSS Advanced Statistics includes procedures for logistic regression, log-linear analysis, multivariate analysis and analysis of variance. This module also includes procedures for constrained non-linear regression, probit, Cox and actuarial survival analysis. SPSS Tables create a high quality presentation – quality tabular reports including stub and banner tables and display of multiple response data sets. The new features include pivot tables, a valuable tool for presentation of selected analytical output tables. SPSS Trends perform comprehensive forecasting and time series analysis with multiple curve fitting models, smoothing models and methods for estimation of autoregressive functions. SPSS categories perform conjoint analysis and optimal scaling procedure, including correspondence analysis. SPSS also provides simplified tabular analysis of categories data, develops predictive models, screens out extraneous predictor variables, and produces easy-to-read tree diagrams that segment a population into sub-groups that share similar characteristics. Recently, the SPSS Corporation announced the release of SPSS version 15.0. Many new add-on products have also been launched in the recent months. You can consult the SPSS World Wide Web site for the latest developments and additions to the computing power of SPSS. Technical support is also available to the registered users at the SPSS site. The SPSS Web site is http://www.spss.com. Select white papers on SPSS applications in major disciplines are also available on this site. The present unit discusses some of the commonly used data management techniques and statistical procedures using SPSS 11.5 version. Since new features are added almost

133

daily, you are advised to check for these details on the currently installed version of SPSS on your computer and also consult the user manuals before undertaking complex type of data analysis. The on-line help is also available. There may be some procedures and syntax-related changes from one version to another. In case these are not available on your version of SPSS, please consult the relevant SPSS authorized representative or the WWW site of the SPSS corporation. The most recent version of SPSS is now called PASW. With this basic knowledge let us get acquainted with the SPSS.

6.3.2 Getting Acquaintance with SPSS The SPSS for Windows can be run from Windows 3.x or Window 95 through Windows 98 or later operating systems such as UNIX, Mac and mainframe versions of the SPSS software. These are also available on the SPSS software. The illustrations in this unit are based on SPSS version for Window 95/98/NT operating systems. We are assuming that SPSS is installed on your machine. Starting SPSS The SPSS for Windows uses graphical environment, descriptive menus and simple dialog boxes to do most of the work. It produces three type of files, namely data files, chart files and text files. To start SPSS, click the start button on your computer. On the start menu that appears, click Program. Another menu appears on the right of the start menu. If there is an entry marked SPSS, that’s the one you want to click. If there isn’t, click the program group where SPSS was installed and an entry marked SPSS will appear. Click the SPSS 11.4 (or which ever version entry). You will know when the SPSS has started and an SPSS Data Editor window appears. To begin with, the SPSS data editor window will be empty and a number of menus will appear on the top of the window. We can start the operations by loading a data set or by creating a new file for which data is to be entered from the data editor window. The data can also be imported from other programs like Dbase, ASCII, Excel and Lotus, we will learn about this in a little while from now. Exiting SPSS Make sure that all SPSS and other files are saved before quitting the program. You should exit the software by shutting off the program by selecting Exit SPSS command from the file menu of the SPSS Data Editor window. In case of unsaved files, the SPSS will prompt you to save or discard the changes in the file. Saving data and other files Many types of file can be saved using ‘save’ or ‘save as’ command. Various types of file used in SPSS are: Data, -Syntax, Chart or Output. Files from spreadsheets or other databases can also be imported by following the appropriate procedure. Similarly, an SPSS file can be saved as a spreadsheet or in Dbase format. Select the appropriate save type command and save the file. The SPSS data files are saved with .sav as the secondary name. Though SPSS files could be given any name, the use of reserved words and symbols is to be avoided in all types of file names. Printing of data and output files The contents of SPSS data files, Output Navigator files and Syntax Files can be printed-using the standard ‘Print’ Command. The SPSS uses the default printer for printing. In the case of network printers, an appropriate printer should be selected for printing the

134

output. It is suggested that ink jet or laser jet printers should be used for printing graphs and charts. Tabular data can be easily printed using a Dot matrix Printer. Operating Windows in SPSS There are seven type of Windows in SPSS which are frequently referred to during the data management and analysis stages. These are: Data Editor As mentioned earlier, the data editor window opens automatically as soon the SPSS gets loaded. To begin with, the data editor does not contain any data. The file containing the data for analysis has to be loaded with the help of ‘file’ menu sub-commands by using various options available for this purpose. The contents of the active data file are displayed in the data editor window. Only one data editor window will be active at a time. No statistical operations can be performed until some data is loaded into data editor.

Output Navigator

All SPSS messages, statistical results, tables and charts are displayed in the output navigator. The output navigator can be opened/closed using the File Open/New Command. The output in the navigator window can be edited and saved for future reference. The Output Navigator opens automatically, the first time some output is generated. The user can customize the presentation of reports and tables displayed in the Output Navigator. The output can be directly imported into reports prepared under word processing packages, and the output files are saved with an extension xxxx.spo.

Pivot Tables

The output shown in the Output Navigator can be modified in many ways using the Edit and Pivot Table Option, which can be used to edit text, swap rows and columns, add colour, prepare custom made reports/output, create and display selectively multi-dimensional tables. The results can be selectively hidden and shown using features available in Pivot Tables.

Graphics

The Chart Editor helps in switching between various types of charts, in swapping of X - Y axis, changing colour and providing facilities for presenting data and results through various type of graphical presentations. It is useful for customizing the charts to highlight specific features of the charts and maps.

Text Editor

The text output not displayed in the Pivot Tables can be modified with the help of Text Editor. It works like an ordinary Text Editor. The output can be saved for future reference or sharing purposes.

Syntax Editor

The Syntax Editor can be opened and closed like any other file using the File Open/New command. The use of Syntax File is recommended when the same type of analysis is to be performed at frequent intervals of time or on a large number of data files. Using Syntax File for such purposes automates complex analysis and also avoids errors due to frequent typing of the same command. The commands can be pasted on the Syntax files using a particular command and pastes buttons from the menu. Experienced users can directly type the commands in the Syntax window. To run the syntax, select the commands to be executed and click on the run button at the top of the syntax window. All or some selected commands from the Syntax File will be executed. The Syntax File is saved as xx.sps.

135

Script Editor

This facility is normally used by the advanced users. It offers fully featured programming environment that uses the Sax BASIC language and includes a Script Editor, Object Browser, Debugging features and context sensitive help. Scripting allows you to automate tasks in SPSS including:

• Automatically customizing output • Open and save data files • Display and manipulate SPSS dialog boxes • Run data transformation and statistical procedures using SPSS command Syntax • Export charts as graphic files in a variety of formats.

The present module will not go into the details of the advanced features of SPSS including scripting. .

6.3.3 MENU COMMANDS AND SUB-COMMANDS Most of the commands can be executed by making appropriate selections from the menu bar. Some additional commands and procedures are available only through the Syntax Window. The SPSS user manuals provide a comprehensive list of commands, which are not available through menu driven options. If you want a comprehensive overview of the basics of SPSS, there is an on-line tutorial, as extensive help on SPSS is available by using the ‘Help’ menu command. The CD version of the software contains an additional demo module.

Since SPSS is menu driven, each Window has its own menu bar. While some of the menu bars are common, the others are specific to a particular, type of Window. We will present below in Table 6.1. The menu and sub-menus of the Data Editor window (Refer to Figure 6.1) highlights the SPSS data editor.

Figure 6.1: SPSS data editor shows the data editor menus. Each command in the main menu has a number of

sub-commands. Table 6.1 : Components of data editor menu

Menu Function/sub-commands

File Open and Save data file, to import data created in other formats like Lotus, Excel, Dbf etc. Print control functions like page setup, printer setup and associated functions. ASCII data can also be read into SPSS.

Edit These functions are similar to those available in general packages. These include undo, redo, cut, copy, paste, paste variable, find, find and replace. Option setting for the SPSS is controlled through Edit menu.

136

View Customize tool bars, Fonts, grid and display of data, displays option for showing value labels.

Data This is a very important menu as far as management of the data is concerned. Variable definition, inserting new variables, transposing templates, aggregating and merging of data files, splitting data files for specific analysis are some important commands in Data Menu.

Transform Compute new variables, recede, random number generation, ranking, time series data transformation, count and missing value analysis are undertaken using Transform Command.

Analyze As the name implies, analyze Menu incorporates statistical procedures, frequency distribution, cross-tabulations, comparison of means, correlation, simple and multiple regression, ANOVA, Log linear regression, discriminate analysis, factor analysis, non-parametric tests and time series analysis are undertaken using analyze menu.

Graphs Includes options for generating various type of custom made graphics like bar, pie, area, X- Y and high-low charts, pareto, control charts, box-plots, histograms, P-P and Q-Q charts and time series representation of data.

Utilities Information about variables, information on working a data file, run scripts and define sets are some of the important functions carried out through Utilities command.

Window Windows menu are used to switch between SPSS windows. Help Context specific help through dialog boxes, demo of the software, and

information about the software are some of the important options under Help command. It provides a connection for the SPSS home page. The statistical coach included in the help module is very useful in understanding various stages of executing a procedure.

Setting The Options

The SPSS provides a facility for setting up of the user defined options. Use the Edit menu and then select Options. The following types of optional setting are allowed in SPSS as illustrated in Figure 6.2. Make the appropriate changes to set the options according to your choice.

Figure 6.2: SPSS options

137

With this basic knowledge about commands and sub-command now let us learn about the basic steps in data analysis.

6.3.4 Basic Steps in Data Analysis There are five basic steps involved in data analysis using SPSS. These are shown in the Figure 6.3.

Let us review these steps. Bring your data into SPSS: You can bring your data into SPSS in the following ways: • Enter data directly into SPSS Data Editor • Open previously saved SPSS data file • Read a spreadsheet data into SPSS data editor • Import data from DBF files • Import data-from RDBMS packages like Access, Oracle, Power Builder, etc.

Basic steps in Data Analysis

Figure 6.3: Steps in data analysis Select the Variables: The variables in the active file are listed each time a dialog box is opened. Select the appropriate variables for the selected procedure. Selection of at least one variable is necessary to run a statistical procedure. The variables may be numeric, string, date or logical. You should be aware that string variables cannot be manipulated to the same extent as the numeric variables.

Select a Procedure from Menus: Before embarking on a statistical analysis, it is advised that you are clear as to what analysis is to be performed. Select the corresponding procedure to work on the data or create charts or tables using the selected procedure.

The command could either be directly executed or pasted on a Syntax Window. As mentioned earlier, pasting the command on the Syntax Window will be useful for undertaking batch processing or for subsequent use, especially where the same type of repetitive analysis required. Pasting the command will not lead to its execution. The command has to be selected and executed using the run command.

Step 1 Step 2 Step 3 Step 4 Step 5

Define your data

Get your data entered into SPSS data Editor

Select the variables for the analysis

Select a procedure from the menus to calculate statistics

Run the procedure and look at the results

138

Figure 6.4 : Variables in the active file

Run the Procedure and Examine the Output: After completing the selection process for the procedure and the variables, execute the SPSS command. Most of the commands are executed by clicking OK on the dialog box. The processor will execute the procedures and produce a report in the Output Navigator.

So then the basic steps involved in data analysis are clear. Before analysis we need to define, edit and enter data. Let us get to know about this next.

6.3.5 DEFINING, EDITING AND ENTERING DATA As mentioned earlier, there are many options for creating SPSS data files. The data can either be directly entered through Data Editor or imported from spreadsheets, ASCII file and other RDBMS packages like oracle and Access. Let us understand how to start, define, edit and enter data in the SPSS.

Starting the SPSS Session

Click the Start button and select SPSS 11.4 from program menu or double click the icon of SPSS 11.4. When you start an SPSS session, the Data Editor opens automatically. The Data Editor provides a spreadsheet for entering and editing data and creating data files. (See the Figure 6.5).

Figure 6.5: Data editor

139

Important features of the Data Editor include: • The data is around in the form of rows and columns in the data editor under • Rows represent cases. For example, each respondent (student/subject) is a case • First column represents case numbers • Columns represent variables. For example, each question is a variable (sometimes it may

represent more that one variable) • Cells represent values. Each cell in defined as the intersection of a row and a column and

refers to the value of a particular variable for a specified case/ direction.

Coding of Data

Before we enter data, we assign codes to the values of variables to make data entry easier. For example, in the “attitude study” gender is a variable that can take on two values. These have been coded so that “1” represents “Male” and “2” represents “Female”.

A Sample Code Book is illustrated here with: Variable Name : V1 Variable Label : Gender Variable Labels : Male = 1 Female = 2 Variable Name : V2 Variable Label : Level of Education Value Labels : Literate = 1 Primary = 2 Secondary = 3 Graduate = 4 Define Variable

Once you prepare your Code Book, you need to include it in the programme for further action to be taken. The process of including the code book is known as “Define Variable”.

• A name for the variable (up to 8 characters only) • A description (label) • A series of labels which explain the values entered (value labels) • A declaration as to which values are non-valid and should be excluded from the statistical

analysis and other operation (missing values). This information is important to understand the no-response pattern and also to specify the observation which should be excluded from the analysis.

Table 6.2 : Variable definition table provides an example of the above description

Variable Variable Value Missing Variable Name Lable Lables Values Type

STID Student None None Number, 6 digits identification no decimal place number

Name None None None String, 24 character long

Gender Sex of M male X String 1 character respondent F female long X Unknown

MTL Marital status 1 Married 9 Number, 1

140

2 Widowed character 3 Divorced 4 Separated 5 Never married 9 Missing

DOB Date of birth None None Date, dd mm yy

To define variable click at Variable View (see the Figure 6.6). In the left bottom corner of the Data Editor there are two Commands namely Data View and Variable View.

Figure 6.6: Variable view

Next, click in the first column and type variable name, label & values. Enter the name you wish to use for the cell. In the example we have chosen the name ‘V1’ to stand for Gender. Next, click on the label cell than type “Gender” in the variable label cell. Then click on the values cell, a dialog box will appear (refer to Figure 6.7) in that type value & Value label and then click on Add button. For example, type “1” in the value box, then click the value label box and type “Male” and finally click the “Add” button. Repeat for the other value of the variable. Once you have finished assigning value labels, click on continue button. • In case you need to change the labels you can always return to this dialog box. The change

button can be used to change a value label. • The remove button can be used to remove a value label. • The cancel button can be used to cancel your labeling work and help button can be used to

access the SPSS on line help. Now go ahead and define the remaining variables.

141

Figure 6.7: Value labels dialog box

So you have seen that defining variables is easy in SPSS. The variable names can be changed and altered with ease even during analysis. Any change made to the working files will be permanently changed only when the data file is saved using ‘save’ or ‘save as’ command.

Data can be entered directly using SPSS Data Editor window. However, if the data is large, you are advised to use a data entry package. The data can also be edited / changed in the data editor window. To change the value in any cell, bring the cursor to the particular cell, enter the new value and press enter. New variables can also be added and the existing variables can be deleted in the Data Editor Window. Let us learn how to enter data using SPSS data editor.

Entering Data

1. Select a cell in the Data Editor 2. Enter the data value. The value is displayed at the top of the Data Editor 3. Press Enter or select another cell to store the value.

Example

To enter the data for the “Attitude Study”, simply move the cursor to the upper-left-hand corner and enter 1 for the first respondent’s gender (male), then move the cursor one cell to the right and enter 1 for the level of education (literate), and so on. On the screen you will see like Figure 6.8.

Figure 6.8: Data file

Go ahead and enter the first 10 cases. Now save the data. Saving the data,

142

To save data • From the menus choose: File Save (Click on save) Because these data have not been saved previously you will see a dialogue box prompting you to enter a file name. Type in the name “attitude” and click OK button. SPSS will then save the data to this file. (SPSS will automatically attach the ‘.sav’ extension if you do not type it in.)

6.3.6 DATA FILE MANAGEMENT FUNCTIONS SPSS is very flexible as far as management of data files is concerned. While only one file can be opened for analysis at a time, the SPSS provides flexibility in merging multiple data files with the same structure into one single data file, merging files to add new variables, partially select the cases for analysis, and make groups of data based on certain characteristics and use different weights for different variables. Some of these functions are discussed below. Groups of data can also be defined to facilitate the analysis of the most commonly referred variables (see utilities and data commands).

Merging Data Files Researchers are often faced with a situation where data from different files are to be merged or a limited number of variables from large and complex data files are required. The following types of facility are available for merging files using SPSS.

Adding variables: Refer to Figure 6.9. Adding variables is useful when two data files contain the information about same case but on different variables. For example, the teachers’ database may contain two files, one having the educational qualifications and the other having the names of the courses taught. Both the files could be combined to analyze the variables available in them. The data on a key and unique variable from both the files can be combined easily. The key variables must have the same name in both the data files. Both the data files should be sorted on the common key variable.

Adding cases: Refer to Figure 6.9. This option is used when the data from two files having the same variables are to be combined. For example, you may record the same information for students in different study centers in India and abroad. The data can be merged to create a centralized database by using Add cases command.

Figure 6.9 : Merge files

143

Aggregate Data Aggregate Data command combines groups of cases into a single summary case and creates a new aggregated data file. Refer to Figure 6.10. Cases are aggregated, based on the value of one or more grouping variables. The new (aggregated) file contains one record for each group. The aggregate file could be saved with a specific name to be provided by you the user. Otherwise, the default name is aggregate. Say, For example, the data on learners, achievement could be aggregated by sex, state and region.

A number of aggregate functions are available in the SPSS. These include sum, mean, number of cases, maximum value, minimum value, standard deviation, and first and the last value. Other Summary functions include percentage and fractions below and above a particular cut-off user-defined value.

Figure 6.10: Aggregate data file

Split File The researcher is often interested in the comparison of a summary and other statistics based on certain group behavior. For example, in a study of learning achievement, the researcher may be interested in comparing the mean scores for students belonging to different sex groups. The sex is taken as a grouping variable. Multiple grouping variables can also be selected. A maximum of eight grouping variables can be defined. Cases need to be sorted out by grouping variables. Two options are available for comparative analysis. These are: compare groups and organize output by groups. The split file is available under Data menu for making such comparisons. Refer to Figure 6.10 above.

Select Cases Select case command can be used for selecting a random sub-sample or sub-group of cases based on specified criteria that includes variables and complex expressions. The following criteria are used for Select Case command.

Select if (condition is satisfied) variable value and their range. Date and time range Arithmetic expression, Logical expression, Functions, Row numbers.

Following the Select Case command, the unselected cases can either be deleted or temporarily filtered. Deleted cases are removed from the active file and cannot be recovered. You should be careful while selecting Delete option. Filtered option will be deleted temporarily. When the Select Case option is on, it is indicated in the Data Editor window.

Next, let us review the aspects linked with running a preliminary analysis.

144

6.3.7 RUNNING A PRELIMINARY ANALYSIS Before running advanced statistical analysis, it is important that you understand the salient features of your data. Use of statistical applications on a data set, the behavior of which is not known, can give misleading conclusions. The following section explains the six characteristics which must be examined for a given data set before attempting an advanced analysis.

Six Characteristics of a Dataset One strong argument for using computers and graphical presentation of the data is the advantage of viewing the data in a variety of ways. Preliminary exploration of data and its graphical presentation helps attain these objectives. The following characteristics will help you in deciding on the best plan for data management, analysis and presentation. SPSS includes commands for analyzing of data along the following lines.

Shape: The shape of the data will be the main factor in determining what set of summary statistics best explains the data. Shape is commonly categorized as symmetric, left-skewed or right-skewed, and as uni-modal, bi-modal or multi-modal. Frequency distribution, plots and graphical presentation of data, histogram, P-P, Q-Q, scatter, Box-Plot are illustrative of the techniques that can be used for determining the shape of a data set. It is important that the user should have enough knowledge of the properties of various statistical distributions, their graphical presentations, characteristics and limitations.

Location: Location is simpler and more descriptive than measures of central tendency. Common measures of location are the mean and the median. Measures of central tendency also can be calculated for various sub-groups of a data set.

Spread: This measure describes the amount of variation in the data. Again approximate value is initially sufficient with the measure of spread being informed by the shape of the data, and its intended use. Common measures of spread are variance, standard deviation and inter-quartile range. Percentile range is another measure which is used for measurement of dispersion.

Outliers: Outliers are data values that lie away from the general cluster of values. Each outlier needs to be examined to determine if it represents a possible value from the population being studied, in which case it should be retained, or if it is non-representative (or an error) in which case it should be excluded. You should properly weigh and carefully examine the behavior of outliers before accepting or rejecting of an observation/case. The best choice to display when looking for outliers is Box-plot. Range, i.e., maximum and minimum values can also be used to examine the behavior of outliers.

Clustering: Clustering implies that data tend to bunch around certain values. Clustering shows most clearly on a dot-plot. Histogram, stem and leaf analysis are also important procedures to examine the clustering pattern of a data set.

Association and relationship: Researchers often look for associative characteristics or similarities and dissimilarities in the behavior of some variables. For example, achievement scores and hours of study may be positively correlated whereas the teacher motivation and drop-out rate may be negatively associated with each other. Correlation coefficient is the most commonly used measure for understanding the nature and magnitude of association between two variables.

You should be clear that association does not imply relationship. A relationship is defined by the cause and effect type of link. Normally, there is one dependent variable and one or more than one independent variable in the cause and effect relationship. Cause and effect relationship is captured through regression analysis.

The analysis of data along the above lines provides considerable insight into the nature of data and also helps researchers in understanding key relationships between variables. It is assumed that the relationships are of linear type. Non-linear relationships can also be examined using non-linear techniques of analysis and also by using data transformation techniques as described next.

145

Data Transformation Data transformation is a very useful aspect of SPSS. Using data transformation, you can collapse categories, recode the data and create new variables based on complex equations and conditional statements. Some of the functions are detailed below:

Compute variable: • Compute values for numeric or string variables • Create new variables or replace the value of existing variables. For the new variables, you

can specify the variable type and label. • Compute values selectively for sub-sets of data based on logical conditions. • Use built-in functions, statistical functions, distribution functions and string functions. Recode variables

Recoding of variables is an important characteristic of data management using SPSS. Many continuous and discrete variables need to be recoded for meaningful analysis. Recoding can be done either within the same variable or a new variable can be generated. Recoding in the same variable will replace the original values for this purpose. Recoding in a new variable will replace the old values with new values. The following example illustrates the need and use of recoding variables.

A survey of the primary schools was conducted in Delhi. Along with other variables, information on the type of management was also collected. The management code was designed as follows:

1) Government 2) Local bodies 3) Private aided 4) Private unaided 5) Others

Let us assume that a comparative analysis of the government and the private management schools is to be undertaken. This will be done by combining categories 1, 2, 3 and 4. This can be achieved by recoding the management code as 1 (for 1 and 2 categories) and 2 for 3 and 4 categories into a new variable.

Assuming that a database on primary schools in Delhi is available, the enrolment analysis could be attempted by making suitable categories, i.e. schools with less than 50 students, 51 - 150, 151 - 250 and more than 250 students. This could be achieved by recoding the enrolment variable into a new variable ‘category’. The analysis could be attempted by changing the class range for category. If at a later stage in the analysis, it is found that a new category is to be introduced, it can again be achieved by recoding the enrolment data.

Count

Count is an important command available in SPSS and is used for counting occurrences of the same value(s) in a list, if variables are within the same case. For example, a survey might contain a list of books purchased (yes/no) by the students. You could count the number of ‘yes’ responses, or a new variable can be generated which gives the value of count indicating the number of books bought.

Procedure to run count command Choose Transform from the main Menu Choose count

146

Enter the name of a target variable (variable where the count value will be stored) Select two or more variables of the same type (numeric or string) Click define variable and specify which value(s) to be counted. Click OK after the selection has been made. In survey on learners’ achievement, the answer code to each question in language and mathematics could be recorded for each student. The codes could be ‘1’ for the correct answer ‘2’ for the wrong answer and ‘3’ for no reply. Count command can then be used to count the number of correct answers.

Rank Cases

Rank Cases command can be used to rank observations in ascending or descending order. Other options available for ranking cases are shown in the right hand panel of the Figure 6.11.

Figure 6.11: Rank case file

Next, we shall review the graphical presentation of data.

Graphical Presentation of Data SPSS offers extensive facilities for viewing the data and its key features in high resolution charts and plots. From the main menu, select Graphs and the screen shown in Figure 6.12 appears. Various types of Graph that can be drawn using SPSS are indicated in the sub-commands.

Figure 6.12: Graphics command

Select a chart type from the Graphics menu. This opens a chart dialog box as shown in Figure 6.13.

Figure 6.13: Bar chart windows

147

After the appropriate selections have been made, the output is displayed in the output Navigator window. The chart can be modified by a double click on any part of the chart. Some typical modifications include the following:

• Edit axis titles and labels and footnotes • Change scale (X - Y) • Edit the legend • Add or modify a title • Add annotation • Add an outer frame Another important category of charts is High-Low which is often used to represent variables like maximum and minimum temperature in a day, Stock market behavior or other similar variables. Box-plot and Error Bar charts help you to visualize distribution and dispersion. Box plot displays the median and quartiles and special symbols are used to identify outliers, if any. Error Bar chart displays the mean and confidence intervals or standard errors: To obtain a box-plot, choose Box plot from the Graphs menu. The simple box plot for mean scores obtained in English and Hindi is shown in Figure 6.14.

Figure 6.14: Box plot

The above figure shows that there were a large number of outliers in the case of Hindi scores as compared to English. The outliers were along the higher side. This shows that many students were scoring very high marks. The sizes (numbers) of cases are shown along the X-axis. The boxes show the median and the quartile values for both the tests. Scatter plots highlight the relationship between two quantitative variables by plotting the actual values along X - Y axis, The scatter plots are useful to examine the actual nature of relationship between these variables. This could be either linear or non-linear in form. To help visualize the relationship, you can add a simple linear or a quadratic regression line. A 3-D scatter plot adds a third variable in the relationship. You can rotate the two dimensional projection of the three dimensions to delineate the underlying patterns. In order to obtain a scatter plot, select Scatter from the Graphs option. A histogram will be obtained by selecting Histogram option from the Graphs menu. The variable for which a histogram is to be obtained should be selected from the dialog box. The normal curve can also be displayed along with the histogram to visually see the extent of similarity between the actual distribution of values and the normal curve. Pareto and Control charts are used to analyze and improve the quality of an ongoing process. You may refer to the SPSS manuals for use of these techniques.

148

6.3.8 UNDERSTANDING RELATIONSHIPS BETWEEN VARIABLES: DATA ANALYSIS

The foregoing details focused on the techniques of analysis describing the behavior of individual variables. However, most of the research studies require relationships between two or more variables to be examined. For example, one may be interested in questions like, “do the achievement scores of boys and girls differ in the same class?” Now, how to analyze this? Next, we shall review some of the various data analysis features specific to parametric and non-parametric tests.

Parametric Test Under this sub-section we shall review frequency tables, cross-tabulations, correlations, ANOVA, simple regression. We begin with frequency distribution.

A. Frequency Tables To analyze the data from menu, choose Analyze → Descriptive Statistics → Frequencies (Refer to Figure 6.15)

Figure 6.15: Frequencies window

Click the button with the picture of the right arrow. This will move list of selected variables on the right (see Figure 6.16).

Figure 6.16: Frequencies dialog box

149

Select one or more variables: To do this click the variable “V1” to select it for analysis, then optionally, you can:

• Click Statistics for descriptive statistics for quantitative variables. • Click chart for bar chart, pie-charts and histogram. If you click statistics you will get a dialog box as shown in Figure 6.17.

Figure 6.17: Frequencies statistics dialog box

Now click the boxes for statistics you wish to apply for your data like, mean, standard deviation, etc. Then click the ‘Continue’ button. Now click the OK button, it automatically opens an Output – SPSS Viewer Window showing the Frequencies Tables as illustrated in Figure 6.18.

Figure 6.18: Showing frequency table

In case you wish to have charts then select Graphs from the menu in that you will see types of charts. Select the chart you wish to have and then the variable.

Nest, let us focus on cross-tabulation feature.

150

B. Cross-tabulations Cross-tabulation is the simplest procedure to describe a relationship between two or more categories of variables.

Suppose, you are interested in knowing whether there is an association between two categorical variables such as “gender” and “attitude towards infant feeding”, you have to cross-tabulate the two variable and use some statistical tests.

To cross-tabulate: From the menus (refer to Figure 6.15) choose: Analyze Descriptive Statistics Crosstabs...

You will get a dialog box as shown in Figure 6.19.

Select “Gender” for the row variable and “Attitude” for the column variable. Select one or more row variables and one or more column variables. Optionally you can:

Figure 6.19: Crosstabs dialog box

• Click Statistics for statistical tests • Click cells for percentages

If you click ‘statistics’ you will get statistics dialog box as shown in Figure 6.20.

Figure 6.20: Statistics dialog box

151

In this dialog box click the box next to the statistical tests you wish to apply. Say for example if you wish to apply chi-square test, click the box next to chi-square. Then click the continue button. Next, click the cells button. You will get Crosstabs: Cell Display dialog box as shown in Figure 6.21.Click on the row or column (or both) percentage box. Click the continue button, then click the OK button. The table shown in Figure 6.21 will appear in the output navigator window:

Figure 6.21: Showing crosstab table

Next, let us review bivariate correlation.

C. Bivariate Correlations To obtain Bivariate Correlations From the menus (refer to Figure 6.5), choose: Analyze Correlate Bivariate

Figure 6.22: Bivariate Correlations

Select two or more numeric variables. The following options are also available:

• Correlation coefficients • Test of significance • Flag significant correlation Next, we shall review how to calculate the independent sample T-test. D. Independent Samples T-test To obtain an Independent samples T-test from the menus (refer to Figure 6.23) Analyze

152

Compare Means Independent-Samples T-test

Figure 6.23: Independent samples T-test dialog box

First select a quantitative test variable. Then select a single grouping variable and click Define Groups to specify two codes for the groups you want to compare. For example; you wish to test a hypothesis that “Did males and females have similar mean attitude scores? To test the hypothesis of equality of means for two groups, we can use the t-test statistic. Figure 6.24 displays the independent samples T-test • Select “Attitude” as the test variable • Select “Gender” as the grouping variable • Click the button define groups • Enter “1” for group 1 and “2” for group 2 (as shown in Figure 6.24) • Click the continue button to return to the previous dialog box. • Click ‘OK’ button to run the procedure.

Figure 6.24: Define groups dialog box

Comparing several means (ANOVA)

When we are interested in an independent variable that has more than two groups, then we will need to use the analysis of variance (ANOVA).

153

Suppose you are interested in testing the hypothesis. “Do students in each of the three groups of religious affiliation have similar mean attitude scores?” From the menu (refer to Figure 6.25) choose Analyze → Compare Means → One-way ANOVA You will see the dialog box, as shown in Figure 6.25.

Figure 6.25: One-way ANOVA dialog box • Select attitude data as the dependent variable and religious affiliation as the factor (i.e.,

independent variable) • Click the ‘OK’ button to run the procedure. Next, we shall review simple regression process. E. Linear Regression Suppose you have a question. “How well can we predict ‘attitude’ of students if we know something about their levels of education?” We need to conduct a simple regression analysis to answer the question.

To conduct this analysis, from the Analyze pull-down menu, select Regression, then choose Linear...The dialog box shown in Figure 6.26 will appear.

Figure 6.26: Linear regression dialog box Select ‘attitude’( ) as the dependent variable and students levels of education as the independent variable. Note that SPSS provides for many important options that are useful in conducting regression analysis. These are available via the Analyze..., Plots..., Save..., and Options... buttons. Readers interested in learning more about regression analysis are encouraged to review

154

Schroeder, Sjoquist, and Stephan (1986), as well as, the chapter on regression in the SPSS manuals (which details these analysis options). Click the OK button to run the procedure. The results of regression analysis will appear in the output navigator window.

Linear regression is the most commonly used procedure for the analysis of a cause and effect relationship between one dependent variable and a number of independent variables. The dependent and independent variables should, be quantitative. Categorical variables like sex and religion should be recoded to dummy (binary) variables or other types of contrast variables. An important assumption of the regression analysis is that the distribution of the dependent variable is normal. Moreover, the relationship between the dependent and all the independent variables should be linear and all observations should be independent of each other.

SPSS provides extensive scope for regression analysis using various types of selection processes.

The method of selecting of independent variables for linear regression analysis is an important choice which the researcher should consider before running the analysis. You can construct a variety of regression models from the same set of variables by using different methods.

You can enter all the variables in a single step or enter the independent variables selectively. Variable selection method is shown in Figure 6.27.

Figure 6.27: Variable selection method It allows you to specify how independent variables are entered into the regression analysis. The following options are available:

• Enter: To enter all the variables in a single step, select Enter option. • Remove: To remove the variables in a block in a single step. • Forward: It enters one variable at a time based on the selected criterion. • Backward: All variables are entered in the first instance and then one variable is removed at a

time on the selected criterion. • Stepwise: Stepwise variable entry and removal examines the variables in the block at each

step for entry and removal. This is a forward step procedure.

All the variables must pass the tolerance criterion to be entered in the equation; regardless of the entry method specified. The default tolerance limit is 0.0001. A new variable will not be entered if it causes the tolerance of another variable already entered to be dropped below the tolerance limit.

155

Linear Regression Statistics

The following statistics (refer to Figure 6.28) are available on linear regression models. Estimates and Model Fit are the two options which are selected by default.

Figure 6.28: Linear regression statistics

Regression coefficients: The Estimates option displays regression coefficient, β , standard error, standard coefficient beta, t-value; and two tailed significance level of t. Covariance matrix displays a variance covariance matrix of regression coefficients with covariance of the diagonal and variance of the diagonal. A correlation matrix will also be displayed.

Model fit: The variables entered and removed from the model are, displayed. Goodness of fit statistics, R-square, multiple R, and adjusted R-square, standard error of the estimate and an analysis of variance table is displayed.

If other options are ticked, the statistics corresponding to each of the options are also displayed in the Output Navigator. If the data does not show linear relationship and the transformation procedure does not help, try using Curve Estimation procedure.

Non-Parametric Tests The non-parametric test procedure provides several tests that do not require assumptions about the shape of the underlying distribution. These include the following most commonly used tests:

• Chi-square test • Binomial test • Run Test • One sample Kolmogorov Semonov test • Two independent Sample tests • Tests for several independent samples • Two related sample tests • Tests for several related samples.

Here, we shall discuss the procedure for Chi-square test only. You are advised to consult the SPSS’ users’ manual and other statistical books for detailed discussion on the other tests.

Chi-Square

Chi-square test (refer to Figure 6.29) is the most commonly used test in social science research. The goodness of fit test compares the observed and the expected frequencies in each cell/category to test either that all categories contain the same proportion of values or that each category contains a user specified proportion of values.

156

Figure 6.29: Chi-square test

Consider that a bag contains red, white and yellow balls. You want to test the hypothesis that the bag contains all type of balls in equal proportion. To obtain Chi-square test, choose Chi-square from Non-parametric tests in the Statistics command. Select one or more variables. Each variable produces a separate output.

By default, all categories have equal expected values as shown in the figure above. Categories can have user specified proportions also. In order to provide user specific expected values, select the Values option and add the user expected values. The sequence in which the values are entered is very important in this case. It corresponds to the ascending order of the category values of the test variable.

6.3.9 SPSS PRODUCTION FACILITY The SPSS Production facility provides the ability to run SPSS in an automated mode. SPSS runs unattended and uninterrupted and terminates after executing the last command. Production mode is useful if you run the same set of time-consuming analysis periodically. The SPSS Production facility uses command syntax file to tell SPSS about the commands to be executed. We have already discussed the important features of the command syntax. The command syntax file can be edited in a standard text editor. To run the SPSS Production facility, quit the SPSS if it is already running. SPSS Production facility cannot be run when SPSS is running. Start SPSS Production program from the start window of window 95/98 or later version. Specify the syntax file that you want to use in the production job. Click Browse to select the Syntax File. Save the production file job. Run the production file job at any time.

Next, we shall review SAS and NUDIST package which are the other software packages available for analysis of quantitative and qualitative data, respectively.

6.4 STATISTICAL ANALYSIS SYSTEM (SAS) Like the SPSS, the Statistical Analysis System (SAS) package calculate descriptive statistics of your choice e.g., Mean, Standard Deviation etc. SAS is available for both main frame and personal computers. It is strong in its treatment of data, in clarity of its graphics and in certain business applications. The various statistical procedures carried out by SAS are always preceded by the word PROC which stands for procedure. The most commonly used SAS statistical procedures are as follows: (Sprinthall et.al, 1991).

• PROC MEANS: Descriptive statistics (mean, standard deviation, maximum and minimum values and so on).

• PROC CORR: Pearson correlation between two or more variables. • PROC t-TEST: t-test for significant difference between the means of two groups.

157

• PROC ANOVA: Analysis of variance for all types of designs (one way, two-way and others).

• PROC FREQ: Frequency distribution for one or more variables.

As pointed out by Klieger (1984) SAS package is comparatively more difficult to use due to its procedural complexities. For greater details on SAS package, you are advised to consult the books by Klieger and Sprinthall.

6.5 NUDIST Computer programmes help in the analysis of qualitative data, especially in understanding a large (say 500 or more pages) text database. Studies using large databases such as ethnographies with extensive interviews, computer programmes provide an invaluable aid in research.

NUDIST (Non-numerical unstructured data indexing, searching and theorizing) programme was developed in Australia in 1991. This package is used for qualitative analysis of data. Here we present briefly the main features of this package. This software requires, 4 megabytes of RAM and atleast 2 megabytes space for data files in your PC or MAC. In your PC it operates under windows (Cres well 1998).

As a researcher this software will help you to provide the following:

1) Storing and organizing files: First establish document files and store information with the NUDIST programme. Document files consist of transcript from an interview, notes of observation or any article scanned from a newspaper.

2) Searching for themes: Tag segments of text from all the documents that relate to a single idea or theme. For example, distance learners, in a study on effectiveness of distance education talk about the role of academic counselors. The researcher can create a node in NUDIST as ‘Role of Academic Counselors’. Researcher will select text in the transcripts where learners have talked about this role and merge it into role of Academic Counselors. Information can be retained in this node and researcher can take print in different ways in which learners talk about the role of academic counselors.

3) Crossing themes: Taking the same example of role of counselors, the researcher can relate this node to other nodes. Suppose the other node is qualifications of counselors. There are two categories like Graduate and Post Graduate. The researcher will ask NUDIST to cross the two categories, role of counselors and qualification of counselors to see for example whether there is any relation between graduate counselors and their role than the post graduate counselors and their role. NUDIST software generates information for a matrix with information in the cells reflecting different perspectives.

4) Diagramming: In this package; once the information is categorized, categories are identified. These categories are developed into nine visual picture of the categories that display their inter connectedness. This is called a tree diagram in NUDIST software. Tree diagram is a hierarchical tree of categories where root node is at the top and parents and siblings in the tree. This tree diagram is a useful device for discussing the data analysis of qualitative research in conferences.

5) Creating a template: In a qualitative research, at the beginning of data analysis, the researcher will create a template which is apriori code book for organizing information.

For further details on NUDIST software you may like to consult the following: Kelle, E.(ed.), Computer aided qualitative data analysis, Thousand Oaks, CA: Sage, 1995. Tesch, R., Qualitative research: Analysis types and software tools, Bristol, PA: Falmer, 1990.

158

6.6 EVIEWS PACKAGE EVIEWS stands for Econometric Views. It is a new version of a statistical package for manipulating time series data. It was originally the Time Series Processor (TSP) software for large mainframe computers. As an econometric package, EVIEWS provides data analysis, regression and forecasting tools. EVIEWS can be useful for multipurpose analytics, but this introduction will focus on financial time series econometric analysis. Once you get familiar with EVIEWS, the program is very user friendly. 6.6.1 EVIEWS files and Data: In this section, we will describe how to create a new work file and import data into EVIEWS. The various ways of handling the data into the work file are as follows: 6.6.1.1 Creating a workfile: Before working on any analysis, one must first create a so-called workfile, which must be of the exact size and type as the data you would like to work with. After the workfile is created, EVIEWS will let you import data into that from Excel, Lotus, ASCII (text files etc.). Data from other software packages such as SAS, SPSS, M-FIT, RATS etc. can not directly imported to EVIEWS. To create a workfile click File →New → Workfile and the following dialog box will appear.

If one is working with time series data , then he / she needs to know the frequency of the data such as daily, weekly monthly, annually etc as well as start and end date for the data. In case of cross sectional data, one needs the number of the total observations. In case of cross sectional data, one should choose undated or irregular and enter the start observation and the end observation in the appropriate textboxes. Let us take an example where time series data is imported from an excel file using the import function.

159

6.6.1.2 Importing Time Series Data from Excel We have created a data file in excel. The Excel file has saved in the Path….. The screenshot of the excel file is as follows:

Now the following five steps procedure should be used for importing time series data to EVIEWS software.

1. Examine the contents of the files in the excel and note o the start date and end date of the observations o total number of observations o the cell where the data starts (usually B2) o the name of the variables and the order in which they appear o the sheet name and the path of the sheet where it has saved.

The example has daily (5-days week) data with a start date of April 1991and an end date of Dec, 2008.The data starts at B2 in the excel sheet called Sheet 1. There are 16 variables. Some of them have very long names and it will be good to make them shorter, which will be easier to import data into EVIEWS work file and easier to work with later.

2. Create a new work file as per the above instructions. One should choose daily (5 days week) and enter the start date and the end date. In case of daily frequency data one would enter MM/DD/YYYY format. However, for quarterly data, one would enter 1999:3, e.g. for the third quarter of 1999. Monthly data follows the same pattern, i.e. 1999:3 means March 1999. In the case of irregular frequency of the data, one would enter the total number of observations. After clicking OK you should end up with the workfile as shown below.

160

As noted in the above workfile, the range as well as the sample is the period between 1st Jan 1998 and 7th July 2009.There are always two different series, C and RESID, as default. C is the column that will contain the coefficients from the last regression equation that one has estimated. RESID is the column that will contain the residuals from your last estimated model.

3. Click Procs → Import → Read Text- Lotus- Excel. In the dialog box for open choose the excel format and browse for the file. Select the file and open. It is important to mention over here that one should close the excel file before trying to import it to EVIEWS. Otherwise there will be an error message.

4. A dialog box now appears in which it is very crucial to enter the correct information. Any mistakes could result in an incomplete or even wrong dataset. This is where our former check-up of the Excel file becomes very important. In this example the dialog box should be filled out as follows:

161

The order of the data is by observation-series in columns. Upper-left data cell is B2and the sheet name is Sheet 1. If there is only a single sheet (as in this example) it is not necessary to name it. The name of the series / variables have been changed (notice that no spaces are allowed in the names) in order to make them easier to work with. However, if you would like to import the names that are given in the excel, you simply enter the number of the variables (in this case 8). These names can then be changed in EVIEWS using the Rename function. However, using this method can cause problems if for example the names start with a number and are very similar (e.g. names such as 30 day return, 5 day price change etc.). The sample to import is taken from the workfile. Here it is possible to exclude periods, which can be useful in case you would like to get rid of any outliers. The workfile that you should have by now as follows:

162

It contains a list of the 9 imported variables in alphabetical orders as well as the two columns for the estimated coefficients and the residuals. It is always better to check if all the variables have been correctly imported. This is because, while importing all the variables, one may get the common errors including rows with “NA” and numbers that are too high or low. Another useful approach is to open the set of variables or all the variables as a group. One can do this with following steps:

o clicking the first variable (as per your discretion) o holding down the [Ctrl] key and clicking the other variable (one can choose

his/ her order of variables) o clicking View→open as one window→open group

or simply just right clicking or double clicking on either of the selected series and then clicking open group

5. If you are certain that you have imported the data correctly into the EVIEWS workfile, you can now save this workfile by clicking File→Save As. The workfile will be saved in EVIEWS own Wf1 format. A saved workfile can be opened later by selecting File →Open File→Workfile from the main menu.

6.6.1.3 Transforming the Data: While working on time series data, it is often very useful to transform the existing variables to take care of scale and size and for other purposes. This can be done in EVIEWS using the [Genr] button in the top right hand corner of your workfile. For example one would like to work on the stock return file and imported the stock price data into EVIEWS workfile. The stock return is explained as follows: RSt = ln (St) – ln (St-1)

Where, RSt represents the stock price return and St and St-1 are the stock prices of time period t and t-1. This continuous return can quickly be calculated by means of the DLOG function. Simply click the [Genr] button and enter the equation below followed by OK.

This will create the variable RET_STOCKA and include it in the workfile. You can view the returns by double clicking the variable.

163

Apart from DLOG, there are naturally a number of other mathematical functions as well as simple addition, subtraction, division and multiplication available. More frequently, one can use price differential as well as lagged variable in time series data for unit root purposes. This can be performed by writing the following equation in [Genr] button. Price Differential = Stock_A – Stock_A(-1) Lag = Stock_A(-1)

6.6.1.4 Copying Output Any graph or equation output can easily be copied into a word document. To copy a table, simply select the area you want to copy and click Edit →Copy. A dialog box should appear, where you would usually select the first option: Formatted –copy numbers as they appear in the table, then you go to word /excel and paste the selected area and change the size of the output until it suits to your document. To copy a graph, click on it and a blue border should appear, then click Edit →Copy. In the appearing dialog box, click copy to clickboard and then paste into word / excel. Again the size can be adjusted to a suitable size. 6.6.1.5 Examining the Data EVIEWS can be used for examining the data in a variety of ways. This is demonstrated as follows: Displaying Line Graphs: If you want to select a few variables and display a line graph of each of the series, you can follow the given example based on the previous mentioned EVIEWS workfile. In this example, we want to view the four time series in the workfile, BSE 100, REER, NEER, SENSEX. The procedure is to highlight the three variables (using the mouse and

164

[Ctrl] key) followed by a double or right click. Then you click Open Group and click the [View] button in the appearing spreadsheet From this menu, you can click Multiple Graphs →Line and four line graphs depicted below as follows. As one can see there are other choices of graphs as well. In general, clicking the [View] button mentioned above offers you many options of viewing your selected data.

If you would like to save the output in your workfile for latter use, you first click the [Freez] button. In the new window which appears, click the [Name] button. In the dialog box, you enter a name for the output and click OK. Now the output appears with a graph icon in your workfile. Drawing a Scatter Plot: The procedure is similar to the one just mentioned above, but while using scatter plot, one can use the other options as well, for example Scatter plots in connection with a regression model. We can show an example from our previous mentioned workfile. This is as follows:

165

Obtaining Descriptive Statistics and Histogram: One can obtain the descriptive statistics and histogram of a series by double clicking the series in the workfile. In the appearing spreadsheet you click the [View] button and choose Descriptive Statistics→Histogram and Stats. If you want to obtain descriptive statistics for several series at a time instead, you highlight the relevant series (using the mouse and the [Ctrl] key), double or right click and choose Open Group. In the appearing spreadsheet, click the [View] button and choose Descriptive Statistics →Individual Samples. This procedure will not give you the histograms, however. The descriptive statistics and histogram of one time series variable (SENSEX) is as follows:

166

6.6.1.6 Displaying Correlation and Covariance Matrices: The easiest way to display correlation and covariance matrices is to highlight the relevant series (using the mouse and [Ctrl] key] and then click Quick →Group Statistics →Correlations (or covariances if you want covariance matrix). This creates a new group and produces a common sample correlation / covariance matrix. If a pair wise correlation/covariance matrix is more suitable, this is produced by clicking [View] button and choosing Correlations (or Covariances) →Pairwise Samples. One of the examples of this is given below.

6.6.1.7 Seasonality of the Series: Before undertaking any time series econometric analysis of the data, it is utmost important to deseasonalised or to remove the seasonal fluctuations, if the frequency of the time series data is quaterly or monthly etc. This is one of the major properties of time series econometrics. To remove the seasonal fluctuations or to deseasonalize the data so many methods are available, which are given below.

Census X12. X11 (Historical) Method. Moving Average Method.

167

To perform the seasonal test, you can select any variable from the workfile and by double clicking on it you can open the data file of that variable. Then by clicking on the [Procs], you can find all the above mentioned tests to check the seasonality of the series. The example of the seasonality test is as follows:

6.6.1.8 Estimating Equations: In the following illustration, we will demonstrate how you estimate a regression model in EVIEWS. When you have opened your workfile, click on the [Objects] button. Select New Object → Equation and the following dialog box will appear.

Alternatively, you could have clicked Quick →Estimate Equation. Say we want to estimate the Regression equation on stock price and exchange rates. In the below example, Sensex is the dependent variable and NEER and REER are the independent variables. You can enter the model in two ways. First you list the variables followed by C for the intercept term and then the independent variable(s). There must be single space between variable, so we will enter the following regression into the first window.

168

SENSEX C NEER REER After you have entered your perfect equation, select the estimation method and your sample period. Then, just click OK and get the following output.

6.6.1.9 Testing for Unit Roots: The non-stationary nature of most times series data and the need for avoiding the problem of spurious or nonsense regression calls for the examination of their stationary property. In brief, variables whose mean, variance and autocovariance (at various lags) change over time are said to be nonstationary time series or the unit root4 variables. Alternatively, a time series is stationary, if its mean, variance and autocovariance (at various lags) are time independent. Dickey and Fuller (1979) consider three different regression equations that can be used to test the presence of a unit root: ttt YY εγ +=Δ −1 … (1) (NONE) ttt YaY εγ ++=Δ −10 ...... (2) (INTERCEPT) ttt taYaY εγ +++=Δ − 210 ..... (3) (TREND & INTERCEPT) In the above specifications, the difference among three regressions concerns the presence of the deterministic elements a0, a2t. The first is a pure random walk model, in the second an intercept or drift term has been added, and the third equation includes both a drift and linear time trend. The parameter of interest in all the regression equations is γ; if γ = 0, the {Yt} sequence contains a unit root. The test involves estimating one or more of the equations above using OLS in order to obtain the estimated value of γ and associated standard error. Comparing the resulting t-statistic with the appropriate value reported in the Dickey Fuller tables allows to determine whether to accept or reject the null hypothesis γ = 0. In conducting the Dickey Fuller test as in equations 1, 2 and 3, it was assumed that the error term εt was uncorrelated. But incase the error term εt is autocorrelated , Dickey and Fuller have developed a test, known as the Augmented Dickey Fuller (ADF) test. The ADF test may be specified as follows:

4 The term unit root refers to the root of the polynomial in the lag operator.

169

∑=

−− +Δ+++=Δk

itititt YYtaaY

1110 εβγ …. (3.1)

Where εt is a pure white noise error term Δ is the difference operator and γ and β are the parameters. In the ADF test, we still test whether γ = 0 and the ADF test follows the same asymptotic distribution as the DF statistic, so the same critical values can be used. This test is quickly done in EVIEWS by double clicking the relevant time series go to the spreadsheet view. Here, click the [View] button and select Unit Root Test. Alternatively, You can click [Quick] button and select series and Unit Root Test. The following dialog box should appear.

170

In the above dialog box, you have to first select the test type. Next select whether you want to test unit root at level, first difference or second difference. And finally, you can choose whether you would like to perform the unit root test with none, with intercept or with trend and intercept (the equations are explained above). Now we will examine whether a time series is stationary or not. In the following example we have examined the unit root of stock index (SENSEX) both at level and first difference by using Augmented Dickey Fuller Test. The results are as follows:

171

From the above result, the ADF test statistics to the left (-4.083999) is greater than the critical values at all the statistical significance level. Hence the null hypothesis of unit root is rejected and SENSEX is stationary at it’s first difference level. 6.6.1.10 ARIMA / ARMA Identification and Estimation: The identification of an ARIMA model is done by examining a correlogram. In EVIEWS you obtain a correlogram for at variable by double clicking the variable to open the spreadsheet view. Here, click the [View] button and choose correlogram. In the appearing dialog box, choose between level, first difference or second difference and then enter the desired number of lags to include. An example of the correlogram for the variable SENSEX is shown below.

An ARIMA estimation or the so called Box-Jenkins methodology for ARIMA model (Madsen (1992) Maddala (1992)consists of four steps.. These steps are identification, estimation, diagonisis and forecast. For more details, the reader may refer any standard text book. 6.6.1.11 Granger Causality Test: Granger’s causality may be defined as the forecasting relationship between two variables proposed by Granger (1969) and popularised by Sims (1972). In brief, Granger causality test states that if S & E are two time series variable and if past values of a variable S significantly contribute to forecast the value of another variable ES is said to be Granger cause E and vice versa. The test involves with the following two regression equations

172

t

n

jjtj

n

iitit uSES 1

110 +++= ∑∑

=−

=− βαγ …(4)

t

m

jjtj

m

iitit uSEE 2

111 +++= ∑∑

=−

=− δλγ ….…(5)

Where St and Et are the stock price and exchange rate to be tested, and u1t and u2t are mutually uncorrelated white noise errors, and t denotes the time period. Equation (4) postulates that current S is related to past values of S as well as of past E. Similarly, equation (5) postulates that E is related to past values of E and S. The null hypothesis for equation (4) is that there is no causation from S to E, thus the coefficients on the lagged S

are not significant, ∑=

=n

jj

10β . Similarly, the null hypothesis for equation (5) is that there

is no causation from E to S, thus the coefficients of lagged E are not significant,

∑=

=n

jj

10δ . Three possible conclusions that can be addressed from such analysis include

unidirectional causality, bi-directional causality and are independent to each other. Granger causality test can easily performed in EVIEWS. In the workfile, you can select the group of variable in your choice and click the [View] menu and select Granger’s Causality. The result of Granger’s causality should appear as follows:

6.6.2 Vector Auto Regression (VAR): By the very construction, a VAR system consists of a set of variables, each of which is related to lags of itself, and of all other variables in the system. In other words, a VAR system consists of a set of regression equations, each of which has an adjustment mechanism such that even small changes in one variable component in the system may be accounted automatically by possible adjustments in the rest of the variables in the system. Thus, VAR provide a fairly unrestrictive approximation to a reduced form structural model without assuming beforehand any of the variables as exogenous. Thus, by avoiding the imposition of a priori restrictions on the model, the VAR adds significantly to the flexibility of the model.

173

A VAR in the standard form is represented as : St = a10 + a11 St-1 + a12 Et-1 +e1t Et = a20 + a21St-1 + a12Et-1 + e2t Where

• St is the stock price at the time period t, • Et is the exchange rate at the time period t, • a10 is element i of the vector Ao, • aij is the element in row i and column j of the matrix A1 and eit

As the element i of the vector et and it represents in the above equation as e1t and e2t respectively are white noise error term and both have zero mean and constant variances and are individually serially un correlated. Steps of VAR:

• Now we discuss about various steps involved in VAR estimation. • To start with, VAR estimation procedure requires the selection of

variables to be included in the system. The variables included in the VAR are selected according to the relevant economic model.

• The next step is to verify the stationarity of the variables. • The last step is to select the appropriate lag length. The lag length of each

of the variables in the system is to be fixed. For this we use Likelihood Ratio (LR) test.

After setting the lag length, now we are in a position to estimate the model. But it may be noted that the coefficients obtained from the estimation of VAR model can’t be interpreted directly. To overcome this problem, Litterman (1979) had suggested the use of Innovation Accounting Techniques, which consists of both Impulse response functions (IRFS) and Variance Decompositions (VDS). Impulse Response Function: Impulse response function is being used to trace out the dynamic interaction among variables. It shows the dynamic response of all the variables in the system to a shock or innovation. For computing the IRFS, it is essential that the variables in the system are ordered. Variance Decomposition: Variance decomposition is used to detect the causal relations among the variables. It explains the extent at which a variable is explained by the shocks in all the variables in the system. The forecast error variance decomposition explains the proportion of the movements in a sequence due to its own shocks verses shocks to the other variables. Now we can take an example of VAR modeling between stock price and exchange rates. Let us say we have considered SENSEX and BSE 100 to represent stock market and NEER and REER to represent the effective exchange rates. In the EVIEWS workfile, you can select the group of variables and open the group of variables. Then you click [Quick] and select estimate VAR. The following dialog box should appear. Enter all the variables

174

under endogenous variables. Choose the optimum lag length as per the lag augmentation criterion (refer next section) and click OK.

After clicking OK, the following output will be generated. As mentioned above, after generating the following output, click[File] and select Lag Structure and select Lag Leangth Criteria. This is given as follows:

Then the following output on lag length will be generated for various statistical lag augmentation criterions. One can choose the optimum lag length as per any given criterion.

175

For Impulse Response Function, you can click [File] and select Impulse Response..Then the following dialog box should appear.

Now you can select the output in display format either in Table or Graph. Then select the impulse and response variables along with periods ahead for forecasting. While doing so, the following dialog box should appear.

In the above dialog box, we have selected the output in multiple graphs format. By clicking OK, the following output table will be generated.

176

In the similar process, you can generate the variance decomposition output which is shown as follows.