Introduction to elementary quantitative concepts and methods Guest lecture Carl Henrik Knutsen, 14/5-2008

Introduction to elementary quantitative concepts and methods

Guest lectureCarl Henrik Knutsen, 14/5-2008

Motivation• Social sciences, and science in general: We are generally interested in:

– “How” questions – “Why” questions.

• Social scientists seek descriptions of empirical phenomena and try to come up with causal explanations. Both quantitative and qualitative methodology try to respond to such questions.

• Nature of problem question is important for choice of methodology, even if in the real world of social science, researchers often choose method after their knowledge and “taste”.

• Knowledge of different methodologies allow researchers and students to fit methodology to problem question Improve analysis.

• Triangulation can often be a good idea: Usage of different methodologies to illuminate a problem in a more comprehensive fashion.

• The knowledge of elementary quantitative method enables you to read different types of research.

Causality and the control problem

• Independent of choice of methodology • Theory and clever design needed• Three causal structures that might lead to

correlation:X Y X Y

X Y Z

Generalization

• The big advantage of quantitative methods• Provides stringent criteria for when we can be

relatively certain that our generalizations hold true and are not driven by coincidences.

• Remember that in the social sciences, we do not face deterministic relationships between factors. Quant. methods takes into account the stochastic structure of social life.

Data

• There exists a vast number of sources for data constructed by different agencies or researchers: You do not need to construct your own data for many purposes. But: Know the data you use in order to avoid different pit-falls.

• Sources on the web: World Development Indicators, Penn World Tables, World Governance Indicators, Polity, Freedom House, OECD, UNESCO, UNCTAD etc!

Descriptive statistics

• Descriptive vs inferential statistics• Descriptive statistics: Draw out

comprehensible information about the structure of your data

• 1) Central tendencies, 2) variation, 3) correlation

Central tendency of variable

• Mean• Median• Mode

Variation

• Range• Variance (S^2 = (Σ(X-M)^2)/(N-1))• Standard deviation

Correlation

• Covariance cov(xy) = (Σ((X-Xm)(Y-Ym)/(N-1)• Correlation coefficients• Pearson’s r = cov(xy)/(S(x)*S(y)): Always

between -1 and 1. NB: Gives only degree of linear relationship.

Presentation of data

• Tables• Histogram• Bar- and pie-charts• Scatter plots• Important to think about the reader:

Combrehensible and informative. Need to strike a balance on the amount of information presented in a chart. Label charts.

Table

Male Female

No higher education University No higher education University

Mean income (N) 150 (2000) 300 (1000) 100 (2500) 250 (700)

Scatter plot

0,00 20,00 40,00 60,00 80,00 100,00 Inverted and normalized FHI

0,00

20,00

40,00

60,00

80,00

100,00

Afghanistan

Albania

Argentina

Armenia

Australia

Austria

Azerbaijan Belarus

Belgium Botswana

Brazil

Burma

Canada

Chile

China

Colombia Cote d'Ivorie

Czech Republic

Denmark

Ecuador

El Salvador

Estonia

France

Guatemala Honduras

Hungary

Iceland

India

Ireland

Israel

Italy

Jamaica

Jordan

Kazakhstan

Korea, South Latvia

Malaysia

Mauritius

Mexico Morocco

Namibia

Nigeria

Philippines

Portugal

Singapore

Slovenia Spain

Taiwan

Ukraine

United States

Uruguay

Venezuela

Inferential statistics

• The aim is solid inference from an observed sample to a larger (unobserved) universe. Generalization about populations or about effects.

• For effects: Can we say that trajectories we observe are due to “real” effects or are they likely only a product of chance?

• Law of large numbers...– Population, samples, – Estimates and underlying mean.

• Random selection? Selection bias ALWAYS a possibility.

• Sampling techniques:– Experiment– Random draws– Stratification

Hypothesis test

• Democracy and economic growth as example.– H0: Democracy has no effect on growth– Halt: Democracy has an effect on growth

• In general H0 is often a hypothesis which claims that there is no effect. We often want to investigate whether we can with relative certainty claim that Halt is valid.

• Burden of proof is on the alternative hypothesis. Conservative bias: we have to have relatively strong results to claim a relationship is not due to pure chance.

• Central limit theorem as underlying. How do we know the distribution given H0? Use given distribution to find out what one is likely to arrive at by pure chance. The normal distribution.

Central limit theorem

• “The central limit theorem is one of the most remarkable results of the theory of probability. In its simplest form, the theorem states that the sum of a large number of independent observations from the same distribution has, under certain general conditions, an approximate normal distribution. Moreover, the approximation steadily improves as the number of observations increases. The theorem is considered the heart of probability theory, although a better name would be normal convergence theorem.” http://davidmlane.com/hyperstat/normal_distribution.html (Berrie Zielman)

http://davidmlane.com/hyperstat/normal_distribution.html

Significance levels and p-values

• Significance level. If we take H0 as true, then we want to have a critical level beyond which it is unlikely that we will see results. For example 5%. Only in 5% probability that we will see this strong relationship if H0 is true. Important to have large sample.

• P-value: The lowest significance level that will give rejection of H0. If H0 is true: What is probability that we will see this extreme result.

Models

• Stockburger: “A model is a representation containing the essential structure of some object or event in the real world.”– 1. Models are necessarily incomplete– (2. The model may be changed or manipulated

with relative ease.)

Regression analysis

• How to fit a straight line through a scatterplot!• Best fit: one criteria is to minimize sum of squared residuals

Ordinary Least Squares (OLS)• Bivariate regression equation: Y = a + bX + ε• Regression analysis recognizes that the world is not

deterministic. The role of the error term: ε. Large error terms in general implies large uncertainty

• Interpretation of a: Mean value of Y when X is equal to zero. Often no substantial interpretation. Not so interesting

• Interpretation of b: Increase in mean of Y when X increases with one unit. Effect of X on Y?

Assumptions of distribution error term when using OLS:

• Homoskedastic• No autocorrelation• Normally distributed

Multivariate regression

• Y = a + b1X1 + b2X2 +b3X3 + ε• New interpretation of b: The mean increase in Y when

relevant X increases with one unit, given that all other variables are held constant.

• R-square: How much of the variation in the data is “explained by the model” (A very imprecise interpretation). Goes from 0 to 1.

• “Control variables”• Extensions of regression analysis: Generalized Least

Squares, Systems of equations, Instrumental Variables, Logit and Probit models and many more.

Extensions

• Dummy variable• Squared X• Logarithmic specifications• Splitting the sample

Problems

• 1) “Simultaneity bias”: Reverse causation. Exogeneity vs endogeneity of X-variables.

• 2) “Omitted variable bias”• 3) Measurement error.

– Reliability. Where does the data come from? GDP in developing countries.

– Validity (TFP and technological change)• Operationalization of variable: Have to be

observable, quantifiable and measurable.

Documents

Introduction to elementary quantitative concepts and methods Guest lecture Carl Henrik Knutsen, 14/5-2008