ECON 620 - Lecture 1 - Introduction to Econometrics

Lecture 1: Introduction to Econometrics University of San Francisco Department of Economics Prof. Jesse K. Anttila-Hughes January 21st, 2014

NYTimes, Jan 29th, 2013

Today well cover: Introduction to econometrics Review of probability

Reading: Wooldridge Appendixes B and C

Office Hours: Tuesdays 1-4pm or by appt

Problem set #1 On Blackboard Due next Monday

Computer assignment #1 Get Stata Download the data files from Blackboard Bring your laptop, with Stata installed and the data files

Goals for this class Emphasis on:

Formal econometrics : Basic statistics and probability Fundamentals of multivariate OLS regression

Regression execution and interpretation Hypothesis testing and standard errors

Simple connections between research questions and econometrics Including some common basic problems and their solutions / lack thereof

Formal foundation for Econometrics II and III Tacit knowledge

Basics of performing econometrics analysis in Stata Basics good habit for handling and managing data

By the end of class you should have: Simple, intuitive understandings of what econometrics is and what it can

and cant do A beginners proficiency with Stata Enough econometrics to start appreciating how much there is to learn

Econometrics involves a lot of tacit learning:

Administraterrata Class:

6:30 9:15 pm on Tuesdays 14 lectures total, one midterm

Blackboard primary resource Syllabus, lecture notes, hws, readings, etc.

Grades: 12 Problem sets (drop lowest 2): 35% Class participation: 10% Midterm exam: 25% Final exam: 30%

Book: Wooldridge 5th Edition

What were aiming for

Hsiang et al. 2013

The Nature of Econometrics and Economic Data

Econometrics What is econometrics?

The statistics used by economists

Why do we do econometrics? Estimating relationships between economic variables Testing economic theories and hypotheses Forecasting economic variables Evaluating or implementing policies

Etc.

In general, econometrics starts with an economic model (i.e., something youd learn in a theory class) and then generates testable predictions which come from that model The step of explicitly defining a model is often skipped

Economic model of crime (Becker (1968)) Derives equation for criminal activity based on utility maximization

Functional form of relationship not specified Equation could have been postulated without economic modeling

Hours spent in criminal activities

Returns to criminal activities Wage for legal

employment Other income

Probability of getting caught

Probability of conviction if caught

Expected sentence

Age

What kind of theory models?

Model of job training and worker productivity What is effect of additional training on worker productivity? Formal economic theory not really needed to derive equation (but may):

Other factors may be relevant, but these are the most important (?)

Hourly wage

Years of formal education Years of work-

force experience

Weeks spent in job training

What kind of theory models?

Econometric model of criminal activity The functional form has to be specified Variables may have to be approximated by other quantities

Measure of criminal activity

Wage for legal employment

Other income

Frequency of prior arrests

Frequency of conviction

Average sentence length after conviction

Age

Unobserved deter- minants of criminal activity

e.g. moral character, wage in criminal activity, family background

What kind of econometric models?

Econometric model of job training and worker productivity

Most of econometrics deals with the specification of the error Econometric models may be used for hypothesis testing

For example, the parameter represents effect of training on wage How large is this effect? Is it different from zero?

Hourly wage Years of formal education

Years of work- force experience

Weeks spent in job training

Unobserved deter- minants of the wage

e.g. innate ability, quality of education, family background

What kind of econometric models?

Types of data Econometric analysis requires data

And there are many, many different things that count as data

In general, we distinguish between the four major kinds of economic data in terms of how they interact between units of observation and time Cross-sectional data

Multiple units of obs., single time Time series data

Single unit of obs., multiple times Pooled cross sections

Multiple unit of obs., multiple times, but different obs. each time Panel/Longitudinal data

Multiple units of observation with multiple time observations for each

Econometric methods depend on the nature of the data used Use of inappropriate methods may lead to misleading results

This will be a big theme in later classes

Cross-sectional data

Sample of individuals, households, firms, cities, states, countries, or

other units of interest at a given point of time/in a given period

Cross-sectional observations must be more or less independent

For example, pure random sampling from a population

Sometimes pure random sampling is violated, e.g. units refuse to

respond in surveys, or if sampling is characterized by clustering

Types of data

Observation number Hourly wage

Indicator variables (1=yes, 0=no)

" Cross-sectional data set on wages and other characteristics

Cross sectional data: One observation per unit of obs

Cross-sectional data on growth rates and country characteristics

Adult secondary education rates

Government consumtion as percentage of GDP

Growth rate of real per capita GDP

Cross sectional data: One observation per unit of obs

Time series data Observations of a variable or several variables over time

For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales,

Time series observations are typically serially correlated Ordering of observations conveys important information Data frequency: daily, weekly, monthly, quarterly, annually, Typical features of time series: trends and seasonality Typical applications: applied macroeconomics and finance

Types of data

Time series data on minimum wages and related variables

Unemployment rate

Average coverage rate

Average minimum wage for given year

Gross national product

Types of data

Pooled cross sections Two or more cross sections are combined in one data set Cross sections are drawn independently of each other Pooled cross sections often used to evaluate policy changes Example:

Evaluate effect of change in property taxes on house prices Random sample of house prices for the year 1993 A new random sample of house prices for the year 1995 Compare before/after (1993: before reform, 1995: after reform)

Types of data

Pooled cross sections on housing prices

Number of bathrooms

Size of house in square feet

Property tax

Before reform

After reform

Types of data

Panel or longitudinal data The same cross-sectional units are followed over time

Panel data have both cross-sectional and a time series dimensions

Hence, panel data can be used to account for time-invariant unobservables

Panel data can be used to model lagged responses

Example:

City crime statistics; each city is observed in two years Time-invariant unobserved city characteristics may be modeled

Effect of police on crime rates may exhibit time lag

Types of data

Two-year panel data on city crime statistics

Each city has two time series observations

Number of police in 1986

Number of police in 1990

Types of data

Causal Inference vs. Association One of the major aspects of econometrics that distinguishes it from

statistics is a very strong emphasis on understanding causal inference Causal inference: evaluating whether a change one variable (x) will lead

to a change in another variable (y) assuming nothing else changes (ceteris paribus)

Why do we care about causality? A lot of times as econometricians were explicitly trying to evaluate a

policy to enact Thus we want to know what will happen if we change one variable

More generally, the statistical tools we have can tell us a lot about how two variables covary But correlation doesnt imply causation, and to get to causal inference we

generally need to know about how the problem works in real life A particular concern is when our two variables x and y are

endogenous, or jointly determined Aka x and y influence each other, or theres a third variable Z that affects

both

Example: Medicine (Medieval)

Four Humors Theory (not falsifiable)

Four Humors Empirics (sub-optimal outcomes)

Medicine (Victorian)

Miasma Theory (not falsifiable)

Miasma Empirics (sub-optimal outcomes)

Medicine (Early Modern, 1850s)

Theory : Cholera is a vector-borne disease transmitted by water

(falsifiable)

Disease Theory Empirics (decent outcomes)

Randomization Ironically given its late arrival to the scientific

method, medicine first developed what we may now consider the be the ultimate tool in causal inference in the sciences: the randomized control trial (or RCT)

Endogeneity and random assignment

Why is random assignment important? Consider

health_outcome = A*(took_drug) + where is our error term, took_drug is a binary

variable indicating that a patient took a drug, and health_outcome is some measurement of health

Why do we care that took_drug is randomly assigned? If took_drug is randomly assigned, then is it

correlated with anything? If not, who cares?

Why is endogeneity an issue? Randomization allows a researcher to eliminate the

possibility that they are arguing for a causal, exogenous interpretation of an endogenous system

Endogenous originating from inside the system, in this case taken to mean co-influential Education and earnings Prices of substitute or complementary goods Development and the environment

Exogenous means originating outside the system Interpreting an endogenous relationship as exogenous

means risking interpreting a system with reverse causality as strictly causal

Endogeneity by example: Classroom size and educational achievement

How is the relationship endogenous?

Class size: ways of attacking endogeneity Randomly assign students to large or small

classes Tennessee STAR

Find a natural experiment that produces something akin to randomization in class size Maimonides Rule in Israel

But these dont always work the way we think Discontinuous class size cutoffs in Chile

Why all this concern about endogeneity? Endogeneity is particularly troublesome in the social

sciences because humans are self-aware Humans might sort on / select into treatment

Our understanding of how to deal with endogeneity is relatively new Natural scientists dont normally have intelligent,

reactive data points Medicine, which one might argue is halfway between

the natural and social sciences, needed to be concerned with endogeneity early on

Endogeneity is still a concern in the natural sciences, though correlation does not imply causation

Documents

ECON 620 - Lecture 1 - Introduction to Econometrics