Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science...

Preview:

Citation preview

Markov chain Monte Carlowith people

Tom GriffithsDepartment of Psychology

Cognitive Science Program

UC Berkeley

with Mike Kalish, Stephan Lewandowsky, and Adam Sanborn

Inductive problems

blicket toma

dax wug

blicket wug

S X Y

X {blicket,dax}

Y {toma, wug}

Learning languages from utterances

Learning functions from (x,y) pairs

Learning categories from instances of their members

Computational cognitive science

Identify the underlying computational problem

Find the optimal solution to that problem

Compare human cognition to that solution

For inductive problems, solutions come from statistics

Statistics and inductive problems

Cognitive science

Categorization

Causal learning

Function learning

Language

Statistics

Density estimation

Graphical models

Regression

Probabilistic grammars

Statistics and human cognition

• How can we use statistics to understand cognition?

• How can cognition inspire new statistical models?– applications of Dirichlet process and Pitman-Yor process

models to natural language– exchangeable distributions on infinite binary matrices via

the Indian buffet process (priors on causal structure)– nonparametric Bayesian models for relational data

Statistics and human cognition

• How can we use statistics to understand cognition?

• How can cognition inspire new statistical models?– applications of Dirichlet process and Pitman-Yor process

models to natural language– exchangeable distributions on infinite binary matrices via

the Indian buffet process (priors on causal structure)– nonparametric Bayesian models for relational data

Statistics and human cognition

• How can we use statistics to understand cognition?

• How can cognition inspire new statistical models?– applications of Dirichlet process and Pitman-Yor process

models to natural language– exchangeable distributions on infinite binary matrices via

the Indian buffet process– nonparametric Bayesian models for relational data

Are people Bayesian?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Reverend Thomas Bayes

Bayes’ theorem

P(h | d) =P(d | h)P(h)

P(d | ′ h )P( ′ h )′ h ∈H

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypothesesh: hypothesis

d: data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

People are stupid

Predicting the future

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

How often is Google News updated?

t = time since last update

ttotal = time between updates

What should we guess for ttotal given t?

The effects of priors

Evaluating human predictions

• Different domains with different priors:– a movie has made $60 million [power-law]

– your friend quotes from line 17 of a poem [power-law]

– you meet a 78 year old man [Gaussian]

– a movie has been running for 55 minutes [Gaussian]

– a U.S. congressman has served for 11 years [Erlang]

• Prior distributions derived from actual data

• Use 5 values of t for each

• People predict ttotal

peopleparametric priorempirical prior

Gott’s rule

A different approach…

Instead of asking whether people are rational, use assumption of rationality to investigate cognition

If we can predict people’s responses, we can design experiments that measure psychological variables

Two deep questions

• What are the biases that guide human learning?– prior probability distribution P(h)

• What do mental representations look like?– category distribution P(x|c)

limt →∞

P(x(t ) = i | x(0)) = π i

Two deep questions

• What are the biases that guide human learning?– prior probability distribution on hypotheses, P(h)

• What do mental representations look like?– distribution over objects x in category c, P(x|c)

Develop ways to sample from these distributions

Outline

Markov chain Monte Carlo

Sampling from the prior

Sampling from category distributions

Outline

Markov chain Monte Carlo

Sampling from the prior

Sampling from category distributions

• Variables x(t+1) independent of history given x(t)

• Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic)

x x x x x x x x

Transition matrixT = P(x(t+1)|x(t))

Markov chains

Markov chain Monte Carlo

• Sample from a target distribution P(x) by constructing Markov chain for which P(x) is the stationary distribution

• Two main schemes:– Gibbs sampling– Metropolis-Hastings algorithm

Gibbs sampling

For variables x = x1, x2, …, xn and target P(x)

Draw xi(t+1) from P(xi|x-i)

x-i = x1(t+1), x2

(t+1),…, xi-1(t+1)

, xi+1(t)

, …, xn(t)

Gibbs sampling

(MacKay, 2002)

Metropolis-Hastings algorithm(Metropolis et al., 1953; Hastings, 1970)

Step 1: propose a state (we assume symmetrically)

Q(x(t+1)|x(t)) = Q(x(t))|x(t+1))

Step 2: decide whether to accept, with probability

Metropolis acceptance function

Barker acceptance function

Metropolis-Hastings algorithm

p(x)

Metropolis-Hastings algorithm

p(x)

Metropolis-Hastings algorithm

p(x)

Metropolis-Hastings algorithm

A(x(t), x(t+1)) = 0.5

p(x)

Metropolis-Hastings algorithm

p(x)

Metropolis-Hastings algorithm

A(x(t), x(t+1)) = 1

p(x)

Outline

Markov chain Monte Carlo

Sampling from the prior

Sampling from category distributions

Iterated learning(Kirby, 2001)

What are the consequences of learners learning from other learners?

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Analyzing iterated learning

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

PL(h|d): probability of inferring hypothesis h from data d

PP(d|h): probability of generating data d from hypothesis h

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

Iterated Bayesian learning

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

PL (h | d) =PP (d | h)P(h)

PP (d | ′ h )P( ′ h )′ h ∈H

Assume learners sample from their posterior distribution:

Analyzing iterated learning

d0 h1 d1 h2PL(h|d) PP(d|h) PL(h|d)

d2 h3PP(d|h) PL(h|d)

d PP(d|h)PL(h|d)h1 h2d PP(d|h)PL(h|d)

h3

A Markov chain on hypotheses

d0 d1h PL(h|d) PP(d|h)d2h PL(h|d) PP(d|h) h PL(h|d) PP(d|h)

A Markov chain on data

Stationary distributions

• Markov chain on h converges to the prior, P(h)

• Markov chain on d converges to the “prior predictive distribution”

P(d) = P(d | h)h

∑ P(h)

(Griffiths & Kalish, 2005)

Explaining convergence to the prior

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

• Intuitively: data acts once, prior many times

• Formally: iterated learning with Bayesian agents is a Gibbs sampler on P(d,h)

(Griffiths & Kalish, in press)

Revealing inductive biases

• Many problems in cognitive science can be formulated as problems of induction– learning languages, concepts, and causal relations

• Such problems are not solvable without bias(e.g., Goodman, 1955; Kearns & Vazirani, 1994; Vapnik, 1995)

• What biases guide human inductive inferences?

If iterated learning converges to the prior, then it may provide a method for investigating biases

Serial reproduction(Bartlett, 1932)

• Participants see stimuli, then reproduce them from memory

• Reproductions of one participant are stimuli for the next

• Stimuli were interesting, rather than controlled– e.g., “War of the Ghosts”

General strategy

• Use well-studied and simple stimuli for which people’s inductive biases are known– function learning– concept learning– color words

• Examine dynamics of iterated learning– convergence to state reflecting biases– predictable path to convergence

Iterated function learning

• Each learner sees a set of (x,y) pairs

• Makes predictions of y for new x values

• Predictions are data for the next learner

data hypotheses

(Kalish, Griffiths, & Lewandowsky, in press)

Function learning experiments

Stimulus

Response

Slider

Feedback

Examine iterated learning with different initial data

1 2 3 4 5 6 7 8 9

IterationInitialdata

Identifying inductive biases

• Formal analysis suggests that iterated learning provides a way to determine inductive biases

• Experiments with human learners support this idea– when stimuli for which biases are well understood are used,

those biases are revealed by iterated learning

• What do inductive biases look like in other cases?– continuous categories– causal structure– word learning– language learning

• Iterated learning for MAP learners reduces to a form of the stochastic EM algorithm– Monte Carlo EM with a single sample

• Provides connections between cultural evolution and classic models used in population genetics– MAP learning of multinomials = Wright-Fisher

• More generally, an account of how products of cultural evolution relate to the biases of learners

Statistics and cultural evolution

Outline

Markov chain Monte Carlo

Sampling from the prior

Sampling from category distributions

Categories are central to cognition

Sampling from categories

Frog distribution

P(x|c)

A task

Ask subjects which of two alternatives comes from a target category

Which animal is a frog?

A Bayesian analysis of the task

Assume:

Response probabilities

If people probability match to the posterior, response probability is equivalent to the Barker acceptance function for target distribution p(x|c)

Collecting the samplesWhich is the frog? Which is the frog? Which is the frog?

Trial 1 Trial 2 Trial 3

Verifying the method

Training

Subjects were shown schematic fish of different sizes and trained on whether they came from the ocean

(uniform) or a fish farm (Gaussian)

Between-subject conditions

Choice task

Subjects judged which of the two fish came from the fish farm (Gaussian) distribution

Examples of subject MCMC chains

Estimates from all subjects

• Estimated means and standard deviations are significantly different across groups

• Estimated means are accurate, but standard deviation estimates are high– result could be due to perceptual noise or response gain

Sampling from natural categories

Examined distributions for four natural categories: giraffes, horses, cats, and dogs

Presented stimuli with nine-parameter stick figures (Olman & Kersten, 2004)

Choice task

Samples from Subject 3(projected onto plane from LDA)

Mean animals by subject

giraffe

horse

cat

dog

S1 S2 S3 S4 S5 S6 S7 S8

Marginal densities (aggregated across subjects)

Giraffes are distinguished by neck length, body height and body tilt

Horses are like giraffes, but with shorter bodies and nearly uniform necks

Cats have longer tails than dogs

Relative volume of categoriesMinimum Enclosing Hypercube

Giraffe Horse Cat Dog

0.00004 0.00006 0.00003 0.00002

Convex hull content divided by enclosing hypercube content

Convex Hull

Discrimination method(Olman & Kersten, 2004)

Parameter space for discrimination

Restricted so that most random draws were animal-like

MCMC and discrimination means

Conclusion

• Markov chain Monte Carlo provides a way to sample from subjective probability distributions

• Many interesting questions can be framed in terms of subjective probability distributions– inductive biases (priors)

– mental representations (category distributions)

• Other MCMC methods may provide further empirical methods…– Gibbs for categories, adaptive MCMC, …

A different approach…

Instead of asking whether people are rational, use assumption of rationality to investigate cognition

If we can predict people’s responses, we can design experiments that measure psychological variables

Randomized algorithms Psychological experiments

PL (h | d)∝PP (d | h)P(h)

PP (d | ′ h )P( ′ h )′ h ∈H

⎢ ⎢ ⎢

⎥ ⎥ ⎥

r

r = 1 r = 2 r =

From sampling to maximizing

• General analytic results are hard to obtain– (r = is Monte Carlo EM with a single sample)

• For certain classes of languages, it is possible to show that the stationary distribution gives each hypothesis h probability proportional to P(h)r

– the ordering identified by the prior is preserved, but not the corresponding probabilities

(Kirby, Dowman, & Griffiths, in press)

From sampling to maximizing

Implications for linguistic universals

• When learners sample from P(h|d), the distribution over languages converges to the prior– identifies a one-to-one correspondence between inductive

biases and linguistic universals

• As learners move towards maximizing, the influence of the prior is exaggerated– weak biases can produce strong universals– cultural evolution is a viable alternative to traditional

explanations for linguistic universals

Iterated concept learning

• Each learner sees examples from a species

• Identifies species of four amoebae

• Iterated learning is run within-subjects

data hypotheses

(Griffiths, Christian, & Kalish, in press)

Two positive examples

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

data (d)

hypotheses (h)

Bayesian model(Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)

P(h | d) =P(d | h)P(h)

P(d | ′ h )P( ′ h )′ h ∈H

∑d: 2 amoebaeh: set of 4 amoebae

P(d | h) =1/ h

m

0

⎧ ⎨ ⎩

d ∈ h

otherwise

m: # of amoebae in the set d (= 2)|h|: # of amoebae in the set h (= 4)

P(h | d) =P(h)

P( ′ h )h '|d ∈h'

∑Posterior is renormalized prior

What is the prior?

Classes of concepts(Shepard, Hovland, & Jenkins, 1958)

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

shape

size

color

Experiment design (for each subject)Class 1Class 2Class 3Class 4Class 5Class 6Class 1Class 2Class 3Class 4Class 5Class 6

6 iterated learning chains

6 independent

learning “chains”

Estimating the prior

data (d)hy

poth

eses

(h)

Estimating the prior

Class 1Class 2

Class 3

Class 4

Class 5

Class 6

0.8610.087

0.009

0.002

0.013

0.028

Prior

r = 0.952

Bayesian modelHuman subjects

Two positive examples(n = 20)

Prob

abil

ity

Iteration

Prob

abil

ity

Iteration

Human learners Bayesian model

Two positive examples(n = 20)

Prob

abil

ity

Bayesian model

Human learners

Three positive examples

data (d)

hypotheses (h)

Three positive examples(n = 20)

Prob

abil

ity

Iteration

Prob

abil

ity

Iteration

Human learners Bayesian model

Three positive examples(n = 20)

Bayesian model

Human learners

Classification objects

Parameter space for discrimination

Restricted so that most random draws were animal-like

MCMC and discrimination means

Problems with classification objects

Category 1

Category 2

Category 1

Category 2

Problems with classification objectsMinimum Enclosing Hypercube

Giraffe Horse Cat Dog

0.00004 0.00006 0.00003 0.00002

Convex hull content divided by enclosing hypercube content

Convex Hull

Allowing a Wider Range of Behavior

An exponentiated choice rule results in a Markov chain with stationary distribution corresponding to an exponentiated version of the category distribution, proportional to p(x|c)

Category drift

• For fragile categories, the MCMC procedure could influence the category representation

• Interleaved training and test blocks in the training experiments