Learning Programs: A Hierarchical Bayesian Approachpliang/papers/programs-icml2010-talk.pdf · would just program themselves since I don't like programming

Learning Programs:

A Hierarchical Bayesian Approach

ICML - Haifa, Israel June 24, 2010

Percy Liang Michael I. Jordan Dan Klein

Motivating Application: Repetitive Text Editing

I like programs, but I wish programswould just program themselves sinceI don’t like programming. =⇒

I like programs, but I wish programswould just program themselves sinceI don’t like programming.

2




Goal: Programming by Demonstration

If the user demonstrates italicizing the first occurrence,can we generalize to the remaining?

2






Solution: represent task by a program to be learned

1. Move to next occurrence of word with prefix program2. Insert 3. Move to end of word4. Insert 

2






Solution: represent task by a program to be learned

1. Move to next occurrence of word with prefix program2. Insert 3. Move to end of word4. Insert 

Challenge: learn from very few examples2

General Setup

Goal:

(X1, Y1)· · ·

(Xn, Yn)

Training data

3

General Setup

Goal:

(X1, Y1)· · ·

(Xn, Yn)=⇒ Z such that (Z Xj) = Yj

Training data Consistent program

3

General Setup

Goal:

(X1, Y1)· · ·



Challenge:

When n small, many programs consistent with training data.

I like programs, but I wish programswould just program themselves sinceI don’t like programming.

Move to beginning of third word, ...Move to beginning of word after like, ...Move 7 spaces to the right, ...Move to word with prefix program, ...· · ·

3

General Setup

Goal:

(X1, Y1)· · ·



Challenge:

When n small, many programs consistent with training data.

I like programs, but I wish programswould just program themselves sinceI don’t like programming.

Move to beginning of third word, ...Move to beginning of word after like, ...Move 7 spaces to the right, ...Move to word with prefix program, ...· · ·

Which program to choose?3

Key Intuition

One task:

Want to choose a program which is simple (Occam’s razor).

Examples =⇒ Z

4

Key Intuition

One task:


Examples =⇒ Z

What’s the right complexity metric (prior)?

4

Key Intuition

One task:


Examples =⇒ Z

What’s the right complexity metric (prior)? No general answer.

4

Key Intuition

One task:


Examples =⇒ Z


Multiple tasks:

Task 1 examples =⇒ Z1

· · · · · ·Task K examples =⇒ ZK

4

Key Intuition

One task:


Examples =⇒ Z


Multiple tasks:



Find programs that share common subprograms.

4

Key Intuition

One task:


Examples =⇒ Z


Multiple tasks:




• Programs do tend to share common components.

4

Key Intuition

One task:


Examples =⇒ Z


Multiple tasks:




• Programs do tend to share common components.

• Penalize joint complexity of all K programs.

4

Outline of Proposed Solution

Program representation: What are subprograms?

Combinatory logic +

∗ I

B I

S

B 1C

5



Combinatory logic +

∗ I

B I

S

B 1C

Probabilistic model: Which programs are favorable?

Nonparametric Bayesα0

p0

{Gt} Zi Yij Xij

nK

5



Combinatory logic +

∗ I

B I

S

B 1C



p0

{Gt} Zi Yij Xij

nK

Statistical inference: How do we search for good programs?

MCMCx y

r1Br z

r0

⇒ x

y z

r′1

r′0r

5



Combinatory logic +

∗ I

B I

S

B 1C



p0

{Gt} Zi Yij Xij

nK


MCMCx y

r1Br z

r0

⇒ x

y z

r′1

r′0r

6

Representation: What Language?

Goal: allow sharing of subprograms

7



Our language:

Combinatory logic [Schonfinkel, 1924]

7



Our language:


+ higher-order combinators (new)

+ routing intuition, visual representation (new)

7



Our language:




Properties: no mutation, no variables ⇒ simple semantics

7



Our language:




Properties: no mutation, no variables ⇒ simple semantics

Result:

• Programs are trees

• Subprograms are subtrees

7

Programs with No Arguments

Example: compute min(3, 4)

8


Example: compute min(3, 4)(if (< 3 4) 3 4)

8



if

< 34

34

8



if

< 34

34

General:

x y⇒ result of applying function x to argument y

8



if

< 34

34

General:


Arguments are curried

8


Example: compute min(3, 4)(if (< 3 4) 3 4) (if true 3 4)

if

< 34

34⇒

if true3

4

General:



8


Example: compute min(3, 4)(if (< 3 4) 3 4) (if true 3 4)

if

< 34

34⇒

if true3

4 ⇒ 3

General:



8

Programs with One Argument

Example: x 7→ x2 + 1

9



λx . +

∗ xx

1

Lambda calculus

9



λx . +

∗ xx

1

+

∗ I

B I

S

B 1C

Lambda calculus Combinatory logic

9



λx . +

∗ xx

1

+

∗ I

B I

S

B 1C


Intuition:

Combinators {B,C,S, I} encode placement of arguments

9



λx . +

∗ xx

1

+

∗ I

B I

S

B 1C


Intuition:


Semantics:

x y

r⇔ (r x y)

r ∈ {B,C,S, I} 9



λx . +

∗ xx

1

+

∗ I

B I

S

B 1C


Intuition:


Semantics:

x y

r⇔ (r x y)

r ∈ {B,C,S, I}

Rules:(B f g x) = (f (g x))... 9


Example: Apply x 7→ x2 + 1 to 5

10



+

∗ I

B I

S

B 1C 5

10



+

∗ I

B I

S

B 1C 5

x yC z ⇔

x zy

route left

10



+

∗ I

B I

S

B 51 x y

C z ⇔x z

y

route left

10



+

∗ I

B I

S

B 51 x y

C z ⇔x z

y

route left

x yB z ⇔ x

y z

route right

10



+

∗ I

B I

S 5

1 x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

10



+

∗ I

B I

S 5

1 x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

x yS z ⇔

x z y z

route left and right10



+

∗ I

B 5 I 5

1 x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

x yS z ⇔

x z y z

route left and right10



+

∗ I

B 5 I 5

1 x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

x yS z ⇔

x z y z

route left and right

I x ⇔ x

stop10



+

∗ I

B 55

1 x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

x yS z ⇔

x z y z


I x ⇔ x

stop10



+

∗ 55

1

x yC z ⇔

x zy

route left

x yB z ⇔ x

y z

route right

x yS z ⇔

x z y z


I x ⇔ x

stop10

Programs with Multiple Arguments

Example: (x, y) 7→ min(x, y)

11


Example: (x, y) 7→ min(x, y)Classical: first-order combinators {B,C,S, I}

Complete basis, so can implement min, but cumbersome

11




New: higher-order combinators {B,C,S, I}∗

Infinite basis, but resulting programs are more intuitive

e.g., CS routes 1st arg. left, 2nd arg. left and right

11







if <

BB I

SC I

CS

11







if <

BB I

SC I

CS 34

11







if <

BB I

SC I

CS 34

11







if< 3

B 3C I

S 4

11







if< 3

B 3C I

S 4

11







if

< 34

34

11

Using Combinators for Refactoring

min max

if <

BB I

SC I

CS

if >

BB I

SC I

CS

12


min max

if <

BB I

SC I

CS

if >

BB I

SC I

CS

No significant sharing of subtrees (subprograms)

12


min max

if <

BB I

SC I

CS

if >

BB I

SC I

CS


Refactored:

if I

BBB I

CSC I

CCS <

if I

BBB I

CSC I

CCS >

12


min max

if <

BB I

SC I

CS

if >

BB I

SC I

CS


Refactored:

if I

BBB I

CSC I

CCS <

if I

BBB I

CSC I

CCS >

Fruitful sharing of subtrees (subprograms)12

Summary

Introduced new combinatory logic basis (intuition: routing)

13

Summary


Purpose of these combinators:

• Represent multi-argument functions

• Allow refactoring to expose common substructures

13

Summary


Purpose of these combinators:

• Represent multi-argument functions

• Allow refactoring to expose common substructures

Achieved uniformity: Every subtree is a subprogram

13



Combinatory logic +

∗ I

B I

S

B 1C



p0

{Gt} Zi Yij Xij

nK


MCMCx y

r1Br z

r0

⇒ x

y z

r′1

r′0r

14

Probabilistic Context-Free Grammars

GenIndep(t): [returns a combinator of type t]

15


GenIndep(t): [returns a combinator of type t]With probability λ0:

15



Return a random primitive combinator (e.g., +, 3, I)

15



Return a random primitive combinator (e.g., +, 3, I)Else:

Choose a type sx← GenIndep(s→ t)

15




Choose a type sx← GenIndep(s→ t)y ← GenIndep(s)

15




Choose a type sx← GenIndep(s→ t)y ← GenIndep(s)return (x, y)

15





Example:

GenIndep(int→ int) =⇒ + 1

15





Example:

GenIndep(int→ int) =⇒ ∗

− 31

15





Example:

GenIndep(int→ int) =⇒ ∗

− 31

Problem: No encouragement to share subprograms

15

Adaptor Grammars [Johnson, 2007]

Ct ← [ ] for each type t [cached list of combinators]

16


Ct ← [ ] for each type t [cached list of combinators](notation: return∗ c adds c to Ct and returns c)

16



GenCache(t): [returns a combinator of type t]

With probability α0+Ntdα0+|Ct|

:

16





:

With probability λ0:Return∗ a random primitive combinator (e.g., +, 3, I)

Else:Choose a type sx← GenCache(s→ t)y ← GenCache(s)Return∗ (x, y)

Else:

16





:



Else:

Return∗ z ∈ Ct with probability Mz−d|Ct|−Ntd

16





:



Else:

Return∗ z ∈ Ct with probability Mz−d|Ct|−Ntd

Interpretation of cache Ct: library of generally useful(unnamed) subroutines which are reused.

16



Combinatory logic +

∗ I

B I

S

B 1C



p0

{Gt} Zi Yij Xij

nK


MCMCx y

r1Br z

r0

⇒ x

y z

r′1

r′0r

17

Inference via MCMC

User provides tree structure that encodes set of programs U

Objective: sample from posterior given program in U

18

Inference via MCMC



Use Metropolis-Hastings

Proposal: sample a random program transformation

18

Inference via MCMC





Program transformations maintain invariant that

program is correct (likelihood is 1)

18

Inference via MCMC





Program transformations maintain invariant that

program is correct (likelihood is 1)

Two types of transformations:

1. Switching

2. Refactoring

18

Program transformations (MCMC moves)

Switching: Change content, preserve empirical semantics

19



Data: {(2, 8)}

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]

Purpose: change generalization

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]


Refactoring: Change form, preserve total semantics

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]



∗+ 2

S

[x 7→ x(x + 2)]

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]



∗+ 2

S

[x 7→ x(x + 2)]

⇔ ∗ +BS 2

[x 7→ x(x + 2)]

19



Data: {(2, 8)} ∗+ 2

S

[x 7→ x(x + 2)]

⇔∗∗ I

S

S

[x 7→ x3]



∗+ 2

S

[x 7→ x(x + 2)]

⇔ ∗ +BS 2

[x 7→ x(x + 2)]

Purpose: expose different subprograms for sharing19

Text Editing Experiments

Setup:

Dataset of [Lau et al., 2003]K = 24 tasksEach task: train on 2–5 examples, test on u 13 examples10 random trials

20

Text Editing Experiments

Setup:

Dataset of [Lau et al., 2003]K = 24 tasksEach task: train on 2–5 examples, test on u 13 examples10 random trials

Example task:

Cardinals 5, Pirates 2.⇒

GameScore[ winner ’Cardinals’; loser ’Pirates’; scores [5, 2]].

20

Experimental Results

Uniform prior

Independent prior

Joint prior

21


2 3 4 5

5

10

15

20

25

erro

r

Uniform prior

Independent prior

Joint prior

21


2 3 4 5

5

10

15

20

25

erro

r

Uniform prior

Independent prior

Joint prior

Observations:

• Independent prior is even worse than uniform prior

21


2 3 4 5

5

10

15

20

25

erro

r

Uniform prior

Independent prior

Joint prior

Observations:

• Independent prior is even worse than uniform prior

• Joint prior (multi-task learning) is effective

21

Summary

X ⇒ ⇒ Y

22

Summary

X ⇒ program ⇒ Y

22

Summary

X ⇒ program ⇒ Y

Key challenge: learn programs from few examples

22

Summary

X ⇒ program ⇒ Y


Main idea: share subprograms across multiple tasks

22

Summary

X ⇒ program ⇒ Y


Main idea: share subprograms across multiple tasks

Tools:

• Combinatory logic: expose subprograms to be shared

• Adaptor grammars: encourage sharing of subprograms

•Metropolis-Hastings: proposals are program transformations

22

Documents

Learning Programs: A Hierarchical Bayesian Approachpliang/papers/programs-icml2010-talk.pdf · would just program themselves since I don't like programming