187
Gaussian Processes for NLP Trevor Cohn & Daniel Beck CIS, University of Melbourne DCS, University of Sheeld ALTA Workshop 26 November 2014 with slides from Daniel Preot ¸iuc-Pietro, Neil Lawrence, Richard Turner

Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Embed Size (px)

Citation preview

Page 1: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes for NLP

Trevor Cohn & Daniel Beck

CIS, University of MelbourneDCS, University of She�eld

ALTA Workshop26 November 2014

with slides from Daniel Preotiuc-Pietro, Neil Lawrence, Richard Turner

Page 2: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

Brings together several key ideas in one framework

I BayesianI kernelisedI non-parametricI non-linearI modelling uncertainty

Elegant and powerful framework, with growing popularity inmachine learning and application domains.

Page 3: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

State of the art for regression

I exact posterior inferenceI supports very complex non-linear functionsI elegant model selection

Page 4: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

Now mature enough for use in NLP

I support for classification, ranking, etcI fancy kernels, e.g., textI sparse approximations for large scale inferenceI probabilistic formulation allows incorporation into larger

graphical modelsI models the prediction uncertainty so that it’s propagated

through pipelines of probabilistic components

Page 5: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Tutorial Scope

Covers

1. GP fundamentals (1 hour)I focus on regressionI weight space vs. function space viewI squared exponential kernel

2. NLP applications (1 hour 30)I sparse GPsI periodic kernels and model selectionI multi-output GPs

3. Further topics (30 mins)I classification and other likelihoodsI structured kernels

Page 6: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 7: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

Regression and classification are fundamental techniques inNLP

Linear models:

I simple and fast to runI provide straightforward interpretabilityI but very limited in the type of relations they can model

Non-linear models (e.g. SVM):

I better results because they allow to model otherrelationships

I but high cost of hyperparameter optimisationI lack of interoperability with downstream processing

Page 8: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Regression

Page 9: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

. . . versus classification

Page 10: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

The Gaussian (normal) distribution

The Gaussian is probably the most widely used probabilitydensity

p(y|µ, �2) =1p

2⇡�2exp

� (y � µ)2

2�2

!

�= N(µ, �2)

Multivariate formulation

p(y|µ,⌃) =1

p(2⇡)d|⌃|

exp✓�1

2(y � µ)>⌃�1(y � µ)

where µ is now a mean vector, and ⌃ a covariance matrix.

Page 11: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

The Gaussian (normal) distribution

The Gaussian is probably the most widely used probabilitydensity

p(y|µ, �2) =1p

2⇡�2exp

� (y � µ)2

2�2

!

�= N(µ, �2)

Multivariate formulation

p(y|µ,⌃) =1

p(2⇡)d|⌃|

exp✓�1

2(y � µ)>⌃�1(y � µ)

where µ is now a mean vector, and ⌃ a covariance matrix.

Page 12: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

The Gaussian distributionGaussian Density

0

1

2

3

0 1 2

p(h|µ,�

2 )

h, height/m

The Gaussian PDF with µ = 1.7 and variance �2 = 0.0225. Meanshown as red line. It could represent the heights of apopulation of students.

Page 13: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Bivariate Gaussian distributionSampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Page 14: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution properties

Several very convenient properties:

I Scaling a Gaussian results in a GaussianI Sum of Gaussian variables is also Gaussian

(generally, central limit theorem)I Product of two Gaussians is (scaled) Gaussian

For multi-variate Gaussian distributed variables:

I Marginal distribution is also GaussianI Conditional distribution is also Gaussian

Page 15: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

The Gaussian distribution:

I is a distribution over scalars or vectors

The Gaussian process:

I is a distribution over functions

Page 16: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes

The Gaussian distribution:

I is a distribution over scalars or vectors

The Gaussian process:

I is a distribution over functions

Page 17: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Being Bayesian

Revisiting our regression example:

I what class of functions should we allow?I given the class what parameter values are sensible?I e.g., linear should we let m, c! ±1?I e.g., non-linear should be jagged or smooth?I e.g., periodic . . .

Usually have some intuitions.

Page 18: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Being Bayesian

Define a prior over the unknowns, p(✓).

I encodes our intitial intuitions about reasonable values

Observing data gives rise to a likelihood, p(y|✓).

I how well the model fits the training sample

Seek to reason over the posterior

I incorporating the above to form our updated beliefs aboutthe unknowns

I formulated using Bayes’ rule

p(✓|y) =p(y|✓)p(✓)

p(y)

Page 19: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Bayesian update illustratedBayes Update

0

1

2

-3 -2 -1 0 1 2 3 4c

p(c) = N (c|0,↵1)

p(y|m, c, x, �2) = N⇣y|mx + c, �2

p(c|y,m, x, �2) =N⇣c| y�mx

1+�2/↵1, (��2 + ↵�1

1 )�1⌘

Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.Example of linear regression using prior over intercept

parameter, c.

Page 20: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes prior

I the prior represents our prior beliefs over the functions weexpect to have generated our (regression) data

I here, standard choices of 0 mean and 1 variance, and asmoothly varying function

Page 21: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Process posterior

I after observing data points posterior now incorporatesdata likelihood

I the functions must pass close to the observations

Page 22: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Relationship to Linear regression

Linear regression:y = w>x + ✏

where w are the learned model parameters.

Gaussian process regression:

y = f (x) + ✏

where function f is drawn from a Gaussian process prior.

I the w parameters are gone (in fact, integrated out)I f is not limited to a parametric form

Page 23: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Graphical model view

f ⇠ GP(m, k)

y ⇠ N( f (x), �2)

I f : RD ! R is a latentfunction

I y is a noisy realisation off (x)

I k is the covariance functionor kernel

I �2 is a learned parameter

k

f

yx

σ

N

Page 24: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Formal definition

Definition

A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

A GP is fully specified by mean m(x) and covariance k(x, x0)functions:

I m(x) = E[ f (x)]

I k(x, x0) = E[( f (x) �m(x))(( f (x0) �m(x0))]

Without loss of generality we can assume m(x) = 0.

Page 25: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Compared to a Gaussian distribution

Gaussian distribution:

I specified by a mean and covariance matrix: x ⇠ N(µ,⌃)I the vector index is the position of the random variables x

Gaussian process:

I specified by a mean and covariance function: f ⇠ GP(m, k)I the index is the argument x of f

The Gaussian Process is an infinite extension of multivariateGaussian distributions

Page 26: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 27: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .4

.4 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 28: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .0

.0 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 29: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 30: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 31: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 32: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y 2

⌅3 ⌅2 ⌅1 0 1 2 3⌅3

⌅2

⌅1

0

1

2

3

Page 33: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian distribution

p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Page 34: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

�2 0 2�3

�2

�1

0

1

2

3

y1

y 2

1 2�3

�2

�1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

Page 35: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

Page 36: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

Page 37: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

Page 38: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 39: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 40: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 41: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 42: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 43: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 44: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 45: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 46: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

Page 47: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 48: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 49: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 50: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 51: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 52: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 53: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920↵3

↵2

↵1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Page 54: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

K(x1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

1 10 20

1

10

20

Page 55: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Regression: probabilistic inference in function space

�3

�2

�1

0

1

2

3

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

Non-parametric (1-parametric)

p(y|✓) = N (0, ⌃)

function estimatewith uncertaintyK(x

1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

1 10 20

1

10

20

Parametric model

y(x) = f(x; ✓) + �y✏

✏ ⇠ N (0, 1)

Page 56: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Regression: probabilistic inference in function space

�3

�2

�1

0

1

2

3

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

p(y|✓) = N (0, ⌃)

Non-parametric (1-parametric)

noise

Parametric model

y(x) = f(x; ✓) + �y✏

✏ ⇠ N (0, 1)K(x1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

1 10 20

1

10

20

horizontal-scalevertical-scale

Page 57: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Squared Exponential (SE) kernel

Usual choice for covariance is the Squared Exponential (SE)kernel (a.k.a. RBF, exponentiated quadratic, Gaussian):

k(x, x0) = �2d exp

� (x � x0)2

2l2

!

The kernel’s parameters {l, �d} are called the GP’shyperparameters.

When l is very high, neighbouring points are very correlatedi.e. that feature is less relevant.

Page 58: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

How to select the hyperparameters?

Varying lengthscale from very low (0.04) to high (7.4), withnoise hyperparameter fixed.

Page 59: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Hyperparameters

Training a GP means:

I finding the right kernelI finding values of hyperparameters

The covariance function controls properties of the GP:

I smoothnessI stationarityI periodicityI etc.

Page 60: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Other kernels

Page 61: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Inference

Although the GP is an infinite dimensional object,marginalisation property means we only need to work withfinite dimensions.

For prediction we need the conditional distribution over testpoints y⇤ given observed points y:

p(y⇤|y, x⇤,X) =p(y, y⇤|x⇤,X)

p(y|X)

We’ll omit the conditioning on x⇤,X for brevity henceforth.

Page 62: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Inference

Consider GP over single extra point

p(y) = N(0,K + �2nI)

p(y, y⇤) = N "

00

#,

"K + �2

nI K>⇤K⇤ K⇤⇤ + �2

n

#!

where K are covariance matrices, i.e.,

I K 2 RN⇥N are the covar. between training instancesI K⇤ 2 RN are the covar. between test and training instancesI K⇤,⇤ 2 R is the test variance

Page 63: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Inference

Recall property of the multi-variate Gaussian:

Conditional distribution is Gaussian

Leads to the following predictive posterior

p(y⇤|y) = N(µ⇤,⌃⇤)

with µ⇤ = K>⇤ (K + �2nI)�1y

⌃⇤ = K⇤⇤ � K>⇤ (K + �2nI)�1K⇤

I Note that predictive mean is linear in the data; andI predictive covariance is the prior covariance is reduced by

the information the observations give about the function

Page 64: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Model selection

Bayesian formulation seemingly removes the need for training,as the parameters are integrated out.

However, need to select hyperparameter values, e.g.,

I �n white noiseI l length (horizontal) scaleI �d vertical scaleI . . .

How to select appropriate values?

Consider the marginal likelihood, p(y|X).

Page 65: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Model selection

Marginal likelihood defined as

p(y|X) =Z

p(y, f|X)df

=

Zp(y|f)p(f|X)df

Observe that

I as space of possible functions grows, p(f|X) diminishesI poorly fitting functions will score poorly on p(y|f)I both a↵ected by type of kernel and hyperparameter values

The marginal likelihood balances these opposing forces.

Page 66: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Model selection for training

Marginal likelihood permits analytical solution due toconjugacy

p(y|X) =Z

p(y|f)|{z}

Gaussian

p(f|X)|{z}

Gaussian

df

= N(0,K + �2nI)

arising from the property that the product of two Gaussians isan unnormalised Gaussian.

Permits analytical solution

log p(y|X) = �12

y>(K + �2nI)�1y � 1

2log |K + �2

nI| � n2

log 2⇡

Can optimise above with respect to model hyperparametersusing gradient ascent, a.k.a. Type II maximum likelihood.

Page 67: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 68: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 69: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 70: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 71: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 72: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 73: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 74: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 75: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Page 76: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Beyond GP regression

I ClassificationI Ordinal regressionI Count dataI Model selection: kernel and hyperparameter optimisationI Kernel designI Extrapolation cf interpolationI Multi-output GPs for multi-task learningI Sparse GPs for scaling to large feature and input spacesI Latent variable models, non-linear probabilistic variant of

PCA/CCA et alI And many others ...

Page 77: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

GP limitations

Non-parametric formulation complicates scaling

I O(N3) time complexity and O(N2) space complexityI but ongoing work to bring this down, e.g., O(NM2) time

complexity or even constant in N

Brilliant for regression, but more di�cult for other likelihoods

I suite of approximation approaches to deal with inferencefor non-conjugate configurations

Not as mature as many other frameworks

I but ‘coming of age’, many big issues have been addressedI even application to ‘big data’ scenarios

Page 78: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Resources

Free book:http://www.gaussianprocess.org/gpml/chapters/

Page 79: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Tutorials

I GPs for Natural Language Processing tutorial (ACL 2014)http://goo.gl/18heUk

I GP Schools in She�eld and roadshows in Kampala,Pereira, and (future) Nyeri, Melbournehttp://ml.dcs.shef.ac.uk/gpss/

I GP Regression demohttp://www.tmpl.fi/gp/

I Annotated bibliography and other materialshttp://www.gaussianprocess.org

Page 80: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Toolkits

I GPML (Matlab)http://www.gaussianprocess.org/gpml/code

I GPy (Python)https://github.com/SheffieldML/GPy

I GPstu↵ (R, Matlab, Octave)http://becs.aalto.fi/en/research/bayes/gpstuff/

Page 81: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 82: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 83: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Case study: User impact on Twitter

Predicting and characterising user impact on Twitter

I define a user-level impact scoreI use user’s text and profile information as features to

predict the scoreI analyse the features which better predict the scoreI provide users with ‘guidelines’ for improving their score

Instance of a text prediction problem

I emphasis on feature analysis and interpretability (specificto social science applications)

I non-linear variation

See Lampos et al. (2014), EACL.

Page 84: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Sparse GPs

Exact inference in a GP

I Memory: O(n2)I Time: O(n3)

Sparse GP approximation

I Memory: O(n ·m)I Time: O(n ·m2)

where m is selected at runtime.

Typically required when n > 1000.

Page 85: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Sparse GPs

Many options for sparse approximations

I Based on Inducing VariablesI Subset of Data (SoD)I Subset of Regressors (SoR)I Deterministic Training Conditional (DTC)I Partially Independent Training Conditional (PITC)I Fully Independent Training Conditional (FITC)

I Fast Matrix Vector Multiplication (MVM)I Variational Methods

See Quinonero Candela and Rasmussen (2005) for an overview.

Page 86: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Sparse GPs

Sparse approximations where f are treated as latent variables

I a subset are treated exactly |uuu| = mI the other are given a computationally cheaper treatmentI avoids large matrix inversions

What does the penalty term do?

Hensman,GPSS ’13

Inducing points are fixed or learned using greedy search /optimisation

Page 87: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Predicting and characterising user impact

500 million Tweets a day in Twitter

I important and some not so important informationI breaking news from mediaI friendsI celebrity self promotionI marketingI spam

Can we automatically predict the impact of a user?

Can we automatically identify factors which influence userimpact?

Page 88: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Defining user impact

Define impact as a function of network connections

I no. of followersI no. of followeesI no. of time the account is listed by others

Impact = ln✓

listings·followers2

followees

DatasetI 38.000 UK usersI all tweets from one yearI 48 million deduplicated

messages

Page 89: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

User controlled features

Only features under the user’s control (e.g. not no. of retweets)

I User features (18)extracted from the account profileaggregated text features

I Text features (100)user’s topic distributiontopics computed using spectral clustering on the wordco-occurrence (NPMI) matrix

Page 90: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Models

Regression task

I Gaussian Process regression modelI n = 38000 · 9/10, use Sparse GPs with FITCI Squared Exponential kernel (k-dimensional):

k(xxxp,xxxq) = �2f exp

✓�1

2(xxxp � xxxq)TD(xxxp � xxxq)

◆+ �2

n�pq

where D 2 Rk⇥k is a symmetric matrix.I if DARD = diag(lll)�2:

k(xxxp,xxxq) = �2f exp

0BBBBB@�

12

kX

d=1

(xxxpd � xxxpd)2

l2d

1CCCCCA + �

2n�pq

Page 91: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Automatic Relevance Determination (ARD)

I with D = DARD the kernel is the SE kernel with automaticrelevance determination (ARD), with the vector lll denotingthe characteristic length-scales of each feature

I ld measures the distance for being uncorrelated along xd

I 1/l2d proportional to how relevant a feature is: largelength-scales means the covariance becomes independentof that feature value

I sorting by length-scales indicates which features impactthe prediction the most

I tuning these parameters in done via Bayesian modelselection

Page 92: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Prediction results

ExperimentsI 10-fold cross validationI using predictive meanI baseline model is ridge

regression (LIN)I Profile featuresI Text features

0.6 0.65 0.7 0.75 0.8

LIN

GP

Pearson correlation

Conclusions

I GPs substantially better than ridge regressionI non-linear GPs with only profile features performs better

than linear methods with all featuresI GPs outperform SVRI adding topic features improves all models

Page 93: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Selected features

Feature ImportanceUsing default profile image 0.73Total number of tweets (entire history) 1.32Number of unique @-mentions in tweets 2.31Number of tweets (in dataset) 3.47Links ratio in tweets 3.57T1 (Weather): mph, humidity,

barometer, gust, winds 3.73T2 (Healthcare, Housing): nursing, nurse,

rn, registered, bedroom, clinical, #news,estate, #hospital 5.44

T3 (Politics): senate, republican, gop, police,arrested, voters, robbery, democrats,presidential, elections 6.07

Proportion of days with non-zero tweets 6.96Proportion of tweets with @-replies 7.10

Page 94: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Feature analysis

No. of unique @-mentions

L H

0

100 L

No. of tweets

L H

11

0

100 L

Impact histogram for users with high (H) values of this featureas opposed to low (L). Red line is the mean impact score.

Page 95: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Feature analysis

τ3

damon, potter, #tvd, harryelena, kate, portman,pattinson, hermione,

jennifer

τ4

senate, republican, gop,police, arrested, voters,

robbery, democrats,presidential, elections

Impact histogram for users with high (H) values of this feature.Red line is the mean impact score.

Page 96: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Conclusions

User impact is highly predictable

I user behaviour very informativeI “tips” for improving your impact

GP framework suitable

I non-linear modellingI ARD feature selectionI sparse GPs allow large scale experimentsI empirical improvements over linear models & SVR

Page 97: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 98: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Case study: Temporal patterns of words

Categorising temporal patterns of hashtags in Twitter

I collect hashtag normalised frequency time series formonths

I use models learnt on past frequencies to forecast futurefrequencies

I identify and group similar temporal patternsI emphasise periodicities in word frequencies

Instance of a forecasting problem

I emphasis on forecasting (extrapolation)I di↵erent e↵ects modelled by specific kernels

See Preotiuc-Pietro and Cohn (2013), EMNLP.

Page 99: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Model selection

Although parameter free, we still need to specify to a GP:

I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H

Training a GP = selecting the kernel and its parameters

Can use only training data (and no validation)

Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel

Page 100: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Model selection

Although parameter free, we still need to specify to a GP:

I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H

Training a GP = selecting the kernel and its parameters

Can use only training data (and no validation)

Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel

Page 101: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Identifying temporal patterns in word frequencies

Word/hashtag frequencies in Twitter

I very time dependentI many ‘live’ only for hours reflecting timely events or

memesI some hashtags are constant over timeI some experience bursts at regular time intervalsI some follow human activity cycles

Can we automatically forecast future hashtag frequencies?

Can we automatically categorise temporal patterns?

Page 102: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Twitter hashtag temporal patterns

Regression task

I Extrapolation: forecast future frequenciesI using predictive mean

Dataset

I two months of Twitter Gardenhose (10%)I first month for training, second month for testingI 1176 hashtags occurring in both splitsI ⇠ 6.5 million tweetsI 5456 tweets/hashtag

Page 103: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels

The kernel

I induces the covariance in the response between pairs ofdata points

I encodes the prior belief on the type of function we aim tolearn

I for extrapolation, kernel choice is paramountI di↵erent kernels are suitable for each specific category of

temporal patterns: isotropic, smooth, periodic,non-stationary, etc.

Page 104: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels

03/1 10/1 17/10

0.2

0.4

0.6

0.8

1#goodmorning

GoldConstLinearSEPer(168)PS(168)

#goodmorning

Const Linear SE Per PSNLML -41 -34 -176 -180 -192

NRMSE 0.213 0.214 0.262 0.119 0.107

Lower is better

Use Bayesian model selection techniques to choose betweenkernels

Page 105: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels: Constant

kC(x, x0) = c

I constant relationshipbetween outputs

I predictive mean is thevalue c

I assumes signal ismodelled by Gaussiannoise centred around thevalue c

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

c=0.6c=0.4

Page 106: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels: Squared exponential

kSE(x, x0) = s2 · exp � (x � x0)2

2l2

!

I smooth transitionbetween neighbouringpoints

I best describes time serieswith a smooth shape e.g.uni-modal burst with asteady decrease

I predictive varianceincreases exponentiallywith distance

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

l=1, s=1l=10, s=1l=10, s=0.5

Page 107: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels: Linear

kLin(x, x0) =|x · x0| + 1

s2

I non-stationary kernel:covariance depends onthe data points values,not only on theirdi↵erence |t � t0|

I equivalent to Bayesianlinear regression withN(0, 1) priors on theregression weights and aprior ofN(0, s2) on thebias

25 50 75 100 1250

0.5

1

1.5

2

s=50s=70

Page 108: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels: Periodic

kPER(x, x0) = s2·exp · �2 sin2(2⇡(x � x0)/p)

l2

!

I s and l are characteristiclength-scales

I p is the period (distancebetween consecutivepeaks)

I best describes periodicpatterns that oscillatesmoothly between highand low values 25 50 75 100 125

0

0.2

0.4

0.6

0.8

1

l=1, p=50l=0.75, p=25

Page 109: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kernels: Periodic Spikes

kPS(x, x0) = cos sin

2⇡ · (x � x0)

p

!!·exp

s cos(2⇡ · (x � x0))

p� s

!

I p is the periodI s is a shape parameter

controlling the width ofthe spike

I best describes time serieswith constant low values,followed by abruptperiodic rise

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

s=1s=5s=50

Page 110: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Results: Examples

#fyi (Const)

#fail (Per)

#snow (SE)

#raw (PS)

Page 111: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Results: Categories

Const SE PER PS#funny #2011 #brb #↵#lego #backintheday #co↵ee #followfriday

#likeaboss #confessionhour #facebook #goodnight#money #februarywish #facepalm #jobs

#nbd #haiti #fail #news#nf #makeachange #love #nowplaying

#notetoself #questionsidontlike #rock #tgif#priorities #savelibraries #running #twitterafterdark

#social #snow #xbox #twittero↵#true #snowday #youtube #ww

49 268 493 366

Page 112: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Results: Forecasting

Lag+ GP−Lin GP−SE GP−PER GP−PS GP+−10

−5

0

5

10

Page 113: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Application: Text classification

Task

I assign the hashtag of a given tweet based on its text

Methods

I Most frequent (MF)I Naive Bayes model with empirical prior (NB-E)I Naive Bayes with GP forecast as prior (NB-P)

MF NB-E NB-PMatch@1 7.28% 16.04% 17.39%Match@5 19.90% 29.51% 31.91%

Match@50 44.92% 59.17% 60.85%MRR 0.144 0.237 0.252

Higher is better

Page 114: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 115: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Case study 1: MT Quality Estimation

Manual assessment of translation quality given source andtranslated texts.

‘Quality’ can be measured many ways (see Specia et al. (2009))

I subjective scoring (1-5) for fluency, adequacy, perceivede↵ort to correct

I post-editing e↵ort: HTER or time takenI binary judgements, ranking, . . .

Human judgements are highly subjective, biased, noisy

I typing speedI experience levelsI expectations from MT

See Cohn and Specia (2013), ACL.

Page 116: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

General annotation problem

MT Quality Estimation is an instance of a general annotationproblem

I can’t rely on single individualI but many annotators produce di↵erent resultsI how can we resolve conflicts?

Previous work in MT QE

I averaged responses from several annotators(Callison-Burch et al., 2012); or

I used output of single annotator (Koponen et al., 2012)

More appropriate to model as a multi-task problem.

Page 117: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-task learning

Multi-task learning

I form of transfer learningI several related tasks sharing the same input data

representationI learn the types, extent of correlations

Compared to domain adaptation

I tasks need not be identical (even regression vsclassification)

I no explicit ‘target’ domainI several sources of variation besides domainI no assumptions of data asymmetry

Page 118: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-task learning for MT Quality Estimation

Modelling individual annotators

I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions

Here use multi-output GP regression

I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data

Page 119: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-task learning for MT Quality Estimation

Modelling individual annotators

I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions

Here use multi-output GP regression

I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data

Page 120: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Review: GP Regression

θ

f

yx

σ

N

f ⇠ GP(0,✓)

yi ⇠ N( f (xi), �2)

Page 121: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-task GP Regression

B

M

θ

f

yx

σ

N

f ⇠ GP(0,⇣B,✓

⌘)

yim ⇠ N( fm(xi), �2m)

See Alvarez et al. (2011).

Page 122: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-task Covariance Kernels

Represent data as (x, t, y) tuples, where t is a task identifier.Define a separable covariance kernel,

K(x, x0)t,t0 = Bt,t0k✓(x, x0) + noise

I e↵ectively each input augmented with t, indexing the taskof interest

I the coregionalisation matrix, B 2 RM⇥M weights inter-taskcovariance

I the data kernel k✓ takes data points x as inpute.g., squared exponential

Page 123: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 124: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learning

I B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 125: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learning

I interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 126: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the above

I full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 127: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 128: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)

Page 129: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Stacking and Kronecker products

Response variables are a matrix

Y = RN⇥M

Represent data in ‘stacked’ form

X =

26666666666666666666666666666666666666664

x1x2...

xN...

x1x2...

xN

37777777777777777777777777777777777777775

y =

26666666666666666666666666666666666666664

y11y21...

yN1...

y1My2M...

yNM

37777777777777777777777777777777777777775

Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)

Page 130: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Stacking and Kronecker products

Response variables are a matrix

Y = RN⇥M

Represent data in ‘stacked’ form

X =

26666666666666666666666666666666666666664

x1x2...

xN...

x1x2...

xN

37777777777777777777777777777777777777775

y =

26666666666666666666666666666666666666664

y11y21...

yN1...

y1My2M...

yNM

37777777777777777777777777777777777777775

Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)

Page 131: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kronecker productKronecker Product

aK bKcK dK

Ka b

c d⌦ =

Page 132: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Kronecker productKronecker Product

⌦ =

Page 133: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Choices for B: Independent learning

⊗ =

B = I

Page 134: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Choices for B: Pooled learning

⊗ =

B = 1

Page 135: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Choices for B: Interpolating independent and pooledlearning

⊗ =

B = 1 + ↵I

Page 136: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Choices for B: Interpolating independent and pooledlearning II

⊗ =

B = 1 + diag(↵)

Page 137: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Compared to Daume III (2007)

Feature augmentation approach to multi-task learning. Useshorizontal data stacking:

X ="

X(1) X(1) 0X(2) 0 X(2)

#y =

"y(1)

y(2)

#

where (X(i),y(i)) are the training data for task i. This expands thefeature space by a factor of M.

Equivalent to a multitask kernel

k(x, x0)t,t0 = (1 + �(t, t0)) x>x0

K(X,X) = (1 + I) ⌦ klinear(X,X)

) A specific choice of B with a linear data kernel

Page 138: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Compared to Evgeniou et al. (2006)

In the regularisation setting, Evgeniou et al. (2006) show thatthe kernel

K(x, x0)t,t0 = (1 � � + �M�(t, t0)) x>x0

is equivalent to a linear model with regularisation term

J(⇥) =1M

0BBBBB@X

t||✓t||2 +

1 � ��||✓t �

1M

X

t0✓t0 ||2

1CCCCCA

This regularises each task’s parameters ✓t towards the meanparameters over all tasks, 1

MP

t0 ✓t0 .

A form of interpolation method from before.

Page 139: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 140: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 141: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 142: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#

Page 143: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 144: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 145: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 146: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 147: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

ICM samplesIntrinsic Coregionalization Model

K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Page 148: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Experimental setup

Quality Estimation data

I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using

Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment

Gaussian Process models

I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise

Page 149: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Experimental setup

Quality Estimation data

I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using

Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment

Gaussian Process models

I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise

Page 150: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Results: WMT12 RMSE for 1-5 ratings

50 100 150 200 250 300 350 400 450 5000.7

0.72

0.74

0.76

0.78

0.8

0.82

Training examples

STLMTLPooled

Page 151: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Incorporating layers of task metadata

Annotator System SourceSenTence

Page 152: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Results: WPTP12 RMSE post-editing time

0.637& 0.635& 0.628& 0.617&

0.748& 0.741&

0.600&0.585&

0.500&

0.550&

0.600&

0.650&

0.700&

0.750&

0.800&

constant&Ann&&

Ind&SVM&Ann&&

Pooled&GP&&

MTL&GP&Ann&

MTL&GP&Sys&

MTL&GP&senT&

MTL&GP&&&&&A+S&&

MTL&GP&&&&A+S+T&&

Page 153: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Case Study 2: Emotion Analysis

I automatically detect emotions in a textI fine-grainedI presence of anti-correlations (sadness and joy)

Headline Fear Joy SadnessStorms kill, knock out power, cancel flights 82 0 60Panda cub makes her debut 0 59 0

See Beck et al. (2014), EMNLP

Page 154: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Modelling anti-correlations

I The coregionalization settings used in Quality Estimationare not suitable for this problem: they assume positivecovariances between tasks.

I We need a parameterization that allows B to have negativecovariances: we use the incomplete-Choleskydecomposition.

B = WWT + diag(↵)

Page 155: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Incomplete-Cholesky model

W11

W21

W31

W41

W51

W61

W

W11 W21 W31 W41 W51 W61

⇥ WT

+ diag( ) =

↵1↵2↵3↵4↵5↵6

+ diag(↵) = B

12 hyperparameters

Page 156: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Incomplete-Cholesky model

18 hyperparameters

W ⇥ WT + diag(↵) = B

⇥ + diag( ) =

W11

W21

W31

W41

W51

W61

W12

W22

W32

W42

W52

W62

W11 W21 W31 W41 W51 W61

W12 W22 W32 W42 W52 W62

↵1↵2↵3↵4↵5↵6

Page 157: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Incomplete-Cholesky model

24 hyperparameters

W ⇥ WT + diag(↵) = B

⇥ + diag( ) =

W11

W21

W31

W41

W51

W61

W12

W22

W32

W42

W52

W62

W13

W23

W33

W43

W53

W63

W11 W21 W31 W41 W51 W61

W12 W22 W32 W42 W52 W62

W13 W23 W33 W43 W53 W63

↵1↵2↵3↵4↵5↵6

Page 158: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Experimental Setup

I Dataset: SEMEval2007 “A↵ective Text” Strapparava andMihalcea (2007);

I 1000 News headlines, each one annotated with 6 scores[0-100], one for emotion;

I 100 sentences for training, 900 for testing;I Bag-of-words representation as features;

Page 159: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Learned Task Covariances

Page 160: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Prediction Results

Page 161: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 162: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 163: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Recap: Regression

Observations, yi, are a noisy version of latent process fi,

yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)

GP Regression

Observations yi are a distorted version of a process fi:

yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)

Page 164: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Likelihood models

Analytic solution for Gaussian likelihood aka noise

Gaussian (process) prior⇥ Gaussian likelihood

= Gaussian posterior

But what about other likelihoods?

I Counts y 2 NI Classification y 2 {C1,C2, ...,Ck}I Ordinal regression (ranking) C1 < C2 < ... < Ck

I . . .

Page 165: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Classification

Binary classification, yi 2 {0, 1}.

Two popular choices for the likelihood

I Logistic sigmoid: p(yi = 1| fi) = �( fi) = 11+exp(� fi)

I Probit function: p(yi = 1| fi) = �( fi) =R fi�1N(z|0, 1)dz

”Squashing” input from (�1,1) into range [0, 1]

Page 166: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Squashing function

Pass latent function through logistic function to obtainprobability, ⇡(x) = p(yi = 1| fi)

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.

c�2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

40 Classification

−4

−2

0

2

4

input, x

late

nt fu

nctio

n, f(

x)

0

1

input, xcl

ass

prob

abilit

y, π

(x)

(a) (b)

Figure 3.2: Panel (a) shows a sample latent function f(x) drawn from a Gaussianprocess as a function of x. Panel (b) shows the result of squashing this sample func-tion through the logistic logit function, �(z) = (1 + exp(�z))�1 to obtain the classprobability �(x) = �(f(x)).

regression model and parallels the development from linear regression to GP

regression that we explored in section 2.1. Specifically, we replace the linear

f(x) function from the linear logistic model in eq. (3.6) by a Gaussian process,

and correspondingly the Gaussian prior on the weights by a GP prior.

The latent function f plays the role of a nuisance function: we do notnuisance function

observe values of f itself (we observe only the inputs X and the class labels y)

and we are not particularly interested in the values of f , but rather in ⇡, in

particular for test cases ⇡(x⇤). The purpose of f is solely to allow a convenient

formulation of the model, and the computational goal pursued in the coming

sections will be to remove (integrate out) f .

We have tacitly assumed that the latent Gaussian process is noise-free, andnoise-free latent process

combined it with smooth likelihood functions, such as the logistic or probit.

However, one can equivalently think of adding independent noise to the latent

process in combination with a step-function likelihood. In particular, assuming

Gaussian noise and a step-function likelihood is exactly equivalent to a noise-

free

8

latent process and probit likelihood, see exercise 3.10.1.

Inference is naturally divided into two steps: first computing the distribution

of the latent variable corresponding to a test case

p(f⇤|X,y,x⇤) =

Zp(f⇤|X,x⇤, f)p(f |X,y) df , (3.9)

where p(f |X,y) = p(y|f)p(f |X)/p(y|X) is the posterior over the latent vari-

ables, and subsequently using this distribution over the latent f⇤ to produce a

probabilistic prediction

⇡⇤ � p(y⇤ =+1|X,y,x⇤) =

Z�(f⇤)p(f⇤|X,y,x⇤) df⇤. (3.10)

8This equivalence explains why no numerical problems arise from considering a noise-freeprocess if care is taken with the implementation, see also comment at the end of section 3.4.3.

Figure from Rasmussen and Williams (2006)

Page 167: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Inference Challenges: for test case x⇤

Distribution over latent function

p( f ⇤|X,y, x⇤) =Z

p( f ⇤|X, x⇤, f) p(f|X,y)| {z }posterior

df

Distribution over classification output

p(y⇤ = 1|X,y, x⇤) =Z�( f⇤)p( f⇤|X,y, x⇤)d f⇤

Problem: likelihood no longer conjugate with prior, so noanalytic solution.

Page 168: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Approximate inference

Several inference techniques proposed for non-conjugatelikelihoods:

I Laplace approximationWilliams and Barber (1998)

I Expectation propagationMinka (2001)

I Variational inferenceGibbs and MacKay (2000)

I MCMCNeal (1999)

And more, including sparse approaches for large scaleapplication.

Page 169: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Laplace approximation

Approximate non-Gaussian posterior by a Gaussian, centred atthe mode

Figure from Rogers and Girolami (2012)

Page 170: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Laplace approximation

Log posterior

�(f) = log p(y|f) + log p(f|X) + const

Find the posterior mode, f, i.e., MAP estimation, O(n3).

Then take a second order Taylor series expansion about mode,and fit with a Gaussian

I with mean, µ = fI and co-variance ⌃ = (K�1 + rr log p(y|f))�1

Allows for computation of posterior and marginal likelihood,but predictions may still be intractable.

Page 171: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Expectation propagation

Take intractable posterior:

p(f|X,y) =1Z

p(f|X)nY

i=1

p(yi| fi)

Z =Z

p(f|X)nY

i=1

p(yi| fi)df

Approximation with fully factorised distribution

q(f|X,y) =1

ZEPp(f|X)

nY

i=1

t( fi)

Page 172: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Expectation propagation

Approximate posterior defined as

q(f|y) =1

ZEPp(f)

nY

i=1

t( fi)

where each component assumed to be Gaussian

I p(yi| fi) ⇡ ti( fi) = ZiN( fi|µi, �2i )

I p(f|X) ⇠ N(f|0,Knn)

Results in Gaussian formulation for q(f|y)

I allows for tractable multiplication, division with GaussiansI and marginalisation, expectations etc

Page 173: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Expectation propagation

EP algorithm aims to fit ti( fi) to the posterior, starting with aguess for q, then iteratively refining as follows

I minimise KL divergence between the true posterior for fiand the approximation, ti

minti

KL�p(yi| fi)q�i( fi) || ti( fi)q�i( fi)

where q�i( fi) is the cavity distribution formed bymarginalising q(f) over f j, j , i then dividing by ti( fi).

I key idea: only need accurate approximation for globallyfeasible fi

I match moments to update ti, then update q(f)

Page 174: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Expectation propagationSite approximation example

Page 175: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Expectation propagation

No proof of convergence

I but empirically works wellI often more accurate than Laplace approximation

Formulated for many di↵erent likelihoods

I complexity O(n3), dominated by matrix inversionI sparse EP approximations can reduce this to O(nm2)

See Minka (2001) and Rasmussen and Williams (2006) forfurther details.

Page 176: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-class classification

Consider multi-class classification, y 2 {C1,C2, . . . ,Ck}.

Draw vector of k latent function values for each input

f =⇣

f 11 , . . . , f 1

n , f 21 , . . . , f 2

n , f k1 , . . . , f k

n

Formulate clasification probability using soft-max

p(yi = c|fi) =exp( f c

i )P

c0 exp( f c0i )

Page 177: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Multi-class classification

Assume k latent processes are uncorrelated, leading to priorcovariance f ⇠ N(0,K) where

K =

0BBBBBBBBBBBBB@

K1 0 · · · 00 K2 · · · 0....... . .

...0 0 · · · Kk

1CCCCCCCCCCCCCA

is block diagonal kn ⇥ kn with each Kj of size n ⇥ n.

Various approximation methods for inference, e.g., Laplace(Williams and Barber, 1998), EP (Kim and Ghahramani, 2006),MCMC (Neal, 1999).

Page 178: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 179: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

GPs for Structured Prediction

I GPSC (Altun et al., 2004):I Defines a likelihood over label sequences: p(y|x), with latent

variable over full sequences yI HMM-inspired kernel, combining features from each

observed symbol xi and label pairsI MAP inference for hidden function values, f, and

sparsification trick for tractable inferenceI GPstruct (Bratieres et al., 2013):

I Base model is a CRF:

p(y|x, f) =exp

Pc f (c, xc,yc)P

y02Y expP

c f (c, xc,y0c)

I Assumes that each potential f (c, xc,yc) is drawn from a GPI Bayesian inference using MCMC (Murray et al., 2010)

Page 180: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Page 181: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

String Kernels

k(x, x0) =X

s2⌃⇤ws�s(x)�s(x0),

I �s(x): counts of substring s inside x;I 0 ws 1: weight of substring s;I s can also be a subsequence (containing gaps);

I s = char sequences! ngram kernels (Lodhi et al., 2002)(useful for stems);k(bar, bat) = 3 (b,a,ba)

I s =word sequences!Word Sequence kernels (Canceddaet al., 2003);k(gas only injection, gas assisted plastic injection) = 3

I Soft matching:k(battle, battles) , 0k(battle, combat) , 0

Page 182: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Du↵y, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

Page 183: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Tree Kernels

I Subset Tree Kernels (Collins and Du↵y, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

Page 184: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Conclusions

Gaussian Processes are a powerful framework

I elegant kernel machinesI probabilistic, quantify uncertaintyI closed form Bayesian inference & model selection

Mature enough for widespread application

I frontier challenges in NLPI scaling improvements make these applications practical

Page 185: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

References I

Altun, Y., Hofmann, T., and Smola, A. J. (2004). Gaussian Process Classification forSegmenting and Annotating Sequences. In Proceedings of ICML, page 8, New York,New York, USA. ACM Press.

Alvarez, M. A., Rosasco, L., and Lawrence, N. D. (2011). Kernels for vector-valuedfunctions: A review. Foundations and Trends in Machine Learning, 4(3):195–266.

Beck, D., Cohn, T., and Specia, L. (2014). Joint Emotion Analysis via Multi-taskGaussian Processes. In Proceedings of EMNLP.

Bonilla, E., Chai, K. M., and Williams, C. (2008). Multi-task Gaussian processprediction. NIPS.

Bratieres, S., Quadrianto, N., and Ghahramani, Z. (2013). Bayesian StructuredPrediction using Gaussian Processes. arXiv:1307.3846, pages 1–17.

Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012).Findings of the 2012 workshop on statistical machine translation.

Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.-M. (2003). Word-SequenceKernels. The Journal of Machine Learning Research, 3:1059–1082.

Cohn, T. and Specia, L. (2013). Modelling annotator bias with multi-task Gaussianprocesses: An application to machine translation quality estimation. ACL.

Collins, M. and Du↵y, N. (2001). Convolution Kernels for Natural Language. InAdvances in Neural Information Processing Systems.

Daume III, H. (2007). Frustratingly easy domain adaptation. ACL.

Page 186: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

References IIEvgeniou, T., Micchelli, C. A., and Pontil, M. (2006). Learning multiple tasks with

kernel methods. Journal of Machine Learning Research, 6(1):615.Gibbs, M. N. and MacKay, D. J. (2000). Variational gaussian process classifiers. IEEE

Transactions on Neural Networks, 11(6):1458–1464.Kim, H.-C. and Ghahramani, Z. (2006). Bayesian gaussian process classification with

the em-ep algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on,28(12):1948–1959.

Koponen, M., Aziz, W., Ramos, L., and Specia, L. (2012). Post-editing time as a measureof cognitive e↵ort. AMTA 2012 Workshop on Post-editing Technology and Practice.

Lampos, V., Aletras, N., Preotiuc-Pietro, D., and Cohn, T. (2014). Predicting andCharacterising User Impact on Twitter. EACL.

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). TextClassification using String Kernels. The Journal of Machine Learning Research,2:419–444.

Minka, T. (2001). Expectation propagation for approximate bayesian inference. InProceedings of the Seventeenth conference on Uncertainty in Artificial Intelligence, pages362–369. Morgan Kaufmann Publishers Inc.

Moschitti, A. (2006). Making Tree Kernels practical for Natural Language Learning. InEACL, pages 113–120.

Murray, I., Adams, R. P., and Mackay, D. (2010). Elliptical slice sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 541–548.

Page 187: Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

References IIINeal, R. (1999). Regression and classification using gaussian process priors. Bayesian

Statistics, 6.Preotiuc-Pietro, D. and Cohn, T. (2013). A temporal model of text periodicities using

Gaussian Processes. EMNLP.Quinonero Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse

approximate gaussian process regression. Journal of Machine Learning Research,6:1939–1959.

Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning,volume 1. MIT press Cambridge, MA.

Rogers, S. and Girolami, M. (2012). A First Course in Machine Learning. Chapman &Hall/CRC.

Specia, L., Shah, K., De Souza, J. G., and Cohn, T. (2013). Quest-a translation qualityestimation framework. Citeseer.

Specia, L., Turchi, M., Cancedda, N., Dymetman, M., and Cristianini, N. (2009).Estimating the Sentence-Level Quality of Machine Translation Systems. pages pp.28–35, Barcelona, Spain.

Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14 : A↵ective Text. InProceedings of SEMEVAL.

Strapparava, C. and Mihalcea, R. (2008). Learning to identify emotions in text. InProceedings of the 2008 ACM Symposium on Applied Computing.

Williams, C. K. and Barber, D. (1998). Bayesian classification with gaussian processes.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(12):1342–1351.