Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear

Gaussian Processes for NLP

Trevor Cohn & Daniel Beck

CIS, University of MelbourneDCS, University of She�eld

ALTA Workshop26 November 2014

with slides from Daniel Preotiuc-Pietro, Neil Lawrence, Richard Turner

Gaussian Processes

Brings together several key ideas in one framework

I BayesianI kernelisedI non-parametricI non-linearI modelling uncertainty

Elegant and powerful framework, with growing popularity inmachine learning and application domains.

Gaussian Processes

State of the art for regression

I exact posterior inferenceI supports very complex non-linear functionsI elegant model selection

Gaussian Processes

Now mature enough for use in NLP

I support for classification, ranking, etcI fancy kernels, e.g., textI sparse approximations for large scale inferenceI probabilistic formulation allows incorporation into larger

graphical modelsI models the prediction uncertainty so that it’s propagated

through pipelines of probabilistic components

Tutorial Scope

Covers

1. GP fundamentals (1 hour)I focus on regressionI weight space vs. function space viewI squared exponential kernel

2. NLP applications (1 hour 30)I sparse GPsI periodic kernels and model selectionI multi-output GPs

3. Further topics (30 mins)I classification and other likelihoodsI structured kernels

OutlineGP fundamentals

NLP Applications

Sparse GPs: Characterising user impact

Model selection and Kernels: Identifying temporal patternsin word frequencies

Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis

Advanced Topics

Classification

Structured prediction

Structured kernels

Gaussian Processes

Regression and classification are fundamental techniques inNLP

Linear models:

I simple and fast to runI provide straightforward interpretabilityI but very limited in the type of relations they can model

Non-linear models (e.g. SVM):

I better results because they allow to model otherrelationships

I but high cost of hyperparameter optimisationI lack of interoperability with downstream processing

Regression

. . . versus classification

The Gaussian (normal) distribution

The Gaussian is probably the most widely used probabilitydensity

p(y|µ, �2) =1p

2⇡�2exp

� (y � µ)2

2�2

!

�= N(µ, �2)

Multivariate formulation

p(y|µ,⌃) =1

p(2⇡)d|⌃|

exp✓�1

2(y � µ)>⌃�1(y � µ)

◆

where µ is now a mean vector, and ⌃ a covariance matrix.

The Gaussian (normal) distribution

The Gaussian is probably the most widely used probabilitydensity

p(y|µ, �2) =1p

2⇡�2exp

� (y � µ)2

2�2

!

�= N(µ, �2)

Multivariate formulation

p(y|µ,⌃) =1

p(2⇡)d|⌃|

exp✓�1

2(y � µ)>⌃�1(y � µ)

◆

where µ is now a mean vector, and ⌃ a covariance matrix.

The Gaussian distributionGaussian Density

0

1

2

3

0 1 2

p(h|µ,�

2 )

h, height/m

The Gaussian PDF with µ = 1.7 and variance �2 = 0.0225. Meanshown as red line. It could represent the heights of apopulation of students.

Bivariate Gaussian distributionSampling Two Dimensional Variables

Joint Distribution

w/k

g

h/m

Marginal Distributions

p(h)

p(w

)

Gaussian distribution properties

Several very convenient properties:

I Scaling a Gaussian results in a GaussianI Sum of Gaussian variables is also Gaussian

(generally, central limit theorem)I Product of two Gaussians is (scaled) Gaussian

For multi-variate Gaussian distributed variables:

I Marginal distribution is also GaussianI Conditional distribution is also Gaussian

Gaussian Processes

The Gaussian distribution:

I is a distribution over scalars or vectors

The Gaussian process:

I is a distribution over functions

Gaussian Processes

The Gaussian distribution:

I is a distribution over scalars or vectors

The Gaussian process:

I is a distribution over functions

Being Bayesian

Revisiting our regression example:

I what class of functions should we allow?I given the class what parameter values are sensible?I e.g., linear should we let m, c! ±1?I e.g., non-linear should be jagged or smooth?I e.g., periodic . . .

Usually have some intuitions.

Being Bayesian

Define a prior over the unknowns, p(✓).

I encodes our intitial intuitions about reasonable values

Observing data gives rise to a likelihood, p(y|✓).

I how well the model fits the training sample

Seek to reason over the posterior

I incorporating the above to form our updated beliefs aboutthe unknowns

I formulated using Bayes’ rule

p(✓|y) =p(y|✓)p(✓)

p(y)

Bayesian update illustratedBayes Update

0

1

2

-3 -2 -1 0 1 2 3 4c

p(c) = N (c|0,↵1)

p(y|m, c, x, �2) = N⇣y|mx + c, �2

⌘

p(c|y,m, x, �2) =N⇣c| y�mx

1+�2/↵1, (��2 + ↵�1

1 )�1⌘

Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.Example of linear regression using prior over intercept

parameter, c.

Gaussian Processes prior

I the prior represents our prior beliefs over the functions weexpect to have generated our (regression) data

I here, standard choices of 0 mean and 1 variance, and asmoothly varying function

Gaussian Process posterior

I after observing data points posterior now incorporatesdata likelihood

I the functions must pass close to the observations

Relationship to Linear regression

Linear regression:y = w>x + ✏

where w are the learned model parameters.

Gaussian process regression:

y = f (x) + ✏

where function f is drawn from a Gaussian process prior.

I the w parameters are gone (in fact, integrated out)I f is not limited to a parametric form

Graphical model view

f ⇠ GP(m, k)

y ⇠ N( f (x), �2)

I f : RD ! R is a latentfunction

I y is a noisy realisation off (x)

I k is the covariance functionor kernel

I �2 is a learned parameter

k

f

yx

σ

N

Formal definition

Definition

A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

A GP is fully specified by mean m(x) and covariance k(x, x0)functions:

I m(x) = E[ f (x)]

I k(x, x0) = E[( f (x) �m(x))(( f (x0) �m(x0))]

Without loss of generality we can assume m(x) = 0.

Compared to a Gaussian distribution

Gaussian distribution:

I specified by a mean and covariance matrix: x ⇠ N(µ,⌃)I the vector index is the position of the random variables x

Gaussian process:

I specified by a mean and covariance function: f ⇠ GP(m, k)I the index is the argument x of f

The Gaussian Process is an infinite extension of multivariateGaussian distributions

Gaussian distribution

p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .4

.4 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .0

.0 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y|⌃) / exp

��1

2

yT⌃

�1y�

⌃ =

2

4 1 .7

.7 1

3

5

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y 2

⌅3 ⌅2 ⌅1 0 1 2 3⌅3

⌅2

⌅1

0

1

2

3


p(y2

|y1

, ⌃) / exp

��1

2

(y2

� µ⇤)⌃⇤�1

(y2

� µ⇤)�

1

2

y1

y2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

New visualisation

�2 0 2�3

�2

�1

0

1

2

3

y1

y 2

1 2�3

�2

�1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y2

1 2−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

4 1 .9

.9 1

3

5

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

⌃ =

2

66666666664

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

3

77777777775

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920↵3

↵2

↵1

0

1

2

3

variable index

y

⌃ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

K(x1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

⌘

1 10 20

1

10

20

Regression: probabilistic inference in function space

�3

�2

�1

0

1

2

3

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

Non-parametric (1-parametric)

p(y|✓) = N (0, ⌃)

function estimatewith uncertaintyK(x

1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

⌘

1 10 20

1

10

20

Parametric model

y(x) = f(x; ✓) + �y✏

✏ ⇠ N (0, 1)

Regression: probabilistic inference in function space

�3

�2

�1

0

1

2

3

y

⌃ =

⌃(x1

, x2

) = K(x1

, x2

) + I�2

y

p(y|✓) = N (0, ⌃)

Non-parametric (1-parametric)

noise

Parametric model

y(x) = f(x; ✓) + �y✏

✏ ⇠ N (0, 1)K(x1

, x2

) = �2

exp

⇣� 1

2l2(x

1

� x2

)

2

⌘

1 10 20

1

10

20

horizontal-scalevertical-scale

Squared Exponential (SE) kernel

Usual choice for covariance is the Squared Exponential (SE)kernel (a.k.a. RBF, exponentiated quadratic, Gaussian):

k(x, x0) = �2d exp

� (x � x0)2

2l2

!

The kernel’s parameters {l, �d} are called the GP’shyperparameters.

When l is very high, neighbouring points are very correlatedi.e. that feature is less relevant.

How to select the hyperparameters?

Varying lengthscale from very low (0.04) to high (7.4), withnoise hyperparameter fixed.

Hyperparameters

Training a GP means:

I finding the right kernelI finding values of hyperparameters

The covariance function controls properties of the GP:

I smoothnessI stationarityI periodicityI etc.

Other kernels

Inference

Although the GP is an infinite dimensional object,marginalisation property means we only need to work withfinite dimensions.

For prediction we need the conditional distribution over testpoints y⇤ given observed points y:

p(y⇤|y, x⇤,X) =p(y, y⇤|x⇤,X)

p(y|X)

We’ll omit the conditioning on x⇤,X for brevity henceforth.

Inference

Consider GP over single extra point

p(y) = N(0,K + �2nI)

p(y, y⇤) = N "

00

#,

"K + �2

nI K>⇤K⇤ K⇤⇤ + �2

n

#!

where K are covariance matrices, i.e.,

I K 2 RN⇥N are the covar. between training instancesI K⇤ 2 RN are the covar. between test and training instancesI K⇤,⇤ 2 R is the test variance

Inference

Recall property of the multi-variate Gaussian:

Conditional distribution is Gaussian

Leads to the following predictive posterior

p(y⇤|y) = N(µ⇤,⌃⇤)

with µ⇤ = K>⇤ (K + �2nI)�1y

⌃⇤ = K⇤⇤ � K>⇤ (K + �2nI)�1K⇤

I Note that predictive mean is linear in the data; andI predictive covariance is the prior covariance is reduced by

the information the observations give about the function

Model selection

Bayesian formulation seemingly removes the need for training,as the parameters are integrated out.

However, need to select hyperparameter values, e.g.,

I �n white noiseI l length (horizontal) scaleI �d vertical scaleI . . .

How to select appropriate values?

Consider the marginal likelihood, p(y|X).

Model selection

Marginal likelihood defined as

p(y|X) =Z

p(y, f|X)df

=

Zp(y|f)p(f|X)df

Observe that

I as space of possible functions grows, p(f|X) diminishesI poorly fitting functions will score poorly on p(y|f)I both a↵ected by type of kernel and hyperparameter values

The marginal likelihood balances these opposing forces.

Model selection for training

Marginal likelihood permits analytical solution due toconjugacy

p(y|X) =Z

p(y|f)|{z}

Gaussian

p(f|X)|{z}

Gaussian

df

= N(0,K + �2nI)

arising from the property that the product of two Gaussians isan unnormalised Gaussian.

Permits analytical solution

log p(y|X) = �12

y>(K + �2nI)�1y � 1

2log |K + �2

nI| � n2

log 2⇡

Can optimise above with respect to model hyperparametersusing gradient ascent, a.k.a. Type II maximum likelihood.

Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2


-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10�1 100 101

length scale, �

E(�) =12

log |K| + y>K�1y2

Beyond GP regression

I ClassificationI Ordinal regressionI Count dataI Model selection: kernel and hyperparameter optimisationI Kernel designI Extrapolation cf interpolationI Multi-output GPs for multi-task learningI Sparse GPs for scaling to large feature and input spacesI Latent variable models, non-linear probabilistic variant of

PCA/CCA et alI And many others ...

GP limitations

Non-parametric formulation complicates scaling

I O(N3) time complexity and O(N2) space complexityI but ongoing work to bring this down, e.g., O(NM2) time

complexity or even constant in N

Brilliant for regression, but more di�cult for other likelihoods

I suite of approximation approaches to deal with inferencefor non-conjugate configurations

Not as mature as many other frameworks

I but ‘coming of age’, many big issues have been addressedI even application to ‘big data’ scenarios

Resources

Free book:http://www.gaussianprocess.org/gpml/chapters/

Tutorials

I GPs for Natural Language Processing tutorial (ACL 2014)http://goo.gl/18heUk

I GP Schools in She�eld and roadshows in Kampala,Pereira, and (future) Nyeri, Melbournehttp://ml.dcs.shef.ac.uk/gpss/

I GP Regression demohttp://www.tmpl.fi/gp/

I Annotated bibliography and other materialshttp://www.gaussianprocess.org

Toolkits

I GPML (Matlab)http://www.gaussianprocess.org/gpml/code

I GPy (Python)https://github.com/SheffieldML/GPy

I GPstu↵ (R, Matlab, Octave)http://becs.aalto.fi/en/research/bayes/gpstuff/


NLP Applications




Advanced Topics

Classification


Structured kernels


NLP Applications




Advanced Topics

Classification


Structured kernels

Case study: User impact on Twitter

Predicting and characterising user impact on Twitter

I define a user-level impact scoreI use user’s text and profile information as features to

predict the scoreI analyse the features which better predict the scoreI provide users with ‘guidelines’ for improving their score

Instance of a text prediction problem

I emphasis on feature analysis and interpretability (specificto social science applications)

I non-linear variation

See Lampos et al. (2014), EACL.

Sparse GPs

Exact inference in a GP

I Memory: O(n2)I Time: O(n3)

Sparse GP approximation

I Memory: O(n ·m)I Time: O(n ·m2)

where m is selected at runtime.

Typically required when n > 1000.

Sparse GPs

Many options for sparse approximations

I Based on Inducing VariablesI Subset of Data (SoD)I Subset of Regressors (SoR)I Deterministic Training Conditional (DTC)I Partially Independent Training Conditional (PITC)I Fully Independent Training Conditional (FITC)

I Fast Matrix Vector Multiplication (MVM)I Variational Methods

See Quinonero Candela and Rasmussen (2005) for an overview.

Sparse GPs

Sparse approximations where f are treated as latent variables

I a subset are treated exactly |uuu| = mI the other are given a computationally cheaper treatmentI avoids large matrix inversions

What does the penalty term do?

Hensman,GPSS ’13

Inducing points are fixed or learned using greedy search /optimisation

Predicting and characterising user impact

500 million Tweets a day in Twitter

I important and some not so important informationI breaking news from mediaI friendsI celebrity self promotionI marketingI spam

Can we automatically predict the impact of a user?

Can we automatically identify factors which influence userimpact?

Defining user impact

Define impact as a function of network connections

I no. of followersI no. of followeesI no. of time the account is listed by others

Impact = ln✓

listings·followers2

followees

◆

DatasetI 38.000 UK usersI all tweets from one yearI 48 million deduplicated

messages

User controlled features

Only features under the user’s control (e.g. not no. of retweets)

I User features (18)extracted from the account profileaggregated text features

I Text features (100)user’s topic distributiontopics computed using spectral clustering on the wordco-occurrence (NPMI) matrix

Models

Regression task

I Gaussian Process regression modelI n = 38000 · 9/10, use Sparse GPs with FITCI Squared Exponential kernel (k-dimensional):

k(xxxp,xxxq) = �2f exp

✓�1

2(xxxp � xxxq)TD(xxxp � xxxq)

◆+ �2

n�pq

where D 2 Rk⇥k is a symmetric matrix.I if DARD = diag(lll)�2:

k(xxxp,xxxq) = �2f exp

0BBBBB@�

12

kX

d=1

(xxxpd � xxxpd)2

l2d

1CCCCCA + �

2n�pq

Automatic Relevance Determination (ARD)

I with D = DARD the kernel is the SE kernel with automaticrelevance determination (ARD), with the vector lll denotingthe characteristic length-scales of each feature

I ld measures the distance for being uncorrelated along xd

I 1/l2d proportional to how relevant a feature is: largelength-scales means the covariance becomes independentof that feature value

I sorting by length-scales indicates which features impactthe prediction the most

I tuning these parameters in done via Bayesian modelselection

Prediction results

ExperimentsI 10-fold cross validationI using predictive meanI baseline model is ridge

regression (LIN)I Profile featuresI Text features

0.6 0.65 0.7 0.75 0.8

LIN

GP

Pearson correlation

Conclusions

I GPs substantially better than ridge regressionI non-linear GPs with only profile features performs better

than linear methods with all featuresI GPs outperform SVRI adding topic features improves all models

Selected features

Feature ImportanceUsing default profile image 0.73Total number of tweets (entire history) 1.32Number of unique @-mentions in tweets 2.31Number of tweets (in dataset) 3.47Links ratio in tweets 3.57T1 (Weather): mph, humidity,

barometer, gust, winds 3.73T2 (Healthcare, Housing): nursing, nurse,

rn, registered, bedroom, clinical, #news,estate, #hospital 5.44

T3 (Politics): senate, republican, gop, police,arrested, voters, robbery, democrats,presidential, elections 6.07

Proportion of days with non-zero tweets 6.96Proportion of tweets with @-replies 7.10

Feature analysis

No. of unique @-mentions

L H

0

100 L

No. of tweets

L H

11

0

100 L

Impact histogram for users with high (H) values of this featureas opposed to low (L). Red line is the mean impact score.

Feature analysis

τ3

damon, potter, #tvd, harryelena, kate, portman,pattinson, hermione,

jennifer

τ4

senate, republican, gop,police, arrested, voters,

robbery, democrats,presidential, elections

Impact histogram for users with high (H) values of this feature.Red line is the mean impact score.

Conclusions

User impact is highly predictable

I user behaviour very informativeI “tips” for improving your impact

GP framework suitable

I non-linear modellingI ARD feature selectionI sparse GPs allow large scale experimentsI empirical improvements over linear models & SVR


NLP Applications




Advanced Topics

Classification


Structured kernels

Case study: Temporal patterns of words

Categorising temporal patterns of hashtags in Twitter

I collect hashtag normalised frequency time series formonths

I use models learnt on past frequencies to forecast futurefrequencies

I identify and group similar temporal patternsI emphasise periodicities in word frequencies

Instance of a forecasting problem

I emphasis on forecasting (extrapolation)I di↵erent e↵ects modelled by specific kernels

See Preotiuc-Pietro and Cohn (2013), EMNLP.

Model selection

Although parameter free, we still need to specify to a GP:

I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H

Training a GP = selecting the kernel and its parameters

Can use only training data (and no validation)

Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel

Model selection

Although parameter free, we still need to specify to a GP:

I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H

Training a GP = selecting the kernel and its parameters

Can use only training data (and no validation)

Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel

Identifying temporal patterns in word frequencies

Word/hashtag frequencies in Twitter

I very time dependentI many ‘live’ only for hours reflecting timely events or

memesI some hashtags are constant over timeI some experience bursts at regular time intervalsI some follow human activity cycles

Can we automatically forecast future hashtag frequencies?

Can we automatically categorise temporal patterns?

Twitter hashtag temporal patterns

Regression task

I Extrapolation: forecast future frequenciesI using predictive mean

Dataset

I two months of Twitter Gardenhose (10%)I first month for training, second month for testingI 1176 hashtags occurring in both splitsI ⇠ 6.5 million tweetsI 5456 tweets/hashtag

Kernels

The kernel

I induces the covariance in the response between pairs ofdata points

I encodes the prior belief on the type of function we aim tolearn

I for extrapolation, kernel choice is paramountI di↵erent kernels are suitable for each specific category of

temporal patterns: isotropic, smooth, periodic,non-stationary, etc.

Kernels

03/1 10/1 17/10

0.2

0.4

0.6

0.8

1#goodmorning

GoldConstLinearSEPer(168)PS(168)

#goodmorning

Const Linear SE Per PSNLML -41 -34 -176 -180 -192

NRMSE 0.213 0.214 0.262 0.119 0.107

Lower is better

Use Bayesian model selection techniques to choose betweenkernels

Kernels: Constant

kC(x, x0) = c

I constant relationshipbetween outputs

I predictive mean is thevalue c

I assumes signal ismodelled by Gaussiannoise centred around thevalue c

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

c=0.6c=0.4

Kernels: Squared exponential

kSE(x, x0) = s2 · exp � (x � x0)2

2l2

!

I smooth transitionbetween neighbouringpoints

I best describes time serieswith a smooth shape e.g.uni-modal burst with asteady decrease

I predictive varianceincreases exponentiallywith distance

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

l=1, s=1l=10, s=1l=10, s=0.5

Kernels: Linear

kLin(x, x0) =|x · x0| + 1

s2

I non-stationary kernel:covariance depends onthe data points values,not only on theirdi↵erence |t � t0|

I equivalent to Bayesianlinear regression withN(0, 1) priors on theregression weights and aprior ofN(0, s2) on thebias

25 50 75 100 1250

0.5

1

1.5

2

s=50s=70

Kernels: Periodic

kPER(x, x0) = s2·exp · �2 sin2(2⇡(x � x0)/p)

l2

!

I s and l are characteristiclength-scales

I p is the period (distancebetween consecutivepeaks)

I best describes periodicpatterns that oscillatesmoothly between highand low values 25 50 75 100 125

0

0.2

0.4

0.6

0.8

1

l=1, p=50l=0.75, p=25

Kernels: Periodic Spikes

kPS(x, x0) = cos sin

2⇡ · (x � x0)

p

!!·exp

s cos(2⇡ · (x � x0))

p� s

!

I p is the periodI s is a shape parameter

controlling the width ofthe spike

I best describes time serieswith constant low values,followed by abruptperiodic rise

25 50 75 100 1250

0.2

0.4

0.6

0.8

1

s=1s=5s=50

Results: Examples

#fyi (Const)

#fail (Per)

#snow (SE)

#raw (PS)

Results: Categories

Const SE PER PS#funny #2011 #brb #↵#lego #backintheday #co↵ee #followfriday

#likeaboss #confessionhour #facebook #goodnight#money #februarywish #facepalm #jobs

#nbd #haiti #fail #news#nf #makeachange #love #nowplaying

#notetoself #questionsidontlike #rock #tgif#priorities #savelibraries #running #twitterafterdark

#social #snow #xbox #twittero↵#true #snowday #youtube #ww

49 268 493 366

Results: Forecasting

Lag+ GP−Lin GP−SE GP−PER GP−PS GP+−10

−5

0

5

10

Application: Text classification

Task

I assign the hashtag of a given tweet based on its text

Methods

I Most frequent (MF)I Naive Bayes model with empirical prior (NB-E)I Naive Bayes with GP forecast as prior (NB-P)

MF NB-E NB-PMatch@1 7.28% 16.04% 17.39%Match@5 19.90% 29.51% 31.91%

Match@50 44.92% 59.17% 60.85%MRR 0.144 0.237 0.252

Higher is better


NLP Applications




Advanced Topics

Classification


Structured kernels

Case study 1: MT Quality Estimation

Manual assessment of translation quality given source andtranslated texts.

‘Quality’ can be measured many ways (see Specia et al. (2009))

I subjective scoring (1-5) for fluency, adequacy, perceivede↵ort to correct

I post-editing e↵ort: HTER or time takenI binary judgements, ranking, . . .

Human judgements are highly subjective, biased, noisy

I typing speedI experience levelsI expectations from MT

See Cohn and Specia (2013), ACL.

General annotation problem

MT Quality Estimation is an instance of a general annotationproblem

I can’t rely on single individualI but many annotators produce di↵erent resultsI how can we resolve conflicts?

Previous work in MT QE

I averaged responses from several annotators(Callison-Burch et al., 2012); or

I used output of single annotator (Koponen et al., 2012)

More appropriate to model as a multi-task problem.

Multi-task learning

Multi-task learning

I form of transfer learningI several related tasks sharing the same input data

representationI learn the types, extent of correlations

Compared to domain adaptation

I tasks need not be identical (even regression vsclassification)

I no explicit ‘target’ domainI several sources of variation besides domainI no assumptions of data asymmetry

Multi-task learning for MT Quality Estimation

Modelling individual annotators

I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions

Here use multi-output GP regression

I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data

Multi-task learning for MT Quality Estimation

Modelling individual annotators

I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions

Here use multi-output GP regression

I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data

Review: GP Regression

θ

f

yx

σ

N

f ⇠ GP(0,✓)

yi ⇠ N( f (xi), �2)

Multi-task GP Regression

B

M

θ

f

yx

σ

N

f ⇠ GP(0,⇣B,✓

⌘)

yim ⇠ N( fm(xi), �2m)

See Alvarez et al. (2011).

Multi-task Covariance Kernels

Represent data as (x, t, y) tuples, where t is a task identifier.Define a separable covariance kernel,

K(x, x0)t,t0 = Bt,t0k✓(x, x0) + noise

I e↵ectively each input augmented with t, indexing the taskof interest

I the coregionalisation matrix, B 2 RM⇥M weights inter-taskcovariance

I the data kernel k✓ takes data points x as inpute.g., squared exponential

Coregionalisation Kernels

Generally B can be any symmetric positive semi-definitematrix. Some interesting choices

I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants

Known as the Intrinsic model of coregionalisation (IMC).

See Alvarez et al. (2011); Bonilla et al. (2008)



I B = I encodes independent learning

I B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants





I B = I encodes independent learningI B = 1 encodes pooled learning

I interpolating the aboveI full rank B =WW>, or low rank variants





I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the above

I full rank B =WW>, or low rank variants













Stacking and Kronecker products

Response variables are a matrix

Y = RN⇥M

Represent data in ‘stacked’ form

X =

26666666666666666666666666666666666666664

x1x2...

xN...

x1x2...

xN

37777777777777777777777777777777777777775

y =

26666666666666666666666666666666666666664

y11y21...

yN1...

y1My2M...

yNM

37777777777777777777777777777777777777775

Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)

Stacking and Kronecker products

Response variables are a matrix

Y = RN⇥M

Represent data in ‘stacked’ form

X =

26666666666666666666666666666666666666664

x1x2...

xN...

x1x2...

xN

37777777777777777777777777777777777777775

y =

26666666666666666666666666666666666666664

y11y21...

yN1...

y1My2M...

yNM

37777777777777777777777777777777777777775

Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)

Kronecker productKronecker Product

aK bKcK dK

Ka b

c d⌦ =

Kronecker productKronecker Product

⌦ =

Choices for B: Independent learning

⊗ =

B = I

Choices for B: Pooled learning

⊗ =

B = 1

Choices for B: Interpolating independent and pooledlearning

⊗ =

B = 1 + ↵I

Choices for B: Interpolating independent and pooledlearning II

⊗ =

B = 1 + diag(↵)

Compared to Daume III (2007)

Feature augmentation approach to multi-task learning. Useshorizontal data stacking:

X ="

X(1) X(1) 0X(2) 0 X(2)

#y =

"y(1)

y(2)

#

where (X(i),y(i)) are the training data for task i. This expands thefeature space by a factor of M.

Equivalent to a multitask kernel

k(x, x0)t,t0 = (1 + �(t, t0)) x>x0

K(X,X) = (1 + I) ⌦ klinear(X,X)

) A specific choice of B with a linear data kernel

Compared to Evgeniou et al. (2006)

In the regularisation setting, Evgeniou et al. (2006) show thatthe kernel

K(x, x0)t,t0 = (1 � � + �M�(t, t0)) x>x0

is equivalent to a linear model with regularisation term

J(⇥) =1M

0BBBBB@X

t||✓t||2 +

1 � ��||✓t �

1M

X

t0✓t0 ||2

1CCCCCA

This regularises each task’s parameters ✓t towards the meanparameters over all tasks, 1

MP

t0 ✓t0 .

A form of interpolation method from before.

ICM samplesIntrinsic Coregionalization Model

K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#


K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#


K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#


K(X,X) = ww> ⌦ k(X,X).

w ="15

#

B ="1 55 25

#


K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#


K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#


K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#


K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#


K(X,X) = B ⌦ k(X,X).

B ="

1 0.50.5 1.5

#

Experimental setup

Quality Estimation data

I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using

Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment

Gaussian Process models

I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise

Experimental setup

Quality Estimation data

I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using

Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment

Gaussian Process models

I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise

Results: WMT12 RMSE for 1-5 ratings

50 100 150 200 250 300 350 400 450 5000.7

0.72

0.74

0.76

0.78

0.8

0.82

Training examples

STLMTLPooled

Incorporating layers of task metadata

⊗

Annotator System SourceSenTence

⊗

Results: WPTP12 RMSE post-editing time

0.637& 0.635& 0.628& 0.617&

0.748& 0.741&

0.600&0.585&

0.500&

0.550&

0.600&

0.650&

0.700&

0.750&

0.800&

constant&Ann&&

Ind&SVM&Ann&&

Pooled&GP&&

MTL&GP&Ann&

MTL&GP&Sys&

MTL&GP&senT&

MTL&GP&&&&&A+S&&

MTL&GP&&&&A+S+T&&

Case Study 2: Emotion Analysis

I automatically detect emotions in a textI fine-grainedI presence of anti-correlations (sadness and joy)

Headline Fear Joy SadnessStorms kill, knock out power, cancel flights 82 0 60Panda cub makes her debut 0 59 0

See Beck et al. (2014), EMNLP

Modelling anti-correlations

I The coregionalization settings used in Quality Estimationare not suitable for this problem: they assume positivecovariances between tasks.

I We need a parameterization that allows B to have negativecovariances: we use the incomplete-Choleskydecomposition.

B = WWT + diag(↵)

Incomplete-Cholesky model

W11

W21

W31

W41

W51

W61

W

⇥

W11 W21 W31 W41 W51 W61

⇥ WT

+ diag( ) =

↵1↵2↵3↵4↵5↵6

+ diag(↵) = B

12 hyperparameters


18 hyperparameters

W ⇥ WT + diag(↵) = B

⇥ + diag( ) =

W11

W21

W31

W41

W51

W61

W12

W22

W32

W42

W52

W62

W11 W21 W31 W41 W51 W61

W12 W22 W32 W42 W52 W62

↵1↵2↵3↵4↵5↵6


24 hyperparameters

W ⇥ WT + diag(↵) = B

⇥ + diag( ) =

W11

W21

W31

W41

W51

W61

W12

W22

W32

W42

W52

W62

W13

W23

W33

W43

W53

W63

W11 W21 W31 W41 W51 W61

W12 W22 W32 W42 W52 W62

W13 W23 W33 W43 W53 W63

↵1↵2↵3↵4↵5↵6

Experimental Setup

I Dataset: SEMEval2007 “A↵ective Text” Strapparava andMihalcea (2007);

I 1000 News headlines, each one annotated with 6 scores[0-100], one for emotion;

I 100 sentences for training, 900 for testing;I Bag-of-words representation as features;

Learned Task Covariances

Prediction Results


NLP Applications




Advanced Topics

Classification


Structured kernels


NLP Applications




Advanced Topics

Classification


Structured kernels

Recap: Regression

Observations, yi, are a noisy version of latent process fi,

yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)

GP Regression

Observations yi are a distorted version of a process fi:

yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)

Likelihood models

Analytic solution for Gaussian likelihood aka noise

Gaussian (process) prior⇥ Gaussian likelihood

= Gaussian posterior

But what about other likelihoods?

I Counts y 2 NI Classification y 2 {C1,C2, ...,Ck}I Ordinal regression (ranking) C1 < C2 < ... < Ck

I . . .

Classification

Binary classification, yi 2 {0, 1}.

Two popular choices for the likelihood

I Logistic sigmoid: p(yi = 1| fi) = �( fi) = 11+exp(� fi)

I Probit function: p(yi = 1| fi) = �( fi) =R fi�1N(z|0, 1)dz

”Squashing” input from (�1,1) into range [0, 1]

Squashing function

Pass latent function through logistic function to obtainprobability, ⇡(x) = p(yi = 1| fi)

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.

c�2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

40 Classification

−4

−2

0

2

4

input, x

late

nt fu

nctio

n, f(

x)

0

1

input, xcl

ass

prob

abilit

y, π

(x)

(a) (b)

Figure 3.2: Panel (a) shows a sample latent function f(x) drawn from a Gaussianprocess as a function of x. Panel (b) shows the result of squashing this sample func-tion through the logistic logit function, �(z) = (1 + exp(�z))�1 to obtain the classprobability �(x) = �(f(x)).

regression model and parallels the development from linear regression to GP

regression that we explored in section 2.1. Specifically, we replace the linear

f(x) function from the linear logistic model in eq. (3.6) by a Gaussian process,

and correspondingly the Gaussian prior on the weights by a GP prior.

The latent function f plays the role of a nuisance function: we do notnuisance function

observe values of f itself (we observe only the inputs X and the class labels y)

and we are not particularly interested in the values of f , but rather in ⇡, in

particular for test cases ⇡(x⇤). The purpose of f is solely to allow a convenient

formulation of the model, and the computational goal pursued in the coming

sections will be to remove (integrate out) f .

We have tacitly assumed that the latent Gaussian process is noise-free, andnoise-free latent process

combined it with smooth likelihood functions, such as the logistic or probit.

However, one can equivalently think of adding independent noise to the latent

process in combination with a step-function likelihood. In particular, assuming

Gaussian noise and a step-function likelihood is exactly equivalent to a noise-

free

8

latent process and probit likelihood, see exercise 3.10.1.

Inference is naturally divided into two steps: first computing the distribution

of the latent variable corresponding to a test case

p(f⇤|X,y,x⇤) =

Zp(f⇤|X,x⇤, f)p(f |X,y) df , (3.9)

where p(f |X,y) = p(y|f)p(f |X)/p(y|X) is the posterior over the latent vari-

ables, and subsequently using this distribution over the latent f⇤ to produce a

probabilistic prediction

⇡⇤ � p(y⇤ =+1|X,y,x⇤) =

Z�(f⇤)p(f⇤|X,y,x⇤) df⇤. (3.10)

8This equivalence explains why no numerical problems arise from considering a noise-freeprocess if care is taken with the implementation, see also comment at the end of section 3.4.3.

Figure from Rasmussen and Williams (2006)

Inference Challenges: for test case x⇤

Distribution over latent function

p( f ⇤|X,y, x⇤) =Z

p( f ⇤|X, x⇤, f) p(f|X,y)| {z }posterior

df

Distribution over classification output

p(y⇤ = 1|X,y, x⇤) =Z�( f⇤)p( f⇤|X,y, x⇤)d f⇤

Problem: likelihood no longer conjugate with prior, so noanalytic solution.

Approximate inference

Several inference techniques proposed for non-conjugatelikelihoods:

I Laplace approximationWilliams and Barber (1998)

I Expectation propagationMinka (2001)

I Variational inferenceGibbs and MacKay (2000)

I MCMCNeal (1999)

And more, including sparse approaches for large scaleapplication.

Laplace approximation

Approximate non-Gaussian posterior by a Gaussian, centred atthe mode

Figure from Rogers and Girolami (2012)

Laplace approximation

Log posterior

�(f) = log p(y|f) + log p(f|X) + const

Find the posterior mode, f, i.e., MAP estimation, O(n3).

Then take a second order Taylor series expansion about mode,and fit with a Gaussian

I with mean, µ = fI and co-variance ⌃ = (K�1 + rr log p(y|f))�1

Allows for computation of posterior and marginal likelihood,but predictions may still be intractable.

Expectation propagation

Take intractable posterior:

p(f|X,y) =1Z

p(f|X)nY

i=1

p(yi| fi)

Z =Z

p(f|X)nY

i=1

p(yi| fi)df

Approximation with fully factorised distribution

q(f|X,y) =1

ZEPp(f|X)

nY

i=1

t( fi)


Approximate posterior defined as

q(f|y) =1

ZEPp(f)

nY

i=1

t( fi)

where each component assumed to be Gaussian

I p(yi| fi) ⇡ ti( fi) = ZiN( fi|µi, �2i )

I p(f|X) ⇠ N(f|0,Knn)

Results in Gaussian formulation for q(f|y)

I allows for tractable multiplication, division with GaussiansI and marginalisation, expectations etc


EP algorithm aims to fit ti( fi) to the posterior, starting with aguess for q, then iteratively refining as follows

I minimise KL divergence between the true posterior for fiand the approximation, ti

minti

KL�p(yi| fi)q�i( fi) || ti( fi)q�i( fi)

�

where q�i( fi) is the cavity distribution formed bymarginalising q(f) over f j, j , i then dividing by ti( fi).

I key idea: only need accurate approximation for globallyfeasible fi

I match moments to update ti, then update q(f)

Expectation propagationSite approximation example


No proof of convergence

I but empirically works wellI often more accurate than Laplace approximation

Formulated for many di↵erent likelihoods

I complexity O(n3), dominated by matrix inversionI sparse EP approximations can reduce this to O(nm2)

See Minka (2001) and Rasmussen and Williams (2006) forfurther details.

Multi-class classification

Consider multi-class classification, y 2 {C1,C2, . . . ,Ck}.

Draw vector of k latent function values for each input

f =⇣

f 11 , . . . , f 1

n , f 21 , . . . , f 2

n , f k1 , . . . , f k

n

⌘

Formulate clasification probability using soft-max

p(yi = c|fi) =exp( f c

i )P

c0 exp( f c0i )

Multi-class classification

Assume k latent processes are uncorrelated, leading to priorcovariance f ⇠ N(0,K) where

K =

0BBBBBBBBBBBBB@

K1 0 · · · 00 K2 · · · 0....... . .

...0 0 · · · Kk

1CCCCCCCCCCCCCA

is block diagonal kn ⇥ kn with each Kj of size n ⇥ n.

Various approximation methods for inference, e.g., Laplace(Williams and Barber, 1998), EP (Kim and Ghahramani, 2006),MCMC (Neal, 1999).


NLP Applications




Advanced Topics

Classification


Structured kernels

GPs for Structured Prediction

I GPSC (Altun et al., 2004):I Defines a likelihood over label sequences: p(y|x), with latent

variable over full sequences yI HMM-inspired kernel, combining features from each

observed symbol xi and label pairsI MAP inference for hidden function values, f, and

sparsification trick for tractable inferenceI GPstruct (Bratieres et al., 2013):

I Base model is a CRF:

p(y|x, f) =exp

Pc f (c, xc,yc)P

y02Y expP

c f (c, xc,y0c)

I Assumes that each potential f (c, xc,yc) is drawn from a GPI Bayesian inference using MCMC (Murray et al., 2010)


NLP Applications




Advanced Topics

Classification


Structured kernels

String Kernels

k(x, x0) =X

s2⌃⇤ws�s(x)�s(x0),

I �s(x): counts of substring s inside x;I 0 ws 1: weight of substring s;I s can also be a subsequence (containing gaps);

I s = char sequences! ngram kernels (Lodhi et al., 2002)(useful for stems);k(bar, bat) = 3 (b,a,ba)

I s =word sequences!Word Sequence kernels (Canceddaet al., 2003);k(gas only injection, gas assisted plastic injection) = 3

I Soft matching:k(battle, battles) , 0k(battle, combat) , 0

Tree Kernels

I Subset Tree Kernels (Collins and Du↵y, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

Tree Kernels

I Subset Tree Kernels (Collins and Du↵y, 2001)

S

VP

Johnloves

NP

Mary

S

VP

dinnerhad

NP

Mary

I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;

I Soft matching can also be applied.

Conclusions

Gaussian Processes are a powerful framework

I elegant kernel machinesI probabilistic, quantify uncertaintyI closed form Bayesian inference & model selection

Mature enough for widespread application

I frontier challenges in NLPI scaling improvements make these applications practical

References I

Altun, Y., Hofmann, T., and Smola, A. J. (2004). Gaussian Process Classification forSegmenting and Annotating Sequences. In Proceedings of ICML, page 8, New York,New York, USA. ACM Press.

Alvarez, M. A., Rosasco, L., and Lawrence, N. D. (2011). Kernels for vector-valuedfunctions: A review. Foundations and Trends in Machine Learning, 4(3):195–266.

Beck, D., Cohn, T., and Specia, L. (2014). Joint Emotion Analysis via Multi-taskGaussian Processes. In Proceedings of EMNLP.

Bonilla, E., Chai, K. M., and Williams, C. (2008). Multi-task Gaussian processprediction. NIPS.

Bratieres, S., Quadrianto, N., and Ghahramani, Z. (2013). Bayesian StructuredPrediction using Gaussian Processes. arXiv:1307.3846, pages 1–17.

Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012).Findings of the 2012 workshop on statistical machine translation.

Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.-M. (2003). Word-SequenceKernels. The Journal of Machine Learning Research, 3:1059–1082.

Cohn, T. and Specia, L. (2013). Modelling annotator bias with multi-task Gaussianprocesses: An application to machine translation quality estimation. ACL.

Collins, M. and Du↵y, N. (2001). Convolution Kernels for Natural Language. InAdvances in Neural Information Processing Systems.

Daume III, H. (2007). Frustratingly easy domain adaptation. ACL.

References IIEvgeniou, T., Micchelli, C. A., and Pontil, M. (2006). Learning multiple tasks with

kernel methods. Journal of Machine Learning Research, 6(1):615.Gibbs, M. N. and MacKay, D. J. (2000). Variational gaussian process classifiers. IEEE

Transactions on Neural Networks, 11(6):1458–1464.Kim, H.-C. and Ghahramani, Z. (2006). Bayesian gaussian process classification with

the em-ep algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on,28(12):1948–1959.

Koponen, M., Aziz, W., Ramos, L., and Specia, L. (2012). Post-editing time as a measureof cognitive e↵ort. AMTA 2012 Workshop on Post-editing Technology and Practice.

Lampos, V., Aletras, N., Preotiuc-Pietro, D., and Cohn, T. (2014). Predicting andCharacterising User Impact on Twitter. EACL.

Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). TextClassification using String Kernels. The Journal of Machine Learning Research,2:419–444.

Minka, T. (2001). Expectation propagation for approximate bayesian inference. InProceedings of the Seventeenth conference on Uncertainty in Artificial Intelligence, pages362–369. Morgan Kaufmann Publishers Inc.

Moschitti, A. (2006). Making Tree Kernels practical for Natural Language Learning. InEACL, pages 113–120.

Murray, I., Adams, R. P., and Mackay, D. (2010). Elliptical slice sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 541–548.

References IIINeal, R. (1999). Regression and classification using gaussian process priors. Bayesian

Statistics, 6.Preotiuc-Pietro, D. and Cohn, T. (2013). A temporal model of text periodicities using

Gaussian Processes. EMNLP.Quinonero Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse

approximate gaussian process regression. Journal of Machine Learning Research,6:1939–1959.

Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning,volume 1. MIT press Cambridge, MA.

Rogers, S. and Girolami, M. (2012). A First Course in Machine Learning. Chapman &Hall/CRC.

Specia, L., Shah, K., De Souza, J. G., and Cohn, T. (2013). Quest-a translation qualityestimation framework. Citeseer.

Specia, L., Turchi, M., Cancedda, N., Dymetman, M., and Cristianini, N. (2009).Estimating the Sentence-Level Quality of Machine Translation Systems. pages pp.28–35, Barcelona, Spain.

Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14 : A↵ective Text. InProceedings of SEMEVAL.

Strapparava, C. and Mihalcea, R. (2008). Learning to identify emotions in text. InProceedings of the 2008 ACM Symposium on Applied Computing.

Williams, C. K. and Barber, D. (1998). Bayesian classification with gaussian processes.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(12):1342–1351.

Documents

Gaussian Processes for NLP - people.eng.unimelb.edu.au · Gaussian Processes Brings together several key ideas in one framework I Bayesian I kernelised I non-parametric I non-linear