Upload
ngohuong
View
229
Download
0
Embed Size (px)
Citation preview
Gaussian Processes for NLP
Trevor Cohn & Daniel Beck
CIS, University of MelbourneDCS, University of She�eld
ALTA Workshop26 November 2014
with slides from Daniel Preotiuc-Pietro, Neil Lawrence, Richard Turner
Gaussian Processes
Brings together several key ideas in one framework
I BayesianI kernelisedI non-parametricI non-linearI modelling uncertainty
Elegant and powerful framework, with growing popularity inmachine learning and application domains.
Gaussian Processes
State of the art for regression
I exact posterior inferenceI supports very complex non-linear functionsI elegant model selection
Gaussian Processes
Now mature enough for use in NLP
I support for classification, ranking, etcI fancy kernels, e.g., textI sparse approximations for large scale inferenceI probabilistic formulation allows incorporation into larger
graphical modelsI models the prediction uncertainty so that it’s propagated
through pipelines of probabilistic components
Tutorial Scope
Covers
1. GP fundamentals (1 hour)I focus on regressionI weight space vs. function space viewI squared exponential kernel
2. NLP applications (1 hour 30)I sparse GPsI periodic kernels and model selectionI multi-output GPs
3. Further topics (30 mins)I classification and other likelihoodsI structured kernels
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
Gaussian Processes
Regression and classification are fundamental techniques inNLP
Linear models:
I simple and fast to runI provide straightforward interpretabilityI but very limited in the type of relations they can model
Non-linear models (e.g. SVM):
I better results because they allow to model otherrelationships
I but high cost of hyperparameter optimisationI lack of interoperability with downstream processing
Regression
. . . versus classification
The Gaussian (normal) distribution
The Gaussian is probably the most widely used probabilitydensity
p(y|µ, �2) =1p
2⇡�2exp
� (y � µ)2
2�2
!
�= N(µ, �2)
Multivariate formulation
p(y|µ,⌃) =1
p(2⇡)d|⌃|
exp✓�1
2(y � µ)>⌃�1(y � µ)
◆
where µ is now a mean vector, and ⌃ a covariance matrix.
The Gaussian (normal) distribution
The Gaussian is probably the most widely used probabilitydensity
p(y|µ, �2) =1p
2⇡�2exp
� (y � µ)2
2�2
!
�= N(µ, �2)
Multivariate formulation
p(y|µ,⌃) =1
p(2⇡)d|⌃|
exp✓�1
2(y � µ)>⌃�1(y � µ)
◆
where µ is now a mean vector, and ⌃ a covariance matrix.
The Gaussian distributionGaussian Density
0
1
2
3
0 1 2
p(h|µ,�
2 )
h, height/m
The Gaussian PDF with µ = 1.7 and variance �2 = 0.0225. Meanshown as red line. It could represent the heights of apopulation of students.
Bivariate Gaussian distributionSampling Two Dimensional Variables
Joint Distribution
w/k
g
h/m
Marginal Distributions
p(h)
p(w
)
Gaussian distribution properties
Several very convenient properties:
I Scaling a Gaussian results in a GaussianI Sum of Gaussian variables is also Gaussian
(generally, central limit theorem)I Product of two Gaussians is (scaled) Gaussian
For multi-variate Gaussian distributed variables:
I Marginal distribution is also GaussianI Conditional distribution is also Gaussian
Gaussian Processes
The Gaussian distribution:
I is a distribution over scalars or vectors
The Gaussian process:
I is a distribution over functions
Gaussian Processes
The Gaussian distribution:
I is a distribution over scalars or vectors
The Gaussian process:
I is a distribution over functions
Being Bayesian
Revisiting our regression example:
I what class of functions should we allow?I given the class what parameter values are sensible?I e.g., linear should we let m, c! ±1?I e.g., non-linear should be jagged or smooth?I e.g., periodic . . .
Usually have some intuitions.
Being Bayesian
Define a prior over the unknowns, p(✓).
I encodes our intitial intuitions about reasonable values
Observing data gives rise to a likelihood, p(y|✓).
I how well the model fits the training sample
Seek to reason over the posterior
I incorporating the above to form our updated beliefs aboutthe unknowns
I formulated using Bayes’ rule
p(✓|y) =p(y|✓)p(✓)
p(y)
Bayesian update illustratedBayes Update
0
1
2
-3 -2 -1 0 1 2 3 4c
p(c) = N (c|0,↵1)
p(y|m, c, x, �2) = N⇣y|mx + c, �2
⌘
p(c|y,m, x, �2) =N⇣c| y�mx
1+�2/↵1, (��2 + ↵�1
1 )�1⌘
Figure : A Gaussian prior combines with a Gaussian likelihood for aGaussian posterior.Example of linear regression using prior over intercept
parameter, c.
Gaussian Processes prior
I the prior represents our prior beliefs over the functions weexpect to have generated our (regression) data
I here, standard choices of 0 mean and 1 variance, and asmoothly varying function
Gaussian Process posterior
I after observing data points posterior now incorporatesdata likelihood
I the functions must pass close to the observations
Relationship to Linear regression
Linear regression:y = w>x + ✏
where w are the learned model parameters.
Gaussian process regression:
y = f (x) + ✏
where function f is drawn from a Gaussian process prior.
I the w parameters are gone (in fact, integrated out)I f is not limited to a parametric form
Graphical model view
f ⇠ GP(m, k)
y ⇠ N( f (x), �2)
I f : RD ! R is a latentfunction
I y is a noisy realisation off (x)
I k is the covariance functionor kernel
I �2 is a learned parameter
k
f
yx
σ
N
Formal definition
Definition
A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.
A GP is fully specified by mean m(x) and covariance k(x, x0)functions:
I m(x) = E[ f (x)]
I k(x, x0) = E[( f (x) �m(x))(( f (x0) �m(x0))]
Without loss of generality we can assume m(x) = 0.
Compared to a Gaussian distribution
Gaussian distribution:
I specified by a mean and covariance matrix: x ⇠ N(µ,⌃)I the vector index is the position of the random variables x
Gaussian process:
I specified by a mean and covariance function: f ⇠ GP(m, k)I the index is the argument x of f
The Gaussian Process is an infinite extension of multivariateGaussian distributions
Gaussian distribution
p(y|⌃) / exp
��1
2
yT⌃
�1y�
⌃ =
2
4 1 .7
.7 1
3
5
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y|⌃) / exp
��1
2
yT⌃
�1y�
⌃ =
2
4 1 .4
.4 1
3
5
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y|⌃) / exp
��1
2
yT⌃
�1y�
⌃ =
2
4 1 .0
.0 1
3
5
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y|⌃) / exp
��1
2
yT⌃
�1y�
⌃ =
2
4 1 .7
.7 1
3
5
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y|⌃) / exp
��1
2
yT⌃
�1y�
⌃ =
2
4 1 .7
.7 1
3
5
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y2
|y1
, ⌃) / exp
��1
2
(y2
� µ⇤)⌃⇤�1
(y2
� µ⇤)�
1
2
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Gaussian distribution
p(y2
|y1
, ⌃) / exp
��1
2
(y2
� µ⇤)⌃⇤�1
(y2
� µ⇤)�
1
2
y1
y 2
⌅3 ⌅2 ⌅1 0 1 2 3⌅3
⌅2
⌅1
0
1
2
3
Gaussian distribution
p(y2
|y1
, ⌃) / exp
��1
2
(y2
� µ⇤)⌃⇤�1
(y2
� µ⇤)�
1
2
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
New visualisation
�2 0 2�3
�2
�1
0
1
2
3
y1
y 2
1 2�3
�2
�1
0
1
2
3
variable index
y
⌃ =
2
4 1 .9
.9 1
3
5
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
4 1 .9
.9 1
3
5
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
4 1 .9
.9 1
3
5
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
4 1 .9
.9 1
3
5
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
⌃ =
2
66666666664
1 .9 .8 .6 .4
.9 1 .9 .8 .6
.8 .9 1 .9 .8
.6 .8 .9 1 .9
.4 .6 .8 .9 1
3
77777777775
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920↵3
↵2
↵1
0
1
2
3
variable index
y
⌃ =
1 10 20
1
10
20
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
⌃ =
⌃(x1
, x2
) = K(x1
, x2
) + I�2
y
K(x1
, x2
) = �2
exp
⇣� 1
2l2(x
1
� x2
)
2
⌘
1 10 20
1
10
20
Regression: probabilistic inference in function space
�3
�2
�1
0
1
2
3
y
⌃ =
⌃(x1
, x2
) = K(x1
, x2
) + I�2
y
Non-parametric (1-parametric)
p(y|✓) = N (0, ⌃)
function estimatewith uncertaintyK(x
1
, x2
) = �2
exp
⇣� 1
2l2(x
1
� x2
)
2
⌘
1 10 20
1
10
20
Parametric model
y(x) = f(x; ✓) + �y✏
✏ ⇠ N (0, 1)
Regression: probabilistic inference in function space
�3
�2
�1
0
1
2
3
y
⌃ =
⌃(x1
, x2
) = K(x1
, x2
) + I�2
y
p(y|✓) = N (0, ⌃)
Non-parametric (1-parametric)
noise
Parametric model
y(x) = f(x; ✓) + �y✏
✏ ⇠ N (0, 1)K(x1
, x2
) = �2
exp
⇣� 1
2l2(x
1
� x2
)
2
⌘
1 10 20
1
10
20
horizontal-scalevertical-scale
Squared Exponential (SE) kernel
Usual choice for covariance is the Squared Exponential (SE)kernel (a.k.a. RBF, exponentiated quadratic, Gaussian):
k(x, x0) = �2d exp
� (x � x0)2
2l2
!
The kernel’s parameters {l, �d} are called the GP’shyperparameters.
When l is very high, neighbouring points are very correlatedi.e. that feature is less relevant.
How to select the hyperparameters?
Varying lengthscale from very low (0.04) to high (7.4), withnoise hyperparameter fixed.
Hyperparameters
Training a GP means:
I finding the right kernelI finding values of hyperparameters
The covariance function controls properties of the GP:
I smoothnessI stationarityI periodicityI etc.
Other kernels
Inference
Although the GP is an infinite dimensional object,marginalisation property means we only need to work withfinite dimensions.
For prediction we need the conditional distribution over testpoints y⇤ given observed points y:
p(y⇤|y, x⇤,X) =p(y, y⇤|x⇤,X)
p(y|X)
We’ll omit the conditioning on x⇤,X for brevity henceforth.
Inference
Consider GP over single extra point
p(y) = N(0,K + �2nI)
p(y, y⇤) = N "
00
#,
"K + �2
nI K>⇤K⇤ K⇤⇤ + �2
n
#!
where K are covariance matrices, i.e.,
I K 2 RN⇥N are the covar. between training instancesI K⇤ 2 RN are the covar. between test and training instancesI K⇤,⇤ 2 R is the test variance
Inference
Recall property of the multi-variate Gaussian:
Conditional distribution is Gaussian
Leads to the following predictive posterior
p(y⇤|y) = N(µ⇤,⌃⇤)
with µ⇤ = K>⇤ (K + �2nI)�1y
⌃⇤ = K⇤⇤ � K>⇤ (K + �2nI)�1K⇤
I Note that predictive mean is linear in the data; andI predictive covariance is the prior covariance is reduced by
the information the observations give about the function
Model selection
Bayesian formulation seemingly removes the need for training,as the parameters are integrated out.
However, need to select hyperparameter values, e.g.,
I �n white noiseI l length (horizontal) scaleI �d vertical scaleI . . .
How to select appropriate values?
Consider the marginal likelihood, p(y|X).
Model selection
Marginal likelihood defined as
p(y|X) =Z
p(y, f|X)df
=
Zp(y|f)p(f|X)df
Observe that
I as space of possible functions grows, p(f|X) diminishesI poorly fitting functions will score poorly on p(y|f)I both a↵ected by type of kernel and hyperparameter values
The marginal likelihood balances these opposing forces.
Model selection for training
Marginal likelihood permits analytical solution due toconjugacy
p(y|X) =Z
p(y|f)|{z}
Gaussian
p(f|X)|{z}
Gaussian
df
= N(0,K + �2nI)
arising from the property that the product of two Gaussians isan unnormalised Gaussian.
Permits analytical solution
log p(y|X) = �12
y>(K + �2nI)�1y � 1
2log |K + �2
nI| � n2
log 2⇡
Can optimise above with respect to model hyperparametersusing gradient ascent, a.k.a. Type II maximum likelihood.
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10�1 100 101
length scale, �
E(�) =12
log |K| + y>K�1y2
Beyond GP regression
I ClassificationI Ordinal regressionI Count dataI Model selection: kernel and hyperparameter optimisationI Kernel designI Extrapolation cf interpolationI Multi-output GPs for multi-task learningI Sparse GPs for scaling to large feature and input spacesI Latent variable models, non-linear probabilistic variant of
PCA/CCA et alI And many others ...
GP limitations
Non-parametric formulation complicates scaling
I O(N3) time complexity and O(N2) space complexityI but ongoing work to bring this down, e.g., O(NM2) time
complexity or even constant in N
Brilliant for regression, but more di�cult for other likelihoods
I suite of approximation approaches to deal with inferencefor non-conjugate configurations
Not as mature as many other frameworks
I but ‘coming of age’, many big issues have been addressedI even application to ‘big data’ scenarios
Resources
Free book:http://www.gaussianprocess.org/gpml/chapters/
Tutorials
I GPs for Natural Language Processing tutorial (ACL 2014)http://goo.gl/18heUk
I GP Schools in She�eld and roadshows in Kampala,Pereira, and (future) Nyeri, Melbournehttp://ml.dcs.shef.ac.uk/gpss/
I GP Regression demohttp://www.tmpl.fi/gp/
I Annotated bibliography and other materialshttp://www.gaussianprocess.org
Toolkits
I GPML (Matlab)http://www.gaussianprocess.org/gpml/code
I GPy (Python)https://github.com/SheffieldML/GPy
I GPstu↵ (R, Matlab, Octave)http://becs.aalto.fi/en/research/bayes/gpstuff/
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
Case study: User impact on Twitter
Predicting and characterising user impact on Twitter
I define a user-level impact scoreI use user’s text and profile information as features to
predict the scoreI analyse the features which better predict the scoreI provide users with ‘guidelines’ for improving their score
Instance of a text prediction problem
I emphasis on feature analysis and interpretability (specificto social science applications)
I non-linear variation
See Lampos et al. (2014), EACL.
Sparse GPs
Exact inference in a GP
I Memory: O(n2)I Time: O(n3)
Sparse GP approximation
I Memory: O(n ·m)I Time: O(n ·m2)
where m is selected at runtime.
Typically required when n > 1000.
Sparse GPs
Many options for sparse approximations
I Based on Inducing VariablesI Subset of Data (SoD)I Subset of Regressors (SoR)I Deterministic Training Conditional (DTC)I Partially Independent Training Conditional (PITC)I Fully Independent Training Conditional (FITC)
I Fast Matrix Vector Multiplication (MVM)I Variational Methods
See Quinonero Candela and Rasmussen (2005) for an overview.
Sparse GPs
Sparse approximations where f are treated as latent variables
I a subset are treated exactly |uuu| = mI the other are given a computationally cheaper treatmentI avoids large matrix inversions
What does the penalty term do?
Hensman,GPSS ’13
Inducing points are fixed or learned using greedy search /optimisation
Predicting and characterising user impact
500 million Tweets a day in Twitter
I important and some not so important informationI breaking news from mediaI friendsI celebrity self promotionI marketingI spam
Can we automatically predict the impact of a user?
Can we automatically identify factors which influence userimpact?
Defining user impact
Define impact as a function of network connections
I no. of followersI no. of followeesI no. of time the account is listed by others
Impact = ln✓
listings·followers2
followees
◆
DatasetI 38.000 UK usersI all tweets from one yearI 48 million deduplicated
messages
User controlled features
Only features under the user’s control (e.g. not no. of retweets)
I User features (18)extracted from the account profileaggregated text features
I Text features (100)user’s topic distributiontopics computed using spectral clustering on the wordco-occurrence (NPMI) matrix
Models
Regression task
I Gaussian Process regression modelI n = 38000 · 9/10, use Sparse GPs with FITCI Squared Exponential kernel (k-dimensional):
k(xxxp,xxxq) = �2f exp
✓�1
2(xxxp � xxxq)TD(xxxp � xxxq)
◆+ �2
n�pq
where D 2 Rk⇥k is a symmetric matrix.I if DARD = diag(lll)�2:
k(xxxp,xxxq) = �2f exp
0BBBBB@�
12
kX
d=1
(xxxpd � xxxpd)2
l2d
1CCCCCA + �
2n�pq
Automatic Relevance Determination (ARD)
I with D = DARD the kernel is the SE kernel with automaticrelevance determination (ARD), with the vector lll denotingthe characteristic length-scales of each feature
I ld measures the distance for being uncorrelated along xd
I 1/l2d proportional to how relevant a feature is: largelength-scales means the covariance becomes independentof that feature value
I sorting by length-scales indicates which features impactthe prediction the most
I tuning these parameters in done via Bayesian modelselection
Prediction results
ExperimentsI 10-fold cross validationI using predictive meanI baseline model is ridge
regression (LIN)I Profile featuresI Text features
0.6 0.65 0.7 0.75 0.8
LIN
GP
Pearson correlation
Conclusions
I GPs substantially better than ridge regressionI non-linear GPs with only profile features performs better
than linear methods with all featuresI GPs outperform SVRI adding topic features improves all models
Selected features
Feature ImportanceUsing default profile image 0.73Total number of tweets (entire history) 1.32Number of unique @-mentions in tweets 2.31Number of tweets (in dataset) 3.47Links ratio in tweets 3.57T1 (Weather): mph, humidity,
barometer, gust, winds 3.73T2 (Healthcare, Housing): nursing, nurse,
rn, registered, bedroom, clinical, #news,estate, #hospital 5.44
T3 (Politics): senate, republican, gop, police,arrested, voters, robbery, democrats,presidential, elections 6.07
Proportion of days with non-zero tweets 6.96Proportion of tweets with @-replies 7.10
Feature analysis
No. of unique @-mentions
L H
0
100 L
No. of tweets
L H
11
0
100 L
Impact histogram for users with high (H) values of this featureas opposed to low (L). Red line is the mean impact score.
Feature analysis
τ3
damon, potter, #tvd, harryelena, kate, portman,pattinson, hermione,
jennifer
τ4
senate, republican, gop,police, arrested, voters,
robbery, democrats,presidential, elections
Impact histogram for users with high (H) values of this feature.Red line is the mean impact score.
Conclusions
User impact is highly predictable
I user behaviour very informativeI “tips” for improving your impact
GP framework suitable
I non-linear modellingI ARD feature selectionI sparse GPs allow large scale experimentsI empirical improvements over linear models & SVR
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
Case study: Temporal patterns of words
Categorising temporal patterns of hashtags in Twitter
I collect hashtag normalised frequency time series formonths
I use models learnt on past frequencies to forecast futurefrequencies
I identify and group similar temporal patternsI emphasise periodicities in word frequencies
Instance of a forecasting problem
I emphasis on forecasting (extrapolation)I di↵erent e↵ects modelled by specific kernels
See Preotiuc-Pietro and Cohn (2013), EMNLP.
Model selection
Although parameter free, we still need to specify to a GP:
I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H
Training a GP = selecting the kernel and its parameters
Can use only training data (and no validation)
Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel
Model selection
Although parameter free, we still need to specify to a GP:
I the kernel parameters a.k.a. hyper-parameters ✓I the kernel definition Hi 2 H
Training a GP = selecting the kernel and its parameters
Can use only training data (and no validation)
Use the marginal likelihood to optimise hyperparameters, anduse this value to select the kernel
Identifying temporal patterns in word frequencies
Word/hashtag frequencies in Twitter
I very time dependentI many ‘live’ only for hours reflecting timely events or
memesI some hashtags are constant over timeI some experience bursts at regular time intervalsI some follow human activity cycles
Can we automatically forecast future hashtag frequencies?
Can we automatically categorise temporal patterns?
Twitter hashtag temporal patterns
Regression task
I Extrapolation: forecast future frequenciesI using predictive mean
Dataset
I two months of Twitter Gardenhose (10%)I first month for training, second month for testingI 1176 hashtags occurring in both splitsI ⇠ 6.5 million tweetsI 5456 tweets/hashtag
Kernels
The kernel
I induces the covariance in the response between pairs ofdata points
I encodes the prior belief on the type of function we aim tolearn
I for extrapolation, kernel choice is paramountI di↵erent kernels are suitable for each specific category of
temporal patterns: isotropic, smooth, periodic,non-stationary, etc.
Kernels
03/1 10/1 17/10
0.2
0.4
0.6
0.8
1#goodmorning
GoldConstLinearSEPer(168)PS(168)
#goodmorning
Const Linear SE Per PSNLML -41 -34 -176 -180 -192
NRMSE 0.213 0.214 0.262 0.119 0.107
Lower is better
Use Bayesian model selection techniques to choose betweenkernels
Kernels: Constant
kC(x, x0) = c
I constant relationshipbetween outputs
I predictive mean is thevalue c
I assumes signal ismodelled by Gaussiannoise centred around thevalue c
25 50 75 100 1250
0.2
0.4
0.6
0.8
1
c=0.6c=0.4
Kernels: Squared exponential
kSE(x, x0) = s2 · exp � (x � x0)2
2l2
!
I smooth transitionbetween neighbouringpoints
I best describes time serieswith a smooth shape e.g.uni-modal burst with asteady decrease
I predictive varianceincreases exponentiallywith distance
25 50 75 100 1250
0.2
0.4
0.6
0.8
1
l=1, s=1l=10, s=1l=10, s=0.5
Kernels: Linear
kLin(x, x0) =|x · x0| + 1
s2
I non-stationary kernel:covariance depends onthe data points values,not only on theirdi↵erence |t � t0|
I equivalent to Bayesianlinear regression withN(0, 1) priors on theregression weights and aprior ofN(0, s2) on thebias
25 50 75 100 1250
0.5
1
1.5
2
s=50s=70
Kernels: Periodic
kPER(x, x0) = s2·exp · �2 sin2(2⇡(x � x0)/p)
l2
!
I s and l are characteristiclength-scales
I p is the period (distancebetween consecutivepeaks)
I best describes periodicpatterns that oscillatesmoothly between highand low values 25 50 75 100 125
0
0.2
0.4
0.6
0.8
1
l=1, p=50l=0.75, p=25
Kernels: Periodic Spikes
kPS(x, x0) = cos sin
2⇡ · (x � x0)
p
!!·exp
s cos(2⇡ · (x � x0))
p� s
!
I p is the periodI s is a shape parameter
controlling the width ofthe spike
I best describes time serieswith constant low values,followed by abruptperiodic rise
25 50 75 100 1250
0.2
0.4
0.6
0.8
1
s=1s=5s=50
Results: Examples
#fyi (Const)
#fail (Per)
#snow (SE)
#raw (PS)
Results: Categories
Const SE PER PS#funny #2011 #brb #↵#lego #backintheday #co↵ee #followfriday
#likeaboss #confessionhour #facebook #goodnight#money #februarywish #facepalm #jobs
#nbd #haiti #fail #news#nf #makeachange #love #nowplaying
#notetoself #questionsidontlike #rock #tgif#priorities #savelibraries #running #twitterafterdark
#social #snow #xbox #twittero↵#true #snowday #youtube #ww
49 268 493 366
Results: Forecasting
Lag+ GP−Lin GP−SE GP−PER GP−PS GP+−10
−5
0
5
10
Application: Text classification
Task
I assign the hashtag of a given tweet based on its text
Methods
I Most frequent (MF)I Naive Bayes model with empirical prior (NB-E)I Naive Bayes with GP forecast as prior (NB-P)
MF NB-E NB-PMatch@1 7.28% 16.04% 17.39%Match@5 19.90% 29.51% 31.91%
Match@50 44.92% 59.17% 60.85%MRR 0.144 0.237 0.252
Higher is better
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
Case study 1: MT Quality Estimation
Manual assessment of translation quality given source andtranslated texts.
‘Quality’ can be measured many ways (see Specia et al. (2009))
I subjective scoring (1-5) for fluency, adequacy, perceivede↵ort to correct
I post-editing e↵ort: HTER or time takenI binary judgements, ranking, . . .
Human judgements are highly subjective, biased, noisy
I typing speedI experience levelsI expectations from MT
See Cohn and Specia (2013), ACL.
General annotation problem
MT Quality Estimation is an instance of a general annotationproblem
I can’t rely on single individualI but many annotators produce di↵erent resultsI how can we resolve conflicts?
Previous work in MT QE
I averaged responses from several annotators(Callison-Burch et al., 2012); or
I used output of single annotator (Koponen et al., 2012)
More appropriate to model as a multi-task problem.
Multi-task learning
Multi-task learning
I form of transfer learningI several related tasks sharing the same input data
representationI learn the types, extent of correlations
Compared to domain adaptation
I tasks need not be identical (even regression vsclassification)
I no explicit ‘target’ domainI several sources of variation besides domainI no assumptions of data asymmetry
Multi-task learning for MT Quality Estimation
Modelling individual annotators
I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions
Here use multi-output GP regression
I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data
Multi-task learning for MT Quality Estimation
Modelling individual annotators
I each bring own biasesI but correlated decisions with others’ annotationsI could even find clusters of common solutions
Here use multi-output GP regression
I joint inference over several translatorsI learn degree of inter-task transferI learn per-translator noiseI incorporate task meta-data
Review: GP Regression
θ
f
yx
σ
N
f ⇠ GP(0,✓)
yi ⇠ N( f (xi), �2)
Multi-task GP Regression
B
M
θ
f
yx
σ
N
f ⇠ GP(0,⇣B,✓
⌘)
yim ⇠ N( fm(xi), �2m)
See Alvarez et al. (2011).
Multi-task Covariance Kernels
Represent data as (x, t, y) tuples, where t is a task identifier.Define a separable covariance kernel,
K(x, x0)t,t0 = Bt,t0k✓(x, x0) + noise
I e↵ectively each input augmented with t, indexing the taskof interest
I the coregionalisation matrix, B 2 RM⇥M weights inter-taskcovariance
I the data kernel k✓ takes data points x as inpute.g., squared exponential
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learning
I B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learningI B = 1 encodes pooled learning
I interpolating the aboveI full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the above
I full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Coregionalisation Kernels
Generally B can be any symmetric positive semi-definitematrix. Some interesting choices
I B = I encodes independent learningI B = 1 encodes pooled learningI interpolating the aboveI full rank B =WW>, or low rank variants
Known as the Intrinsic model of coregionalisation (IMC).
See Alvarez et al. (2011); Bonilla et al. (2008)
Stacking and Kronecker products
Response variables are a matrix
Y = RN⇥M
Represent data in ‘stacked’ form
X =
26666666666666666666666666666666666666664
x1x2...
xN...
x1x2...
xN
37777777777777777777777777777777777777775
y =
26666666666666666666666666666666666666664
y11y21...
yN1...
y1My2M...
yNM
37777777777777777777777777777777777777775
Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)
Stacking and Kronecker products
Response variables are a matrix
Y = RN⇥M
Represent data in ‘stacked’ form
X =
26666666666666666666666666666666666666664
x1x2...
xN...
x1x2...
xN
37777777777777777777777777777777777777775
y =
26666666666666666666666666666666666666664
y11y21...
yN1...
y1My2M...
yNM
37777777777777777777777777777777777777775
Kernel a Kronecker product K(X,X) = B ⌦ kdata(Xo,Xo)
Kronecker productKronecker Product
aK bKcK dK
Ka b
c d⌦ =
Kronecker productKronecker Product
⌦ =
Choices for B: Independent learning
⊗ =
B = I
Choices for B: Pooled learning
⊗ =
B = 1
Choices for B: Interpolating independent and pooledlearning
⊗ =
B = 1 + ↵I
Choices for B: Interpolating independent and pooledlearning II
⊗ =
B = 1 + diag(↵)
Compared to Daume III (2007)
Feature augmentation approach to multi-task learning. Useshorizontal data stacking:
X ="
X(1) X(1) 0X(2) 0 X(2)
#y =
"y(1)
y(2)
#
where (X(i),y(i)) are the training data for task i. This expands thefeature space by a factor of M.
Equivalent to a multitask kernel
k(x, x0)t,t0 = (1 + �(t, t0)) x>x0
K(X,X) = (1 + I) ⌦ klinear(X,X)
) A specific choice of B with a linear data kernel
Compared to Evgeniou et al. (2006)
In the regularisation setting, Evgeniou et al. (2006) show thatthe kernel
K(x, x0)t,t0 = (1 � � + �M�(t, t0)) x>x0
is equivalent to a linear model with regularisation term
J(⇥) =1M
0BBBBB@X
t||✓t||2 +
1 � ��||✓t �
1M
X
t0✓t0 ||2
1CCCCCA
This regularises each task’s parameters ✓t towards the meanparameters over all tasks, 1
MP
t0 ✓t0 .
A form of interpolation method from before.
ICM samplesIntrinsic Coregionalization Model
K(X,X) = ww> ⌦ k(X,X).
w ="15
#
B ="1 55 25
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = ww> ⌦ k(X,X).
w ="15
#
B ="1 55 25
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = ww> ⌦ k(X,X).
w ="15
#
B ="1 55 25
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = ww> ⌦ k(X,X).
w ="15
#
B ="1 55 25
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = B ⌦ k(X,X).
B ="
1 0.50.5 1.5
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = B ⌦ k(X,X).
B ="
1 0.50.5 1.5
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = B ⌦ k(X,X).
B ="
1 0.50.5 1.5
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = B ⌦ k(X,X).
B ="
1 0.50.5 1.5
#
ICM samplesIntrinsic Coregionalization Model
K(X,X) = B ⌦ k(X,X).
B ="
1 0.50.5 1.5
#
Experimental setup
Quality Estimation data
I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using
Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment
Gaussian Process models
I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise
Experimental setup
Quality Estimation data
I 2k examples of source sentence and MT outputI measuring subjective post-editing (1-5) WMT12I post-editing time per word, in log seconds WPTP12I 17 dense features extracted using
Quest toolkit (Specia et al., 2013)I using o�cial train/test split, or random assignment
Gaussian Process models
I squared exponential data kernel (RBF)I hyper-parameter values trained using type II MLEI consider simple interpolation coregionalisation kernelsI include per-task noise or global tied noise
Results: WMT12 RMSE for 1-5 ratings
50 100 150 200 250 300 350 400 450 5000.7
0.72
0.74
0.76
0.78
0.8
0.82
Training examples
STLMTLPooled
Incorporating layers of task metadata
⊗
Annotator System SourceSenTence
⊗
Results: WPTP12 RMSE post-editing time
0.637& 0.635& 0.628& 0.617&
0.748& 0.741&
0.600&0.585&
0.500&
0.550&
0.600&
0.650&
0.700&
0.750&
0.800&
constant&Ann&&
Ind&SVM&Ann&&
Pooled&GP&&
MTL&GP&Ann&
MTL&GP&Sys&
MTL&GP&senT&
MTL&GP&&&&&A+S&&
MTL&GP&&&&A+S+T&&
Case Study 2: Emotion Analysis
I automatically detect emotions in a textI fine-grainedI presence of anti-correlations (sadness and joy)
Headline Fear Joy SadnessStorms kill, knock out power, cancel flights 82 0 60Panda cub makes her debut 0 59 0
See Beck et al. (2014), EMNLP
Modelling anti-correlations
I The coregionalization settings used in Quality Estimationare not suitable for this problem: they assume positivecovariances between tasks.
I We need a parameterization that allows B to have negativecovariances: we use the incomplete-Choleskydecomposition.
B = WWT + diag(↵)
Incomplete-Cholesky model
W11
W21
W31
W41
W51
W61
W
⇥
W11 W21 W31 W41 W51 W61
⇥ WT
+ diag( ) =
↵1↵2↵3↵4↵5↵6
+ diag(↵) = B
12 hyperparameters
Incomplete-Cholesky model
18 hyperparameters
W ⇥ WT + diag(↵) = B
⇥ + diag( ) =
W11
W21
W31
W41
W51
W61
W12
W22
W32
W42
W52
W62
W11 W21 W31 W41 W51 W61
W12 W22 W32 W42 W52 W62
↵1↵2↵3↵4↵5↵6
Incomplete-Cholesky model
24 hyperparameters
W ⇥ WT + diag(↵) = B
⇥ + diag( ) =
W11
W21
W31
W41
W51
W61
W12
W22
W32
W42
W52
W62
W13
W23
W33
W43
W53
W63
W11 W21 W31 W41 W51 W61
W12 W22 W32 W42 W52 W62
W13 W23 W33 W43 W53 W63
↵1↵2↵3↵4↵5↵6
Experimental Setup
I Dataset: SEMEval2007 “A↵ective Text” Strapparava andMihalcea (2007);
I 1000 News headlines, each one annotated with 6 scores[0-100], one for emotion;
I 100 sentences for training, 900 for testing;I Bag-of-words representation as features;
Learned Task Covariances
Prediction Results
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
Recap: Regression
Observations, yi, are a noisy version of latent process fi,
yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)
GP Regression
Observations yi are a distorted version of a process fi:
yi = fi(xi) + ✏i, with ✏i ⇠ N(0, �2)
Likelihood models
Analytic solution for Gaussian likelihood aka noise
Gaussian (process) prior⇥ Gaussian likelihood
= Gaussian posterior
But what about other likelihoods?
I Counts y 2 NI Classification y 2 {C1,C2, ...,Ck}I Ordinal regression (ranking) C1 < C2 < ... < Ck
I . . .
Classification
Binary classification, yi 2 {0, 1}.
Two popular choices for the likelihood
I Logistic sigmoid: p(yi = 1| fi) = �( fi) = 11+exp(� fi)
I Probit function: p(yi = 1| fi) = �( fi) =R fi�1N(z|0, 1)dz
”Squashing” input from (�1,1) into range [0, 1]
Squashing function
Pass latent function through logistic function to obtainprobability, ⇡(x) = p(yi = 1| fi)
C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,
ISBN 026218253X.
c�2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
40 Classification
−4
−2
0
2
4
input, x
late
nt fu
nctio
n, f(
x)
0
1
input, xcl
ass
prob
abilit
y, π
(x)
(a) (b)
Figure 3.2: Panel (a) shows a sample latent function f(x) drawn from a Gaussianprocess as a function of x. Panel (b) shows the result of squashing this sample func-tion through the logistic logit function, �(z) = (1 + exp(�z))�1 to obtain the classprobability �(x) = �(f(x)).
regression model and parallels the development from linear regression to GP
regression that we explored in section 2.1. Specifically, we replace the linear
f(x) function from the linear logistic model in eq. (3.6) by a Gaussian process,
and correspondingly the Gaussian prior on the weights by a GP prior.
The latent function f plays the role of a nuisance function: we do notnuisance function
observe values of f itself (we observe only the inputs X and the class labels y)
and we are not particularly interested in the values of f , but rather in ⇡, in
particular for test cases ⇡(x⇤). The purpose of f is solely to allow a convenient
formulation of the model, and the computational goal pursued in the coming
sections will be to remove (integrate out) f .
We have tacitly assumed that the latent Gaussian process is noise-free, andnoise-free latent process
combined it with smooth likelihood functions, such as the logistic or probit.
However, one can equivalently think of adding independent noise to the latent
process in combination with a step-function likelihood. In particular, assuming
Gaussian noise and a step-function likelihood is exactly equivalent to a noise-
free
8
latent process and probit likelihood, see exercise 3.10.1.
Inference is naturally divided into two steps: first computing the distribution
of the latent variable corresponding to a test case
p(f⇤|X,y,x⇤) =
Zp(f⇤|X,x⇤, f)p(f |X,y) df , (3.9)
where p(f |X,y) = p(y|f)p(f |X)/p(y|X) is the posterior over the latent vari-
ables, and subsequently using this distribution over the latent f⇤ to produce a
probabilistic prediction
⇡⇤ � p(y⇤ =+1|X,y,x⇤) =
Z�(f⇤)p(f⇤|X,y,x⇤) df⇤. (3.10)
8This equivalence explains why no numerical problems arise from considering a noise-freeprocess if care is taken with the implementation, see also comment at the end of section 3.4.3.
Figure from Rasmussen and Williams (2006)
Inference Challenges: for test case x⇤
Distribution over latent function
p( f ⇤|X,y, x⇤) =Z
p( f ⇤|X, x⇤, f) p(f|X,y)| {z }posterior
df
Distribution over classification output
p(y⇤ = 1|X,y, x⇤) =Z�( f⇤)p( f⇤|X,y, x⇤)d f⇤
Problem: likelihood no longer conjugate with prior, so noanalytic solution.
Approximate inference
Several inference techniques proposed for non-conjugatelikelihoods:
I Laplace approximationWilliams and Barber (1998)
I Expectation propagationMinka (2001)
I Variational inferenceGibbs and MacKay (2000)
I MCMCNeal (1999)
And more, including sparse approaches for large scaleapplication.
Laplace approximation
Approximate non-Gaussian posterior by a Gaussian, centred atthe mode
Figure from Rogers and Girolami (2012)
Laplace approximation
Log posterior
�(f) = log p(y|f) + log p(f|X) + const
Find the posterior mode, f, i.e., MAP estimation, O(n3).
Then take a second order Taylor series expansion about mode,and fit with a Gaussian
I with mean, µ = fI and co-variance ⌃ = (K�1 + rr log p(y|f))�1
Allows for computation of posterior and marginal likelihood,but predictions may still be intractable.
Expectation propagation
Take intractable posterior:
p(f|X,y) =1Z
p(f|X)nY
i=1
p(yi| fi)
Z =Z
p(f|X)nY
i=1
p(yi| fi)df
Approximation with fully factorised distribution
q(f|X,y) =1
ZEPp(f|X)
nY
i=1
t( fi)
Expectation propagation
Approximate posterior defined as
q(f|y) =1
ZEPp(f)
nY
i=1
t( fi)
where each component assumed to be Gaussian
I p(yi| fi) ⇡ ti( fi) = ZiN( fi|µi, �2i )
I p(f|X) ⇠ N(f|0,Knn)
Results in Gaussian formulation for q(f|y)
I allows for tractable multiplication, division with GaussiansI and marginalisation, expectations etc
Expectation propagation
EP algorithm aims to fit ti( fi) to the posterior, starting with aguess for q, then iteratively refining as follows
I minimise KL divergence between the true posterior for fiand the approximation, ti
minti
KL�p(yi| fi)q�i( fi) || ti( fi)q�i( fi)
�
where q�i( fi) is the cavity distribution formed bymarginalising q(f) over f j, j , i then dividing by ti( fi).
I key idea: only need accurate approximation for globallyfeasible fi
I match moments to update ti, then update q(f)
Expectation propagationSite approximation example
Expectation propagation
No proof of convergence
I but empirically works wellI often more accurate than Laplace approximation
Formulated for many di↵erent likelihoods
I complexity O(n3), dominated by matrix inversionI sparse EP approximations can reduce this to O(nm2)
See Minka (2001) and Rasmussen and Williams (2006) forfurther details.
Multi-class classification
Consider multi-class classification, y 2 {C1,C2, . . . ,Ck}.
Draw vector of k latent function values for each input
f =⇣
f 11 , . . . , f 1
n , f 21 , . . . , f 2
n , f k1 , . . . , f k
n
⌘
Formulate clasification probability using soft-max
p(yi = c|fi) =exp( f c
i )P
c0 exp( f c0i )
Multi-class classification
Assume k latent processes are uncorrelated, leading to priorcovariance f ⇠ N(0,K) where
K =
0BBBBBBBBBBBBB@
K1 0 · · · 00 K2 · · · 0....... . .
...0 0 · · · Kk
1CCCCCCCCCCCCCA
is block diagonal kn ⇥ kn with each Kj of size n ⇥ n.
Various approximation methods for inference, e.g., Laplace(Williams and Barber, 1998), EP (Kim and Ghahramani, 2006),MCMC (Neal, 1999).
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
GPs for Structured Prediction
I GPSC (Altun et al., 2004):I Defines a likelihood over label sequences: p(y|x), with latent
variable over full sequences yI HMM-inspired kernel, combining features from each
observed symbol xi and label pairsI MAP inference for hidden function values, f, and
sparsification trick for tractable inferenceI GPstruct (Bratieres et al., 2013):
I Base model is a CRF:
p(y|x, f) =exp
Pc f (c, xc,yc)P
y02Y expP
c f (c, xc,y0c)
I Assumes that each potential f (c, xc,yc) is drawn from a GPI Bayesian inference using MCMC (Murray et al., 2010)
OutlineGP fundamentals
NLP Applications
Sparse GPs: Characterising user impact
Model selection and Kernels: Identifying temporal patternsin word frequencies
Multi-task learning with GPs: Machine Translationevaluation & Sentiment analysis
Advanced Topics
Classification
Structured prediction
Structured kernels
String Kernels
k(x, x0) =X
s2⌃⇤ws�s(x)�s(x0),
I �s(x): counts of substring s inside x;I 0 ws 1: weight of substring s;I s can also be a subsequence (containing gaps);
I s = char sequences! ngram kernels (Lodhi et al., 2002)(useful for stems);k(bar, bat) = 3 (b,a,ba)
I s =word sequences!Word Sequence kernels (Canceddaet al., 2003);k(gas only injection, gas assisted plastic injection) = 3
I Soft matching:k(battle, battles) , 0k(battle, combat) , 0
Tree Kernels
I Subset Tree Kernels (Collins and Du↵y, 2001)
S
VP
Johnloves
NP
Mary
S
VP
dinnerhad
NP
Mary
I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;
I Soft matching can also be applied.
Tree Kernels
I Subset Tree Kernels (Collins and Du↵y, 2001)
S
VP
Johnloves
NP
Mary
S
VP
dinnerhad
NP
Mary
I Partial Tree Kernels (Moschitti, 2006): allows “broken”rules, useful for dependency trees;
I Soft matching can also be applied.
Conclusions
Gaussian Processes are a powerful framework
I elegant kernel machinesI probabilistic, quantify uncertaintyI closed form Bayesian inference & model selection
Mature enough for widespread application
I frontier challenges in NLPI scaling improvements make these applications practical
References I
Altun, Y., Hofmann, T., and Smola, A. J. (2004). Gaussian Process Classification forSegmenting and Annotating Sequences. In Proceedings of ICML, page 8, New York,New York, USA. ACM Press.
Alvarez, M. A., Rosasco, L., and Lawrence, N. D. (2011). Kernels for vector-valuedfunctions: A review. Foundations and Trends in Machine Learning, 4(3):195–266.
Beck, D., Cohn, T., and Specia, L. (2014). Joint Emotion Analysis via Multi-taskGaussian Processes. In Proceedings of EMNLP.
Bonilla, E., Chai, K. M., and Williams, C. (2008). Multi-task Gaussian processprediction. NIPS.
Bratieres, S., Quadrianto, N., and Ghahramani, Z. (2013). Bayesian StructuredPrediction using Gaussian Processes. arXiv:1307.3846, pages 1–17.
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012).Findings of the 2012 workshop on statistical machine translation.
Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.-M. (2003). Word-SequenceKernels. The Journal of Machine Learning Research, 3:1059–1082.
Cohn, T. and Specia, L. (2013). Modelling annotator bias with multi-task Gaussianprocesses: An application to machine translation quality estimation. ACL.
Collins, M. and Du↵y, N. (2001). Convolution Kernels for Natural Language. InAdvances in Neural Information Processing Systems.
Daume III, H. (2007). Frustratingly easy domain adaptation. ACL.
References IIEvgeniou, T., Micchelli, C. A., and Pontil, M. (2006). Learning multiple tasks with
kernel methods. Journal of Machine Learning Research, 6(1):615.Gibbs, M. N. and MacKay, D. J. (2000). Variational gaussian process classifiers. IEEE
Transactions on Neural Networks, 11(6):1458–1464.Kim, H.-C. and Ghahramani, Z. (2006). Bayesian gaussian process classification with
the em-ep algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on,28(12):1948–1959.
Koponen, M., Aziz, W., Ramos, L., and Specia, L. (2012). Post-editing time as a measureof cognitive e↵ort. AMTA 2012 Workshop on Post-editing Technology and Practice.
Lampos, V., Aletras, N., Preotiuc-Pietro, D., and Cohn, T. (2014). Predicting andCharacterising User Impact on Twitter. EACL.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2002). TextClassification using String Kernels. The Journal of Machine Learning Research,2:419–444.
Minka, T. (2001). Expectation propagation for approximate bayesian inference. InProceedings of the Seventeenth conference on Uncertainty in Artificial Intelligence, pages362–369. Morgan Kaufmann Publishers Inc.
Moschitti, A. (2006). Making Tree Kernels practical for Natural Language Learning. InEACL, pages 113–120.
Murray, I., Adams, R. P., and Mackay, D. (2010). Elliptical slice sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 541–548.
References IIINeal, R. (1999). Regression and classification using gaussian process priors. Bayesian
Statistics, 6.Preotiuc-Pietro, D. and Cohn, T. (2013). A temporal model of text periodicities using
Gaussian Processes. EMNLP.Quinonero Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse
approximate gaussian process regression. Journal of Machine Learning Research,6:1939–1959.
Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning,volume 1. MIT press Cambridge, MA.
Rogers, S. and Girolami, M. (2012). A First Course in Machine Learning. Chapman &Hall/CRC.
Specia, L., Shah, K., De Souza, J. G., and Cohn, T. (2013). Quest-a translation qualityestimation framework. Citeseer.
Specia, L., Turchi, M., Cancedda, N., Dymetman, M., and Cristianini, N. (2009).Estimating the Sentence-Level Quality of Machine Translation Systems. pages pp.28–35, Barcelona, Spain.
Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14 : A↵ective Text. InProceedings of SEMEVAL.
Strapparava, C. and Mihalcea, R. (2008). Learning to identify emotions in text. InProceedings of the 2008 ACM Symposium on Applied Computing.
Williams, C. K. and Barber, D. (1998). Bayesian classification with gaussian processes.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(12):1342–1351.