Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing

Generalized Hebbian Algorithm for Dimensionality Reduction in

Natural Language Processing

Genevieve Gorrell

5th June 2007

IntroductionThink datapoints plotted in hyperspaceImagine a space in which each word has its own dimension

big bad[ 2 1 ][ 1 1 ][ 0 1 ]

We can compare these passagesusing vector representations in this space

axis of bigness

axis of badness

”big big bad”

”big bad”

”bad”

Dimensionality Reduction

Do we really need two dimensions to describe the relationship between these datapoints?

axis of bigness

axis of badness

”big big bad”

”big bad”

”bad”

Dimensionality Reduction

Do we really need two dimensions to describe the relationship between these datapoints?

axis of bigness

axis of badness

”big big bad”

”big bad”

”bad”

Rotation

Imagine the data look like this ...

Rotation

Imagine the data look like this ...

More Rotation

Or even like this ...

More Rotation

Or even like this ... Because if these

were the dimensions we would know which were the most important

We could describe as much of the data as possible using a smaller number of dimensions approximation compression generalisation

More Rotation




More Rotation




Eigen DecompositionThe key lies in rotating the data into the most efficient orientationEigen decomposition will give us a set of axes (eigenvectors) of a new space in which our data might more efficiently be represented

Eigen Decomposition

Eigen decomposition is a vector space technique that provides a useful way to automatically reduce data dimensionalityThis technique is of interest in natural language processing

Latent Semantic IndexingGiven a dataset in a given space, eigen decomposition can be used to create a nearest approximation in a space with fewer dimensions

For example, document vectors as bags of words in a space with one dimension per word can be mapped to a space with fewer dimensions than one per word

Mv = λv

A real world example—eigenfaces

Each new dimension captures something important about the dataThe original observation can be recreated from a combination of these components

Eigen Faces 2Each eigen face captures as much information in the dataset as possible (eigenvectors are orthogonal to each other)

This is much more efficient than the original representation

More Eigen Face Convergence

Eigen faces with high eigenvalues capture important generalisations in the corpusThese generalisations might well apply to unseen data ...

We have been using this in natural language processing ...

Corpus-driven language modelling suffers from problems with data sparsity

We can use eigen decomposition to make generalisations that might apply to unseen data

But language corpora are very large ...

Problems with eigen decomposition

Existing algorithms often;require all the data be available at once (batch processing)produce all the component vectors simultaneously, even though they may not all be necessary and it takes longer to do all of themare very computationally expensive, therefore may exceed the capabilities of the computer for larger corpora

large RAM requirementexponential relationship between time/RAM requirement and dataset size

Generalized Hebbian Algorithm (Sanger 1989)

Based on Hebbian learningSimple localised technique for deriving eigen decompositionRequires very little memoryLearns based on single observations (for example, document vectors) presented serially, therefore no problem to add more data

In fact, the entire matrix need never be simultaneously available

Greatest are produced first

GHA Algorithmc += (c . x) x

c is the eigenvector, x is the training datum

Initialise eigenvector randomlyWhile the eigenvector is not converged {

Dot-product each training vector with the eigenvectorMultiply the result by the training vectorAdd the resulting vector to the eigenvector

}

Dot-product is a measure of similarity of direction of one vector with another, and produces a scalarThere are various ways in which one might assess convergence

GHA Algorithm Continued

Or in other words, train by adding each datum to the eigenvector proportionally with the extent to which it already resembles itTrain subsequent eigenvectors by removing the stronger eigenvectors from the data before we train, so it doesn’t find those ones

GHA as a neural net

x=1

dp = Input_x Weight_x

n

Weight_1

Input_1

Input_2

Input_3

Input_n

Weight_2

Weight_3

Weight_n

Weight_2 += dp Input_2


Weight_n += dp Input_n


• Can be extended to learn many eigenvectors

Singular Value Decomposition

Extends eigen decomposition to paired data

Word co-occurrence

big bad

big

bad 3

5

3

3

Word bigrams

big:2 bad:2

big:1

bad:1

“bad”“big bad”“big big bad”

0

1

0

2

Asymmetrical GHA (Gorrell 2006)

Extends GHA to asymmetrical datasetsallows us to work with n-grams for example

Retains the features of GHA

Asymmetrical GHA Algorithm

ca += (cb.xb) xa

cb += (ca.xa) xb

Train singular vectors on data presented as a series of vector pairs by dotting left training datum with left singular vector and scaling right singular vector by the resulting scalar and vice versa

for example, first word in a bigram might be vector xa and the second, xb

Asymmetrical GHA Performance (20,000 NL bigrams)

RAM requirement linear with dimensionality and number of singular vectors required

Time per training step linear with dimensionality

This is a big improvement on conventional approaches for larger corpora/dimensionalities ...

But don't forget, the algorithm needs to be allowed to converge

N-Gram Language Model Smoothing

Modelling language as a string of n-gramshighly successful approachbut we will always have problems with data sparsityzero probabilities are bad news

A Zipf Curve

N-gram Language Modelling—An Example Corpus

A man hits the ball at the dog. The man hits the ball at thehouse. The man takes the dog to the ball. A man takes the ball to thehouse. The dog takes the ball to the house. The dog takes the ball tothe man. The man hits the ball to the dog. The man walks the dog tothe house. The man walks the dog. The dog walks to the man. A dog hitsa ball. The man walks in the house. The man hits the dog. A ball hitsthe dog. The man walks. A ball hits. Every ball hits. Every dog walks. Everyman walks. A man walks. A small man walks. Every nice dog barks.

man hits the ball at dog house takes to walks a in small nice barksa 0.03 0.0 0.0 0.03 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0man 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.02 0.0 0.07 0.0 0.0 0.0 0.0 0.0hits 0.0 0.0 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0the 0.1 0.0 0.0 0.07 0.0 0.1 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0ball 0.0 0.03 0.0 0.0 0.02 0.0 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0at 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0takes 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0dog 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.02 0.0 0.0 0.0 0.0 0.01to 0.0 0.0 0.07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0walks 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.01 0.0 0.0 0.0in 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0every 0.01 0.0 0.0 0.01 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0small 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0nice 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

An Example Corpus as Normalised Bigram Matrix


First Singular Vector Pair


Second Singular Vector Pair


Third Singular Vector Pair

Language Models from Eigen N-Grams

Add k singular vector pairs (“eigen n-grams”) together

Remove all the negative cell values

Normalise row-wise to get probabilities

Include a smoothing approach to remove zeros


What do we hope to see?

Theory is that reduced dimensionality representation better describes the unseen test corpus than the original representationAs k increases perplexity should decrease until the optimum is reachedk should then begin to increase as the optimum is passed and too much data is includedWe hope for a U-shaped curve

Some results ...

Perplexity is a measure of the quality of the language model

k is number of dimensions (eigen n-grams)

Times are how long it took to calculate the dimensions

Some specifics about this experiment

The corpus comprises five newsgroups from CMU's newsgroup corpusTraining corpus contains over a million itemsUnseen test corpus comprises over 100,000 itemsI used AGHA to calculate the decompositionI used simple heuristically-chosen smoothing constants and single-order language models

Maybe k is too low?

200,000 trigramsLAS2 algorithm

Full rank decomposition

20,000 bigrams

Furthermore perplexity in each case never reaches the baseline of perplexity of the original n-gram model

Linear interpolation may generate an interesting result

k Weight SVDLM perp. N-gram perp. Comb. perp25 1 7.071990e+02 4.647004e+02 3.891952e+0210 1 8.884950e+02 4.647004e+02 3.695157e+0210 0.7 8.884950e+02 4.647004e+02 3.705559e+025 1 1.156845e+03 4.647004e+02 3.788119e+02

Best result is 370 An overall improvement of 20% is demonstrated (However, this involved tuning on the test corpus)

200,000 Trigram Corpus

k Weight SVDLM perp. N-gram perp. Comb. perp100 1 1.003399e+03 4.057236e+02 3.196404e+0250 1 1.220449e+03 4.057236e+02 3.008804e+0225 1 1.508873e+03 4.057236e+02 2.834632e+0210 1 2.188041e+03 4.057236e+02 2.898518e+02

Improvement on the baseline n-gram is even greater on the medium-sized corpus (30%)

1 Million Trigram Corpusk Weight SVDLM perp. N-gram perp. Comb. perp25 1 4.237069e+04 3.730947e+02 3.729931e+0225 2 4.237069e+04 3.730947e+02 3.728907e+0225 10 4.237069e+04 3.730947e+02 3.721338e+0225 100 4.237069e+04 3.730947e+02 3.663666e+0225 1000 4.237069e+04 3.730947e+02 3.442525e+0225 10000 4.237069e+04 3.730947e+02 2.980755e+0225 100000 4.237069e+04 3.730947e+02 2.422045e+0225 1000000 4.237069e+04 3.730947e+02 2.187968e+0225 10000000 4.237069e+04 3.730947e+02 2.741027e+02

This is a big dataset for SVD! Needed to increase the weighting on the

SVDLM a lot to get a good result

Fine-Tuning kk Weight SVDLM perp. N-gram perp. Comb. perp25 1000000 4.237069e+04 3.730947e+02 2.192305e+0220 1000000 4.249082e+04 3.730947e+02 2.174188e+0215 1000000 4.266386e+04 3.730947e+02 2.100715e+0210 1000000 4.290579e+04 3.730947e+02 2.102029e+02

Tuning k results in a best perplexity of over 40% A low optimal k is a good thing because many

algorithms for calculating SVD produce singular vectors one at a time starting with the largest

TractabilityThe biggest challenge with SVDLM is tractabilityCalculating SVD is computationally demanding

But optimal k is lowI have also developed an algorithm that helps with tractability

Usability of the resulting SVDLM is also an issueSVDLM is much larger than regular n-gramBut the size can be minimised by discarding low values with minimal impact on performance

Backoff SVDLM

Improving on n-gram language modelling is interesting workHowever no improvement on the state of the art has been demonstrated yet!Next steps involve creation of a backoff SVDLM

Interpolating with lower-order n-grams is standardBackoff models have much superior performance

Similar Work

Jerome Bellegarda developed the LSA language model

Uses longer span eigen decomposition information to access semantic informationOthers have since developed the work

Saul and Pereira demonstrated an approach based on Markov models

Again demonstrates that some form of dimensionality reduction is beneficial

Summary

GHA-based algorithm allows large datasets to be decomposedAsymmetrical formulation allows data such as n-grams to be decomposedPromising initial results in n-gram language model smoothing have been presented

Thanks!

Gorrell, 2006 “Generalized Hebbian Algorithm for Incremental Singular Value Decomposition.” Proceedings of EACL 2006Gorrell and Webb, 2005 ”Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis.” Proceedings of Interspeech 2005Sanger, T. 1989 ”Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Network.” Neural Networks, 2, 459-473

Documents

Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing