Upload
ashtyn
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Genevieve Gorrell 5 th June 2007. Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing. Introduction. Think datapoints plotted in hyperspace Imagine a space in which each word has its own dimension big bad [21] [11] [01] - PowerPoint PPT Presentation
Citation preview
Generalized Hebbian Algorithm for Dimensionality Reduction in
Natural Language Processing
Genevieve Gorrell
5th June 2007
IntroductionThink datapoints plotted in hyperspaceImagine a space in which each word has its own dimension
big bad[ 2 1 ][ 1 1 ][ 0 1 ]
We can compare these passagesusing vector representations in this space
axis of bigness
axis of badness
”big big bad”
”big bad”
”bad”
Dimensionality Reduction
Do we really need two dimensions to describe the relationship between these datapoints?
axis of bigness
axis of badness
”big big bad”
”big bad”
”bad”
Dimensionality Reduction
Do we really need two dimensions to describe the relationship between these datapoints?
axis of bigness
axis of badness
”big big bad”
”big bad”
”bad”
Rotation
Imagine the data look like this ...
Rotation
Imagine the data look like this ...
More Rotation
Or even like this ...
More Rotation
Or even like this ... Because if these
were the dimensions we would know which were the most important
We could describe as much of the data as possible using a smaller number of dimensions approximation compression generalisation
More Rotation
Or even like this ... Because if these
were the dimensions we would know which were the most important
We could describe as much of the data as possible using a smaller number of dimensions approximation compression generalisation
More Rotation
Or even like this ... Because if these
were the dimensions we would know which were the most important
We could describe as much of the data as possible using a smaller number of dimensions approximation compression generalisation
Eigen DecompositionThe key lies in rotating the data into the most efficient orientationEigen decomposition will give us a set of axes (eigenvectors) of a new space in which our data might more efficiently be represented
Eigen Decomposition
Eigen decomposition is a vector space technique that provides a useful way to automatically reduce data dimensionalityThis technique is of interest in natural language processing
Latent Semantic IndexingGiven a dataset in a given space, eigen decomposition can be used to create a nearest approximation in a space with fewer dimensions
For example, document vectors as bags of words in a space with one dimension per word can be mapped to a space with fewer dimensions than one per word
Mv = λv
A real world example—eigenfaces
Each new dimension captures something important about the dataThe original observation can be recreated from a combination of these components
Eigen Faces 2Each eigen face captures as much information in the dataset as possible (eigenvectors are orthogonal to each other)
This is much more efficient than the original representation
More Eigen Face Convergence
Eigen faces with high eigenvalues capture important generalisations in the corpusThese generalisations might well apply to unseen data ...
We have been using this in natural language processing ...
Corpus-driven language modelling suffers from problems with data sparsity
We can use eigen decomposition to make generalisations that might apply to unseen data
But language corpora are very large ...
Problems with eigen decomposition
Existing algorithms often;require all the data be available at once (batch processing)produce all the component vectors simultaneously, even though they may not all be necessary and it takes longer to do all of themare very computationally expensive, therefore may exceed the capabilities of the computer for larger corpora
large RAM requirementexponential relationship between time/RAM requirement and dataset size
Generalized Hebbian Algorithm (Sanger 1989)
Based on Hebbian learningSimple localised technique for deriving eigen decompositionRequires very little memoryLearns based on single observations (for example, document vectors) presented serially, therefore no problem to add more data
In fact, the entire matrix need never be simultaneously available
Greatest are produced first
GHA Algorithmc += (c . x) x
c is the eigenvector, x is the training datum
Initialise eigenvector randomlyWhile the eigenvector is not converged {
Dot-product each training vector with the eigenvectorMultiply the result by the training vectorAdd the resulting vector to the eigenvector
}
Dot-product is a measure of similarity of direction of one vector with another, and produces a scalarThere are various ways in which one might assess convergence
GHA Algorithm Continued
Or in other words, train by adding each datum to the eigenvector proportionally with the extent to which it already resembles itTrain subsequent eigenvectors by removing the stronger eigenvectors from the data before we train, so it doesn’t find those ones
GHA as a neural net
x=1
dp = Input_x Weight_x
n
Weight_1
Input_1
Input_2
Input_3
Input_n
Weight_2
Weight_3
Weight_n
Weight_2 += dp Input_2
Weight_1 += dp Input_1
Weight_n += dp Input_n
Weight_3 += dp Input_3
• Can be extended to learn many eigenvectors
Singular Value Decomposition
Extends eigen decomposition to paired data
Word co-occurrence
big bad
big
bad 3
5
3
3
Word bigrams
big:2 bad:2
big:1
bad:1
“bad”“big bad”“big big bad”
0
1
0
2
Asymmetrical GHA (Gorrell 2006)
Extends GHA to asymmetrical datasetsallows us to work with n-grams for example
Retains the features of GHA
Asymmetrical GHA Algorithm
ca += (cb.xb) xa
cb += (ca.xa) xb
Train singular vectors on data presented as a series of vector pairs by dotting left training datum with left singular vector and scaling right singular vector by the resulting scalar and vice versa
for example, first word in a bigram might be vector xa and the second, xb
Asymmetrical GHA Performance (20,000 NL bigrams)
RAM requirement linear with dimensionality and number of singular vectors required
Time per training step linear with dimensionality
This is a big improvement on conventional approaches for larger corpora/dimensionalities ...
But don't forget, the algorithm needs to be allowed to converge
N-Gram Language Model Smoothing
Modelling language as a string of n-gramshighly successful approachbut we will always have problems with data sparsityzero probabilities are bad news
A Zipf Curve
N-gram Language Modelling—An Example Corpus
A man hits the ball at the dog. The man hits the ball at thehouse. The man takes the dog to the ball. A man takes the ball to thehouse. The dog takes the ball to the house. The dog takes the ball tothe man. The man hits the ball to the dog. The man walks the dog tothe house. The man walks the dog. The dog walks to the man. A dog hitsa ball. The man walks in the house. The man hits the dog. A ball hitsthe dog. The man walks. A ball hits. Every ball hits. Every dog walks. Everyman walks. A man walks. A small man walks. Every nice dog barks.
man hits the ball at dog house takes to walks a in small nice barksa 0.03 0.0 0.0 0.03 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0man 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.02 0.0 0.07 0.0 0.0 0.0 0.0 0.0hits 0.0 0.0 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0the 0.1 0.0 0.0 0.07 0.0 0.1 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0ball 0.0 0.03 0.0 0.0 0.02 0.0 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0at 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0takes 0.0 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0dog 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.02 0.0 0.0 0.0 0.0 0.01to 0.0 0.0 0.07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0walks 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.01 0.0 0.0 0.0in 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0every 0.01 0.0 0.0 0.01 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0small 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0nice 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
An Example Corpus as Normalised Bigram Matrix
man hits the ball at dog house takes to walks a in small nice barksa 0.02 0.00 0.00 0.02 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00man 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00hits 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00the 0.10 0.00 0.00 0.07 0.00 0.10 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00ball 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00at 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00takes 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00dog 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00to 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00walks 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00in 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00every 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
First Singular Vector Pair
man hits the ball at dog house takes to walks a in small nice barksa 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00man 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00hits 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00the 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00ball 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00at 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00takes 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00dog 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00to 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00walks 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00in 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00every 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Second Singular Vector Pair
man hits the ball at dog house takes to walks a in small nice barksa 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00man 0.00 0.04 0.00 0.00 0.01 0.00 0.00 0.02 0.02 0.06 0.00 0.00 0.00 0.00 0.00hits 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00the 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00ball 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.00 0.00 0.00 0.00 0.00at 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00takes 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00dog 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.00 0.00 0.00 0.00 0.00to 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00walks 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00in 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00every 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Third Singular Vector Pair
Language Models from Eigen N-Grams
Add k singular vector pairs (“eigen n-grams”) together
Remove all the negative cell values
Normalise row-wise to get probabilities
Include a smoothing approach to remove zeros
man hits the ball at dog house takes to walks a in small nice barksa 0.02 0.00 0.00 0.02 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00man 0.00 0.04 0.00 0.00 0.01 0.00 0.00 0.02 0.02 0.06 0.00 0.00 0.00 0.00 0.00hits 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00the 0.10 0.00 0.00 0.07 0.00 0.10 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00ball 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.00 0.00 0.00 0.00 0.00at 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00takes 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00dog 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.00 0.00 0.00 0.00 0.00to 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00walks 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00in 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00every 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00small 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
What do we hope to see?
Theory is that reduced dimensionality representation better describes the unseen test corpus than the original representationAs k increases perplexity should decrease until the optimum is reachedk should then begin to increase as the optimum is passed and too much data is includedWe hope for a U-shaped curve
Some results ...
Perplexity is a measure of the quality of the language model
k is number of dimensions (eigen n-grams)
Times are how long it took to calculate the dimensions
Some specifics about this experiment
The corpus comprises five newsgroups from CMU's newsgroup corpusTraining corpus contains over a million itemsUnseen test corpus comprises over 100,000 itemsI used AGHA to calculate the decompositionI used simple heuristically-chosen smoothing constants and single-order language models
Maybe k is too low?
200,000 trigramsLAS2 algorithm
Full rank decomposition
20,000 bigrams
Furthermore perplexity in each case never reaches the baseline of perplexity of the original n-gram model
Linear interpolation may generate an interesting result
k Weight SVDLM perp. N-gram perp. Comb. perp25 1 7.071990e+02 4.647004e+02 3.891952e+0210 1 8.884950e+02 4.647004e+02 3.695157e+0210 0.7 8.884950e+02 4.647004e+02 3.705559e+025 1 1.156845e+03 4.647004e+02 3.788119e+02
Best result is 370 An overall improvement of 20% is demonstrated (However, this involved tuning on the test corpus)
200,000 Trigram Corpus
k Weight SVDLM perp. N-gram perp. Comb. perp100 1 1.003399e+03 4.057236e+02 3.196404e+0250 1 1.220449e+03 4.057236e+02 3.008804e+0225 1 1.508873e+03 4.057236e+02 2.834632e+0210 1 2.188041e+03 4.057236e+02 2.898518e+02
Improvement on the baseline n-gram is even greater on the medium-sized corpus (30%)
1 Million Trigram Corpusk Weight SVDLM perp. N-gram perp. Comb. perp25 1 4.237069e+04 3.730947e+02 3.729931e+0225 2 4.237069e+04 3.730947e+02 3.728907e+0225 10 4.237069e+04 3.730947e+02 3.721338e+0225 100 4.237069e+04 3.730947e+02 3.663666e+0225 1000 4.237069e+04 3.730947e+02 3.442525e+0225 10000 4.237069e+04 3.730947e+02 2.980755e+0225 100000 4.237069e+04 3.730947e+02 2.422045e+0225 1000000 4.237069e+04 3.730947e+02 2.187968e+0225 10000000 4.237069e+04 3.730947e+02 2.741027e+02
This is a big dataset for SVD! Needed to increase the weighting on the
SVDLM a lot to get a good result
Fine-Tuning kk Weight SVDLM perp. N-gram perp. Comb. perp25 1000000 4.237069e+04 3.730947e+02 2.192305e+0220 1000000 4.249082e+04 3.730947e+02 2.174188e+0215 1000000 4.266386e+04 3.730947e+02 2.100715e+0210 1000000 4.290579e+04 3.730947e+02 2.102029e+02
Tuning k results in a best perplexity of over 40% A low optimal k is a good thing because many
algorithms for calculating SVD produce singular vectors one at a time starting with the largest
TractabilityThe biggest challenge with SVDLM is tractabilityCalculating SVD is computationally demanding
But optimal k is lowI have also developed an algorithm that helps with tractability
Usability of the resulting SVDLM is also an issueSVDLM is much larger than regular n-gramBut the size can be minimised by discarding low values with minimal impact on performance
Backoff SVDLM
Improving on n-gram language modelling is interesting workHowever no improvement on the state of the art has been demonstrated yet!Next steps involve creation of a backoff SVDLM
Interpolating with lower-order n-grams is standardBackoff models have much superior performance
Similar Work
Jerome Bellegarda developed the LSA language model
Uses longer span eigen decomposition information to access semantic informationOthers have since developed the work
Saul and Pereira demonstrated an approach based on Markov models
Again demonstrates that some form of dimensionality reduction is beneficial
Summary
GHA-based algorithm allows large datasets to be decomposedAsymmetrical formulation allows data such as n-grams to be decomposedPromising initial results in n-gram language model smoothing have been presented
Thanks!
Gorrell, 2006 “Generalized Hebbian Algorithm for Incremental Singular Value Decomposition.” Proceedings of EACL 2006Gorrell and Webb, 2005 ”Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis.” Proceedings of Interspeech 2005Sanger, T. 1989 ”Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Network.” Neural Networks, 2, 459-473