Upload
lucas
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Latent Semantic Indexing and Beyond. Leif Grönqvist ([email protected]) School of Mathematics and Systems Engineering The Swedish Graduate School of Language Technology. What is Latent Semantic Indexing?. LSI uses a kind of vector model - PowerPoint PPT Presentation
Citation preview
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 1
Latent Semantic Indexingand Beyond
Leif Grönqvist ([email protected])School of Mathematics and Systems Engineering
The Swedish Graduate School of Language Technology
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 2
What is Latent Semantic Indexing?
• LSI uses a kind of vector model• The classical IR vector model groups documents
with many terms in common• But
– Documents could have a very similar content, using different vocabularies
– The terms used in the document may not be the most representative
• LSI uses the distribution of all terms in all documents when comparing two documents!
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 3
A traditional vector model for IR
• The starting point is a term-document-matrix, both for the traditional vector model and LSI
• We can calculate similarities between terms or documents using the cosine
• We can also (trivially) find relevant terms for a document
• Problems:– The term “trees” seems relevant to the m-documents,
but is not present in m4– cos(c1,c5)=0 just as cos(c1,m3)=0
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 4
A toy example
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 5
How does LSI work?
• The idea is to try to use latent information like:– word1 and word2 are often found together, so
maybe doc1 (containing word1) and doc2 (containing word2 ) are related?
– doc3 and doc4 have many words in common so maybe the words they don’t have in common are related?
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 6
How does LSI work? cont’d
• In the classical vector model, a document vector (from our toy example) is 12-dimensional and the term vectors are 9-dimensional
• What we want to do is to project these vector into a vector space with lower dimensionality
• One way is to use Singular Value Decomposition (SVD)
• We decompose the original matrix into three new matrices
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 7
What SVD gives us
X=T0S0D0: X, T0, S0, D0 are matrices
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 8
Using the SVD
• The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra
• We can select m easily just by using as many rows/columns of T0, S0, D0 as we want
• To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 9
We can recalculate X with m=2
C1 C2 C3 C4 C5 M1 M2 M3 M4
Human .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09
Interface .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04Computer .15 .51 .36 .41 .24 .02 .06 .09 .12
User .26 .84 .61 .70 .39 .03 .08 .12 .19
System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05Response .16 .58 .38 .42 .28 .06 .13 .19 .22
Time .16 .58 .38 .42 .28 .06 .13 .19 .22
EPS .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11
Survey .10 .53 .23 .21 .27 .14 .44 .44 .42
Trees -.06 .23 -.14 -.27 .14 .24 .77 .77 .66
Graph -.06 .34 -.15 -.30 .20 .31 .98 .98 .85
Minors -.04 .25 -.10 -.21 .15 .22 .71 .71 .62
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 10
What does the SVD give?• Susan Dumais 1995: “The SVD program takes the ltc
transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.”
• Michael W Berry 1992: “This important result indicates that Ak is the best
k-rank approximation (in at least
squares sense) to the matrix A.
• Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 11
Algorithms for dimensional reduction
• Singular Value Decomposition (SVD)– This is a mathematically complicated (based on eigen-
values) way to find an optimal vector space in a specific number of dimensions
– Computationally heavy - maybe 20 hours for a one million documents newspaper corpus
– Uses often the entire document as context• Random Indexing (RI)
– Select some dimensions randomly– Not as heavy to calculate, but more unclear (for me)
why it works– Uses a small context, typically 1+1 – 5+5 words
• Neural nets, Hyperspace Analogue to Language, etc.
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 12
Some applications
• Automatic generation of a domain specific thesaurus
• Keyword extraction from documents
• Find sets of similar documents in a collection
• Find documents related to a given document or a set of terms
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 13
Problems and questions
• How can we interpret the similarities as different kinds of relations?
• How can we include document structure and phrases in the model?
• Terms are not really terms, but just words• Ambiguous terms pollute the vector space• How could we find the optimal number of
dimensions for the vector space?
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 14
An example based on 50 000 newspaper articles
stefan edbergedberg 0.918cincinnatis 0.887edbergs 0.883världsfemman 0.883stefans 0.883tennisspelarna 0.863stefan 0.861turneringsseger 0.859queensturneringen 0.858växjöspelaren 0.852grästurnering 0.847
bengt johanssonjohansson 0.852johanssons 0.704bengt 0.678centerledare 0.674miljöcentern 0.667landsbygdscentern 0.667implikationer 0.645ickesocialistisk 0.643centerledaren 0.627regeringsalternativet 0.620vagare 0.616
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 15
Bengt Johansson is just Bengt + Johansson – something is missing!
bengt 1.000westerberg 0.912folkpartiledaren 0.899westerbergs 0.893fpledaren 0.864socialminister 0.862försvarsfrågorna 0.860socialministern 0.841måndagsresor 0.840bulldozer 0.838skattesubventionerade 0.833barnomsorgsgaranti 0.829
johansson 1.000johanssons 0.800olof 0.684centerledaren 0.673valperiod 0.668centerledarens 0.654betongpolitiken 0.650downhill 0.640centerfamiljen 0.635centerinflytande 0.634brokrisen 0.632gödslet 0.628
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 16
A small experiment
• I want the model to know the difference between Bengt and Bengt
1.Make a frequency list for all n-tuples up to n=5 with a frequency>1
2.Keep all words in the bags, but add the tuples, with space replaced by -, as words
3.Run the LSI again• Now bengt-johansson is a word, and bengt-
johansson is NOT Bengt + JohanssonNumber of terms grows a lot!
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 17
And the top list for Bengt-Johansson
bengt-johansson 1.000dubbellandskamperna 0.954pettersson-sävehof 0.952kristina-jönsson 0.950fanns-svenska-glädjeämnen 0.945johan-pettersson-sävehof 0.942martinsson-karlskrona 0.938förbundskaptenen-bengt-bengan-johansson 0.932förbundskaptenen-bengt-bengan 0.932sjumålsskytt 0.931svenska-damhandbollslandslaget 0.928stankiewicz 0.926em-par 0.925västeråslaget 0.923
jan-stankiewicz 0.923handbollslandslag 0.922bengt-johansson-tt 0.921st-petersburg-sverige 0.921petersburg-sverige 0.921sjuklistan 0.920olsson-givetvis 0.920emtruppen 0.919…johansson 0.567bengt 0.354olof 0.181centerledaren 0.146westerberg 0.061folkpartiledaren 0.052
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 18
The new vector space model
• It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach
• But is the model better for single words and for document comparison as well?
What do you think? • More “words” than before – hopefully it
improves the result just as more data does• At least no reason for a worse result... Or?
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 19
An example documentREGERINGSKRIS ELLER INTE
PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 20
Closest terms in each model0.986 underkänner0.982 irhammar0.977 partiledarna0.970 godkände0.962 delade-meningar0.960 regeringssammanträde0.957 riksdagsledamot0.957 bengt-westerberg0.954 materialet0.952 diskuterade0.950 folkpartiledaren0.949 medierna0.947 motsättningarna0.946 vilar0.944 socialminister-bengt-
westerberg
0.967 partiledarna0.921 miljökrav0.921 underkänner0.918 tolkar0.897 meningar0.888 centerledaren0.886
regeringssammanträde0.880 slottet0.880 rosenbad0.877 planminister0.866 folkpartiledaren0.855 thurdin0.845 brokonsortiet0.839 görel0.826 irhammar
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 21
Closest document in both modelsBILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det
blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 22
Doc Basic model Tuples addedScore Rank Score Rank
2126 1.000 1 1.000 1
2127 .996 2 .999 2
2128 .848 5 .677 3
3767 .849 3 .534 7
211 .805 8 .526 8
156 .844 6 .525 9
215 .805 9 .522 10
2602 .848 4 .492 12
2367 .804 10 .434 19
2360 .838 7 .402 23
3481 .527 53 .673 4
1567 .456 73 .601 5
1371 .456 73 .601 5
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 23
Documents with better ranking in the tuple model
2602 .848 4 .492 12BRON KAN BLI VALFRÅGA SÄGER JOHANSSON
Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga …
2367 .804 10 .434 19INTE EN KRITISK RÖST BLAND
CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 24
Documents with better ranking in the phrase model
1567 .456 73 .601 5ALF SVENSSON TOPPNAMN I STOCKHOLM Kds-
ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats …
1371 .456 74 .601 6BENGT WESTERBERG BARNPORREN MÅSTE
STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 25
Hmm, adding n-grams was maybe too simple...
1. If the bad result is due to overtraining, it could help to remove the words I build phrases from…
2. Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams
A new test following 1 above:
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 26
Ok, the words inside tuples are now removed
bengt-johansson 1.000
tomas-svensson 0.931
sveriges-handbollslandslag0.912
förbundskapten-bengt-johansson 0.898
handboll 0.897
svensk-handboll 0.896
handbollsem 0.894
carlen 0.883
lagkaptenen-carlen 0.869
förbundskapten-johansson 0.863
ola-lindgren 0.863
bengan-johansson 0.862
erik-hajas 0.854
mats-olsson 0.854
carlen-magnus-wislander 0.852
handbollens 0.851
magnus-andersson0.851
halvlek-svenskarna 0.849
teka-santander 0.849
storskyttarna 0.849
förbundskaptenen-bengt-johansson 0.845
målvakten-mats-olsson 0.845
danmark-tvåa 0.843
handbollsspelare 0.839
sveriges-handbollsherrar 0.836
lag-ibland 0.835
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 27
And now pseudo documents are added for each tuple
bengt-johansson 1.000förbundskapten-bengt-johansson 0.907förbundskaptenen-bengt-johansson 0.835jonas-johansson 0.816förbundskapten-johansson 0.799johanssons 0.795svenske-förbundskaptenen-bengt-johansson 0.792bengan 0.786carlen 0.777bengan-johansson 0.767johansson-andreas-dackell 0.765förlorat-matcherna 0.750ck-bure 0.748daniel-johansson 0.748
målvakten-mats-olsson 0.747jörgen-jönsson-mikael-johansson 0.744kicki-johansson 0.744mattias-johansson-aik 0.741thomas-johansson 0.739handbollsnation 0.738mikael-johansson 0.737förbundskaptenen-bengt-johansson-valde 0.736johansson-mats-olsson 0.736sveriges-handbollslandslag0.736ställningen-33-matcher 0.736
Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 28
What I still have to do something about
• Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself...
• Get the phrases into the model in some wayWhen these things are done I could:• Try to interpret various relations from
similarities in a vector space mode • Try to solve the “number of optimal
dimensions”-problem• Explore what the length of the vectors mean