Latent Semantic Indexing and Beyond

Friday 30. May 2003 NoDaLiDa 2003: Leif Grönqvist 1

Latent Semantic Indexingand Beyond

Leif Grönqvist ([email protected])School of Mathematics and Systems Engineering

The Swedish Graduate School of Language Technology


What is Latent Semantic Indexing?

• LSI uses a kind of vector model• The classical IR vector model groups documents

with many terms in common• But

– Documents could have a very similar content, using different vocabularies

– The terms used in the document may not be the most representative

• LSI uses the distribution of all terms in all documents when comparing two documents!


A traditional vector model for IR

• The starting point is a term-document-matrix, both for the traditional vector model and LSI

• We can calculate similarities between terms or documents using the cosine

• We can also (trivially) find relevant terms for a document

• Problems:– The term “trees” seems relevant to the m-documents,

but is not present in m4– cos(c1,c5)=0 just as cos(c1,m3)=0


A toy example


How does LSI work?

• The idea is to try to use latent information like:– word1 and word2 are often found together, so

maybe doc1 (containing word1) and doc2 (containing word2 ) are related?

– doc3 and doc4 have many words in common so maybe the words they don’t have in common are related?


How does LSI work? cont’d

• In the classical vector model, a document vector (from our toy example) is 12-dimensional and the term vectors are 9-dimensional

• What we want to do is to project these vector into a vector space with lower dimensionality

• One way is to use Singular Value Decomposition (SVD)

• We decompose the original matrix into three new matrices


What SVD gives us

X=T0S0D0: X, T0, S0, D0 are matrices


Using the SVD

• The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra

• We can select m easily just by using as many rows/columns of T0, S0, D0 as we want

• To get an idea, let’s use m=2 and recalculate a new (approximated) X – it will still be a t x d matrix


We can recalculate X with m=2

C1 C2 C3 C4 C5 M1 M2 M3 M4

Human .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09

Interface .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04Computer .15 .51 .36 .41 .24 .02 .06 .09 .12

User .26 .84 .61 .70 .39 .03 .08 .12 .19

System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05Response .16 .58 .38 .42 .28 .06 .13 .19 .22

Time .16 .58 .38 .42 .28 .06 .13 .19 .22

EPS .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11

Survey .10 .53 .23 .21 .27 .14 .44 .44 .42

Trees -.06 .23 -.14 -.27 .14 .24 .77 .77 .66

Graph -.06 .34 -.15 -.30 .20 .31 .98 .98 .85

Minors -.04 .25 -.10 -.21 .15 .22 .71 .71 .62


What does the SVD give?• Susan Dumais 1995: “The SVD program takes the ltc

transformed term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.”

• Michael W Berry 1992: “This important result indicates that Ak is the best

k-rank approximation (in at least

squares sense) to the matrix A.

• Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.


Algorithms for dimensional reduction

• Singular Value Decomposition (SVD)– This is a mathematically complicated (based on eigen-

values) way to find an optimal vector space in a specific number of dimensions

– Computationally heavy - maybe 20 hours for a one million documents newspaper corpus

– Uses often the entire document as context• Random Indexing (RI)

– Select some dimensions randomly– Not as heavy to calculate, but more unclear (for me)

why it works– Uses a small context, typically 1+1 – 5+5 words

• Neural nets, Hyperspace Analogue to Language, etc.


Some applications

• Automatic generation of a domain specific thesaurus

• Keyword extraction from documents

• Find sets of similar documents in a collection

• Find documents related to a given document or a set of terms


Problems and questions

• How can we interpret the similarities as different kinds of relations?

• How can we include document structure and phrases in the model?

• Terms are not really terms, but just words• Ambiguous terms pollute the vector space• How could we find the optimal number of

dimensions for the vector space?


An example based on 50 000 newspaper articles

stefan edbergedberg 0.918cincinnatis 0.887edbergs 0.883världsfemman 0.883stefans 0.883tennisspelarna 0.863stefan 0.861turneringsseger 0.859queensturneringen 0.858växjöspelaren 0.852grästurnering 0.847

bengt johanssonjohansson 0.852johanssons 0.704bengt 0.678centerledare 0.674miljöcentern 0.667landsbygdscentern 0.667implikationer 0.645ickesocialistisk 0.643centerledaren 0.627regeringsalternativet 0.620vagare 0.616


Bengt Johansson is just Bengt + Johansson – something is missing!

bengt 1.000westerberg 0.912folkpartiledaren 0.899westerbergs 0.893fpledaren 0.864socialminister 0.862försvarsfrågorna 0.860socialministern 0.841måndagsresor 0.840bulldozer 0.838skattesubventionerade 0.833barnomsorgsgaranti 0.829

johansson 1.000johanssons 0.800olof 0.684centerledaren 0.673valperiod 0.668centerledarens 0.654betongpolitiken 0.650downhill 0.640centerfamiljen 0.635centerinflytande 0.634brokrisen 0.632gödslet 0.628


A small experiment

• I want the model to know the difference between Bengt and Bengt

1.Make a frequency list for all n-tuples up to n=5 with a frequency>1

2.Keep all words in the bags, but add the tuples, with space replaced by -, as words

3.Run the LSI again• Now bengt-johansson is a word, and bengt-

johansson is NOT Bengt + JohanssonNumber of terms grows a lot!


And the top list for Bengt-Johansson

bengt-johansson 1.000dubbellandskamperna 0.954pettersson-sävehof 0.952kristina-jönsson 0.950fanns-svenska-glädjeämnen 0.945johan-pettersson-sävehof 0.942martinsson-karlskrona 0.938förbundskaptenen-bengt-bengan-johansson 0.932förbundskaptenen-bengt-bengan 0.932sjumålsskytt 0.931svenska-damhandbollslandslaget 0.928stankiewicz 0.926em-par 0.925västeråslaget 0.923

jan-stankiewicz 0.923handbollslandslag 0.922bengt-johansson-tt 0.921st-petersburg-sverige 0.921petersburg-sverige 0.921sjuklistan 0.920olsson-givetvis 0.920emtruppen 0.919…johansson 0.567bengt 0.354olof 0.181centerledaren 0.146westerberg 0.061folkpartiledaren 0.052


The new vector space model

• It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach

• But is the model better for single words and for document comparison as well?

What do you think? • More “words” than before – hopefully it

improves the result just as more data does• At least no reason for a worse result... Or?


An example documentREGERINGSKRIS ELLER INTE

PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …


Closest terms in each model0.986 underkänner0.982 irhammar0.977 partiledarna0.970 godkände0.962 delade-meningar0.960 regeringssammanträde0.957 riksdagsledamot0.957 bengt-westerberg0.954 materialet0.952 diskuterade0.950 folkpartiledaren0.949 medierna0.947 motsättningarna0.946 vilar0.944 socialminister-bengt-

westerberg

0.967 partiledarna0.921 miljökrav0.921 underkänner0.918 tolkar0.897 meningar0.888 centerledaren0.886

regeringssammanträde0.880 slottet0.880 rosenbad0.877 planminister0.866 folkpartiledaren0.855 thurdin0.845 brokonsortiet0.839 görel0.826 irhammar


Closest document in both modelsBILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det

blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …


Doc Basic model Tuples addedScore Rank Score Rank

2126 1.000 1 1.000 1

2127 .996 2 .999 2

2128 .848 5 .677 3

3767 .849 3 .534 7

211 .805 8 .526 8

156 .844 6 .525 9

215 .805 9 .522 10

2602 .848 4 .492 12

2367 .804 10 .434 19

2360 .838 7 .402 23

3481 .527 53 .673 4

1567 .456 73 .601 5

1371 .456 73 .601 5


Documents with better ranking in the tuple model

2602 .848 4 .492 12BRON KAN BLI VALFRÅGA SÄGER JOHANSSON

Om det lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga …

2367 .804 10 .434 19INTE EN KRITISK RÖST BLAND

CENTERPARTISTERNA TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …


Documents with better ranking in the phrase model

1567 .456 73 .601 5ALF SVENSSON TOPPNAMN I STOCKHOLM Kds-

ledaren Alf Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats …

1371 .456 74 .601 6BENGT WESTERBERG BARNPORREN MÅSTE

STOPPAS Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …


Hmm, adding n-grams was maybe too simple...

1. If the bad result is due to overtraining, it could help to remove the words I build phrases from…

2. Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams

A new test following 1 above:


Ok, the words inside tuples are now removed

bengt-johansson 1.000

tomas-svensson 0.931

sveriges-handbollslandslag0.912

förbundskapten-bengt-johansson 0.898

handboll 0.897

svensk-handboll 0.896

handbollsem 0.894

carlen 0.883

lagkaptenen-carlen 0.869

förbundskapten-johansson 0.863

ola-lindgren 0.863

bengan-johansson 0.862

erik-hajas 0.854

mats-olsson 0.854

carlen-magnus-wislander 0.852

handbollens 0.851

magnus-andersson0.851

halvlek-svenskarna 0.849

teka-santander 0.849

storskyttarna 0.849

förbundskaptenen-bengt-johansson 0.845

målvakten-mats-olsson 0.845

danmark-tvåa 0.843

handbollsspelare 0.839

sveriges-handbollsherrar 0.836

lag-ibland 0.835


And now pseudo documents are added for each tuple

bengt-johansson 1.000förbundskapten-bengt-johansson 0.907förbundskaptenen-bengt-johansson 0.835jonas-johansson 0.816förbundskapten-johansson 0.799johanssons 0.795svenske-förbundskaptenen-bengt-johansson 0.792bengan 0.786carlen 0.777bengan-johansson 0.767johansson-andreas-dackell 0.765förlorat-matcherna 0.750ck-bure 0.748daniel-johansson 0.748

målvakten-mats-olsson 0.747jörgen-jönsson-mikael-johansson 0.744kicki-johansson 0.744mattias-johansson-aik 0.741thomas-johansson 0.739handbollsnation 0.738mikael-johansson 0.737förbundskaptenen-bengt-johansson-valde 0.736johansson-mats-olsson 0.736sveriges-handbollslandslag0.736ställningen-33-matcher 0.736


What I still have to do something about

• Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself...

• Get the phrases into the model in some wayWhen these things are done I could:• Try to interpret various relations from

similarities in a vector space mode • Try to solve the “number of optimal

dimensions”-problem• Explore what the length of the vectors mean