30
UNIVERSITY OF PATRAS TMG: A MATLAB Toolbox for Generating Term- Document Matrices from Text Collections D. Zeimpekis and E. Gallopoulos January 2005 Technical Report HPCLAB-SCG 1/01-05 LABORATORY: High Performance Information Systems Laboratory GRANTS: University of Patras “Karatheodori” Grant; Bodossaki Foundation Graduate Fellowship. REPOSITORY http://scgroup.hpclab.ceid.upatras.gr REF: Final version of paper will appear in Grouping Multidimensional Data: Recent Advances in Clustering, J. Kogan and C. Nicholas, eds., Springer. COMPUTER ENGINEERING &INFORMATICS DEPARTMENT,UNIVERSITY OF PATRAS, GR-26500, PATRAS,GREECE www.ceid.upatras.gr

TMG: A MATLAB Toolbox for Generating Term- Document ...scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tmgHPCLAB... · PGTP is an MPI-based parallel version ofGTP. Other tools

  • Upload
    lyanh

  • View
    229

  • Download
    2

Embed Size (px)

Citation preview

UNIVERSITY OF PATRAS

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

D. Zeimpekisand E. Gallopoulos

January 2005

Technical Report HPCLAB-SCG 1/01-05LABORATORY: High Performance Information Systems LaboratoryGRANTS: University of Patras “Karatheodori” Grant; Bodossaki Foundation Graduate Fellowship.REPOSITORY http://scgroup.hpclab.ceid.upatras.grREF: Final version of paper will appear inGrouping Multidimensional Data: Recent Advances in Clustering, J. Kogan

and C. Nicholas, eds., Springer.

COMPUTER ENGINEERING & I NFORMATICS DEPARTMENT, UNIVERSITY OF PATRAS, GR-26500, PATRAS, GREECE www.ceid.upatras.gr

TMG: A MATLAB Toolbox for GeneratingTerm-Document Matrices from Text Collections

Dimitrios Zeimpekis1 and Efstratios Gallopoulos2

1 Department of Computer Engineering and Informatics, University of Patras, [email protected]

2 Department of Computer Engineering and Informatics, University of Patras, [email protected]

Summary. A wide range of computational kernels in data mining and information retrievalfrom text collections involve techniques from linear algebra. These kernels typically operateon data that is presented in the form of large sparse term-document matrices (tdm). We presentTMG, a research and teaching toolbox for the generation of sparse tdm’s from text collectionsand for the incremental modification of these tdm’s by means of additions or deletions. Thetoolbox is written entirely inMATLAB , a popular problem solving environment that is pow-erful in computational linear algebra, in order to streamline document preprocessing and pro-totyping of algorithms for information retrieval. Several design issues that concern the use ofMATLAB sparse infrastructure and data structures are addressed. We illustrate the use of thetool in numerical explorations of the effect of stemming and different term-weighting policieson the performance of querying and clustering tasks.

1 Introduction

Much of the knowledge available today is stored as text. It is not surprising, there-fore, that data mining (DM) and information retrieval (IR) from text collections (textmining) has become an active and exciting research area; see for example [53]. Asthe vector space model (VSM) and matrix and vector representations are routinelyused in DM and IR, it turns out that several performance critical kernels in these ar-eas originate from computational linear algebra (CLA). Consider, for example, twotypical operations: Clustering and querying in the context of the VSM. Algorithmsthat implement them rely on modules standing at various levels in the hierarchy oflinear algebra computations, from inner products to eigenvalue and singular valuedecompositions (SVD). As a result, the fields of DM and IR have been providingthe ground for synergistic efforts between application specialists and researchers inCLA. The latter researchers understand the intricacies of designing effective matrixcomputations on modern computer system platforms, commonly used in DM and IRand contribute the design of performance critical kernels for DM and IR algorithms;see for example [14, 12, 16, 19, 26, 38, 39].

DRAFT: Please do not distribute

This chapter presentsTMG, a toolbox that helps the user in the two major phasesof the VSM: The preprocessing, “indexing” phase, in which the index of terms isbuilt, and the “search” phase, during which the index is used in the course of queriesand other operations. In particular,TMG preprocesses documents to construct an in-dex in the form of a sparse “term-document matrix”, hereafter abbreviated by “tdm”,and preprocesses user queries so as to make them ready for the application of an IRmodel.TMG is specifically oriented to the application of vector space techniques (seee.g. [51, 24, 45]) that model documents as term vectors so that many IR tasks canbe cast in terms of CLA. We will use the convention thatm×n matrices representtdm’s of n documents over an index ofm terms. In view of the significant presenceof CLA kernels in vector space techniques for IR, we felt that there was a “marketneed” for aMATLAB -based tdm generation system, asMATLAB is a highly popu-lar problem solving environment for CLA that enables the rapid prototyping of novelIR algorithms [5]. Therefore,TMG is written entirely inMATLAB and runs on anycomputer system that supports that environment. Even thoughMATLAB started asa “Matrix Laboratory”, it is now equipped with a large number of facilities includingdata structures, functions, visualization and interface building tools that make pos-sible the rapid synthesis of entire suites of special purpose algorithms. Furthermore,it claims a very large user base that continuously contributes new software (such asTMG) on the Web3. See [29, 34], for example, for toolboxes that specialize in op-erations related to IR, e.g. algorithms based on spectral analysis of sparse matricesencapsulating graph structures.

TMG parses single files or entire directories of multiple files containing text, per-forms the necessary preprocessing and constructs a tdm according to parameters setby the user. It is also able to renew existing tdm’s by performing efficient updates ordowndates corresponding to the incorporation of new or deletion of existing docu-ments.

We must emphasize that two critical components forTMG’s operations areMAT-LAB ’s sparse matrix infrastructure and visualization tools.TMG can be used to com-plement algorithms and tools that work with tdm’s, e.g. [8, 14, 12, 19, 26, 41]. Forexample,TMG was used in recent experiments included in [56, 57, 40, 48] and hasalready been requested by many researchers. We also expectTMG to be useful in in-structional settings, by helping to create motivating examples in CLA and IR courses.

TMG is not tied to specific algorithms of the vector space model but includes,for convenience,MATLAB code for querying and clustering that can be used astemplate by users who want to perform such tasks withTMG. InterfacingTMG withcodes in other languages is quite straightforward as there are several utilities forconverting the objects in the underlyingMATLAB storage class to other formats.The CLUTO clustering toolkit ([37]), for example, inputsASCII files containing thecompressed sparse row (CSR) representation of matrices that can be obtained fromTMG, while SDDPACK([41]) providesMATLAB routines for converting to and fromthe Matrix Market format [18].TMG was designed to provide several term-weighting

3 See, for example, the Link Exchange Center,www.mathworks.com/matlabcentral/fileexchange

DRAFT: Please do not distribute

options to the user [11, 42] as well as the possibility of stemming. In addition todescribing the design of the tool, we also report herein on the application ofTMG

as a preprocessor for IR tasks and its combination with a variety of term-weightingfunctions and stemming.

This chapter is organized as follows. In the rest of this section we briefly re-view some related efforts. Section 2 presentsTMG, describing all its core functions,including the graphical user interface (GUI) and analyzing the various options thatare provided in the package. Section 3 describes implementation issues, in particularthe utilization ofMATLAB ’s sparse matrix technology. Section 4 demonstrates theuse ofTMG on a pubic dataset, we callBIBBENCH, and compares the performanceof some query answering and clustering algorithms based on vector space modelsusing various term-weighting schemes and stemming for data from theMEDLINE,CRANFIELD, CISI4 andREUTERS-215785 collections. Section 5 provides conclud-ing remarks. All numerical experiments were conducted on a 3GHz Pentium 4 PCwith 512MB RAM running Windows XP andMATLAB 7.0. Runtimes were mea-sured usingMATLAB infrastructure for performance analysis, specificallyprofileand timing functionstic, toc .

Related work

There exist already several tools for constructing tdm’s since IR systems that arebased on vector space techniques (e.g. Latent Semantic Indexing, herafter abbre-viated as LSI) typically operate on rows and columns of such matrices; see e.g.[3, 45, 49]. TheTelcordia LSI Engine, for example, is a production-level IR ar-chitecture that contains components for generating sparse tdm’s [22, 7] from textcollections.Lemur[4] is a popular language modeling and IR toolkit written inC++.A recent powerful system that we have found to be particularly effective is theGen-eral Text Parser(GTP) ([2, 30]). GTP is a complete IR package written in C++ andJava and employing LSI; we used it to evaluate the results obtained fromTMG. PGTP

is an MPI-based parallel version ofGTP. Other tools that one can find in the openliterature areDOC2MAT [1], written in perl and developed in the context of theCLUTO IR package [37];MC [6], written in C++ [25, 26]; and the Unix shell scriptutility countallwords included in thePDDPpackage [20]. The above tools are im-plemented in high level or scripting languages (e.g.C, C++, Java, Perl ). It is fairto say at the outset and will become clear from our description thatTMG’s currentdesign is best suited for datasets of moderate size. For very large datasets, one wouldbe better served by systems such asGTP.

4 Available fromftp://ftp.cs.cornell.edu/pub/smart/5 Available fromhttp://kdd.ics.uci.edu/

DRAFT: Please do not distribute

Table 1.Steps in functiontmg

Function tmgInput : filename,OPTIONSOutput : tdm, dictionary, and several optional outputs;parse files or input directory;read the stoplist;for each input file,

parse the file (construct dictionary);endnormalize the dictionary (remove stopwords and too long or tooshort terms, stemming);construct tdm;remove terms as per frequency parameters;compute global weights;apply local weighting function;form final tdm;

2 The Text to Matrix Generator

2.1 Outline

TMG is constructed to perform preprocessing and filtering steps that are typicallyperformed in textual IR [10] (in parentheses are the names of the relevantMATLABm-functions):

- Creation of the tdm corresponding to a set of documents (tmg );- creation of query vectors from user input (tmp query );- update existing tdm by incorporation of new documents (tdm update );- downdate existing tdm by deletion of specified documents (tdm downdate ).

The document preprocessing steps encoded byTMG are the following:i) Lexicalanalysis;ii) stopword elimination;iii ) stemming;iv) index-term selection;v) indexconstruction. These steps are tabulated in Table 1.

Each element,αi j , of a tdm can be expressed as

αi j = l i j gini j , (1)

where l i j is a local factor that measures the importance of termi in document j,gi is a global factor that measures the importance of termi in the entire collectionandni j is a normalization factor [50]. This latter is used to moderate bias towardslonger documents [52]. The local, global term weighting and normalization optionsavailable inTMG are listed in Table 2. Symbolfi j denotes term frequency, i.e. thenumber of times termi appears in documentj; also,

pi j =fi j

∑k fik, and b( fi j ) =

{1, if fi j 6= 00, if fi j = 0.

It must be noted thatTMG does not restrict the separating delimiter to be an end-of-

DRAFT: Please do not distribute

Table 2.Term-weighting and normalization schemes [11, 23, 42, 50].

Symbol Name Type

Local term-weighting (l i j )t Term frequency fi jb Binary b( fi j )l Logarithmic log2(1+ fi j )a Alternate log [42] b( fi j )(1+ log2 fi j )n Augmented normalized(b( fi j )+( fi j /maxk fk j))/2

term frequency

Global term-weighting (gi)x None 1e Entropy 1+(∑ j (pi j log2(pi j ))/ log2n)f Inverse document log2(n/∑ j b( fi j ))

frequency(IDF)g GfIdf (∑ j fi j )/(∑ j b( fi j ))

n Normal 1/√

∑ j f 2i j

p Probabilistic Inverse log2((n−∑ j b( fi j ))/∑ j b( fi j ))Normalization factor (ni j )

x None 1c Cosine (∑ j (gi l i j )2)−1/2

file character, hence the number of documents corresponding to the collection wouldbe at least as large as the actual number of (valid) files processed byTMG.

2.2 User interface

The user interacts withTMG by means of any of the aforementionedMATLABfunctions or via a graphical interface (GUI), implemented as functiontmg gui . TheGUI facilitates user selection of the appropriate options among the many alternativesavailable at the command-line level. A user that desires to construct a tdm from textwill either usetmg or tmg gui . The specific invocation of the former is of the form:

outargs=tmg(‘fname’, OPTIONS);

whereoutargs stands for the output list:

[A, dictionary, global wts, norml factors, words per doc, titles, files,

update struct] .

The tdm is stored as aMATLAB sparse double arrayA, while dictionary is achar array containing the collection’s distinct words, andupdate struct containsthe essential information for the collection’s renewal (see Section 3.3). The otheroutput arguments store statistics for the collection. The full list of output argumentsis tabulated in Table 4.

Argumentfname specifies the individual file(s) to be processed or the directoryname that contains them. In the latter case,TMG recursively processes included sub-directories and files. It is assumed that all files contain valid data. In particular, files

DRAFT: Please do not distribute

Table 3.OPTIONSfields.

delimiter String specifying the “end-of-document” markerfor tmg . Possible values areemptyline (default),none delimiter (treats each file as a single document)or any other string

line delimiter Variable specifying if thedelimiter takes a whole lineof text (default, 1)

stoplist Filename for stopwords (default no name, meaning nostopword removal)

stemming A flag that indicates if stemming is to be applied (1) ornot (0) (defaultstemming =0)

min length Minimum term length (default 3)max length Maximum term length (default 30)min local freq Minimum term local frequency (default 1)max local freq Maximum term local frequency (defaultInf )min global freq Minimum number of documents for a term to appear to

insert it in the dictionary (default 1)max global freq Maximum number of documents for a term to appear to

insert it in the dictionary (defaultInf )local weight Local term weighting function (default ‘t ’, possible val-

ues ‘t ’, ‘ b’, ‘ l ’, ‘ a’, ‘ n’)global weight Global term weighting function (default no global

weighting used, ‘x ’, possible values ‘x ’, ‘ e’, ‘ f ’, ‘ g’,‘n’, ‘ p’)

normalization Flag specifying if document vectors are to be normal-ized (‘c ’) or not (‘x ’) (default)

dsp Flag specifying if results are to be printed in the com-mand window (1, default) or not (other)

are assumed to contain plainASCII text or that a special filter that can convert themto such a format is available and properly linked toTMG. Currently,TMG can processAdobe AcrobatPDF andPOSTSCRIPT documents provided Ghostscript’sps2asciiutility is available. Filenames suffixed withhtml or htm are assumed to beASCII

files with html markups;TMG processes them by stripping the corresponding tagsusing thestrip html function.

The options available at the command line to the user oftmg are set via thefields of theMATLAB OPTIONSstructure tabulated in Table 3. Fielddelimiterspecifies the delimiter that separates individual documents within the same file. Thedefault delimiter is a blank line, in which caseTMG is likely to generate more “doc-uments” than the number of files given as input. Fieldline delimiter specifies ifthedelimiter takes a whole line of text. Fieldstoplist specifies the file contain-ing the stopwords, i.e. the terms excluded from the collection’s dictionary [11]. Thecurrent release ofTMG contains a stoplist obtained fromGTP [2]. Field stemmingindicates whether stemming is to be used; this is performed bystemmer , our MAT-LAB implementation of a modified version of Porter’s algorithm [47, 46].stemmer

DRAFT: Please do not distribute

Table 4. TMG outputs.

A resulting tdm;dictionary collection’s dictionary (char array);global wts vector of global weightsnorml factors vector of document norms prior to normalization;words per doc vector containing statistics for each document;titles titles of each document (cell array);files processed filenames with set title and document’s first

line (cell array);update struct structure containing necessary data for renewal;

can also be called directly from the command line. To validate our implementation,we compared our results and verified that they coincided with the word list and cor-responding stems listed in [46].

In the current version ofTMG stopword removal takes place before stemming.Therefore, care is required in adding terms in the stoplist, as we might also needto provide their variants as well. In particular, if in a bibliography file we wish todispose of the word “author ” and “authors ” we would need to add both to thestoplist. It is easy to alter this inTMG so as to apply stemming on the stoplist aswell as on the dictionary. One disadvantage is that this could lead to the removalof terms that share the same stem with a stopword. Another option that would beeasy to incorporate inTMG is to use two stoplists: One containing basic stopwordsand the other, a stoplist generator, for domain-specific terms that would be useful topreprocess by stemming and then use it as stoplist. Overall, given the current optionsin TMG, it is not difficult to enrich the current filtering steps to help process “dirtytext”, containing typos, adhoc abbreviations, special symbols, etc. [21].

Parametersmin length, max length are thresholds used to exclude terms thatare out of range; e.g. terms that are too short are likely to be of little value in in-dexing while very long ones are likely to be misprints. Parametersmin local freq ,max local freq , min global freq andmax global freq are also filtering param-eters, thresholding based on frequency of occurrence. The lastOPTIONSfield, dsp ,indicates if the intermediate results are printed on theMATLAB command window.

Functiontmg query uses the dictionary returned bytmg and constructs, using thesame processing steps asTMG, a “term query” array whose columns are the (sparse)query vectors for the text collection. The function is invoked as follows:

[Q, wds per query, titles, files]=tmg query(‘fname’,dictionary, OPTIONS);

Here,OPTIONScontains fields that are a subset of those used intmg ; for details seethe code documentation.

Graphical User Interface

As described thus far, the main toolbox functionstmg and tmg query offer a largenumber of options. Moreover, it is anticipated that future releases will further in-crease this number to allow for additional flexibility in operations and filetypes han-

DRAFT: Please do not distribute

dled byTMG. In view of this, a GUI we would be callingTMG GUI that is depicted inFig. 1, was created to facilitate interaction. This is instantiated by means of functiontmg gui . The GUI design was facilitated by the interactiveMATLAB tool GUIDE.TMG GUI consists of two frames: One provides a set of four mutually exclusive ra-dio buttons, corresponding to the basic functions ofTMG, along with a set of radiobuttons, edit boxes, lists and toggle buttons for all required input arguments; theother provides a set of items for the optional arguments oftmg, tmg query and up-date routines. After specifying all necessary parameters, and theContinue button isclicked,TMG GUI invokes the appropriate function. The progress of the program isshown on the screen; upon finishing the user is queried if and where he wants the re-sults to be saved; results are saved inMATLAB -mat file(s), i.e. the file format usedby MATLAB for saving and exchanging data.

Fig. 1. TMG GUI.

DRAFT: Please do not distribute

3 Implementation issues

We next address some issues that relate to design choices made regarding the algo-rithms and data structures used in the tool. Overall,TMG’s efficiency is greatly aidedby the use ofMATLAB ’s sparse matrix infrastructure and an effective implementa-tion of inverted indexes.

3.1 Sparse matrix technology

One important goal in the design ofTMG was to employ data structures that would beefficient regardingi) the costs of creating and updating them,ii) the overall storagerequirements, andiii ) the processing of the kernel IR operations. Tdm’s are usuallyextremely sparse; e.g. see Table 10 that tabulates the statistics for some well-knowncollections used to benchmark IR algorithms, and Table 7 for the statistics for ourBIBBENCH collection: Approximately 98% or more of the entries of the correspond-ing tdm’s are zero. Therefore, a natural object for representing tdm’s are sparse ma-trices. Indeed, with the current popularity of VSM-based techniques, sparse matrixrepresentations have become popular in IR and are the subject of investigation; seee.g. [31, 36]. It is worth noting that recent studies suggest that sparse matrices arepreferable for IR over other implementations, such as inverted indexes [31, 32]. In-verted indexes, for example, complicate the implementation of non-Boolean searchesand dimensionality reduction transformations that are at the core of LSI [9]. Nonethe-less,TMG employs an inverted index as an intermediate data structure to aid in theassembly of the sparse tdm.

After parsing the collection (cf. Section 3.2), cleaning and stemming the dictio-nary, each cell array for the posting list is copied to another, each element of whichis a MATLAB sparse column vector of sizen. This latter array is finally convertedto the sparse tdm using functioncell2mat .

MATLAB provides an effective environment for sparse computations builtaround the concept of a “sparse array”, a specialMATLAB class that economizesstorage and operations by utilizing mature technology; see [27, 28] for an excellent,early, technical description. Sparse matrices inMATLAB are stored internally inthe well-known compressed sparse column format (CSC) that formally consists oftwo arrays of length equal to the number of nonzero entries,nnz , one consisting ofreals containing the values of the matrix elements in column major order and theother integer containing the corresponding row indices; and an array of sizen+ 1containing an integer index to the previous two arrays indicating the location of theleading nonzero entry of each column and the value ofnnz at the last position. Ac-tually, upon the creation of a sparse matrix,MATLAB uses an estimate,nzmax(A)for the number of its nonzeros (equal or larger than the actual value ofnnz ) and allo-cates enough storage to store the matrix in the above format [27]. Current versions ofMATLAB use 8 byte reals and 4 byte integers so that the total workspace occupiedby a sparse non-square tdmA is Mem(A) = 12nzmax(A) + 4(n+ 1) bytes. There-fore, for non-square matrices, the space requirements are asymmetric, in the sensethatMem(A) 6= Mem(A>), though the storage difference is4|m−n|, which is small

DRAFT: Please do not distribute

Table 5. MATLAB commands to build the tdmA for schemelnc from the frequency tableF = ( fi j ).

[i, j, L]= find(F); L=log2(L+1);A=sparse(i, j, L, size(F,1),size(F, 2))A = spdiags(1./sqrt(sum(F.ˆ2,2)), 0, size(F,1), size(F,1))*A;A = A*spdiags(1./sqrt(sum(A.ˆ2,1))’, 0, size(A,2), size(A,2));

relative to the total storage required. By expressing and coding the more intensivemanipulations in theTMG toolbox in terms ofMATLAB sparse operations, the costof operating on tdm’s becomes proportional to the number of real arithmetic opera-tions on nonzero elements or the size of the data size of the tdm (that is size of outputand input participating non-trivially in the computation of the output), whichever islarger. Formula (1), for example, implies that the tdm can be obtained from the appli-cation of element-by-element operations on the sparse matrix containing the termsfi j to obtain the local weights, followed by element-by-element multiplication withthe tdm followed by left multiplication with the diagonal matrix (in sparse format)containing the global weightsgi . Table 5, for example, showsMATLAB statementsfor building the tdm for schemelnc . New term weighting formulas (e.g. [23]) caneasily be programmed in the system.

It is worth noting here that had we opted to build the target tdm directly as asparse matrix in the course of the reading phase, it would have necessitated fast up-dates (creating new rows and columns, changing individual elements) which wouldhave been inefficient, especially in the absence of a good a priori estimate of thematrix size and total number of nonzeros.

As already mentioned, sparse representations are employed by other systems aswell. GTP and the Telcordia LSI Engine systems, for example, use the Harwell-Boeing format ([2, 30, 22]), while theMC toolkit ([6]) also uses the CSC format.The authors of [31] use the compressed sparse row (CSR) format to store instead“document-term” matrices; this, of course, is equivalent to our approach. On theother hand, the experiments in [36] assume a CSR representation for term-documentmatrices.

We next experimentally illustrate the dependence ofTMG’s runtime on aspectsof the dataset size. In this as well as in Section 4, we experimented with datasetscreated from theREUTERS-21578collection. We kept only those texts that containednon-empty text bodies and called the resulting set, consisting of 19,042 documents,REUT-ALL . We then organized the collection in 22 files that we labeledREUTi, wherei = 1, ...,22. In the sequel, we would be using the notationREUT[i : j] to denote thedataset consisting of filesREUTi up to and includingREUT j.

Fig. 2 shows the runtimes ofTMG to build tdm’s from each of the 22 file collec-tions REUT1, REUT[1 : 2], up to REUT[1 : 22], vs. the number of nonzero elementsand the number of documents. The figure suggests that the time taken byTMG de-pends linearly on the number of nonzeros of the tdm. The dependence also appearsto be linear in the number of documents. We also illustrate the performance of twokernel CLA operations for IR, specifically matrix-vector multiplication and the com-

DRAFT: Please do not distribute

0 1 2 3 4 5 6 7 8 9 10

x 105

0

20

40

60

80

100

120

140

160

180

number of nonzeros

time

(sec

)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

20

40

60

80

100

120

140

160

180

number of documents

time

(sec

)

Fig. 2. TMG runtimes (sec) vs. the number of nonzeros in the tdm (left); vs. the number ofdocuments (right).

0 1 2 3 4 5 6 7 8 9 10

x 105

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

number of nonzeros

mat

rix−

vect

or ti

me

(sec

)

0 1 2 3 4 5 6 7 8 9 10

x 105

0

0.5

1

1.5

2

2.5

number of nonzeros

svd

time

(sec

)

Fig. 3.Runtimes (sec) of matrix-vector multiplication (left) and SVD (right) where the matrixis the tdm constructed byTMG vs. the number of nonzeros in the tdm.

putation of the largest singular value and corresponding singular vectors using thenativeMATLAB function for sparse SVD (svds ); the latter is based on the implic-itly restarted Arnoldi method [44]. Results are shown in Fig. 3.

3.2 Dictionary construction

Central to the operation ofTMG are the steps of document parsing and dictionaryconstruction.TMG reads each document using functionstrread . This returns the to-kens present in its input char array in a cell array of chars. All distinct terms presentin the document are then obtained in sorted order via functionunique . At the sametime, the procedure creates a (local) posting list for these terms, that is pairs contain-ing the number of occurrences and the document identifier for each term. Assumingthat we keep a “running inverted index” of all documents processed up to stepi−1,we can apply the procedure iteratively as follows: at stepsi = 2, ..., we first createthe local term vector and posting list and then use it to update the running invertedindex. One weakness of this approach is that it requires as many calls to functionsunique and ismember as there are documents, something that we found to be very

DRAFT: Please do not distribute

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000100

200

300

400

500

600

700

800

N

time

(sec

)

Fig. 4. TMG runtimes (sec) over theupdate step parameterN.

time consuming. Another approach would be to proceed by appending to the dic-tionary’s cell array the new terms in the document and keep track of the documentindices containing each word. This would necessitate only one call tounique toform the inverted index but with high cost in memory, since we would need to storefirst all tokens in the collection. The memory penalty is further accentuated by thefact thatcell data structures have a higher memory overhead than sparse numericarrays. Based on the above observations, we designed a simple but effective schemeto construct the inverted index. In particular, we still use a running inverted indexbut update it using a block inverted index consisting of data fromN documents at atime. The functions that implement the above operations are calledunique wordsandmerge dictionary , while we useupdate step to designate the block sizeN.SelectingN = n or N = 1, the above approach reduces to the first and second ofthe aforementioned methods. Fig. 4 depicts the runtimes ofTMG for the REUT-ALL

collection, forN = [10,20,50,100,200,500,1000 : 1000 : 10000]. As we see,TMG’sperformance peaks at intermediate values ofN. Finally, although it is not clear fromthis figure, larger values ofN can further increase runtime because of disk accesses.The effects of this blocked approach to building the dictionary together with an in-cremental approach to constructing the tdm, presented in Section 3.3, are subjectof our current investigations towards better tuning ofTMG’s performance. Overall,however, becauseTMG does not currently implement text or index compression (seee.g. [10, 54]) it is better suited for datasets of moderate size.

3.3 TMG for document renewal

The efficient updating or downdating tdm’s is of importance as it is key to the main-tenance of document collections. It can also lead to important CLA issues related tothe design of effective algorithms for fastSVD updates; see for example [45, 55, 58].In order to retain independence from the underlying VSM, we are concerned herewith simple tdm updates that result in a matrix that is identical with the one thatwould have been created were all documents available from the beginning andTMG

applied to all. In other words, we designed updating operations that maintain the in-

DRAFT: Please do not distribute

Table 6. GTP andTMG runtimes (sec).

Toolbox-CollectionREUT-ALL MEDLINE CRANFIELD CISI

TMG 169.52 8.27 10.05 14.28GTP 96.84 4.52 8.2 4.96

tegrity of the resulting tdm. To this end,TMG includes functionstdm update andtdm downdate for modifying an existing tdm so as to take into account documentarrival and/or deletion. Any document arrival or deletion is likely to change the sizeof the tdm as well as specific entries: Non-trivial document arrival will certainlychange the number of tdm columns and the number and labeling of the rows, be-cause terms satisfying the filtering requirements are encountered or removed; and/orterms in the original dictionary that were excluded by filtering become valid entries.Hence, in order to update correctly, the entire dictionary of terms encountered dur-ing parsing (before filtering) must be maintained together with the correspondingterm frequencies. This information is also sufficient for the proper update of the tdmwhen parameters such as the maximum and/or minimum global frequencies change.Therefore, as long as updates are anticipated, whenTMG is run, the user must selectto save these items (TMG GUI prompts the user accordingly).TMG saves them in aMATLAB structure array, denoted byupdate struct . We avoid using one full andone normalized (post-filtering) dictionary by working only with the full dictionaryand a vector of indices indicating those terms active in the normalized dictionary,stored in the same structure array. Fortdm downdate , the user specifies the relevantupdate struct and a vector of integer indices identifying the documents to remove.We evaluate the performance of renewal in Section 4.2.

4 Experimental Results

To check the results obtained fromTMG, we first used it to build the tdm’s from theMEDLINE, CRANFIELD andCISI collections and confirmed that they were the sameas those obtained usingGTP, except for expected differences due to the fact thatthe two packages follow contrasting approaches to handle terms containing digits:TMG (resp.GTP) excludes (resp. includes) terms that are solely composed of numericcharacters but keeps (resp. drops) words combining letters and numeric characters.We also show, in Table 6, the runtimes ofTMG and GTP for the aforementioneddatasets as well as for the setREUT-ALL , described in Section 3. Results withGTP

were obtained on a system running Linux with theGCC2.95 compiler. In view of thefact thattmg consists ofMATLAB code, it is quite efficient, albeit slower thanGTP.Furthermore, as mentioned earlier,GTP’s lead is expected to increase for very largedatasets.

DRAFT: Please do not distribute

Table 7.BIBBENCH dataset.

feature BEC BKN GVL BIBBENCH txx s

documents 1,590 651 860 3,101terms (indexing) 1712 1159 720 3,154stemmed terms 372 389 221 964avg. terms/document 41 38 28 37avg. terms/document (indexing) 13 13 8.40 12tdm nonzeros (%) 0.74 1.00 1.15 0.36

4.1 TheBIBBENCH dataset

To illustrate the use ofTMG we created a new dataset, we callBIBBENCH con-sisting of three source files from publicly accessible bibliographies6 in BIBTEX(the bibliography format for LATEX documents), with characteristics shown in Ta-ble 7. The first, we callBKN, is a 651 entry bibliography contained in this book,though loaded sometime before printing and therefore not corresponding exactlyto the final edition. The major theme is, of course, clustering . The second bibli-ography,BEC is from http://jilawww.colorado.edu/bec/bib/ , a repository ofBIBTEX references from topics in Atomic and Condensed Matter Physics on thetopic of Bose-Einstein condensation. When downloaded (Jan. 24, 2005) the bibli-ography contained 1,590 references. The last bibliography,GVL, was downloadedfrom http://www.netlib.org/bibnet/subjects/ and contains the full 861 itembibliography of the 2nd edition (1989) of a well-known treatise on Matrix Computa-tions [33]. The file was edited to remove the first 350 lines of text that consisted ofirrelevant header information. All files were stored in a directory namedBibBench .It is worth noting that, at first approximation, the articles inBEC could be thoughtas belonging in one cluster (“physics”) whereas those inBKN and GVL in another(“linear algebra and information retrieval”).

We first usedTMG to assemble the aforementioned bibliographies using termweighting, no global weighting, no normalization and stemming (txx s) thus settingas non-defaultOPTIONS

OPTIONS.delimiter=’@’; OPTIONS.line delimiter=0;OPTIONS.stoplist=’bibcommon words’; OPTIONS.stemming=1;OPTIONS.min global freq =2; OPTIONS.dsp= 0

therefore, any words that appeared only once globally were eliminated (this had theeffect of eliminating one document fromGVL). The remaining packet had 3,101bibliographical entries across three plainASCII filesKNB.bib, BEC.bib, GVL.bib .The stoplist file was selected to consist of the same terms found in [30] augmented bykeywords utilized inBIBTEX and referring to items that are not useful for indexing,such asauthor, title, editor, year, abstract, keywords , etc.

The execution of the following commands

6 The bibliographies are directly accessible from theTMG web site.

DRAFT: Please do not distribute

[A, dictionary, global wts, norml factors, words per doc,...titles, files, update struct]=tmg(’BibBench’,OPTIONS);

has the following effect (compacting for some pretty-printing):

==================================================================Applying TMG for file/directory C:\TMG\BibBench...==================================================================Results:==================================================================Number of documents = 3101Number of terms = 3154Avg number terms per document (before normalization) = 36.9987Avg number of indexing terms per document = 11.9074Sparsity = 0.362853%Removed 302 stopwords...Removed 964 terms using the stemming algorithm...Removed 2210 numbers...Removed 288 terms using the term-length thresholds...Removed 6196 terms using the global thresholds...Removed 0 elements using the local thresholds...Removed 0 empty terms...Removed 1 empty documents...

A simple combination of commands depicts the frequencies of the most frequentlyoccurring terms. After runningTMG as above, the commands

f=sum(A,2); plot(f,’.’); [F,I]=sort(f);t=20; dictionary(I(end:-1:end-t+1),:)

plot the frequencies of each term (Fig. 5) and return the topt = 20 terms of highestfrequency in the set, listed in decreasing order of occurrence below:

phy rev os condens instein lett atom trap comput algorithm clustermethod data ga usa system matrix linear matric mar

We next useTMG to modify the tdm so that it uses a different weighting schemespecificallytnc and stemming. This can be done economically with theupdate structcomputed earlier as follows:

update struct.normalization=’c’; update struct.global weight=’n’;

A=tdm update([],update struct);

Using MATLAB ’s spy command we visualize the sparsity structure of the tdmAin Fig. 7(left). In the sequel we apply twoMATLAB functions produced in-house,namelypddp andblock diagonalize . The former implements thePDDP(l ) algo-rithm for clustering of term-document matrices [56]. We usedl = 1 and partitionedin two clusters only, so that results are identical with the originalPDDP algorithm[19]. In particular, classification of each document into one of the two clusters is per-formed on the basis of the sign of the corresponding element in the maximum right

DRAFT: Please do not distribute

500 1000 1500 2000 2500 3000

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000

0

500

1000

1500

2000

2500

3000

nz = 35489

Benchbib tdm

Fig. 5. BIBBENCH term frequencies (left); tdm sparsity structure usingÄspy (right).

singular vector of matrixA−Aee>/n, wheree is the vector of all 1’s. TheMATLABcommands and results are as follows:

>> clusters = pddp(A,’svds’,’normal’,’1’,2);Running PDDP(1) for k=2...Using svds method for the SVD...Splitting node #2 with 3101 documents and 55.558 scatter value

Leaf 3 with 1583 documentsLeaf 4 with 1518 documents

Number of empty clusters = 0PDDP(1) terminated with 2 clusters

Fig. 6 plots the maximum singular vectorvmax corresponding to theBIBBENCH

dataset. Even though our goal here is not to evaluate clustering algorithms (there isplenty on this matter in other chapters of this volume!), it is worth noting thatPDDP

was quite good at revealing the two “natural clusters”. Fig. 6 shows that there aresome documents fromBEC (marked with ‘+’) that were classified in the “clusteringand matrix computations” cluster and very few documents fromBKN andGVL thatwere classified in the “physics” cluster.

Finally, functionblock diagonalize implements and plots the results from asimple heuristic for row reordering of the term-document matrix based onpddp . Inparticular, running

>> block diagonalize(A, clusters);

we obtain Fig. 7(right). This illustrates the improvement made by the clustering pro-cedure. We note here that experiments of this nature, in the spirit of work describedin [17], are expected to be useful for instruction and research, e.g. to visualize theeffect of novel reordering schemes. Finally, Table 8, shows the size and ten top mostfrequent terms (after stemming) for each of the four clusters obtained usingPDDP(1).

DRAFT: Please do not distribute

500 1000 1500 2000 2500 3000

−0.06

−0.04

−0.02

0

0.02

0.04

0.06 cluster 2cluster 1

BEC BKN GVL

Fig. 6. Values of each of the 3101 components of the maximum right singular vectorvmax ofthe BIBBENCH dataset vs. their location in the set. The vertical lines that separate the threeBIBTEXfiles and the labels were inserted manually.

0 500 1000 1500 2000 2500 3000

0

500

1000

1500

2000

2500

3000

nz = 35489

BibBench tdm after document clustering with PDDP(1) and simple row reordering

0 500 1000 1500 2000 2500 3000

0

500

1000

1500

2000

2500

3000

nz = 35489

BibBench tdm after simple reordering using PDDP(1), l=1, k=4

Fig. 7. spy view of BIBBENCH tdm’s for k = 2 (left) andk = 4 (right) clusters.

There were two “physics” clusters, the theme of another appears to be “linear alge-bra” while the theme of the last one is “data mining”. The terms also reveal theneed for better data cleaning [21]), e.g. by normalizing or eliminating journal names,restoring terms, etc.: For instance,numermath, siamnum were generated because ofnon-standard abbreviations of the journals “Numerische Mathematik” and “SIAM

DRAFT: Please do not distribute

Table 8.Ten most frequent terms for each of the four clusters ofBIBBENCH usingPDDP(1).In parentheses are the cluster sizes. We applied stemming but only minimal data cleaning.

I (1,033) II (553) III (633) IV (885)phy phy matric clusterrev instein numermath usaos rev matrix data

condenscondens eigenvalu computtrap os siamnuman mine

instein lett symmetr algorithmga ketterl linalgapp york

atom atom problem analysilett optic linear parallel

interact mar solut siam

Journal of Numerical Analysis”. Termsinstein and os were generated becauseof entries such as{E}instein and{B}ose , where the brackets were used in theBIBTEX to avoid automatic conversion to lower case.

4.2 Performance Evaluation

Renewal experiments

We next evaluate experimentally the performance ofTMG when renewing exist-ing tdm’s. We first ranTMG on the collection of 19,042REUT-ALL documents andrecorded the total runtime (169.52sec) for tdm creation. We consider this to be one-pass tdm creation. We then separated the documents inb = 2 groups formed byREUT[1 : j] andREUT[( j +1) : 22], j = 1 : 21, an “original group” ofK documentsand an “update group” with the remaining ones. We consider this to be tdm creationin b = 2 passes. We then ranTMG twice, first usingtmg to create the tdm for theKdocuments and thentdm update . We performed a similar experiment for downdat-ing, removing in each step the second part from the complete collection. Runtimesare summarized in Table 9. We observe that renewal is quite efficient, and in somecases approaches the one-pass creation. In any case, it clearly proves that renewing ismuch more efficient than recreating the tdm from scratch. Also, the gains from down-dating (vs. rebuilding) are even larger. These experiments also suggest that for largedatasets, even if the entire document collection is readily available and no furthermodifications are anticipated, it might be cost effective to build the tdm in multiple(b≥ 2) passes.

4.3 Evaluating stemming and term-weighting

We next take advantage of the flexibility ofTMG to evaluate the effect of differentterm weighting and normalization schemes and stemming in the context of queryanswering and clustering with VSM and LSI.

DRAFT: Please do not distribute

Table 9. Runtimes (sec) for document renewal. To build the collection in one-pass took205.89sec.

K tmg up Total down K tmg up Total down925 6.00271.70277.70 0.11 10963 88.8882.52171.39 0.34

276118.00224.05242.05 0.13 11893 97.9270.55168.47 0.38368724.75203.24227.99 0.16 12529104.6463.63168.27 0.44458431.83185.09216.92 0.19 13185110.6756.14166.81 0.39550839.70167.25206.95 0.17 14109119.9246.77166.69 0.45642947.19151.63198.81 0.20 15056128.9237.58166.50 0.48731954.69137.28191.97 0.22 15996139.8627.16167.02 0.52823663.22121.78185.00 0.25 16903149.6619.23168.89 0.56914071.47108.53180.00 0.31 17805159.5811.66171.24 0.61

1005180.39 93.91174.30 0.31 18582166.61 6.78173.39 0.66

Query answering

Our methodology is similar to that used by several other researchers in CLA methodsfor IR, see for example [42]. We experimented with all possible schemes availablein TMG on standard data collections. In the case of LSI, we used as computationalkernel the sparse singular value decomposition (SVD) algorithm implemented byMATLAB ’s svds function. We note that this is just one of several alternative ap-proaches for the kernel SVD in LSI (cf. [15, 13, 35, 43]) and thatTMG facilitatessetting up experiments seeking to evaluate their performance. A common metric forthe effectiveness of IR models is theN-point interpolated average precision, definedby

p =1N

N

∑i=0

p̂(i

N−1), (2)

where p̂(x) = max{pi | ni ≥ xr, i = 1 : r}, is the “precision” at “recall” levelx,x ∈[0,1]. Precision and recall afteri documents have been examined arepi = ni

i , andr i = ni

r respectively, where, for a given query,ni is the number of relevant documentsup to thei-th document, andr is the total number of relevant documents. We usedthis measure withN = 11 (a common choice in IR experiments) for three standarddocument collections,MEDLINE, CRANFIELD andCISI whose features, as reportedby TMG, are tabulated in Table 10. Thestoplist file was the default obtained fromGTP. Parametermin global freq was set to 2, so terms appearing only once wereexcluded, and stemming was enabled. As shown in Table 10, stemming causes asignificant - up to 36% - reduction in dictionary size.

For LSI, the matrix was approximated with the leading 100 singular triplets. Asdescribed in Section 2.2, there are 60 possible combinations for term weighting andnormalization in constructing the term document matrix and 30 possible combina-tions in constructing the query vector. Taking into account the stemming option,there are 3,600 possible parameter combinations. We ran all of them on the afore-mentioned data collections and recorded the results. It must be noted that this is anexhaustive experiment of considerable magnitude, taking approximately 10 hours of

DRAFT: Please do not distribute

Table 10.Dataset statistics for ‘query answering’ experiments.

feature MEDLINE CRANFIELD CISI

documents 1,033 1,398 1,460terms (indexing) 5,735 4,563 5,544terms/document 157 189 302terms/document (indexing) 71 92 60tdm nonzeros (%) 0.86 1.27 0.84# queries 30 225 35terms/query 22 19 16terms/query (indexing) 11 9 8

terms (after stemming) 4,241 3,091 3,557dictionary size reduction (%) 26 32 36

Table 11.VSM precision.

MEDLINE CRANFIELD CISI MEDLINE CRANFIELD CISI

ngc.bps 58.45lgc.nf s 43.27lpx.lg s 24.13ngc.nf s 58.03lxc.ne s 42.93apx.ags 23.69lgc.bp s 58.41lgc.bes 43.25lgx.lp s 24.13lfc.bg s 58.01lxc.bf s 42.92agx.aps 23.69ngc.bf s 58.36lgc.nes 43.23lpx.tg s 23.96ngc.nes 58.00lxc.nf s 42.87apx.ngs 23.59npc.bgs 58.35lgc.bf s 43.20lgx.tp s 23.96lgc.bes 57.97lgc.np s 42.87agx.nps 23.59ngc.bes 58.35ngc.bes 43.16lpx.ag s 23.93npc.ngs 57.81ngc.les 42.82npx.tg s 23.44lgc.bf s 58.20ngc.bf s 43.12lgx.ap s 23.93npx.bgs 57.81lxc.lf s 42.82ngx.tp s 23.44lpc.bg s 58.17lxc.be s 43.03lpx.ng s 23.81ngx.bps 57.81agc.nfs 42.81ngx.tf s 23.41nec.bgs 58.17ngc.nf s 43.03lgx.np s 23.81agc.bps 57.73lgc.bp s 42.81nfx.tg s 23.41ngc.nps 58.15ngc.nes 42.98apx.lg s 23.79lec.bgs 57.70ngc.lf s 42.81npx.ags 23.36nfc.bg s 58.15lgc.le s 42.96agx.lp s 23.79lgc.nes 57.69lgc.af s 42.79ngx.aps 23.36lgc.np s 58.12lxc.le s 42.96apx.tgs 23.69agc.bfs 57.67lgc.tf s 42.79ngx.lf s 23.36lgc.nf s 58.08lgc.lf s 42.95agx.tps 23.69nec.ngs 57.67agc.nes 42.76nfx.lg s 23.36

computation. Tables 11 and 12 list the means of the 25 best precision values ob-tained amongst all weighting and normalization schemes used for query answeringusing VSM and LSI. Symbols “s” and “ ns ” indicate the presence or absence ofstemming. Tables 11 and 12 show the performance of LSI for the best weightingand normalization options. First, note that LSI returns good precision, about 19%better than VSM forMEDLINE. The performance of each weighting scheme does notseem to vary across collections. For example, the ‘logarithmic’ local term and the‘gfidf’ global term weighting schemes appear to return the best precision values forVSM. In the case of LSI, it appears that ‘logarithmic’ local term weighting givessimilar results, while ‘IDF’ and ‘probabilistic inverse’ global term weighting returnthe best performance. Furthermore, precision is generally better with stemming. Inview of this and the reduction in dictionary size, stemming appears to be a desirablefeature in both VSM and LSI.

DRAFT: Please do not distribute

Table 12.LSI precision.

MEDLINE CRANFIELD CISI MEDLINE CRANFIELD CISI

lfc.bp s 69.51aec.bns 46.23aec.lps 24.79lec.nps 69.07lpc.bf s 45.95lfc.le s 24.24lec.bps 69.39lec.bns 46.18aec.nps 24.66lfc.nf s 69.06lpc.bes 45.92lfc.lp s 24.22lec.bf s 69.38lec.nns 46.13lfc.tf s 24.45lpc.nf s 69.05lfc.bp s 45.89afc.tf s 24.22lfc.bf s 69.33aec.nns 46.09lfc.lf s 24.40lpc.nes 69.03lec.bf s 45.87lfc.tp s 24.19lpc.bp s 69.31lec.ln s 46.07lfc.nf s 24.35lec.nf s 69.00lfc.be s 45.87aec.bps 24.19lpc.bes 69.31afc.bns 46.06lec.lf s 24.35lpx.bp s 68.99lfc.nf s 45.87lec.bps 24.17lpc.bf s 69.27aec.lns 46.05aec.les 24.33aec.bps 68.99aec.ans 45.82lec.nf s 24.16lfc.be s 69.27lec.ans 46.04lfc.af s 24.32lfx.be s 68.93lpc.nes 45.82afc.af s 24.16lfc.np s 69.25lec.tn s 46.04lec.nps 24.32lfc.ne s 68.92aec.tns 45.82lfc.ne s 24.16lpc.np s 69.16lfc.bf s 46.02lfc.te s 24.31aec.bfs 68.92afc.ans 45.79aec.nes 24.14lec.bes 69.13afc.nns 45.99aec.lf s 24.25lpx.bf s 68.92afc.tn s 45.79afc.lf s 24.14afc.bps 69.09afc.ln s 45.96lfc.ae s 24.24afc.bf s 68.91lpc.nf s 45.79lec.nes 24.13

Table 13.Document collections used in clustering experiments.

feature REUT-ALL REUTC1 REUTC2 REUTC3 REUTC4

documents 19,042 880 900 1,200 2,936terms (indexing) 21,698 4,522 4,734 5,279 8,846terms/document 145 179 180 175 180terms/document (indexing) 69 81 83 81 85tdm nonzeros (%) 0.22 0.20 0.20 1.06 0.66

terms (after stemming) 15,295 3,228 3,393 3,691 6,068dictionary size reduction (%) 30 29 28 30 31

Clustering

We next present results concerning the effects of term weighting and stemming onclustering. For our experiments, we used parts ofREUT-ALL . We remind the readerthat the latter consists of 19,042 documents, 8,654 of which belong to a single topic.We appliedTMG in four parts ofREUT-ALL , labeledREUTC1, REUTC2, REUTC3andREUTC4, respectively. Each of these consist of documents from 22, 9, 6 and 25classes, respectively.REUTC1 up toREUTC3 contain an equal number of documentsfrom each class (i.e. 40, 100 and 200, respectively).REUTC4, on the other hand, con-sists of documents with varying class sizes, ranging from 30 to 300. Table 13 summa-rizes the features of our datasets. As before, stemming causes again a significant - upto 31% - reduction in dictionary size. As in the previous section, we tried all possibleweighting and normalization options available inTMG and recorded, the resulting en-tropy values for two clustering schemes:PDDP [19], as a representative hierarchicalalgorithm, based on spectral information, and Sphericalk-means (Skmeans) ([26])as an interesting partitioning algorithm. Tables 14 and 15 summarize the entropyvalues using the combinations of the ten weighting and normalization schemes thatreturned the best results. Skmeans entropy values are about 45% better thanPDDP

for REUTC2. Stemming and cosine normalization appear to improve the quality of

DRAFT: Please do not distribute

Table 14.Entropy values forPDDP.

REUT1 REUT2 REUT3 REUT4

tpc s 1.46 lec ns 1.11aecs 0.85afc ns 1.63tec s 1.54afc ns 1.13 lec s 0.90 tfc s 1.64tfc s 1.58aecns 1.15 tec s 0.92aecns 1.67tec ns 1.59 lec s 1.17 tec ns 0.93afc s 1.68lec s 1.61 lfc ns 1.18bxc ns 0.96 lfc ns 1.68aecs 1.61 lfc s 1.19 tfc s 0.96 tec ns 1.69tpc ns 1.63aecs 1.20 lec ns 0.97 tec s 1.69aecns 1.66afc s 1.24afc s 0.98aecs 1.72afc s 1.67 tfc ns 1.26aecns 1.01 tpc ns 1.72afc ns 1.67 lgc ns 1.29afc ns 1.01 lec s 1.73

Table 15.Entropy values for Skmeans.

REUT1 REUT2 REUT3 REUT4

tpc s 1.18axc s 0.61bxc ns 0.66 lec s 0.96tpc ns 1.23aecs 0.73 lec ns 0.67 tfc s 0.98tfc s 1.28 lec ns 0.73 lxc ns 0.73 tec s 0.99tec s 1.30 lxc s 0.73axc s 0.74afc ns 1.03afc s 1.31 tfc s 0.73bxc s 0.74aecs 1.03tec ns 1.31nxc s 0.74bgc s 0.74 lec ns 1.04lec ns 1.33 lxc ns 0.75 tec ns 0.75afc s 1.04axc s 1.35axc ns 0.76nxc ns 0.78 tec ns 1.06afc ns 1.35 tec ns 0.76bgc ns 0.78apcs 1.06ngc s 1.36bgc s 0.76 tpc ns 0.79 tfc ns 1.07

clustering in most cases. Tables 14, 15 do not identify a specific weighting schemeas best, though ‘logarithmic’ and ‘alternate log’ local and ‘entropy’ and ‘IDF’ globalweighting appear to return good results. Moreover, the simple ‘term frequency’ localfunction appears to return good clustering performance whereas global weightingdoes not seem to improve matters.

5 Conclusions

We have outlined the design and implementation ofTMG, a novelMATLAB toolboxfor the construction of tdm’s from text collections presented in the form ofASCII textfiles and directories. Our motivation was to facilitate users, such as researchers andeducators in computational linear algebra who useMATLAB to build algorithmsfor textual information retrieval and are interested in the rapid preparation of testdata. UsingTMG one avoids the extra steps necessary to convert or interface withdata produced by other systems.TMG returns (albeit slower) results comparable withthose produced byGTP, a popular C++ package for IR using LSI.TMG also allowsone to conduct stemming by means of a well known variation of Porter’s algorithm

DRAFT: Please do not distribute

and provides facilities for the maintenance and incremental construction of term-document collections. We presented examples of use ofTMG in various settings anddata collections, includingBIBBENCH, a new dataset consisting of data inBIBTEXformat. The flexibility of TMG allowed us extensive experimentation with variouscombinations of term weighting and normalization schemes and stemming. The toolis publicly available via a simple request. We are currently working in enabling thetool to process a variety of other document types as well as in distributed implemen-tations. We intend to exploit the facilities for integer and single-precision arithmeticof MATLAB 7.0 as well as compression techniques to produce a more efficientimplementation.

Acknowledgments

We thank Jacob Kogan and Charles Nicholas for inviting us to contribute to thisvolume.TMG was conceived after a motivating discussion with Andrew Knyazev re-garding a collection ofMATLAB tools we had put together to aid in our clusteringexperiments. We thank Michael Berry for discussions and for including the softwarein the LSI web site [3], Efi Kokiopoulou and Constantine Bekas for many helpfulsuggestions, Dan Boley for his help regarding preprocessing inPDDP. We thank In-derjit Dhillon, Pavel Berkhin and the editors for letting us access an early version ofthis volume’sBIBTEX source to use in our experiments and and Michael Velgakisfor assistance regarding theBEC dataset. We thank Elias Houstis for his help in theinitial phases of this research and for providing us access toMATLAB 7.0. Specialthanks are due to many of the users for their constructive comments regardingTMG.This research was supported in part by a University of Patras “Karatheodori” grant.The first author was also supported by a Bodossaki Foundation graduate fellowship.

A Appendix: Availability

TMG and its documentation are available from the following URL:

http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/

Assuming that the file has been downloaded and saved into a directory that is al-ready in theMATLAB path or made to be that way by executing theMATLABcommand (addpath(’directory’) ), TMG is ready for use. To process AdobePDF

andPOSTSCRIPT files, the Ghostscript utilityps2ascii and Ghostscript’s compilermust be made available in the path.TMG checks the availability of this utility anduses it if available, otherwise processes the next document. It is also recommended,beforeTMG, to useps pdf2ascii , that is ourMATLAB interface tops2ascii .This checks the resultingASCII file as the results are not always desired. The usercan also easily editps pdf2ascii to insert additional filters so that the system canprocess additional file formats (e.g.detex for TEX files). MATLAB version 6.5 orhigher is assumed.

DRAFT: Please do not distribute

References

1. Doc2mat. Available from www-users.cs.umn.edu/ ∼karypis/cluto/files/doc2mat-1.0.tar.gz .

2. General Text Parser. Available fromwww.cs.utk. edu/ ∼lsi/soft.html .3. Latent Semantic Indexing Web Site. Maintained by M.W. Berry and S. Dumais at

www.cs.utk.edu/ ∼lsi/ .4. The Lemur Toolkit. Available fromhttp://www-2.cs.cmu.edu/ ∼lemur/ .5. MATLAB: The Language of Technical Computing. In

http://www.mathworks.com/products/matlab/ .6. Mc Toolkit. Available fromwww.cs.utexas. edu/users/dml/software/mc/ .7. Telcordia Latent Semantic Indexing (LSI) Demo Machine. At

http://lsi.research.telcordia.com/ .8. R. Ando. Latent semantic space: Iterative scaling improves precision of inter-document

similarity measurement. InProc. 23rd ACM Conf. SIGIR, pages 216–223, 2000.9. R. Baeza-Yates, A. Moffat, and G. Navarro. Searching large text collections. In J. Abello,

P. Pardalos, and M. Resende, editors,Handbook of Massive Data Sets, pages 195–244.Kluwer Academic Publishers, 2002.

10. R. A. Baeza-Yates and B.A. Ribeiro-Neto.Modern Information Retrieval. ACMPress/Addison-Wesley, 1999.

11. M. Berry and M. Browne.Understanding Search Engines. SIAM, 1999.12. M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent

information retrieval.SIAM Review, 37(4):573–595, 1995.13. Michael Berry, Theresa Do, Gavin O’ Brien, Vijay Krishna, and Sowmini Varadhan.

SVDPACKC (Version 1.0) User’s Guide. Computer Science Dept. Technical Report CS-93-194, University of Tennessee, Knoxville, April 1993.

14. Michael W. Berry, Zlatko Drmac, and Elizabeth R. Jessup. Matrices, vector spaces, andinformation retrieval.SIAM Review, 41(2):335–362, June 1999.

15. M.W. Berry. Large scale singular value decomposition.Int. J. Supercomp. Appl., 6:13–49,1992.

16. M.W. Berry, editor. Survey of Text Mining: Clustering, Classification, and Retrieval.Springer Verlag, New York, 2004.

17. M.W. Berry, B. Hendrickson, and P. Raghavan. Sparse matrix reordering schemes forbrowsing hypertext. In J. Renegar, M. Shub, and S. Smale, editors,The Mathematics ofNumerical Analysis, volume 32 ofLectures in Applied Mathematics (LAM), pages 99–123. American Mathematical Society, 1996.

18. R.F. Boisvert, R. Pozo, K. Remington, R. Barrett, and J. Dongarra. The Matrix Market:A Web repository for test matrix data. In R.F. Boisvert, editor,The Quality of NumericalSoftware, Assessment and Enhancement, pages 125–137. Chapman and Hall, London,1997.

19. D. L. Boley. Principal direction divisive partitioning.Data Mining and Knowledge Dis-covery, 2(4):325–344, 1998.

20. D.L. Boley. Principal direction divisive partitioning software (experimental soft-ware, version 2-beta), Feb. 2003. Available fromwww-users.cs.umn.edu/∼boley/Distribution/PDDP2.html .

21. M. Castellanos. Hot-Miner: Discovering hot topics from dirty text. In Berry [16], pages123–157.

22. C. Chen, N. Stoffel, M. Post, C. Basu, D. Bassu, and C. Behrens. Telcordia LSI engine:Implementation and scalability issues. InProc. 11th Workshop on Research Issues in Data

DRAFT: Please do not distribute

Engineering (RIDE 2001): Doc. Management for Data Intensive Business and ScientificApplications, Heidelberg, Germany, Apr. 2001.

23. E. Chisholm and T. Kolda. New term weighting formulas for the vector space methodin information retrieval, 1999. Report ORNL/TM-13756, Computer Science and Mathe-matics Division, Oak Ridge National Laboratory.

24. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexingby Latent Semantic Analysis.Journal of the American Society for Information Science,41(6):391–407, 1990.

25. I.S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of very large document collections.In R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, editors,DataMining for Scientific and Engineering Applications, pages 357–381. Kluwer AcademicPublishers, 2001.

26. I.S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data us-ing clustering.Machine Learning, 42(1):143–175, January 2001. Also appears as IBMResearch Report RJ 10147, July 1999.

27. J.R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in MATLAB: Design and imple-mentation. Technical Report Tech. Report CSL 91-4, Xerox Palo Alto Research Center,1991.

28. J.R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in MATLAB: Design and im-plementation.SIAM J. Matrix Anal. Appl., 13(1):333–356, 1992.

29. J.R. Gilbert and S.-H. Teng. MATLAB mesh partition-ing and graph separator toolbox, Feb. 2002. Available fromftp://parcftp.xerox.com/pub/gilbert/meshpartdist.zip .

30. J.T. Giles, L. Wo, and M. Berry. GTP (General Text Parser) software for text mining.Statistical Data Mining and Knowledge Discovery, pages 455–471, 2003.

31. N. Goharian, A. Chowdhury, D. Grossman, and T. El-Ghazawi. Efficiency enhancementsfor information retrieval using sparse matrix approach. InProc. 2000 Parallel and Dis-tributed Processing Techniques and Applications (PDPTA), Las Vegas, June 2000.

32. N. Goharian, T. El-Ghazawi, and D. Grossman. Enterprise text processing: A sparsematrix approach. InProc. IEEE International Conference on Information Techniques on:Coding & Computing (ITCC 2001), Las Vegas, 2001.

33. G. H. Golub and C. F. Van Loan.Matrix Computations. The Johns Hopkins UniversityPress, 2nd edition, 1989.

34. L. Grady and E.L. Schwartz. The graph analysis toolbox: Im-age processing on arbitrary graphs, Aug. 2003. Available fromhttp://eslab.bu.edu/software/graphanalysis/graphanalysis.html .

35. M. Hochstenbach. A Jacobi–Davidson type SVD method.SIAM J. Sci. Comput.,23(2):606–628, 2001.

36. E.-J. Im and K. Yelick. Optimization of sparse matrix kernels for data mining. Universityof California, Berkeley, unpublished manuscript, 2001.

37. G. Karypis. CLUTO. a clustering toolkit. Technical Report 02-017, University of Min-nesota, Department of Computer Science, Minneapolis, MN 55455, Aug. 2002.

38. J. Kleinberg and A. Tomkins. Applications of linear algebra in information retrieval andhypertext analysis. InProc. 18th ACM SIGMOD-SIGACT-SIGART Symp. Princ. Datab.Sys., pages 185–193. ACM Press, 1999.

39. M. Kobayashi, M. Aono, H. Takeuchi, and H. Samukawa. Matrix computations for infor-mation retrieval and major and minor outlier cluster detection.J. Comput. Appl. Math.,149(1):119–129, 2002.

40. E. Kokiopoulou and Y. Saad. Polynomial filtering in latent semantic indexing for infor-mation retrieval. InProc. 27th ACM SIGIR, pages 104–111, New York, 2004. ACM.

DRAFT: Please do not distribute

41. T. Kolda and D. O’Leary. A semidiscrete matrix decomposition for latent semantic in-dexing information retrieval.ACM Trans. Inf. Sys., 16(4):322–346, 1998.

42. T. G. Kolda.Limited-Memory Matrix Methods with Applications. PhD thesis, The Ap-plied Mathematics Program, University of Maryland, College Park, Mayland, 1997.

43. R.M. Larsen. PROPACK: A software package for the symmetric eigenvalue problem andsingular value problems on Lanczos and Lanczos bidiagonalization with partial reorthog-onalization. Fromhttp://soi.stanford.edu/ rmunk/PROPACK/ .

44. R. Lehoucq, D.C. Sorensen, and C. Yang.Arpack User’s Guide: Solution of Large-ScaleEigenvalue Problems With Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia,1998.

45. T.A. Letsche and M.W. Berry. Large-scale information retrieval with latent semanticindexing. Information Sciences, 100(1-4):105–137, 1997.

46. M.F. Porter. The Porter stemming algorithm. Seewww.tartarus.org/ ∼martin/PorterStemmer .

47. M.F. Porter. An algorithm for suffix stripping.Program, 14:130–137, 1980.48. J. Quesada. Creating your own LSA space. In T. Landauer, D. McNamara, S. Dennis,

and W. Kintsch, editors,Latent Semantic Anlysis: A road to meaning. Erlbaum. In press.49. G. Salton, J. Allan, and C. Buckley. Automatic structuring and retrieval of large text files.

Communications of the ACM, 37(2):97–108, 1994.50. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval.Infor-

mation Processing & Management, 4(5):513–523, 1988.51. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing.

Communications of the ACM, 18(11):613–620, 1975.52. A. Singhal, C. Buckley, M. Mitra, and G. Salton. Pivoted document length normalization.

In ACM SIGIR, 1996.53. S. Sirmakessis, editor.Text Mining and its Applications (Results of the NEMIS Launch

Conference). Springer, Berlin, 2004.54. I.H. Witten, A. Moffat, and T.C. Bell.Managing Gigabytes: Compressing and Indexing

Documents and Images. Van Nostrand Reinhold, New York, 1994.55. D.I. Witter and M.W. Berry. Downdating the latent semantic indexing model for concep-

tual information retrieval.The Computer Journal, 41(8):589–601, 1998.56. D. Zeimpekis and E. Gallopoulos. PDDP(l): towards a flexible principal direction divisive

partitioning clustering algorithm. In D. Boley et al, editor,Proceedings of the Workshopon Clustering Large Data Sets (held in conjunction with the Third IEEE InternationalConference on Data Mining), pages 26–35, 2003.

57. D. Zeimpekis and E. Gallopoulos. CLSI: A flexible approximation scheme from clusteredterm-document matrices. InProc. 5th SIAM International Conference on Data Mining,Philadelphia, 2005 (to appear). SIAM.

58. Hongyuan Zha and Horst D. Simon. On updating problems in latent semantic indexing.SIAM Journal on Scientific Computing, 21(2):782–791, March 2000.

Index

GTP, 14PDDP, 23TMG, 8, 11–14BIBBENCH, 3DOC2MAT , 3MATLAB , 2CISI, 3, 14, 20, 21CLUTO, 3CRANFIELD, 3, 14, 20, 21GTP, 3, 6, 11, 14, 20, 24MC, 3MEDLINE, 3, 14, 20, 21PDDP, 23REUTERS-21578, 3, 11TMG, 2–16, 19, 20, 22–24

SDDPACK, 3

clustering, 1, 22, 23compressed sparse column format (CSC), 10compressed sparse row (CSR), 3

dirty text, 6

eigenvalue decomposition, 1

General Text Parser (GTP), 3

Harwell-Boeing, 11

implicitly restarted Arnoldi, 11inverted index, 9

Lemur Tolkit, 3

Matrix Market, 3

precision, 20–22Principal Direction Divisive Partitioning

(PDDP), 3

querying, 1, 2

recall, 20

singular value decomposition (SVD), 1Skmeans, 23sparse matrix, 2, 8Sphericalk-means (Skmeans), 22stemming, 3SVD, 11, 12, 20

Telcordia LSI Engine, 3term-document matrix (tdm), 2text mining, 1Text to Matrix Generator (TMG), 1

vector space model (VSM), 1