THE BASICS OF THE GALLITO 2.0 PROGRAM · 2013. 10. 24. · Gallito program. It is written in plain language, avoiding unnecessary technical terms that underlie the tool, and features

THE BASICS OF THE GALLITO 2.0 PROGRAM

INTRODUCTION

The Gallito 2.0 software (currently the latest version) is a program for the processing of

a large number of linguistic documents to obtain a mathematical representation of

language. In addition to processing a large volume of language (as we shall see, the

term "processing" is usually replaced by the term "training"), it includes multiple

features to semantically process linguistic information. Thus, the program is able to

quantify the semantic relationship between texts, measure the cohesiveness between

paragraphs in a text, extract the key words that summarize a document, serve as the

basis to obtain conceptual network graphs, enable the analysis of term types by means

of K-mean cluster analysis, serve to assess text quality, and change the basis to obtain

a new semantic representation of language.

The purpose of this document is to serve as a roadmap for effective, simple use of the

Gallito program. It is written in plain language, avoiding unnecessary technical terms

that underlie the tool, and features are always illustrated by examples so that users

can grasp them more easily.

Before discussing use of the program, it should be pointed out that Gallito is essentially

a program that represents language mathematically. When working with this software,

users should bear in mind that each word is a numerical vector with a highly

approachable number of coordinates (approximately 300). In this way, every text,

sentence, or term is a numerical vector with some 300 coordinates. So this is not a

program for qualitative discourse processing. Rather, it falls within the framework of

artificial intelligence theories. Gallito is based on latent semantic analysis (LSA)

technology and philosophy.

WHAT IS LATENT SEMANTIC ANALYSIS (LSA)?

How is this mathematical representation of language obtained? How is each word

represented by means of a number of numerical coordinates? To obtain a

mathematical representation, a technique known as LSA is applied. Much research has

been done since the 1990s on this technique and its possibilities for treatment of

language semantics (the classic paper is Landauer and Dumais' 1997 "A Solution to

Plato's Problem", which can be easily downloaded).

Basically, the procedure is a follows. The user must have a set of texts to train the

program. This is what is known as the linguistic corpus. This handbook provides

multiple examples taken from a linguistic corpus of some 46,000 press documents

from the Spain section in the El País and El Mundo newspapers, collected between

2002 and 2009. To this end, we wanted Gallito to obtain a statistical or mathematical

representation of the terms of those press documents, as well as of the documents in

that corpus.

The first step towards this goal consists in uploading the linguistic corpus stored in a

single (plain text) file to the tool. A character must be specified to let the program

know when a new document begins and ends. Documents are usually separated by

means of the hash character (#).

What does the LSA (and thus Gallito) do with the file that contains the linguistic

corpus? It generates a frequency matrix, where all the different terms that appear in

the file are entered into rows and each of the documents is entered into a column.

Thus, the cells in that great matrix provide the number of times a given word occurs in

a document. In our newspaper corpus there is a total of 10,686 terms (rows) and

45,886 documents.

How much space does a document take up? When we talk about a document, we

usually refer to a paragraph (generally between 50 and 300 words). The natural unit of

the document for the LSA is the paragraph, although you can also work in such a way

that a document is a sentences or a couple of sentences, e.g. if you want a smaller

analysis level. The opposite is also possible. You can turn a book chapter, an entire

book, etc. into a document, although this is more unusual.

The great frequency matrix is called X. Does Gallito work with this matrix? This matrix

can be thought of as the numerical (mathematical) representation of the terms and

documents. Actually, this matrix is not very useful. To find the semantic similarity

between two terms, it would suffice to find the correlation or degree of resemblance

between two rows. To find the semantic similarity between Congreso (Conference)

and Economía (Economy) it would be enough to pick the two rows in that matrix and

calculate the correlation between them. Studies show that this data matrix does not

represent well the semantic relationships between words. When working with this

matrix, a lot of noise appears. Noise means the large extent to which this matrix

depends on the idiosyncrasy in language use by the different authors. This matrix is

usually known as the brute matrix, because, among other things, it is not purged.

To begin with, Gallito removes syntactic words from this matrix. It carries out a purge

to remove the most frequent words in the language, those which do not provide

semantic information, such as prepositions, articles, and pronouns. In addition, word

frequencies follow a greatly asymmetric distribution. Some words appear much more

frequently than others, but that frequency ratio is not related to the semantic

significance of each of the words. Put otherwise, the brute matrix X must be modified.

The second step consists in applying a weight function, usually involving logarithms, to

prevent this great asymmetry in word frequency (log-entropy and log-IDF are the most

usual modification procedures. Both are included in Gallito, as we will see). Because of

this, matrix X does not display such an asymmetrical word frequency distribution.

The result is a modified matrix X. Can we work with this 10,686 x 45,886 dimension

matrix? It's still not useful. What is most subtle about LSA and has made it famous is

the next step: applying a dimension reduction technique on that modified matrix X so

that the words (or documents) are represented not by 45,886 columns, but by a much

smaller number of dimensions. This number is usually about 300. The reason why the

number is 300 is chosen is more empirical than theoretical. No research has been done

(as far as we know) that found that the brain tends to represent all words in a 300-

dimension semantic space because this is an adaptive or reasonable figure which

correctly encompasses the essence of concepts. Actually, this approximate number of

dimensions is the result of empirical studies, where the issue is how well human

semantics are represented if the dimension reduction algorithm yields a solution of

100, 200, 300, 400, etc. dimensions. The results state that language can be

mathematically represented to simulate well human semantics in a satisfactory way if

the dimension reduction technique extracts or breaks down the modified matrix X into

about 300 dimensions. It seems that too few dimensions provide an excessively scarce

solution. Semantics is not well captured using a too small number of dimensions.

Essential aspects are lost if we try to account for the semantics with so little

information. If too many dimensions are used, the opposite is the case. There is an

excess of dimensions, spurious dimensions which are useless and distort the semantic

representation.

The dimension reduction algorithm is known as the Singular Value Decomposition

algorithm. Techniques such as Component Analysis on the correlation matrix or

Correspondence Analysis on contingency tables are very similar to such

decomposition.

The final result of the linguistic training of LSA (or, in this case, Gallito) is the creation

of a new 300-dimension matrix (if a k of 300 is selected) where each of the 10,686

words is represented by mean of vector of 300 numbers or coordinates.

This matrix is typically known as US.

Gallito is then ready to be used, as an interesting and useful mathematical

representation of the language is available, which has been trained.

The matrix US is usually known in LSA as the latent semantic space. This space is the

vectorial space generated from the original linguistic corpus. It is semantic because the

matrix US captures well the semantic relationships between words. It is latent because

by using approximately of 300 dimensions, the noise of the original matrix X has been

removed, so that the essence of semantics, the regularities between words, is

captured, rather than the idiosyncrasies. In this way, the latent aspects that underlie

semantics are captured.

The most basic possibility afforded by this representation is the verification of the

semantic similarity between two words. This amounts to obtaining the correlation

between two rows in the matrix X (to be more precise, the cosine). The cosine ranges

between -1 and 1. Leaving negative figures aside, as they are highly infrequent, values

range between 0 and 1. The semantic relationship between two words can be assessed

by means of the cosine. When the value of the cosine is about 0, the two words are

independent, orthogonal, or semantically different. The more a cosine tends towards

1, however, the greater the semantic similarity between two words.

For example, if we obtain the semantic relationship between the term puñetazo (blow)

and pelea (fight) in Gallito, the value of the cosine is 0.497, a value that shows a close

semantic relationship between the terms (values higher than 0.50 are very rare, as

users will find out as they work with the program).

If we try the terms puñetazo (blow) and Ebro (the river), the semantic relationship is -

0.021, practically none. Puñetazo and Ebro are two words which have practically no

semantic relationship.

Finally, if the semantic relationship between two rivers, Ebro and Duero, is tested, the

cosine has a value of 0.533, again a very close semantic relationship.

As well as semantic similarity, another basic piece of information is the vector length. A

vector length can be defined as the amount of information which the LSA has about a

word. A word with a high vector length means that the LSA has a high degree of

knowledge about that word. By contrast, a word with a low vector length means that

the LSA has not had much exposure to that word, and thus its knowledge of it is not

very dep. These are the vector lengths of the four words that served as examples in a

journalistic corpus.

The journalistic or press corpus used to train the LSA has greater knowledge about the

word Ebro (vector length 2.141) than about the other three. The word Ebro has a

vector length that is three times longer than the word Pelea (vector length 0.716).

Duero and Puñetazo are words which are less represented in the journalistic corpus, as

their vector length is rather smaller. The result of our analysis of the cosines and vector

length can be graphically summarized as follows:

Puñetazo (Blow) Vector length = 0.373

Pelea (Fight) Vector length = 0.716

Ebro Vector length = 2.141

Duero Vector length = 0.326

If we represent the semantic space in two dimensions (a simplification, given that we

will usually need about 300 dimensions), Pelea and Puñetazo (Fight and Blow)are

represented by vectors with more or less a similar direction (they are semantically

related), but are highly independent from the words Duero and Ebro (which in turn are

also closely related as their vectors also have more or less a similar direction). The

largest vector is that for the term Ebro, as it has the largest vector length. The LSA

therefore has more information about the word Ebro than about any of the other

three ones.

INSTALLATION OF THE GALLITO PROGRAM

REQUIREMENTS

For Gallito 2.0 to work correctly, the following components must be installed (they are

included in the download website and can also be installed from the official Microsoft

website).

- A 64-bit Windows operating system (Windows 7 or Windows Server).

- Microsoft SDK 4 (included in the download package).

- Microsoft Visual C++2010

- Writing permission to install the program in the installation directory

DOWNLOADING THE PROGRAM

Publications, projects, demos, and interesting links related to the LSA technology can

be downloaded from the www.elsemantico.com website.

From the Download tab you can access the page

http://www.elsemantico.com/gallito20/download-es.html where the program can be

downloaded.

The download link for the program (version 2.1.3) and three manuals that serve as a

guide to learn about the program is located in the setup and manuals box.

Clicking on Versión 2.x.x setup (30 days free) [zip] opens the following dialog box:

http://www.elsemantico.com/http://www.elsemantico.com/gallito20/download-es.html

Check the Save with option and click Accept.

A .zip file will be saved from which the program will be installed.

Unzip the file and Double click on the Release directory:

Double click again on the icon. The program installation wizard will

start:

Click Next.

This step in the wizard shows the path where the program will be installed. Click Next.

In the next step, just confirm where you want to install the program, then click Next.

The installation process will take a few minutes.

Once installed, close the wizard clicking Close. Don´t forget to give writing permission

to the installation directory (C:\Program Files\elsemantico.com\)

The program should appear on the Start menu:

The program icon appears. Click on it. The program should open after loading the files

and display the following interface.

It warns you that you have a 30-day trial period.

Licensed users are provided an executable that extends the user license for an

indefinite period.

The program has been installed and is ready for use. The following sections in the

manual will explain how to train the program by means of a linguistic corpus, how to

load a semantic space, and all the working possibilities provided by the software.

TRAINING A CORPUS

Training a linguistic corpus in Gallito firstly entails compiling all the linguistic

documents to be trained in just a file (preferably a *.txt file).

Given that the program starts by placing the documents in columns and the different

terms in rows in a large matrix, the program must be told how the documents are

separated from each other.

To do this, go to the Corpus tab.

CORPUS TAB

Locate the path for the*.txt file that includes the documents to be trained in the

Reference Corpus button .

Note that the character separation option is checked, with the symbol # separating

documents.

The file to be read has this structure:

If the document including the texts to be trained includes a different separating

symbol, the program must be told.

If the documents are not separated by a special symbol (hash, ampersand, at sign,

etc.), but rather the point is used to separate documents, the sentences separation

option can be chosen. The number of sentence separated by a dot which constitute a

document (by default 1) can be specified in the box A document is ___ sentences.

The A document minimum are 2 words option serves to specify the minimum number

of words for a document. Given that documents are usually paragraphs, a higher

number can be chosen in this option, e.g. 10.

Finally, the Remove words that do not appear at least 1 document options makes it

possible to specify how many times a term must appear in the linguistic training corpus

for it to be mathematically represented by the program. So the question here is

specifying the minimum number of times a word must appear to be included in the

semantic space generated by the program. A number of 5 of higher is usually chosen,

as underrepresenting a word may be counterproductive and display highly random

semantic relationships.

SAVE TAB

There are more options to be specified in the training process. We have already

completed the Corpus tab. Now we will specify a path and name for the seven files

that are automatically generated after the training.

Access the Save tab.

As can be seen, there are seven buttons associated with seven text boxes. This

means that a path must be specified and names must be given to the seven different

files created after training Galito using the linguistic corpus which we saw previously.

Term list is a file that, after the training ends, includes all the terms mathematically

represented by the program. It is sort of dictionary of the terms available after the

training. That file should be given the name TERMLIST for easy use afterwards

(although any name can be used, of course).

Doc list is a file that includes of the documents used to train the program. It is a sort of

dictionary of documents. The file name can be DOCLIST.

Doc matrix is the numerical matrix were all the documents are mathematically

represented. Again, any name can be given, but DOCMATRIX can be useful so that this

file is retrieve we know exactly what it contains.

Term matrix is the file that contains the vectorial representation of the terms. This

matrix is, so to speak, the final essence of the training, the core from which the

semantic relationships between the texts can be worked on. The assigned name could

be TERMMATRIX.

Global weight is a file that contains the weights assigned to each of the words

analyzed. Not all words carry the same amount of semantic information, hence they

receive different weights depending on their importance.

Diagonal matrix is a matrix that also contains weights. In this case, the diagonal matrix

describes the importance of each of the dimensions by means of which the latent

semantic space was finally represented. For users who are familiar with factorial

analysis, the diagonal matrix contains the variance percentage described above for

each of the dimensions that account for the semantic relationships between words.

Space features is a file that includes certain features of the training process.

Once all the files that will be generated in the file have been named, the way in which

those files must be saved is chosen> binary or serialized. The binary mode is more

efficient than the serialized one, so that, regardless of the technical aspects involved in

both types of files, the binary mode is a better option.

Finally, the Save Project automatically checkbox saves these file in the path previously

used to name the files, once the training is complete.

IT IS IMPORTANT TO CHECK THIS BOX TO AVOID LOSING THE SEVEN FILES DESCRIBED.

ENSURE THAT THE SEVEN FILES ARE AUTOMATICALLY SAVED IN THE SPECIFIED PATH

AFTER THE TRAINING IS COMPLETED.

MATRIX TAB

We will now follow with the specifications to be made before the training starts. Now

we will take a look at the matrix tab, which establishes some interesting features in

this stage of the training.

Dimensions. This option box, checked by default, allows us to specify how many

dimensions will be used to mathematically represent the words. That is to say, the

number of dimensions which will describe the semantic relationships between the

words. For novel users, this option can be somewhat disconcerting if they do not know

the number of dimensions that is generally used in LSA.

Experts like Landauer and Dumais, together with other collaborators, as well as the

extensive experience of the tests made, suggest that the adequate number of

dimensions for generalist linguistic corpora (corpora that contain a varied language,

such as novels, newspapers, essays, poetry, technical documents, etc.) ranges between

250 and 350. This means that the best way to semantically represent relationships

between words requires a number of dimensions that ranges between 250 and 350.

For corpora within a more specific domain a more reduced number of dimensions can

be specified, e.g. 150.

In any case, this issue can be settled empirically, not theoretically. We therefore

recommend that users establish a number of dimensions that ranges between 150 and

200 for specific-domain corpora and 250 and 350 for generalist corpora.

Accumulated singular value is an interesting option, although we do not recommend

using it with large linguistic corpora. It makes it possible tell the program something

like 'choose the number of dimensions such that said dimensions account for 40% of

the original variability of relationships between words and documents'. This option is

useful for two reasons. Firstly, it makes it unnecessary to decide the number of

dimensions to specify if it is unclear what this number must be. Secondly, some tests

carried out by experts suggest that 40% of the total variance provides a very

interesting semantic space which correctly simulates what humans do. The drawback

of this option is that it requires the program to calculate all possible options, a task

that is extremely costly is the number of documents to train the program is slightly

high (e.g. more than 5,000).

Linguistics adjustment is related to the way in which the original matrix of term

frequency per document will be transformed - the original matrix with which the

program works before obtaining the latent semantic space. Word frequencies do not

follow a specific distribution. Studies have shown that the most frequent word is used

twice as frequently as the second most frequent word, the second most frequent word

is used twice as frequently as the third most frequent word, etc. so that some words

appear much more frequently than others. When the original word frequency matrix

per document is obtained, some words will appear much more often than others. If

the original frequency distribution is to be preserved, the option nothing must be

chosen. If this large asymmetry in use of frequencies is to be modified, one of the

other two options must be chosen: Log*entropy or Log*IDF. Both ways of weighting

the frequency matrix have strong empirical support. The most usual one is

Log*Entropy. As can be seen, both weighting methods involve the logarithm, a method

usually employed in statistics to assign a greater weight to infrequent items and a

much lower weight to highly frequent items.

Normalization. This option forces the normalization by rows of the matrix U (see the

section on LSA above). That is to say, it forces the length of the vectors that represent

all the words to be equal to one.

REM/ADD TAB

Finally, certain features must be specified by means of this tab before finally starting

the training of the corpus. This tab basically establishes issues such as which words are

to be processed by the program and how they are to be processed.

The options under the Lexicon box are all linguistic forms which can be selected if we

prefer Gallito to ignore them in the training. The program provides files with adverbs,

prepositions, pronouns etc. in Spanish, which will make it possible to ignore these

typically syntactic words during the training stage. Thus, choosing the option Adverbs

will make the program ignore adverbs. This means that the latent semantic space will

not include adverbs. Conjunctions, modifiers, interjections, prepositions, and pronouns

are some of the options that are typically chosen (and which we recommend choosing)

so that they are ignored during the training. These are syntactic rather than semantic

words, so that they possibly only add noise to the final semantic space. It is

recommendable to choose all these options. There is also the possibility of choosing to

ignore verbs, although they are usually included in the training (this option is usually

not chosen).

The Additional option makes it possible to avoid training words or short phrases

specified by the user. If the language with which the program will be trained includes

complex terms or structures to be rejected, this option should be chosen. This option

should be selected if, for example, a PDF file is copies which has the heading "Políticas

contra la desertización” (“Anti-desertification policies") repeated and we do not want

the program to train this phrase. In addition, the Structures Select drop lists menu

should be chosen:

Add the phrase to be discarded in structures with more than one term (the box on the

left) and click the button Add.

In the box Action we can choose between Remove if we want to ignore (remove) the

words whose linguistic category falls under one of the categories previously selected

or the opposite can also be done (add exclusively): specifying that those words will be

the only ones that will take part in the training. This option will be rarely used other

than for research purposes.

Finally, when choosing Lemmatization, the option Spanish can be selected in the drop-

down menu and the verification box to the left can be checked:

This options makes it possible to group words under a single form. It is mainly applied

to verbs, nouns, and adjectives. For example, if we want the verb forms

“abandonaba”, “abandonaron”, “abandonaré”, “abandonaría”, etc., to appear all

under the infinitive “abandoner”, this option should be chosen. Lemmatization is a

very good option to handle the large number of verbal forms in Spanish. If this option

is not chosen, all the verb forms different from the verb “abandonar” will be seen as

different terms and thus will take up a row in the term matrix.

Obviously, lemmatization makes it possible (1) to handle many different terms which

should fall under the same semantic category, and (2) to better represent words with a

smaller number of training texts. Just imagine how many documents would have to be

trained to properly represent the verb “abandonar” semantically with and without

lemmatization. Given that with no lemmatization the verb forms of the verb

“abandonar” multiply, we would need a huge number of texts.

For example, after lemmatization, this list of words appears between the terms (left-

hand list) as oppose to an identical training following the same parameters except that

this time the training is not lemmatized (the right-hand list):

PROCESS TAB

In this tab we click on the START button to carry out the training. There is only one

option: Ocurrence matrix to, with which we can specify a path and name to save the

original term frequency matrix per document - the matrix with which the program

start to operate. This matrix is usually ignored, among other things because it can take

up a lot of memory. Ignoring this option, click button START and wait for the program

to notify that the training is complete.

This warning states that the program has finished its training with linguistic corpus

provided and saved the seven files with which we can start to work.

We look for the seven files generated by the program in the specified directory. It

should be noted that the names of the files correspond to the names assigned to

Gallito in the Save tab. This will later facilitate work when we start a new session with

the program and want to load these files to work with the program.

LOADING THE FILES

It's not necessary to train a linguistic corpus very time we start a session with Gallito in

order to be able to work with the program. Once the program has been trained and

the seven files required to work with the program have been saved in the hard disc, we

can simply open them and start to work with the program. This section briefly

describes how to load the working files in Gallito.

Launch the program as usual.

Access the Load tab.

As we already know, seven files must be opened to be able to work with Gallito. These

are files which were previously saved in a previous stage when the program was

trained using a linguistic corpus. These files have the *.bnl extension.

We will now explain how to access the files and briefly describe each one:

By clicking on the button associated with the Term list box, we can look for the file that

contains all the stored terms. Then we will do the same with the Doc list button, in

order to find a file that specifies all the documents which have been used to train the

program. The Doc matrix file stores the latent semantic space for the documents, i.e. it

is the vectorial representation of the Doc list file. This file is not usually frequently

used, unless we want to build a data search engine. Following the file opening process,

Term Matrix is without a doubt the most interesting file of all those opened. It stores

the latent semantic space for the terms. This is the file that contains the vectorial

representation of all the words in the training documents. Global weight is a file that

contains the weights of the terms. Not all terms have the same weights. Some terms

define contexts (documents) where they appear better than other terms (think about

prepositions, terms which do not add any information to the context and which would

have practically no weight). Diagonal matrix is a file that also contains weights, but in

this case it is the weights granted to each dimension in the latent semantic space. We

know from the LSA that a term is vectorially represented, e.g. in 300 dimensions. The

Diagonal matrix file specifies the respective importance of each of those 300

dimensions. For users who are familiar with the main component technique, the

Diagonal matrix file would contain the variance ratio for each dimension. Finally, the

Space features file contains specific information about the space generated.

Once all the files have been selected, click the Load button.

FEATURES OF MY TRAINING CORPUS AND MY SEMANTIC SPACE

A good way to take a look at the features of the opened space is going to the Spaces

Properties menu.

This specifies that our semantic space has 10,686 terms in a linguistic corpus of 45,886

documents. The dimensions by means of which the semantic space has been

represented are 250 (remember that the number of dimensions usually ranges

between 200 and 350). The average similarity between the terms in the corpus (the

average cosine) is 0.0466, and the typical deviation for those similarities is 0.0732.

Given that it sometimes can be hard to assess in absolute terms whether two terms

are semantically related or not by examining exclusively one cosine, a good option is to

examine the cosine in Z scores. To do so, we must subtract the average similarity in the

space (0.0466 in this case) from the cosine and divide the difference between the

typical deviation (0.0732 in this case). In this way we will obtain the typical similarity in

the score. Similarities higher than 3 typical scores represent a high semantic

relationship between a pair of terms.

WORKING WITH GALLITO. BASIC ISSUES

QUERIES MENU

SEMANTIC NEIGHBORS

From the Queries Semantic neighbors menu we have the option of seeing which are

the terms most closely related to a word of our interest. For example, if we choose the

word Crisis because we are interested in finding its closest semantic neighbors in the

Term box we will click on the word Crisis. By means of Measures we can select four

measures for semantic relationships.

Cosines: It shows the cosine between the word chosen and its closest neighbors.

Corrected cosines: It shows the cosine between the word chosen and its closest

neighbors, weighting by the neighboring vectors’ lengths. By means of this option, it

can be specified whether the list of neighbors is to be created using a high or relatively

high vector length. The occurrence of more familiar words with a higher number of

occurrences in the training linguistic corpus is thus rewarded.

Predication: This option makes it possible to obtain semantic neighbors not for words,

but for pairs of words or predicative structures. Instead of entering the word Crisis we

may be interested in the list of closest neighbors for the words “Crisis mundial” (world

crisis).

Corrected predication: It is also a method to obtain neighbors from two-word

predicative structures, but by means of a more sophisticated algorithm put forward by

Kintsch which can provide more interesting results from an intuitive point of view.

For example, if we select Cosines, we enter the word crisis and ask for 10 semantic

neighbors:

The list given is:

The first semantic neighbor for the term Crisis is the term itself: “Crisis”. The first

information provided is the vector length for the word: 6.205. Given that Crisis is a

term often used in journalism, this vector length is very high. In order to find whether

a vector length is high or low, the average vector length for the corpus words can be

calculated in order to have a reference (see Export files later on). The closest semantic

neighbor to Crisis is “económica/co” (this space is lemmatized and the term that

appears is the masculine singular, which comprises “económica”, “económicas”,

“económico” and “económicos”).

If we want to find the semantic relationship between Crisis and “Económico” in the 0-1

scale, we click on the + button to display it.

The Activation label displays the semantic similarity between both terms: 0.76. This

similarity is very high (close to one) and is within the 10 typical deviations above the

average similarity in the corpus (which, it should be reminded, was 0.0466). The vector

length for the term “Económico” is even higher than that for Crisis: 6.57, and it appears

under the label Norm. It is also a word which a high level of representation in the press

corpus (with a very high vector length).

If we select the word Crisis, which again has 10 semantic neighbors, but now we chose

the Corrected cosines method, these are the terms that appear:

Once again, the term “Económico” appears, but now new terms such as “Economía” or

the verb “haber” appear. As was previously said, this option provides the closest

neighbors, but we require semantic neighbors to be sufficiently represented, with a

high vector length. For example, the term “Economía” has a higher vector length than

three, whereas the term “Recesión”, which under the previous method was the third

closest semantic neighbor has a 0.65 vector length, five times smaller.

If we now use a two-term structure such as “Crisis mundial” (world crisis), we select

the Predication method, and obtain the following list:

The semantic neighbor of this two-word structure is the term “Crecimiento” (growth),

followed by PIB (GBP), Mundial (world), etc.

COMPARING A TERM TO ANOTHER TERM

The semantic neighbors menu enables us to take a quick look as the words that are

most closely related to a specific term. This can provide information about the

semantic field in which said term is found, the words to which it is most related, etc.

However, this does not ensure that we will be able to assess the semantic relationship

between two specific terms, and those terms will possibly not be among the, e.g. 100

first neighbors in the list.

The Queries Term-Term option makes it possible to assess the semantic relationship

between any two terms as long as they are found within the semantic space generated

during the training.

For example, if we want to see how the word Crisis is associated with the term PSOE

(the Spanish Socialist Party), we can enter the word Crisis as T1 (term 1) and the words

PSOE as T2 (term 2).

Clicking the Compare button shows that the vector length for the first term is 6.205

(the unit figure may appear deleted, giving the impression that the total figure is

0.205). The vector length for the second term is even higher: 9.97. The semantic

relationship is 0.11. If, instead of “PSOE”, we enter the term “Banca” (the banking

sector), the semantic relationship is even stronger: 0.18. “Banca” has a rather lower

vector length than the two other terms: 0.50. If we choose a term that presumably has

no semantic relationship such as “muerte” (death) (which is more associated with

news related to murders or war), we will see that the semantic relationship is

practically none: 0.006.

COMPARING A DOCUMENT TO ANOTHER DOCUMENT

This option makes it possible to compare a corpus document to another corpus

document. Let us supposed that we have trained the program by means of 10,000

abstracts taken from scientific journals. These abstracts are also numbered in a

database, have an associated scientific field, etc. By means of this option, we would be

able to see the semantic relationship between document 1,005 and 2,198, for

example:

Go to the Queries Doc-Doc menu

Enter the document number in the Doc1 box and the number for the other document

in the Doc2 box.

In this case, the semantic relationship between both documents is very tenuous (0.03).

Both in this option and in the previous one, we have the possibility of calculating the

Euclidean distance between two terms, or between two documents. The Euclidean

distance is not a measure of similarity like the cosine, but rather a measure of

dissimilarity. Its use in LSA is less common, but it has proven to be useful to assess text

quality, among other things.

COMPARING TWO TEXTS

Another usual option is comparing the semantic relationship between two free texts.

By free texts we mean that they are not documents that are part of the training

corpus. For example, if we want to establish a semantic comparison between the

relationship between two pieces of news that talk about terrorism, this option can be

used.

Queries Free texts

The semantic similarity between the texts is 0.33. Both texts have a similar vector

length, so LSA is equally familiar with both texts and their respective lengths are

similar.

MOST REPRESENTATIVE TERMS

The Queries Most representative texts option makes it possible to obtain the list of

the k terms with the highest vector length, and thus those with the LSA is most familiar

with.

We just have to specify the number of terms that we want to obtain and click the

Extract button.

For example, if we want the 100 terms with the highest vector length:

There is the auxiliary verb “Haber”, which has the highest vector length (12.92),

followed by “Año” (year) (12.10), the verb “Ir” (12.00), the verb “Ser” (to be) (11.94),

etc.

MOST REPRESENTATIVE DOCUMENTS

In the same way, we can access the Queries Most representative docs document to

list the k documents with the highest semantic relationship on average from all the

trained documents.

CI SUMMARIZER

This procedure can help to categorize a text by summarizing it in a few terms.

For example, we enter the following document in the text box:

In Neigh. per word, we specify the number of neighbors obtained by the procedure for

each of the words in the text entered. We specify two.

Final list shows which of the neighbors obtained in the previous stage are most

representative (those which have the highest semantic similarity on average with the

rest). Therefore, it displays a short list summarizing the text entered.

By checking Also corrected we can remove terms that have a high vector length but do

not contribute much to the meaning of the texts. If this option is chosen (as

recommended), we can remove highly common verbs such as haber (auxiliary verb), or

ser and estar (to be).

Final neighbors provides a final list summarizing the text. If the procedure is successful,

this list and the previous one provide a brief summary of the document entered in the

text box.

In this example, the final result after specifying Neigh. per word, = 2, Final list = 10,

Also corrected = Checked and Final neighbors = 10 is the following:

The original document is basically summarized by words such as “propinar”, “ocurrir”,

“detener”, “haber”, “golpe”, “agression”, “ser”, “paliza”, “agredir”, “agredido” and

“patada” - all of which are words related to aggression and hitting. Even though there

are some auxiliary verbs such as “ser” and “haber”, many of the terms sum up well the

essence of the document.

FILE EXPORT

EXPORT OF MATRICES AS TXT FILES TO BE READ BY SPSS OR OTHER STATISTICS

SOFTWARE

From Export Matrices to .txt we can generate the work matrices for the latent

semantic space with the *.txt extension in a hard disc directory of our choice.

Click on the button to tell the program in which directory you want it to save the

matrices. Then click the Generate button.

The matrices saved in their respective files are the following:

Modulos.txt contains the vector lengths for all the terms trained. If you open this file in

SPSS and generate a few descriptions, you will find that the vector lengths for the

10,685 words are distributed in a positive asymmetric way (As = 4.365) with a 0.76

average and a 0.32 mean.

The pesos.txt file contains the weights assigned to the 10,685 terms in this corpus. It is

the weight applied to the gross frequencies in the original term matrix x documents.

The S.txt file contains a square matrix of n x n, n being the number of dimensions

specified to train the corpus, with the singular value associated with each of the

dimensions (it represents the variance percentage associated with each dimension).

The US.txt file contains the matrix of terms per document. In this case, as the linguistic

corpus has been trained using 250 dimensions and we have 10,685 terms, the matrix

for this file contains 251 columns (the first column contains the list of terms) and

10,685 rows, each of which represents a term. This matrix is extremely interesting, as

it provides the latent semantic space proper that was generated after the training.

Estadísticos

Modulo

10685

0

,7627

,3204

1,35637

4,365

,024

22,053

,047

Válidos

Perdidos

N

Media

Mediana

Desv. típ.

Asimetría

Error típ. de asimetría

Curtosis

Error típ. de curtosis

Finally, the SV.txt matrix contains the vectorial representation not of the terms but of

the trained documents.

Importing the Modulo.txt, Pesos.txt, and US.txt into a single SPSS file is very easy:

You can use all the analysis techniques available in this software.

EXTRACTING CLUSTERS FROM A SEMANTIC SPACE AND REPRESENTING THEM IN

PAJEK

This is a very good option to see a graph representation of the main concepts in the

training linguistic corpus. We start by accessing the Export to Pajek clusters to

pajek menu.

This procedure generates three files. One of them is a large matrix of correlations

between the terms which the cluster extraction procedure identifies as most relevant.

In turn, another file is generated so that the Pajek program can directly generate a

conceptual network diagram. The parameters must be specified:

First of all, the path by which we want Gallito to generate the three files must be

specified in Directory.

Once the path has been established, we must tell the program whether we want it to

generate word clusters or choose the most representative word in the cluster

generated. We will use words (Words) to see the example.

We also choose the Normalizing US matrix option, preventing the terms in the US

matrix which have highest vector lengths from having a greater weight when the

clusters are generated. The cluster analysis procedure is the K-means algorithm, a

procedure which requires specifying the number of clusters to work with. In the

Cluster num box we choose, for example, 35 clusters. In Cluster cycles we decide how

many iterations the procedure is to make. If we choose 2 iterations, after the first

cycle, in which words are assigned to their closest cluster, in the second cycle we

reassign words to another cluster if distances have changed. The more iterations, the

more stable the solution. However, the longer the longer the procedure will also take.

After the clusters are processes, the following files are generated in the directory

specified:

The file are ready to be used in the Pajek program. Pajek is a program

that represents network graphs, so that we can have a quick, useful view of the

general groupings of semantic concepts used to train the tool. Here is an example of a

representation of association networks using the press corpora used to train Gallito.

The cluster.mat is the input file to Pajek. Just open it with pajek, remove conections

below a value in the transform tab (for example, 0.35) and draw it. Then use the

Kamada–Kawai algorithm to quickly generate a reasonable layout.

Al Qaeda

Inmigración

[Immigration]

Embarcación, petrolero, fuel…

[Vessel, Oil tanker, Fuel]

Bush, Azores, Blair…

Guerra, Militar, Afganistán

[War, Military, Afghanistan]

Justicia

[Justice]

Manifestación

[Demonstration]

Fraude inmobiliario

[Real estate fraud]

Batasuna

[Basque

independentist

party]

CONVERTING TEXT FILES TO VECTOR AND REPRESENTING THEM IN PAJEK

This is a very good option to see a graph representation of text files (with essays,

documents) in pajek. We start by accessing the Export to Pajek documents to

pajek.

This procedure convert text files (essays, documents, e-mails, etc.) into vectors and generate a .mat file (to pajek) to draw the as in the clusters. Just specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:

Prueba.docx

prueba2.docx

34361692.txt

27597718.doc

40035398.txt

242310602.txt

33882636.txt

32906839.docx

65982026.txt

39970932.txt

15871651.txt

The file are ready to be used in the Pajek program.

Just open it with pajek, remove conections below a value in the transform tab (for

example, 0.35) and draw it. Then use the Kamada–Kawai algorithm to quickly generate a

reasonable layout.

CHANGE OF BASIS

IDEA

The change of basis is a procedure that basically makes it possible to interpret each of

the coordinates that represents a word in a much more specific way.

When we choose to represent the latent semantic space by means of 250 dimensions,

the mathematical representation of each term is expressed by means of dimensions

which have no familiar meaning to us. These are abstract dimensions in which we

known that the words are well defined, but which however do not provide an idea or

insight to users.

As was previously seen, the latent semantic space can be exported in a file named

US.txt by the program, which can be easily exported to Excel or SPSS.

We previously exported the data to SPSS, obtaining this:

The first word that appears is “congreso” (conference). This word has a vector length, a

weight, and 250 coordinates in 250 dimensions. The figure above shows that the

coordinate in dimension 1 of the term “congreso” is worth 2.039, coordinate 2 is worth

1.951, etc. What do these dimensions mean? That is to say, the word “congreso” has a

coordinate of 2.039 in the first dimension. What does this mean? Unfortunately, the

first dimension in abstract and does not correlate semantically to anything.

The idea of the change of basis involves transforming that abstract space into another

one which can be more easily interpreted by users. This, in addition to making the

semantic space more specific, makes it possible to use the semantic space in a more

efficient way, as we shall see.

SPACE CHANGE OF BASIS

In order to carry out a change of basis, we must first access the Space Change of

basis menu. There are two options:

By clusters turns abstract dimensions into types of words generated by a k-median

cluster analysis.

By predefined words turns abstract dimensions into new dimensions defined by users

depending on their interests.

We will work with this second object:

Space Change of basis By predefined words

When accessing this menu, the following options are available:

In Reference folder we will specify the path to save the new semantic space with

meaning coordinates (together with other files which will be described later) and

which includes a text file containing the word list chosen by users to change the

abstract dimensions by new, meaningful dimensions.

In the Words File boxes the name of the file where the words that will be the new

dimensions is specified. With the help of the conceptual network graph previously

generated by the Pajek software, together with some notions about the main concepts

in the press corpus, we put the following dimensions forward:

The first dimension is related to the descriptors “Guerra”, “Militar” and “Afganistán”

(War, Military, and Afghanistan). The second one is related to the maritime terms

“Embarcación”, “Petrolero” and Fuel (Vessel, Oil tanker, and Fuel). Note that the file is

called Nuevas dimensiones.txt, which must be specified in the Words file dialog box.

Gram-Schimidt orthogonalization is an algebraic procedure required to preserve the

orthogonality of the new meaningful dimensions. If this option is not chosen, there is

the risk of obtaining a new semantic space which is meaningful but whose meaning is

highly oblique, which then generates very distorted neighbor comparisons and basic

processes. The gain in meaningfulness entails a loss in practical usefulness. Thus

activating this option is recommended.

Normalized basis is another recommended option which not only makes the basis

(Gram-Schmidt), but also provides it with a unitary vector length. Choosing both

options will provide a new basis with an orthonormal meaning.

The words which you want to use to define the new dimensions should be among the

trained terms. To do so, you should ensure that the words to be used as new

dimensions do exist.

One the relevant changes to this dialog box have been made, click the Change button.

The process to calculate the new semantic space usually does not take long.

The figure above shows the files generated by the procedure to change basis.

basisMatrix.txt is a file including the new basis (the new, meaningful basis).

basisMatrixBeforeGsOrtog.txt contains the original basis (the abstract basis).

GSreliability.txt is a file that shows the reliability of the dimensions (they must be

higher than 0.70 to provide the words chosen with meaning). Taking a quick look at

this file, we can see that the degrees of reliability associated with the dimensions really

do have, without exception, a degree of reliability higher than 0.70:

Note that, after the first 14 dimensions there appear the dimensions ABSTRACT15,

ABSTRACT16, etc. This is because, once 14 meaningful dimensions have been specified,

the rest, up to 250, must be abstract dimensions.

newTermMatrix.bnl contains the new semantic space that includes the first 14

meaningful dimensions. This file can be loaded (as it has the extension *.bnl) in the

Load tab in the Term matrix box (see the section Loading the files).

newTermMatrix.txt contains the new semantic space including the first 14 meaningful

dimensions. Unlike the previous file, this file can be easily exported to Excel or SPSS.

oldTermMatrix.txt is the file where the former semantic space (including the 250

abstract dimensions) is preserved.

Let us see some examples of how to make use of the new basis.

The term “Congresos” (Conferences) clearly stands out in the “Política_PSOE_PP”

dimension over the rest. The difference with respect the former semantic space is that

we can now say in which dimension(s) a specific word has significant weight.

The term Rey (King) clearly saturates the “Monarquía_Rey_Princesa” dimension.

-0,4

-0,2

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

congreso

-1

0

1

2

3

4

5

6

rey

The term Islam clearly saturates the “Al_Qaeda_Bin_Laden” dimension.

And the term “Garzón” saturates the “Justicia_Juez” dimension.

Con este aviso sabremos que el programa ya ha terminado de entrenar el corpus

lingüístico que le hayamos proporcionado y que ha guardado los siete ficheros con los

que podemos comenzar a trabajar.

Podemos buscar en el directorio especificado los siete ficheros que ha debido crear el

programa. Nótese que los nombres de los ficheros se corresponden con los nombres

que asigna Gallito en la pestaña Save tab. Esto facilitará luego la tarea cuando abramos

sesión nueva con el programa y queramos cargar estos ficheros para trabajar con el

programa.

-0,5

0

0,5

1

1,5

2

2,5

islam

-0,5

0

0,5

1

1,5

2

2,5

3

garzón

BATCHES

You will often have to analyze a group of words or files. To this end, Gallito offers the possibility of carrying out actions by batches.

SEMANTIC NEIGHBORS

Access the al batches Neighbors menu

Through this menu you can extract the n first semantic neighbors of a series of terms. In the dialog box you can specify the number of neighbors, the directory where the file where the terms are specified will be located, and the file itself.

The content of the file will look like this

This process will also generate in the working directory one file per term in which the neighbor, the cosine, and the length of the vector or norm will be specified. The computation time for this procedure is quite short.

SIMILARITY MEASURES

Access the batches Similarity matrices menu. A matrix will be extracted in which the neighbors of a term will be compared to themselves. This square matrix will have ones in its diagonal, with each cell representing the cosine between one neighbor and another. The working directory and the file name will be specified. Similarity matrices will be generated in that directory. The form can be accessed via batches Similarity Batches.

The file contents will look like this:

Clicking Extract generates a 200x200 dimension matrix in the first case, 300x300 in the second, etc.

SIMILARITY PAIRS

Access the batches Similarity Pairs menu

This will generate the similarities in a series of pairs of terms. The reference file and the file name must be specified in the directory.

The file contents will look like this:

INTRATEXT ANALYSIS

Sometimes you will need to analyze the internal properties of a text, such as the number of paragraphs, the number of words per paragraph, per sentence, its consistency, its synthetic and informational capacity, etc. To this end, Gallito2.0 has the capacity to measure a text on the basis of various indices that will be described now. First of all, it offers coherence measures. The theoretical measure of textual coherence has been measured by means of the LSA on the basis of similarity scores between various and successive parts of the text, whether they are sentences or paragraphs (Foltz, 2007). Not only this, but behavioral correlates have been found as regards its effects. For example, Wolfe, Magliano, & Larsen (2005) found that the similarity measure between sentences measured using the LSA had an influence on reading

processing times. Bellissens, Jeuniaux, Duran, & McNamara (2010) have also shown how students with low level of previous knowledge of a topic, by contrast to those with a high degree of knowledge, are more sensitive to the withdrawal of information overlap between sentences, opertivized by means of LSA similarities, as well as the number of causal clauses measured with CohMetrix (Graeser et al., 2004). When textual coherence is not maintained, readers are forced to make elaborative inferences, which are more costly and require previous knowledge. The authors argue that this semantic overlap promotes the integration of what is being read and what was previously encoded, facilitating the reading flow.

In the case of Gallito 2.0, it measures three types of coherence. Firstly, it measures Paragraph-Paragaph coherence. A paragraph is defined by a break. Paragraphs with less than 10 words (words vectorized by LSA) are not taken into account for the analysis. The procedure extracts the cosine between a paragraph and the next one, and in the end all the similarities are averaged to yield a single measure. Secondly, Gallito 2.0 measure Sentence-Sentence coherence. This coherence is measured within each paragraph, which means that the similarity between the final sentence in a paragraph and the initial sentence in the following paragraph is not measures. Gallito 2.0 acts in this way under the assumption that the nature of the paragraphs reflects a thematic unit. Obviously, one-sentence paragraph are not included in this analysis. Nor do sentences with fewer than 4 words included. At the end, sentence-sentence coherences within each paragraph are averaged, obtaining a single sentence-sentence coherence. In addition, a third type of coherence is calculated which, despite being usually measured as such, is also used to obtain the sentence which best represents a paragraph or text (Kintsch, 2002). This is Sentence-Paragraph coherence, which measures the similarity between very sentence and the paragraph that includes it. We should warn that this measure may be conflictive as we believe that it is too highly dependent on the number of sentences in a paragraph. In addition to coherences, Gallito obtains the following surface measures: number of paragraphs, number of words, average number of words per paragraph, average number of words per sentence, and average number of sentences per paragraph. Finally, Gallito 2.0 provides an estimate of the average amount of information provided by the words in a given text. This index gives an idea of the domain-specificity of the words used and of the synthesis that has taken place, that is to say, the extent to which examples or uninformative words have been used. We introduced this latter measure, namely the average global weight in each paragraph, to measure the informativeness of the language employed in texts. Global weight is the opposite of entropy, and in fact is part of its formula. The higher the global weight, the more information a word is with regard to the contexts in which it appears. The way to use this batch process is to access batches intratex Analysis. The following form will appear.

Specify the working directory in the Directory. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:

Prueba.docx

prueba2.docx

34361692.txt

27597718.doc

40035398.txt

242310602.txt

33882636.txt

32906839.docx

65982026.txt

39970932.txt

15871651.txt

The results will be given in a file called resul.txt , which will include the following columns:

ID (file name)

numParagraphs (number of paragraphs)

AverageSentencesPerParagraph (average number of sentences per paragraph)

AverageWordsPerParagraph (average number of words per paragraph

AverageWordsPerSentence (average number of words per sentence)

AverageCohesionSenSen (average sentence-sentence coherence)

AverageCohesionParSen (averae paragraph-sentence coherence)

CohesionParPar (average paragraph-paragraph coherence)

AverageGlobalWeight (average global weight)

ESSAY FILE EVALUATION

Evaluation of the content of essay files is relatively simple sing LSA. Once you have the semantic-vectorial space, you must compare the essay file for each of the students will what are known as gold essay files. Gold essay files will serve as a reference, as they have been written by an expert who has optimally summarized the topic, or who has correctly answered the question. There may be a single gold essay file or several gold essay files (Rehder, et al, 1998).

The procedure is quite similar to intra-textual analysis, with the exception that gold essay list must be specified in addition to the student essay list. Access batches Essay Evaluation to find this form.

It may specify both the essay files list and the gold essay files list. In addition, as always, it a working directory must be specified. The results will be included in a file called resul.txt in the working directory. Both student and golden files can be in any of the following formats: .pdf, .docx, .doc, .txt. Both the list of essay files to be evaluated as the list of gold essay files will be in the following format:

The results appear in the following format:

Similarity is measured in terms of distances, so higher figures will indicate a lower score with respect to gold essay files.

Docs to vectors

This procedure convert text files (essays, documents, e-mails, etc.) into vectors. Specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format.

The list formats will look like this:

Prueba.docx

prueba2.docx

34361692.txt

27597718.doc

40035398.txt

242310602.txt

33882636.txt

32906839.docx

65982026.txt

39970932.txt

15871651.txt

The outputs: a .txt file with a matrix whose rows are the vector of the files in the

list.txt.