Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
THE BASICS OF THE GALLITO 2.0 PROGRAM
INTRODUCTION
The Gallito 2.0 software (currently the latest version) is a program for the processing of
a large number of linguistic documents to obtain a mathematical representation of
language. In addition to processing a large volume of language (as we shall see, the
term "processing" is usually replaced by the term "training"), it includes multiple
features to semantically process linguistic information. Thus, the program is able to
quantify the semantic relationship between texts, measure the cohesiveness between
paragraphs in a text, extract the key words that summarize a document, serve as the
basis to obtain conceptual network graphs, enable the analysis of term types by means
of K-mean cluster analysis, serve to assess text quality, and change the basis to obtain
a new semantic representation of language.
The purpose of this document is to serve as a roadmap for effective, simple use of the
Gallito program. It is written in plain language, avoiding unnecessary technical terms
that underlie the tool, and features are always illustrated by examples so that users
can grasp them more easily.
Before discussing use of the program, it should be pointed out that Gallito is essentially
a program that represents language mathematically. When working with this software,
users should bear in mind that each word is a numerical vector with a highly
approachable number of coordinates (approximately 300). In this way, every text,
sentence, or term is a numerical vector with some 300 coordinates. So this is not a
program for qualitative discourse processing. Rather, it falls within the framework of
artificial intelligence theories. Gallito is based on latent semantic analysis (LSA)
technology and philosophy.
WHAT IS LATENT SEMANTIC ANALYSIS (LSA)?
How is this mathematical representation of language obtained? How is each word
represented by means of a number of numerical coordinates? To obtain a
mathematical representation, a technique known as LSA is applied. Much research has
been done since the 1990s on this technique and its possibilities for treatment of
language semantics (the classic paper is Landauer and Dumais' 1997 "A Solution to
Plato's Problem", which can be easily downloaded).
Basically, the procedure is a follows. The user must have a set of texts to train the
program. This is what is known as the linguistic corpus. This handbook provides
multiple examples taken from a linguistic corpus of some 46,000 press documents
from the Spain section in the El País and El Mundo newspapers, collected between
2002 and 2009. To this end, we wanted Gallito to obtain a statistical or mathematical
representation of the terms of those press documents, as well as of the documents in
that corpus.
The first step towards this goal consists in uploading the linguistic corpus stored in a
single (plain text) file to the tool. A character must be specified to let the program
know when a new document begins and ends. Documents are usually separated by
means of the hash character (#).
What does the LSA (and thus Gallito) do with the file that contains the linguistic
corpus? It generates a frequency matrix, where all the different terms that appear in
the file are entered into rows and each of the documents is entered into a column.
Thus, the cells in that great matrix provide the number of times a given word occurs in
a document. In our newspaper corpus there is a total of 10,686 terms (rows) and
45,886 documents.
How much space does a document take up? When we talk about a document, we
usually refer to a paragraph (generally between 50 and 300 words). The natural unit of
the document for the LSA is the paragraph, although you can also work in such a way
that a document is a sentences or a couple of sentences, e.g. if you want a smaller
analysis level. The opposite is also possible. You can turn a book chapter, an entire
book, etc. into a document, although this is more unusual.
The great frequency matrix is called X. Does Gallito work with this matrix? This matrix
can be thought of as the numerical (mathematical) representation of the terms and
documents. Actually, this matrix is not very useful. To find the semantic similarity
between two terms, it would suffice to find the correlation or degree of resemblance
between two rows. To find the semantic similarity between Congreso (Conference)
and Economía (Economy) it would be enough to pick the two rows in that matrix and
calculate the correlation between them. Studies show that this data matrix does not
represent well the semantic relationships between words. When working with this
matrix, a lot of noise appears. Noise means the large extent to which this matrix
depends on the idiosyncrasy in language use by the different authors. This matrix is
usually known as the brute matrix, because, among other things, it is not purged.
To begin with, Gallito removes syntactic words from this matrix. It carries out a purge
to remove the most frequent words in the language, those which do not provide
semantic information, such as prepositions, articles, and pronouns. In addition, word
frequencies follow a greatly asymmetric distribution. Some words appear much more
frequently than others, but that frequency ratio is not related to the semantic
significance of each of the words. Put otherwise, the brute matrix X must be modified.
The second step consists in applying a weight function, usually involving logarithms, to
prevent this great asymmetry in word frequency (log-entropy and log-IDF are the most
usual modification procedures. Both are included in Gallito, as we will see). Because of
this, matrix X does not display such an asymmetrical word frequency distribution.
The result is a modified matrix X. Can we work with this 10,686 x 45,886 dimension
matrix? It's still not useful. What is most subtle about LSA and has made it famous is
the next step: applying a dimension reduction technique on that modified matrix X so
that the words (or documents) are represented not by 45,886 columns, but by a much
smaller number of dimensions. This number is usually about 300. The reason why the
number is 300 is chosen is more empirical than theoretical. No research has been done
(as far as we know) that found that the brain tends to represent all words in a 300-
dimension semantic space because this is an adaptive or reasonable figure which
correctly encompasses the essence of concepts. Actually, this approximate number of
dimensions is the result of empirical studies, where the issue is how well human
semantics are represented if the dimension reduction algorithm yields a solution of
100, 200, 300, 400, etc. dimensions. The results state that language can be
mathematically represented to simulate well human semantics in a satisfactory way if
the dimension reduction technique extracts or breaks down the modified matrix X into
about 300 dimensions. It seems that too few dimensions provide an excessively scarce
solution. Semantics is not well captured using a too small number of dimensions.
Essential aspects are lost if we try to account for the semantics with so little
information. If too many dimensions are used, the opposite is the case. There is an
excess of dimensions, spurious dimensions which are useless and distort the semantic
representation.
The dimension reduction algorithm is known as the Singular Value Decomposition
algorithm. Techniques such as Component Analysis on the correlation matrix or
Correspondence Analysis on contingency tables are very similar to such
decomposition.
The final result of the linguistic training of LSA (or, in this case, Gallito) is the creation
of a new 300-dimension matrix (if a k of 300 is selected) where each of the 10,686
words is represented by mean of vector of 300 numbers or coordinates.
This matrix is typically known as US.
Gallito is then ready to be used, as an interesting and useful mathematical
representation of the language is available, which has been trained.
The matrix US is usually known in LSA as the latent semantic space. This space is the
vectorial space generated from the original linguistic corpus. It is semantic because the
matrix US captures well the semantic relationships between words. It is latent because
by using approximately of 300 dimensions, the noise of the original matrix X has been
removed, so that the essence of semantics, the regularities between words, is
captured, rather than the idiosyncrasies. In this way, the latent aspects that underlie
semantics are captured.
The most basic possibility afforded by this representation is the verification of the
semantic similarity between two words. This amounts to obtaining the correlation
between two rows in the matrix X (to be more precise, the cosine). The cosine ranges
between -1 and 1. Leaving negative figures aside, as they are highly infrequent, values
range between 0 and 1. The semantic relationship between two words can be assessed
by means of the cosine. When the value of the cosine is about 0, the two words are
independent, orthogonal, or semantically different. The more a cosine tends towards
1, however, the greater the semantic similarity between two words.
For example, if we obtain the semantic relationship between the term puñetazo (blow)
and pelea (fight) in Gallito, the value of the cosine is 0.497, a value that shows a close
semantic relationship between the terms (values higher than 0.50 are very rare, as
users will find out as they work with the program).
If we try the terms puñetazo (blow) and Ebro (the river), the semantic relationship is -
0.021, practically none. Puñetazo and Ebro are two words which have practically no
semantic relationship.
Finally, if the semantic relationship between two rivers, Ebro and Duero, is tested, the
cosine has a value of 0.533, again a very close semantic relationship.
As well as semantic similarity, another basic piece of information is the vector length. A
vector length can be defined as the amount of information which the LSA has about a
word. A word with a high vector length means that the LSA has a high degree of
knowledge about that word. By contrast, a word with a low vector length means that
the LSA has not had much exposure to that word, and thus its knowledge of it is not
very dep. These are the vector lengths of the four words that served as examples in a
journalistic corpus.
The journalistic or press corpus used to train the LSA has greater knowledge about the
word Ebro (vector length 2.141) than about the other three. The word Ebro has a
vector length that is three times longer than the word Pelea (vector length 0.716).
Duero and Puñetazo are words which are less represented in the journalistic corpus, as
their vector length is rather smaller. The result of our analysis of the cosines and vector
length can be graphically summarized as follows:
Puñetazo (Blow) Vector length = 0.373
Pelea (Fight) Vector length = 0.716
Ebro Vector length = 2.141
Duero Vector length = 0.326
If we represent the semantic space in two dimensions (a simplification, given that we
will usually need about 300 dimensions), Pelea and Puñetazo (Fight and Blow)are
represented by vectors with more or less a similar direction (they are semantically
related), but are highly independent from the words Duero and Ebro (which in turn are
also closely related as their vectors also have more or less a similar direction). The
largest vector is that for the term Ebro, as it has the largest vector length. The LSA
therefore has more information about the word Ebro than about any of the other
three ones.
INSTALLATION OF THE GALLITO PROGRAM
REQUIREMENTS
For Gallito 2.0 to work correctly, the following components must be installed (they are
included in the download website and can also be installed from the official Microsoft
website).
- A 64-bit Windows operating system (Windows 7 or Windows Server).
- Microsoft SDK 4 (included in the download package).
- Microsoft Visual C++2010
- Writing permission to install the program in the installation directory
DOWNLOADING THE PROGRAM
Publications, projects, demos, and interesting links related to the LSA technology can
be downloaded from the www.elsemantico.com website.
From the Download tab you can access the page
http://www.elsemantico.com/gallito20/download-es.html where the program can be
downloaded.
The download link for the program (version 2.1.3) and three manuals that serve as a
guide to learn about the program is located in the setup and manuals box.
Clicking on Versión 2.x.x setup (30 days free) [zip] opens the following dialog box:
http://www.elsemantico.com/http://www.elsemantico.com/gallito20/download-es.html
Check the Save with option and click Accept.
A .zip file will be saved from which the program will be installed.
Unzip the file and Double click on the Release directory:
Double click again on the icon. The program installation wizard will
start:
Click Next.
This step in the wizard shows the path where the program will be installed. Click Next.
In the next step, just confirm where you want to install the program, then click Next.
The installation process will take a few minutes.
Once installed, close the wizard clicking Close. Don´t forget to give writing permission
to the installation directory (C:\Program Files\elsemantico.com\)
The program should appear on the Start menu:
The program icon appears. Click on it. The program should open after loading the files
and display the following interface.
It warns you that you have a 30-day trial period.
Licensed users are provided an executable that extends the user license for an
indefinite period.
The program has been installed and is ready for use. The following sections in the
manual will explain how to train the program by means of a linguistic corpus, how to
load a semantic space, and all the working possibilities provided by the software.
TRAINING A CORPUS
Training a linguistic corpus in Gallito firstly entails compiling all the linguistic
documents to be trained in just a file (preferably a *.txt file).
Given that the program starts by placing the documents in columns and the different
terms in rows in a large matrix, the program must be told how the documents are
separated from each other.
To do this, go to the Corpus tab.
CORPUS TAB
Locate the path for the*.txt file that includes the documents to be trained in the
Reference Corpus button .
Note that the character separation option is checked, with the symbol # separating
documents.
The file to be read has this structure:
If the document including the texts to be trained includes a different separating
symbol, the program must be told.
If the documents are not separated by a special symbol (hash, ampersand, at sign,
etc.), but rather the point is used to separate documents, the sentences separation
option can be chosen. The number of sentence separated by a dot which constitute a
document (by default 1) can be specified in the box A document is ___ sentences.
The A document minimum are 2 words option serves to specify the minimum number
of words for a document. Given that documents are usually paragraphs, a higher
number can be chosen in this option, e.g. 10.
Finally, the Remove words that do not appear at least 1 document options makes it
possible to specify how many times a term must appear in the linguistic training corpus
for it to be mathematically represented by the program. So the question here is
specifying the minimum number of times a word must appear to be included in the
semantic space generated by the program. A number of 5 of higher is usually chosen,
as underrepresenting a word may be counterproductive and display highly random
semantic relationships.
SAVE TAB
There are more options to be specified in the training process. We have already
completed the Corpus tab. Now we will specify a path and name for the seven files
that are automatically generated after the training.
Access the Save tab.
As can be seen, there are seven buttons associated with seven text boxes. This
means that a path must be specified and names must be given to the seven different
files created after training Galito using the linguistic corpus which we saw previously.
Term list is a file that, after the training ends, includes all the terms mathematically
represented by the program. It is sort of dictionary of the terms available after the
training. That file should be given the name TERMLIST for easy use afterwards
(although any name can be used, of course).
Doc list is a file that includes of the documents used to train the program. It is a sort of
dictionary of documents. The file name can be DOCLIST.
Doc matrix is the numerical matrix were all the documents are mathematically
represented. Again, any name can be given, but DOCMATRIX can be useful so that this
file is retrieve we know exactly what it contains.
Term matrix is the file that contains the vectorial representation of the terms. This
matrix is, so to speak, the final essence of the training, the core from which the
semantic relationships between the texts can be worked on. The assigned name could
be TERMMATRIX.
Global weight is a file that contains the weights assigned to each of the words
analyzed. Not all words carry the same amount of semantic information, hence they
receive different weights depending on their importance.
Diagonal matrix is a matrix that also contains weights. In this case, the diagonal matrix
describes the importance of each of the dimensions by means of which the latent
semantic space was finally represented. For users who are familiar with factorial
analysis, the diagonal matrix contains the variance percentage described above for
each of the dimensions that account for the semantic relationships between words.
Space features is a file that includes certain features of the training process.
Once all the files that will be generated in the file have been named, the way in which
those files must be saved is chosen> binary or serialized. The binary mode is more
efficient than the serialized one, so that, regardless of the technical aspects involved in
both types of files, the binary mode is a better option.
Finally, the Save Project automatically checkbox saves these file in the path previously
used to name the files, once the training is complete.
IT IS IMPORTANT TO CHECK THIS BOX TO AVOID LOSING THE SEVEN FILES DESCRIBED.
ENSURE THAT THE SEVEN FILES ARE AUTOMATICALLY SAVED IN THE SPECIFIED PATH
AFTER THE TRAINING IS COMPLETED.
MATRIX TAB
We will now follow with the specifications to be made before the training starts. Now
we will take a look at the matrix tab, which establishes some interesting features in
this stage of the training.
Dimensions. This option box, checked by default, allows us to specify how many
dimensions will be used to mathematically represent the words. That is to say, the
number of dimensions which will describe the semantic relationships between the
words. For novel users, this option can be somewhat disconcerting if they do not know
the number of dimensions that is generally used in LSA.
Experts like Landauer and Dumais, together with other collaborators, as well as the
extensive experience of the tests made, suggest that the adequate number of
dimensions for generalist linguistic corpora (corpora that contain a varied language,
such as novels, newspapers, essays, poetry, technical documents, etc.) ranges between
250 and 350. This means that the best way to semantically represent relationships
between words requires a number of dimensions that ranges between 250 and 350.
For corpora within a more specific domain a more reduced number of dimensions can
be specified, e.g. 150.
In any case, this issue can be settled empirically, not theoretically. We therefore
recommend that users establish a number of dimensions that ranges between 150 and
200 for specific-domain corpora and 250 and 350 for generalist corpora.
Accumulated singular value is an interesting option, although we do not recommend
using it with large linguistic corpora. It makes it possible tell the program something
like 'choose the number of dimensions such that said dimensions account for 40% of
the original variability of relationships between words and documents'. This option is
useful for two reasons. Firstly, it makes it unnecessary to decide the number of
dimensions to specify if it is unclear what this number must be. Secondly, some tests
carried out by experts suggest that 40% of the total variance provides a very
interesting semantic space which correctly simulates what humans do. The drawback
of this option is that it requires the program to calculate all possible options, a task
that is extremely costly is the number of documents to train the program is slightly
high (e.g. more than 5,000).
Linguistics adjustment is related to the way in which the original matrix of term
frequency per document will be transformed - the original matrix with which the
program works before obtaining the latent semantic space. Word frequencies do not
follow a specific distribution. Studies have shown that the most frequent word is used
twice as frequently as the second most frequent word, the second most frequent word
is used twice as frequently as the third most frequent word, etc. so that some words
appear much more frequently than others. When the original word frequency matrix
per document is obtained, some words will appear much more often than others. If
the original frequency distribution is to be preserved, the option nothing must be
chosen. If this large asymmetry in use of frequencies is to be modified, one of the
other two options must be chosen: Log*entropy or Log*IDF. Both ways of weighting
the frequency matrix have strong empirical support. The most usual one is
Log*Entropy. As can be seen, both weighting methods involve the logarithm, a method
usually employed in statistics to assign a greater weight to infrequent items and a
much lower weight to highly frequent items.
Normalization. This option forces the normalization by rows of the matrix U (see the
section on LSA above). That is to say, it forces the length of the vectors that represent
all the words to be equal to one.
REM/ADD TAB
Finally, certain features must be specified by means of this tab before finally starting
the training of the corpus. This tab basically establishes issues such as which words are
to be processed by the program and how they are to be processed.
The options under the Lexicon box are all linguistic forms which can be selected if we
prefer Gallito to ignore them in the training. The program provides files with adverbs,
prepositions, pronouns etc. in Spanish, which will make it possible to ignore these
typically syntactic words during the training stage. Thus, choosing the option Adverbs
will make the program ignore adverbs. This means that the latent semantic space will
not include adverbs. Conjunctions, modifiers, interjections, prepositions, and pronouns
are some of the options that are typically chosen (and which we recommend choosing)
so that they are ignored during the training. These are syntactic rather than semantic
words, so that they possibly only add noise to the final semantic space. It is
recommendable to choose all these options. There is also the possibility of choosing to
ignore verbs, although they are usually included in the training (this option is usually
not chosen).
The Additional option makes it possible to avoid training words or short phrases
specified by the user. If the language with which the program will be trained includes
complex terms or structures to be rejected, this option should be chosen. This option
should be selected if, for example, a PDF file is copies which has the heading "Políticas
contra la desertización” (“Anti-desertification policies") repeated and we do not want
the program to train this phrase. In addition, the Structures Select drop lists menu
should be chosen:
Add the phrase to be discarded in structures with more than one term (the box on the
left) and click the button Add.
In the box Action we can choose between Remove if we want to ignore (remove) the
words whose linguistic category falls under one of the categories previously selected
or the opposite can also be done (add exclusively): specifying that those words will be
the only ones that will take part in the training. This option will be rarely used other
than for research purposes.
Finally, when choosing Lemmatization, the option Spanish can be selected in the drop-
down menu and the verification box to the left can be checked:
This options makes it possible to group words under a single form. It is mainly applied
to verbs, nouns, and adjectives. For example, if we want the verb forms
“abandonaba”, “abandonaron”, “abandonaré”, “abandonaría”, etc., to appear all
under the infinitive “abandoner”, this option should be chosen. Lemmatization is a
very good option to handle the large number of verbal forms in Spanish. If this option
is not chosen, all the verb forms different from the verb “abandonar” will be seen as
different terms and thus will take up a row in the term matrix.
Obviously, lemmatization makes it possible (1) to handle many different terms which
should fall under the same semantic category, and (2) to better represent words with a
smaller number of training texts. Just imagine how many documents would have to be
trained to properly represent the verb “abandonar” semantically with and without
lemmatization. Given that with no lemmatization the verb forms of the verb
“abandonar” multiply, we would need a huge number of texts.
For example, after lemmatization, this list of words appears between the terms (left-
hand list) as oppose to an identical training following the same parameters except that
this time the training is not lemmatized (the right-hand list):
PROCESS TAB
In this tab we click on the START button to carry out the training. There is only one
option: Ocurrence matrix to, with which we can specify a path and name to save the
original term frequency matrix per document - the matrix with which the program
start to operate. This matrix is usually ignored, among other things because it can take
up a lot of memory. Ignoring this option, click button START and wait for the program
to notify that the training is complete.
This warning states that the program has finished its training with linguistic corpus
provided and saved the seven files with which we can start to work.
We look for the seven files generated by the program in the specified directory. It
should be noted that the names of the files correspond to the names assigned to
Gallito in the Save tab. This will later facilitate work when we start a new session with
the program and want to load these files to work with the program.
LOADING THE FILES
It's not necessary to train a linguistic corpus very time we start a session with Gallito in
order to be able to work with the program. Once the program has been trained and
the seven files required to work with the program have been saved in the hard disc, we
can simply open them and start to work with the program. This section briefly
describes how to load the working files in Gallito.
Launch the program as usual.
Access the Load tab.
As we already know, seven files must be opened to be able to work with Gallito. These
are files which were previously saved in a previous stage when the program was
trained using a linguistic corpus. These files have the *.bnl extension.
We will now explain how to access the files and briefly describe each one:
By clicking on the button associated with the Term list box, we can look for the file that
contains all the stored terms. Then we will do the same with the Doc list button, in
order to find a file that specifies all the documents which have been used to train the
program. The Doc matrix file stores the latent semantic space for the documents, i.e. it
is the vectorial representation of the Doc list file. This file is not usually frequently
used, unless we want to build a data search engine. Following the file opening process,
Term Matrix is without a doubt the most interesting file of all those opened. It stores
the latent semantic space for the terms. This is the file that contains the vectorial
representation of all the words in the training documents. Global weight is a file that
contains the weights of the terms. Not all terms have the same weights. Some terms
define contexts (documents) where they appear better than other terms (think about
prepositions, terms which do not add any information to the context and which would
have practically no weight). Diagonal matrix is a file that also contains weights, but in
this case it is the weights granted to each dimension in the latent semantic space. We
know from the LSA that a term is vectorially represented, e.g. in 300 dimensions. The
Diagonal matrix file specifies the respective importance of each of those 300
dimensions. For users who are familiar with the main component technique, the
Diagonal matrix file would contain the variance ratio for each dimension. Finally, the
Space features file contains specific information about the space generated.
Once all the files have been selected, click the Load button.
FEATURES OF MY TRAINING CORPUS AND MY SEMANTIC SPACE
A good way to take a look at the features of the opened space is going to the Spaces
Properties menu.
This specifies that our semantic space has 10,686 terms in a linguistic corpus of 45,886
documents. The dimensions by means of which the semantic space has been
represented are 250 (remember that the number of dimensions usually ranges
between 200 and 350). The average similarity between the terms in the corpus (the
average cosine) is 0.0466, and the typical deviation for those similarities is 0.0732.
Given that it sometimes can be hard to assess in absolute terms whether two terms
are semantically related or not by examining exclusively one cosine, a good option is to
examine the cosine in Z scores. To do so, we must subtract the average similarity in the
space (0.0466 in this case) from the cosine and divide the difference between the
typical deviation (0.0732 in this case). In this way we will obtain the typical similarity in
the score. Similarities higher than 3 typical scores represent a high semantic
relationship between a pair of terms.
WORKING WITH GALLITO. BASIC ISSUES
QUERIES MENU
SEMANTIC NEIGHBORS
From the Queries Semantic neighbors menu we have the option of seeing which are
the terms most closely related to a word of our interest. For example, if we choose the
word Crisis because we are interested in finding its closest semantic neighbors in the
Term box we will click on the word Crisis. By means of Measures we can select four
measures for semantic relationships.
Cosines: It shows the cosine between the word chosen and its closest neighbors.
Corrected cosines: It shows the cosine between the word chosen and its closest
neighbors, weighting by the neighboring vectors’ lengths. By means of this option, it
can be specified whether the list of neighbors is to be created using a high or relatively
high vector length. The occurrence of more familiar words with a higher number of
occurrences in the training linguistic corpus is thus rewarded.
Predication: This option makes it possible to obtain semantic neighbors not for words,
but for pairs of words or predicative structures. Instead of entering the word Crisis we
may be interested in the list of closest neighbors for the words “Crisis mundial” (world
crisis).
Corrected predication: It is also a method to obtain neighbors from two-word
predicative structures, but by means of a more sophisticated algorithm put forward by
Kintsch which can provide more interesting results from an intuitive point of view.
For example, if we select Cosines, we enter the word crisis and ask for 10 semantic
neighbors:
The list given is:
The first semantic neighbor for the term Crisis is the term itself: “Crisis”. The first
information provided is the vector length for the word: 6.205. Given that Crisis is a
term often used in journalism, this vector length is very high. In order to find whether
a vector length is high or low, the average vector length for the corpus words can be
calculated in order to have a reference (see Export files later on). The closest semantic
neighbor to Crisis is “económica/co” (this space is lemmatized and the term that
appears is the masculine singular, which comprises “económica”, “económicas”,
“económico” and “económicos”).
If we want to find the semantic relationship between Crisis and “Económico” in the 0-1
scale, we click on the + button to display it.
The Activation label displays the semantic similarity between both terms: 0.76. This
similarity is very high (close to one) and is within the 10 typical deviations above the
average similarity in the corpus (which, it should be reminded, was 0.0466). The vector
length for the term “Económico” is even higher than that for Crisis: 6.57, and it appears
under the label Norm. It is also a word which a high level of representation in the press
corpus (with a very high vector length).
If we select the word Crisis, which again has 10 semantic neighbors, but now we chose
the Corrected cosines method, these are the terms that appear:
Once again, the term “Económico” appears, but now new terms such as “Economía” or
the verb “haber” appear. As was previously said, this option provides the closest
neighbors, but we require semantic neighbors to be sufficiently represented, with a
high vector length. For example, the term “Economía” has a higher vector length than
three, whereas the term “Recesión”, which under the previous method was the third
closest semantic neighbor has a 0.65 vector length, five times smaller.
If we now use a two-term structure such as “Crisis mundial” (world crisis), we select
the Predication method, and obtain the following list:
The semantic neighbor of this two-word structure is the term “Crecimiento” (growth),
followed by PIB (GBP), Mundial (world), etc.
COMPARING A TERM TO ANOTHER TERM
The semantic neighbors menu enables us to take a quick look as the words that are
most closely related to a specific term. This can provide information about the
semantic field in which said term is found, the words to which it is most related, etc.
However, this does not ensure that we will be able to assess the semantic relationship
between two specific terms, and those terms will possibly not be among the, e.g. 100
first neighbors in the list.
The Queries Term-Term option makes it possible to assess the semantic relationship
between any two terms as long as they are found within the semantic space generated
during the training.
For example, if we want to see how the word Crisis is associated with the term PSOE
(the Spanish Socialist Party), we can enter the word Crisis as T1 (term 1) and the words
PSOE as T2 (term 2).
Clicking the Compare button shows that the vector length for the first term is 6.205
(the unit figure may appear deleted, giving the impression that the total figure is
0.205). The vector length for the second term is even higher: 9.97. The semantic
relationship is 0.11. If, instead of “PSOE”, we enter the term “Banca” (the banking
sector), the semantic relationship is even stronger: 0.18. “Banca” has a rather lower
vector length than the two other terms: 0.50. If we choose a term that presumably has
no semantic relationship such as “muerte” (death) (which is more associated with
news related to murders or war), we will see that the semantic relationship is
practically none: 0.006.
COMPARING A DOCUMENT TO ANOTHER DOCUMENT
This option makes it possible to compare a corpus document to another corpus
document. Let us supposed that we have trained the program by means of 10,000
abstracts taken from scientific journals. These abstracts are also numbered in a
database, have an associated scientific field, etc. By means of this option, we would be
able to see the semantic relationship between document 1,005 and 2,198, for
example:
Go to the Queries Doc-Doc menu
Enter the document number in the Doc1 box and the number for the other document
in the Doc2 box.
In this case, the semantic relationship between both documents is very tenuous (0.03).
Both in this option and in the previous one, we have the possibility of calculating the
Euclidean distance between two terms, or between two documents. The Euclidean
distance is not a measure of similarity like the cosine, but rather a measure of
dissimilarity. Its use in LSA is less common, but it has proven to be useful to assess text
quality, among other things.
COMPARING TWO TEXTS
Another usual option is comparing the semantic relationship between two free texts.
By free texts we mean that they are not documents that are part of the training
corpus. For example, if we want to establish a semantic comparison between the
relationship between two pieces of news that talk about terrorism, this option can be
used.
Queries Free texts
The semantic similarity between the texts is 0.33. Both texts have a similar vector
length, so LSA is equally familiar with both texts and their respective lengths are
similar.
MOST REPRESENTATIVE TERMS
The Queries Most representative texts option makes it possible to obtain the list of
the k terms with the highest vector length, and thus those with the LSA is most familiar
with.
We just have to specify the number of terms that we want to obtain and click the
Extract button.
For example, if we want the 100 terms with the highest vector length:
There is the auxiliary verb “Haber”, which has the highest vector length (12.92),
followed by “Año” (year) (12.10), the verb “Ir” (12.00), the verb “Ser” (to be) (11.94),
etc.
MOST REPRESENTATIVE DOCUMENTS
In the same way, we can access the Queries Most representative docs document to
list the k documents with the highest semantic relationship on average from all the
trained documents.
CI SUMMARIZER
This procedure can help to categorize a text by summarizing it in a few terms.
For example, we enter the following document in the text box:
In Neigh. per word, we specify the number of neighbors obtained by the procedure for
each of the words in the text entered. We specify two.
Final list shows which of the neighbors obtained in the previous stage are most
representative (those which have the highest semantic similarity on average with the
rest). Therefore, it displays a short list summarizing the text entered.
By checking Also corrected we can remove terms that have a high vector length but do
not contribute much to the meaning of the texts. If this option is chosen (as
recommended), we can remove highly common verbs such as haber (auxiliary verb), or
ser and estar (to be).
Final neighbors provides a final list summarizing the text. If the procedure is successful,
this list and the previous one provide a brief summary of the document entered in the
text box.
In this example, the final result after specifying Neigh. per word, = 2, Final list = 10,
Also corrected = Checked and Final neighbors = 10 is the following:
The original document is basically summarized by words such as “propinar”, “ocurrir”,
“detener”, “haber”, “golpe”, “agression”, “ser”, “paliza”, “agredir”, “agredido” and
“patada” - all of which are words related to aggression and hitting. Even though there
are some auxiliary verbs such as “ser” and “haber”, many of the terms sum up well the
essence of the document.
FILE EXPORT
EXPORT OF MATRICES AS TXT FILES TO BE READ BY SPSS OR OTHER STATISTICS
SOFTWARE
From Export Matrices to .txt we can generate the work matrices for the latent
semantic space with the *.txt extension in a hard disc directory of our choice.
Click on the button to tell the program in which directory you want it to save the
matrices. Then click the Generate button.
The matrices saved in their respective files are the following:
Modulos.txt contains the vector lengths for all the terms trained. If you open this file in
SPSS and generate a few descriptions, you will find that the vector lengths for the
10,685 words are distributed in a positive asymmetric way (As = 4.365) with a 0.76
average and a 0.32 mean.
The pesos.txt file contains the weights assigned to the 10,685 terms in this corpus. It is
the weight applied to the gross frequencies in the original term matrix x documents.
The S.txt file contains a square matrix of n x n, n being the number of dimensions
specified to train the corpus, with the singular value associated with each of the
dimensions (it represents the variance percentage associated with each dimension).
The US.txt file contains the matrix of terms per document. In this case, as the linguistic
corpus has been trained using 250 dimensions and we have 10,685 terms, the matrix
for this file contains 251 columns (the first column contains the list of terms) and
10,685 rows, each of which represents a term. This matrix is extremely interesting, as
it provides the latent semantic space proper that was generated after the training.
Estadísticos
Modulo
10685
0
,7627
,3204
1,35637
4,365
,024
22,053
,047
Válidos
Perdidos
N
Media
Mediana
Desv. típ.
Asimetría
Error típ. de asimetría
Curtosis
Error típ. de curtosis
Finally, the SV.txt matrix contains the vectorial representation not of the terms but of
the trained documents.
Importing the Modulo.txt, Pesos.txt, and US.txt into a single SPSS file is very easy:
You can use all the analysis techniques available in this software.
EXTRACTING CLUSTERS FROM A SEMANTIC SPACE AND REPRESENTING THEM IN
PAJEK
This is a very good option to see a graph representation of the main concepts in the
training linguistic corpus. We start by accessing the Export to Pajek clusters to
pajek menu.
This procedure generates three files. One of them is a large matrix of correlations
between the terms which the cluster extraction procedure identifies as most relevant.
In turn, another file is generated so that the Pajek program can directly generate a
conceptual network diagram. The parameters must be specified:
First of all, the path by which we want Gallito to generate the three files must be
specified in Directory.
Once the path has been established, we must tell the program whether we want it to
generate word clusters or choose the most representative word in the cluster
generated. We will use words (Words) to see the example.
We also choose the Normalizing US matrix option, preventing the terms in the US
matrix which have highest vector lengths from having a greater weight when the
clusters are generated. The cluster analysis procedure is the K-means algorithm, a
procedure which requires specifying the number of clusters to work with. In the
Cluster num box we choose, for example, 35 clusters. In Cluster cycles we decide how
many iterations the procedure is to make. If we choose 2 iterations, after the first
cycle, in which words are assigned to their closest cluster, in the second cycle we
reassign words to another cluster if distances have changed. The more iterations, the
more stable the solution. However, the longer the longer the procedure will also take.
After the clusters are processes, the following files are generated in the directory
specified:
The file are ready to be used in the Pajek program. Pajek is a program
that represents network graphs, so that we can have a quick, useful view of the
general groupings of semantic concepts used to train the tool. Here is an example of a
representation of association networks using the press corpora used to train Gallito.
The cluster.mat is the input file to Pajek. Just open it with pajek, remove conections
below a value in the transform tab (for example, 0.35) and draw it. Then use the
Kamada–Kawai algorithm to quickly generate a reasonable layout.
Al Qaeda
Inmigración
[Immigration]
Embarcación, petrolero, fuel…
[Vessel, Oil tanker, Fuel]
Bush, Azores, Blair…
Guerra, Militar, Afganistán
[War, Military, Afghanistan]
Justicia
[Justice]
Manifestación
[Demonstration]
Fraude inmobiliario
[Real estate fraud]
Batasuna
[Basque
independentist
party]
CONVERTING TEXT FILES TO VECTOR AND REPRESENTING THEM IN PAJEK
This is a very good option to see a graph representation of text files (with essays,
documents) in pajek. We start by accessing the Export to Pajek documents to
pajek.
This procedure convert text files (essays, documents, e-mails, etc.) into vectors and generate a .mat file (to pajek) to draw the as in the clusters. Just specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:
Prueba.docx
prueba2.docx
34361692.txt
27597718.doc
40035398.txt
242310602.txt
33882636.txt
32906839.docx
65982026.txt
39970932.txt
15871651.txt
The file are ready to be used in the Pajek program.
Just open it with pajek, remove conections below a value in the transform tab (for
example, 0.35) and draw it. Then use the Kamada–Kawai algorithm to quickly generate a
reasonable layout.
CHANGE OF BASIS
IDEA
The change of basis is a procedure that basically makes it possible to interpret each of
the coordinates that represents a word in a much more specific way.
When we choose to represent the latent semantic space by means of 250 dimensions,
the mathematical representation of each term is expressed by means of dimensions
which have no familiar meaning to us. These are abstract dimensions in which we
known that the words are well defined, but which however do not provide an idea or
insight to users.
As was previously seen, the latent semantic space can be exported in a file named
US.txt by the program, which can be easily exported to Excel or SPSS.
We previously exported the data to SPSS, obtaining this:
The first word that appears is “congreso” (conference). This word has a vector length, a
weight, and 250 coordinates in 250 dimensions. The figure above shows that the
coordinate in dimension 1 of the term “congreso” is worth 2.039, coordinate 2 is worth
1.951, etc. What do these dimensions mean? That is to say, the word “congreso” has a
coordinate of 2.039 in the first dimension. What does this mean? Unfortunately, the
first dimension in abstract and does not correlate semantically to anything.
The idea of the change of basis involves transforming that abstract space into another
one which can be more easily interpreted by users. This, in addition to making the
semantic space more specific, makes it possible to use the semantic space in a more
efficient way, as we shall see.
SPACE CHANGE OF BASIS
In order to carry out a change of basis, we must first access the Space Change of
basis menu. There are two options:
By clusters turns abstract dimensions into types of words generated by a k-median
cluster analysis.
By predefined words turns abstract dimensions into new dimensions defined by users
depending on their interests.
We will work with this second object:
Space Change of basis By predefined words
When accessing this menu, the following options are available:
In Reference folder we will specify the path to save the new semantic space with
meaning coordinates (together with other files which will be described later) and
which includes a text file containing the word list chosen by users to change the
abstract dimensions by new, meaningful dimensions.
In the Words File boxes the name of the file where the words that will be the new
dimensions is specified. With the help of the conceptual network graph previously
generated by the Pajek software, together with some notions about the main concepts
in the press corpus, we put the following dimensions forward:
The first dimension is related to the descriptors “Guerra”, “Militar” and “Afganistán”
(War, Military, and Afghanistan). The second one is related to the maritime terms
“Embarcación”, “Petrolero” and Fuel (Vessel, Oil tanker, and Fuel). Note that the file is
called Nuevas dimensiones.txt, which must be specified in the Words file dialog box.
Gram-Schimidt orthogonalization is an algebraic procedure required to preserve the
orthogonality of the new meaningful dimensions. If this option is not chosen, there is
the risk of obtaining a new semantic space which is meaningful but whose meaning is
highly oblique, which then generates very distorted neighbor comparisons and basic
processes. The gain in meaningfulness entails a loss in practical usefulness. Thus
activating this option is recommended.
Normalized basis is another recommended option which not only makes the basis
(Gram-Schmidt), but also provides it with a unitary vector length. Choosing both
options will provide a new basis with an orthonormal meaning.
The words which you want to use to define the new dimensions should be among the
trained terms. To do so, you should ensure that the words to be used as new
dimensions do exist.
One the relevant changes to this dialog box have been made, click the Change button.
The process to calculate the new semantic space usually does not take long.
The figure above shows the files generated by the procedure to change basis.
basisMatrix.txt is a file including the new basis (the new, meaningful basis).
basisMatrixBeforeGsOrtog.txt contains the original basis (the abstract basis).
GSreliability.txt is a file that shows the reliability of the dimensions (they must be
higher than 0.70 to provide the words chosen with meaning). Taking a quick look at
this file, we can see that the degrees of reliability associated with the dimensions really
do have, without exception, a degree of reliability higher than 0.70:
Note that, after the first 14 dimensions there appear the dimensions ABSTRACT15,
ABSTRACT16, etc. This is because, once 14 meaningful dimensions have been specified,
the rest, up to 250, must be abstract dimensions.
newTermMatrix.bnl contains the new semantic space that includes the first 14
meaningful dimensions. This file can be loaded (as it has the extension *.bnl) in the
Load tab in the Term matrix box (see the section Loading the files).
newTermMatrix.txt contains the new semantic space including the first 14 meaningful
dimensions. Unlike the previous file, this file can be easily exported to Excel or SPSS.
oldTermMatrix.txt is the file where the former semantic space (including the 250
abstract dimensions) is preserved.
Let us see some examples of how to make use of the new basis.
The term “Congresos” (Conferences) clearly stands out in the “Política_PSOE_PP”
dimension over the rest. The difference with respect the former semantic space is that
we can now say in which dimension(s) a specific word has significant weight.
The term Rey (King) clearly saturates the “Monarquía_Rey_Princesa” dimension.
-0,4
-0,2
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
congreso
-1
0
1
2
3
4
5
6
rey
The term Islam clearly saturates the “Al_Qaeda_Bin_Laden” dimension.
And the term “Garzón” saturates the “Justicia_Juez” dimension.
Con este aviso sabremos que el programa ya ha terminado de entrenar el corpus
lingüístico que le hayamos proporcionado y que ha guardado los siete ficheros con los
que podemos comenzar a trabajar.
Podemos buscar en el directorio especificado los siete ficheros que ha debido crear el
programa. Nótese que los nombres de los ficheros se corresponden con los nombres
que asigna Gallito en la pestaña Save tab. Esto facilitará luego la tarea cuando abramos
sesión nueva con el programa y queramos cargar estos ficheros para trabajar con el
programa.
-0,5
0
0,5
1
1,5
2
2,5
islam
-0,5
0
0,5
1
1,5
2
2,5
3
garzón
BATCHES
You will often have to analyze a group of words or files. To this end, Gallito offers the possibility of carrying out actions by batches.
SEMANTIC NEIGHBORS
Access the al batches Neighbors menu
Through this menu you can extract the n first semantic neighbors of a series of terms. In the dialog box you can specify the number of neighbors, the directory where the file where the terms are specified will be located, and the file itself.
The content of the file will look like this
This process will also generate in the working directory one file per term in which the neighbor, the cosine, and the length of the vector or norm will be specified. The computation time for this procedure is quite short.
SIMILARITY MEASURES
Access the batches Similarity matrices menu. A matrix will be extracted in which the neighbors of a term will be compared to themselves. This square matrix will have ones in its diagonal, with each cell representing the cosine between one neighbor and another. The working directory and the file name will be specified. Similarity matrices will be generated in that directory. The form can be accessed via batches Similarity Batches.
The file contents will look like this:
Clicking Extract generates a 200x200 dimension matrix in the first case, 300x300 in the second, etc.
SIMILARITY PAIRS
Access the batches Similarity Pairs menu
This will generate the similarities in a series of pairs of terms. The reference file and the file name must be specified in the directory.
The file contents will look like this:
INTRATEXT ANALYSIS
Sometimes you will need to analyze the internal properties of a text, such as the number of paragraphs, the number of words per paragraph, per sentence, its consistency, its synthetic and informational capacity, etc. To this end, Gallito2.0 has the capacity to measure a text on the basis of various indices that will be described now. First of all, it offers coherence measures. The theoretical measure of textual coherence has been measured by means of the LSA on the basis of similarity scores between various and successive parts of the text, whether they are sentences or paragraphs (Foltz, 2007). Not only this, but behavioral correlates have been found as regards its effects. For example, Wolfe, Magliano, & Larsen (2005) found that the similarity measure between sentences measured using the LSA had an influence on reading
processing times. Bellissens, Jeuniaux, Duran, & McNamara (2010) have also shown how students with low level of previous knowledge of a topic, by contrast to those with a high degree of knowledge, are more sensitive to the withdrawal of information overlap between sentences, opertivized by means of LSA similarities, as well as the number of causal clauses measured with CohMetrix (Graeser et al., 2004). When textual coherence is not maintained, readers are forced to make elaborative inferences, which are more costly and require previous knowledge. The authors argue that this semantic overlap promotes the integration of what is being read and what was previously encoded, facilitating the reading flow.
In the case of Gallito 2.0, it measures three types of coherence. Firstly, it measures Paragraph-Paragaph coherence. A paragraph is defined by a break. Paragraphs with less than 10 words (words vectorized by LSA) are not taken into account for the analysis. The procedure extracts the cosine between a paragraph and the next one, and in the end all the similarities are averaged to yield a single measure. Secondly, Gallito 2.0 measure Sentence-Sentence coherence. This coherence is measured within each paragraph, which means that the similarity between the final sentence in a paragraph and the initial sentence in the following paragraph is not measures. Gallito 2.0 acts in this way under the assumption that the nature of the paragraphs reflects a thematic unit. Obviously, one-sentence paragraph are not included in this analysis. Nor do sentences with fewer than 4 words included. At the end, sentence-sentence coherences within each paragraph are averaged, obtaining a single sentence-sentence coherence. In addition, a third type of coherence is calculated which, despite being usually measured as such, is also used to obtain the sentence which best represents a paragraph or text (Kintsch, 2002). This is Sentence-Paragraph coherence, which measures the similarity between very sentence and the paragraph that includes it. We should warn that this measure may be conflictive as we believe that it is too highly dependent on the number of sentences in a paragraph. In addition to coherences, Gallito obtains the following surface measures: number of paragraphs, number of words, average number of words per paragraph, average number of words per sentence, and average number of sentences per paragraph. Finally, Gallito 2.0 provides an estimate of the average amount of information provided by the words in a given text. This index gives an idea of the domain-specificity of the words used and of the synthesis that has taken place, that is to say, the extent to which examples or uninformative words have been used. We introduced this latter measure, namely the average global weight in each paragraph, to measure the informativeness of the language employed in texts. Global weight is the opposite of entropy, and in fact is part of its formula. The higher the global weight, the more information a word is with regard to the contexts in which it appears. The way to use this batch process is to access batches intratex Analysis. The following form will appear.
Specify the working directory in the Directory. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:
Prueba.docx
prueba2.docx
34361692.txt
27597718.doc
40035398.txt
242310602.txt
33882636.txt
32906839.docx
65982026.txt
39970932.txt
15871651.txt
The results will be given in a file called resul.txt , which will include the following columns:
ID (file name)
numParagraphs (number of paragraphs)
AverageSentencesPerParagraph (average number of sentences per paragraph)
AverageWordsPerParagraph (average number of words per paragraph
AverageWordsPerSentence (average number of words per sentence)
AverageCohesionSenSen (average sentence-sentence coherence)
AverageCohesionParSen (averae paragraph-sentence coherence)
CohesionParPar (average paragraph-paragraph coherence)
AverageGlobalWeight (average global weight)
ESSAY FILE EVALUATION
Evaluation of the content of essay files is relatively simple sing LSA. Once you have the semantic-vectorial space, you must compare the essay file for each of the students will what are known as gold essay files. Gold essay files will serve as a reference, as they have been written by an expert who has optimally summarized the topic, or who has correctly answered the question. There may be a single gold essay file or several gold essay files (Rehder, et al, 1998).
The procedure is quite similar to intra-textual analysis, with the exception that gold essay list must be specified in addition to the student essay list. Access batches Essay Evaluation to find this form.
It may specify both the essay files list and the gold essay files list. In addition, as always, it a working directory must be specified. The results will be included in a file called resul.txt in the working directory. Both student and golden files can be in any of the following formats: .pdf, .docx, .doc, .txt. Both the list of essay files to be evaluated as the list of gold essay files will be in the following format:
The results appear in the following format:
Similarity is measured in terms of distances, so higher figures will indicate a lower score with respect to gold essay files.
Docs to vectors
This procedure convert text files (essays, documents, e-mails, etc.) into vectors. Specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format.
The list formats will look like this:
Prueba.docx
prueba2.docx
34361692.txt
27597718.doc
40035398.txt
242310602.txt
33882636.txt
32906839.docx
65982026.txt
39970932.txt
15871651.txt
The outputs: a .txt file with a matrix whose rows are the vector of the files in the
list.txt.