62
THE BASICS OF THE GALLITO 2.0 PROGRAM INTRODUCTION The Gallito 2.0 software (currently the latest version) is a program for the processing of a large number of linguistic documents to obtain a mathematical representation of language. In addition to processing a large volume of language (as we shall see, the term "processing" is usually replaced by the term "training"), it includes multiple features to semantically process linguistic information. Thus, the program is able to quantify the semantic relationship between texts, measure the cohesiveness between paragraphs in a text, extract the key words that summarize a document, serve as the basis to obtain conceptual network graphs, enable the analysis of term types by means of K-mean cluster analysis, serve to assess text quality, and change the basis to obtain a new semantic representation of language. The purpose of this document is to serve as a roadmap for effective, simple use of the Gallito program. It is written in plain language, avoiding unnecessary technical terms that underlie the tool, and features are always illustrated by examples so that users can grasp them more easily. Before discussing use of the program, it should be pointed out that Gallito is essentially a program that represents language mathematically. When working with this software, users should bear in mind that each word is a numerical vector with a highly approachable number of coordinates (approximately 300). In this way, every text, sentence, or term is a numerical vector with some 300 coordinates. So this is not a program for qualitative discourse processing. Rather, it falls within the framework of artificial intelligence theories. Gallito is based on latent semantic analysis (LSA) technology and philosophy. WHAT IS LATENT SEMANTIC ANALYSIS (LSA)?

THE BASICS OF THE GALLITO 2.0 PROGRAM · 2013. 10. 24. · Gallito program. It is written in plain language, avoiding unnecessary technical terms that underlie the tool, and features

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • THE BASICS OF THE GALLITO 2.0 PROGRAM

    INTRODUCTION

    The Gallito 2.0 software (currently the latest version) is a program for the processing of

    a large number of linguistic documents to obtain a mathematical representation of

    language. In addition to processing a large volume of language (as we shall see, the

    term "processing" is usually replaced by the term "training"), it includes multiple

    features to semantically process linguistic information. Thus, the program is able to

    quantify the semantic relationship between texts, measure the cohesiveness between

    paragraphs in a text, extract the key words that summarize a document, serve as the

    basis to obtain conceptual network graphs, enable the analysis of term types by means

    of K-mean cluster analysis, serve to assess text quality, and change the basis to obtain

    a new semantic representation of language.

    The purpose of this document is to serve as a roadmap for effective, simple use of the

    Gallito program. It is written in plain language, avoiding unnecessary technical terms

    that underlie the tool, and features are always illustrated by examples so that users

    can grasp them more easily.

    Before discussing use of the program, it should be pointed out that Gallito is essentially

    a program that represents language mathematically. When working with this software,

    users should bear in mind that each word is a numerical vector with a highly

    approachable number of coordinates (approximately 300). In this way, every text,

    sentence, or term is a numerical vector with some 300 coordinates. So this is not a

    program for qualitative discourse processing. Rather, it falls within the framework of

    artificial intelligence theories. Gallito is based on latent semantic analysis (LSA)

    technology and philosophy.

    WHAT IS LATENT SEMANTIC ANALYSIS (LSA)?

  • How is this mathematical representation of language obtained? How is each word

    represented by means of a number of numerical coordinates? To obtain a

    mathematical representation, a technique known as LSA is applied. Much research has

    been done since the 1990s on this technique and its possibilities for treatment of

    language semantics (the classic paper is Landauer and Dumais' 1997 "A Solution to

    Plato's Problem", which can be easily downloaded).

    Basically, the procedure is a follows. The user must have a set of texts to train the

    program. This is what is known as the linguistic corpus. This handbook provides

    multiple examples taken from a linguistic corpus of some 46,000 press documents

    from the Spain section in the El País and El Mundo newspapers, collected between

    2002 and 2009. To this end, we wanted Gallito to obtain a statistical or mathematical

    representation of the terms of those press documents, as well as of the documents in

    that corpus.

    The first step towards this goal consists in uploading the linguistic corpus stored in a

    single (plain text) file to the tool. A character must be specified to let the program

    know when a new document begins and ends. Documents are usually separated by

    means of the hash character (#).

    What does the LSA (and thus Gallito) do with the file that contains the linguistic

    corpus? It generates a frequency matrix, where all the different terms that appear in

    the file are entered into rows and each of the documents is entered into a column.

    Thus, the cells in that great matrix provide the number of times a given word occurs in

    a document. In our newspaper corpus there is a total of 10,686 terms (rows) and

    45,886 documents.

  • How much space does a document take up? When we talk about a document, we

    usually refer to a paragraph (generally between 50 and 300 words). The natural unit of

    the document for the LSA is the paragraph, although you can also work in such a way

    that a document is a sentences or a couple of sentences, e.g. if you want a smaller

    analysis level. The opposite is also possible. You can turn a book chapter, an entire

    book, etc. into a document, although this is more unusual.

    The great frequency matrix is called X. Does Gallito work with this matrix? This matrix

    can be thought of as the numerical (mathematical) representation of the terms and

    documents. Actually, this matrix is not very useful. To find the semantic similarity

    between two terms, it would suffice to find the correlation or degree of resemblance

    between two rows. To find the semantic similarity between Congreso (Conference)

    and Economía (Economy) it would be enough to pick the two rows in that matrix and

    calculate the correlation between them. Studies show that this data matrix does not

    represent well the semantic relationships between words. When working with this

    matrix, a lot of noise appears. Noise means the large extent to which this matrix

    depends on the idiosyncrasy in language use by the different authors. This matrix is

    usually known as the brute matrix, because, among other things, it is not purged.

    To begin with, Gallito removes syntactic words from this matrix. It carries out a purge

    to remove the most frequent words in the language, those which do not provide

    semantic information, such as prepositions, articles, and pronouns. In addition, word

    frequencies follow a greatly asymmetric distribution. Some words appear much more

  • frequently than others, but that frequency ratio is not related to the semantic

    significance of each of the words. Put otherwise, the brute matrix X must be modified.

    The second step consists in applying a weight function, usually involving logarithms, to

    prevent this great asymmetry in word frequency (log-entropy and log-IDF are the most

    usual modification procedures. Both are included in Gallito, as we will see). Because of

    this, matrix X does not display such an asymmetrical word frequency distribution.

    The result is a modified matrix X. Can we work with this 10,686 x 45,886 dimension

    matrix? It's still not useful. What is most subtle about LSA and has made it famous is

    the next step: applying a dimension reduction technique on that modified matrix X so

    that the words (or documents) are represented not by 45,886 columns, but by a much

    smaller number of dimensions. This number is usually about 300. The reason why the

    number is 300 is chosen is more empirical than theoretical. No research has been done

    (as far as we know) that found that the brain tends to represent all words in a 300-

    dimension semantic space because this is an adaptive or reasonable figure which

    correctly encompasses the essence of concepts. Actually, this approximate number of

    dimensions is the result of empirical studies, where the issue is how well human

    semantics are represented if the dimension reduction algorithm yields a solution of

    100, 200, 300, 400, etc. dimensions. The results state that language can be

    mathematically represented to simulate well human semantics in a satisfactory way if

    the dimension reduction technique extracts or breaks down the modified matrix X into

    about 300 dimensions. It seems that too few dimensions provide an excessively scarce

    solution. Semantics is not well captured using a too small number of dimensions.

    Essential aspects are lost if we try to account for the semantics with so little

    information. If too many dimensions are used, the opposite is the case. There is an

    excess of dimensions, spurious dimensions which are useless and distort the semantic

    representation.

    The dimension reduction algorithm is known as the Singular Value Decomposition

    algorithm. Techniques such as Component Analysis on the correlation matrix or

    Correspondence Analysis on contingency tables are very similar to such

    decomposition.

    The final result of the linguistic training of LSA (or, in this case, Gallito) is the creation

    of a new 300-dimension matrix (if a k of 300 is selected) where each of the 10,686

    words is represented by mean of vector of 300 numbers or coordinates.

    This matrix is typically known as US.

  • Gallito is then ready to be used, as an interesting and useful mathematical

    representation of the language is available, which has been trained.

    The matrix US is usually known in LSA as the latent semantic space. This space is the

    vectorial space generated from the original linguistic corpus. It is semantic because the

    matrix US captures well the semantic relationships between words. It is latent because

    by using approximately of 300 dimensions, the noise of the original matrix X has been

    removed, so that the essence of semantics, the regularities between words, is

    captured, rather than the idiosyncrasies. In this way, the latent aspects that underlie

    semantics are captured.

    The most basic possibility afforded by this representation is the verification of the

    semantic similarity between two words. This amounts to obtaining the correlation

    between two rows in the matrix X (to be more precise, the cosine). The cosine ranges

    between -1 and 1. Leaving negative figures aside, as they are highly infrequent, values

    range between 0 and 1. The semantic relationship between two words can be assessed

    by means of the cosine. When the value of the cosine is about 0, the two words are

    independent, orthogonal, or semantically different. The more a cosine tends towards

    1, however, the greater the semantic similarity between two words.

    For example, if we obtain the semantic relationship between the term puñetazo (blow)

    and pelea (fight) in Gallito, the value of the cosine is 0.497, a value that shows a close

    semantic relationship between the terms (values higher than 0.50 are very rare, as

    users will find out as they work with the program).

  • If we try the terms puñetazo (blow) and Ebro (the river), the semantic relationship is -

    0.021, practically none. Puñetazo and Ebro are two words which have practically no

    semantic relationship.

    Finally, if the semantic relationship between two rivers, Ebro and Duero, is tested, the

    cosine has a value of 0.533, again a very close semantic relationship.

    As well as semantic similarity, another basic piece of information is the vector length. A

    vector length can be defined as the amount of information which the LSA has about a

    word. A word with a high vector length means that the LSA has a high degree of

    knowledge about that word. By contrast, a word with a low vector length means that

    the LSA has not had much exposure to that word, and thus its knowledge of it is not

    very dep. These are the vector lengths of the four words that served as examples in a

    journalistic corpus.

    The journalistic or press corpus used to train the LSA has greater knowledge about the

    word Ebro (vector length 2.141) than about the other three. The word Ebro has a

    vector length that is three times longer than the word Pelea (vector length 0.716).

    Duero and Puñetazo are words which are less represented in the journalistic corpus, as

    their vector length is rather smaller. The result of our analysis of the cosines and vector

    length can be graphically summarized as follows:

    Puñetazo (Blow) Vector length = 0.373

    Pelea (Fight) Vector length = 0.716

    Ebro Vector length = 2.141

    Duero Vector length = 0.326

  • If we represent the semantic space in two dimensions (a simplification, given that we

    will usually need about 300 dimensions), Pelea and Puñetazo (Fight and Blow)are

    represented by vectors with more or less a similar direction (they are semantically

    related), but are highly independent from the words Duero and Ebro (which in turn are

    also closely related as their vectors also have more or less a similar direction). The

    largest vector is that for the term Ebro, as it has the largest vector length. The LSA

    therefore has more information about the word Ebro than about any of the other

    three ones.

    INSTALLATION OF THE GALLITO PROGRAM

    REQUIREMENTS

  • For Gallito 2.0 to work correctly, the following components must be installed (they are

    included in the download website and can also be installed from the official Microsoft

    website).

    - A 64-bit Windows operating system (Windows 7 or Windows Server).

    - Microsoft SDK 4 (included in the download package).

    - Microsoft Visual C++2010

    - Writing permission to install the program in the installation directory

    DOWNLOADING THE PROGRAM

    Publications, projects, demos, and interesting links related to the LSA technology can

    be downloaded from the www.elsemantico.com website.

    From the Download tab you can access the page

    http://www.elsemantico.com/gallito20/download-es.html where the program can be

    downloaded.

    The download link for the program (version 2.1.3) and three manuals that serve as a

    guide to learn about the program is located in the setup and manuals box.

    Clicking on Versión 2.x.x setup (30 days free) [zip] opens the following dialog box:

    http://www.elsemantico.com/http://www.elsemantico.com/gallito20/download-es.html

  • Check the Save with option and click Accept.

    A .zip file will be saved from which the program will be installed.

    Unzip the file and Double click on the Release directory:

  • Double click again on the icon. The program installation wizard will

    start:

    Click Next.

  • This step in the wizard shows the path where the program will be installed. Click Next.

    In the next step, just confirm where you want to install the program, then click Next.

    The installation process will take a few minutes.

  • Once installed, close the wizard clicking Close. Don´t forget to give writing permission

    to the installation directory (C:\Program Files\elsemantico.com\)

    The program should appear on the Start menu:

  • The program icon appears. Click on it. The program should open after loading the files

    and display the following interface.

    It warns you that you have a 30-day trial period.

    Licensed users are provided an executable that extends the user license for an

    indefinite period.

  • The program has been installed and is ready for use. The following sections in the

    manual will explain how to train the program by means of a linguistic corpus, how to

    load a semantic space, and all the working possibilities provided by the software.

  • TRAINING A CORPUS

    Training a linguistic corpus in Gallito firstly entails compiling all the linguistic

    documents to be trained in just a file (preferably a *.txt file).

    Given that the program starts by placing the documents in columns and the different

    terms in rows in a large matrix, the program must be told how the documents are

    separated from each other.

    To do this, go to the Corpus tab.

    CORPUS TAB

    Locate the path for the*.txt file that includes the documents to be trained in the

    Reference Corpus button .

    Note that the character separation option is checked, with the symbol # separating

    documents.

    The file to be read has this structure:

    If the document including the texts to be trained includes a different separating

    symbol, the program must be told.

    If the documents are not separated by a special symbol (hash, ampersand, at sign,

    etc.), but rather the point is used to separate documents, the sentences separation

    option can be chosen. The number of sentence separated by a dot which constitute a

    document (by default 1) can be specified in the box A document is ___ sentences.

  • The A document minimum are 2 words option serves to specify the minimum number

    of words for a document. Given that documents are usually paragraphs, a higher

    number can be chosen in this option, e.g. 10.

    Finally, the Remove words that do not appear at least 1 document options makes it

    possible to specify how many times a term must appear in the linguistic training corpus

    for it to be mathematically represented by the program. So the question here is

    specifying the minimum number of times a word must appear to be included in the

    semantic space generated by the program. A number of 5 of higher is usually chosen,

    as underrepresenting a word may be counterproductive and display highly random

    semantic relationships.

    SAVE TAB

    There are more options to be specified in the training process. We have already

    completed the Corpus tab. Now we will specify a path and name for the seven files

    that are automatically generated after the training.

    Access the Save tab.

    As can be seen, there are seven buttons associated with seven text boxes. This

    means that a path must be specified and names must be given to the seven different

    files created after training Galito using the linguistic corpus which we saw previously.

    Term list is a file that, after the training ends, includes all the terms mathematically

    represented by the program. It is sort of dictionary of the terms available after the

    training. That file should be given the name TERMLIST for easy use afterwards

    (although any name can be used, of course).

    Doc list is a file that includes of the documents used to train the program. It is a sort of

    dictionary of documents. The file name can be DOCLIST.

    Doc matrix is the numerical matrix were all the documents are mathematically

    represented. Again, any name can be given, but DOCMATRIX can be useful so that this

    file is retrieve we know exactly what it contains.

    Term matrix is the file that contains the vectorial representation of the terms. This

    matrix is, so to speak, the final essence of the training, the core from which the

    semantic relationships between the texts can be worked on. The assigned name could

    be TERMMATRIX.

  • Global weight is a file that contains the weights assigned to each of the words

    analyzed. Not all words carry the same amount of semantic information, hence they

    receive different weights depending on their importance.

    Diagonal matrix is a matrix that also contains weights. In this case, the diagonal matrix

    describes the importance of each of the dimensions by means of which the latent

    semantic space was finally represented. For users who are familiar with factorial

    analysis, the diagonal matrix contains the variance percentage described above for

    each of the dimensions that account for the semantic relationships between words.

    Space features is a file that includes certain features of the training process.

    Once all the files that will be generated in the file have been named, the way in which

    those files must be saved is chosen> binary or serialized. The binary mode is more

    efficient than the serialized one, so that, regardless of the technical aspects involved in

    both types of files, the binary mode is a better option.

    Finally, the Save Project automatically checkbox saves these file in the path previously

    used to name the files, once the training is complete.

    IT IS IMPORTANT TO CHECK THIS BOX TO AVOID LOSING THE SEVEN FILES DESCRIBED.

    ENSURE THAT THE SEVEN FILES ARE AUTOMATICALLY SAVED IN THE SPECIFIED PATH

    AFTER THE TRAINING IS COMPLETED.

    MATRIX TAB

    We will now follow with the specifications to be made before the training starts. Now

    we will take a look at the matrix tab, which establishes some interesting features in

    this stage of the training.

    Dimensions. This option box, checked by default, allows us to specify how many

    dimensions will be used to mathematically represent the words. That is to say, the

    number of dimensions which will describe the semantic relationships between the

    words. For novel users, this option can be somewhat disconcerting if they do not know

    the number of dimensions that is generally used in LSA.

    Experts like Landauer and Dumais, together with other collaborators, as well as the

    extensive experience of the tests made, suggest that the adequate number of

    dimensions for generalist linguistic corpora (corpora that contain a varied language,

    such as novels, newspapers, essays, poetry, technical documents, etc.) ranges between

    250 and 350. This means that the best way to semantically represent relationships

    between words requires a number of dimensions that ranges between 250 and 350.

    For corpora within a more specific domain a more reduced number of dimensions can

    be specified, e.g. 150.

  • In any case, this issue can be settled empirically, not theoretically. We therefore

    recommend that users establish a number of dimensions that ranges between 150 and

    200 for specific-domain corpora and 250 and 350 for generalist corpora.

    Accumulated singular value is an interesting option, although we do not recommend

    using it with large linguistic corpora. It makes it possible tell the program something

    like 'choose the number of dimensions such that said dimensions account for 40% of

    the original variability of relationships between words and documents'. This option is

    useful for two reasons. Firstly, it makes it unnecessary to decide the number of

    dimensions to specify if it is unclear what this number must be. Secondly, some tests

    carried out by experts suggest that 40% of the total variance provides a very

    interesting semantic space which correctly simulates what humans do. The drawback

    of this option is that it requires the program to calculate all possible options, a task

    that is extremely costly is the number of documents to train the program is slightly

    high (e.g. more than 5,000).

    Linguistics adjustment is related to the way in which the original matrix of term

    frequency per document will be transformed - the original matrix with which the

    program works before obtaining the latent semantic space. Word frequencies do not

    follow a specific distribution. Studies have shown that the most frequent word is used

    twice as frequently as the second most frequent word, the second most frequent word

    is used twice as frequently as the third most frequent word, etc. so that some words

    appear much more frequently than others. When the original word frequency matrix

    per document is obtained, some words will appear much more often than others. If

    the original frequency distribution is to be preserved, the option nothing must be

    chosen. If this large asymmetry in use of frequencies is to be modified, one of the

    other two options must be chosen: Log*entropy or Log*IDF. Both ways of weighting

    the frequency matrix have strong empirical support. The most usual one is

    Log*Entropy. As can be seen, both weighting methods involve the logarithm, a method

    usually employed in statistics to assign a greater weight to infrequent items and a

    much lower weight to highly frequent items.

    Normalization. This option forces the normalization by rows of the matrix U (see the

    section on LSA above). That is to say, it forces the length of the vectors that represent

    all the words to be equal to one.

  • REM/ADD TAB

    Finally, certain features must be specified by means of this tab before finally starting

    the training of the corpus. This tab basically establishes issues such as which words are

    to be processed by the program and how they are to be processed.

    The options under the Lexicon box are all linguistic forms which can be selected if we

    prefer Gallito to ignore them in the training. The program provides files with adverbs,

    prepositions, pronouns etc. in Spanish, which will make it possible to ignore these

    typically syntactic words during the training stage. Thus, choosing the option Adverbs

    will make the program ignore adverbs. This means that the latent semantic space will

    not include adverbs. Conjunctions, modifiers, interjections, prepositions, and pronouns

    are some of the options that are typically chosen (and which we recommend choosing)

    so that they are ignored during the training. These are syntactic rather than semantic

    words, so that they possibly only add noise to the final semantic space. It is

    recommendable to choose all these options. There is also the possibility of choosing to

    ignore verbs, although they are usually included in the training (this option is usually

    not chosen).

    The Additional option makes it possible to avoid training words or short phrases

    specified by the user. If the language with which the program will be trained includes

    complex terms or structures to be rejected, this option should be chosen. This option

    should be selected if, for example, a PDF file is copies which has the heading "Políticas

    contra la desertización” (“Anti-desertification policies") repeated and we do not want

    the program to train this phrase. In addition, the Structures Select drop lists menu

    should be chosen:

  • Add the phrase to be discarded in structures with more than one term (the box on the

    left) and click the button Add.

    In the box Action we can choose between Remove if we want to ignore (remove) the

    words whose linguistic category falls under one of the categories previously selected

    or the opposite can also be done (add exclusively): specifying that those words will be

    the only ones that will take part in the training. This option will be rarely used other

    than for research purposes.

    Finally, when choosing Lemmatization, the option Spanish can be selected in the drop-

    down menu and the verification box to the left can be checked:

    This options makes it possible to group words under a single form. It is mainly applied

    to verbs, nouns, and adjectives. For example, if we want the verb forms

    “abandonaba”, “abandonaron”, “abandonaré”, “abandonaría”, etc., to appear all

  • under the infinitive “abandoner”, this option should be chosen. Lemmatization is a

    very good option to handle the large number of verbal forms in Spanish. If this option

    is not chosen, all the verb forms different from the verb “abandonar” will be seen as

    different terms and thus will take up a row in the term matrix.

    Obviously, lemmatization makes it possible (1) to handle many different terms which

    should fall under the same semantic category, and (2) to better represent words with a

    smaller number of training texts. Just imagine how many documents would have to be

    trained to properly represent the verb “abandonar” semantically with and without

    lemmatization. Given that with no lemmatization the verb forms of the verb

    “abandonar” multiply, we would need a huge number of texts.

    For example, after lemmatization, this list of words appears between the terms (left-

    hand list) as oppose to an identical training following the same parameters except that

    this time the training is not lemmatized (the right-hand list):

    PROCESS TAB

    In this tab we click on the START button to carry out the training. There is only one

    option: Ocurrence matrix to, with which we can specify a path and name to save the

    original term frequency matrix per document - the matrix with which the program

    start to operate. This matrix is usually ignored, among other things because it can take

    up a lot of memory. Ignoring this option, click button START and wait for the program

    to notify that the training is complete.

  • This warning states that the program has finished its training with linguistic corpus

    provided and saved the seven files with which we can start to work.

    We look for the seven files generated by the program in the specified directory. It

    should be noted that the names of the files correspond to the names assigned to

    Gallito in the Save tab. This will later facilitate work when we start a new session with

    the program and want to load these files to work with the program.

  • LOADING THE FILES

    It's not necessary to train a linguistic corpus very time we start a session with Gallito in

    order to be able to work with the program. Once the program has been trained and

    the seven files required to work with the program have been saved in the hard disc, we

    can simply open them and start to work with the program. This section briefly

    describes how to load the working files in Gallito.

    Launch the program as usual.

    Access the Load tab.

    As we already know, seven files must be opened to be able to work with Gallito. These

    are files which were previously saved in a previous stage when the program was

    trained using a linguistic corpus. These files have the *.bnl extension.

    We will now explain how to access the files and briefly describe each one:

  • By clicking on the button associated with the Term list box, we can look for the file that

    contains all the stored terms. Then we will do the same with the Doc list button, in

    order to find a file that specifies all the documents which have been used to train the

    program. The Doc matrix file stores the latent semantic space for the documents, i.e. it

    is the vectorial representation of the Doc list file. This file is not usually frequently

    used, unless we want to build a data search engine. Following the file opening process,

    Term Matrix is without a doubt the most interesting file of all those opened. It stores

    the latent semantic space for the terms. This is the file that contains the vectorial

    representation of all the words in the training documents. Global weight is a file that

    contains the weights of the terms. Not all terms have the same weights. Some terms

    define contexts (documents) where they appear better than other terms (think about

    prepositions, terms which do not add any information to the context and which would

    have practically no weight). Diagonal matrix is a file that also contains weights, but in

    this case it is the weights granted to each dimension in the latent semantic space. We

    know from the LSA that a term is vectorially represented, e.g. in 300 dimensions. The

    Diagonal matrix file specifies the respective importance of each of those 300

    dimensions. For users who are familiar with the main component technique, the

    Diagonal matrix file would contain the variance ratio for each dimension. Finally, the

    Space features file contains specific information about the space generated.

    Once all the files have been selected, click the Load button.

    FEATURES OF MY TRAINING CORPUS AND MY SEMANTIC SPACE

    A good way to take a look at the features of the opened space is going to the Spaces

    Properties menu.

  • This specifies that our semantic space has 10,686 terms in a linguistic corpus of 45,886

    documents. The dimensions by means of which the semantic space has been

    represented are 250 (remember that the number of dimensions usually ranges

    between 200 and 350). The average similarity between the terms in the corpus (the

    average cosine) is 0.0466, and the typical deviation for those similarities is 0.0732.

    Given that it sometimes can be hard to assess in absolute terms whether two terms

    are semantically related or not by examining exclusively one cosine, a good option is to

    examine the cosine in Z scores. To do so, we must subtract the average similarity in the

    space (0.0466 in this case) from the cosine and divide the difference between the

    typical deviation (0.0732 in this case). In this way we will obtain the typical similarity in

    the score. Similarities higher than 3 typical scores represent a high semantic

    relationship between a pair of terms.

  • WORKING WITH GALLITO. BASIC ISSUES

    QUERIES MENU

    SEMANTIC NEIGHBORS

    From the Queries Semantic neighbors menu we have the option of seeing which are

    the terms most closely related to a word of our interest. For example, if we choose the

    word Crisis because we are interested in finding its closest semantic neighbors in the

    Term box we will click on the word Crisis. By means of Measures we can select four

    measures for semantic relationships.

    Cosines: It shows the cosine between the word chosen and its closest neighbors.

    Corrected cosines: It shows the cosine between the word chosen and its closest

    neighbors, weighting by the neighboring vectors’ lengths. By means of this option, it

    can be specified whether the list of neighbors is to be created using a high or relatively

    high vector length. The occurrence of more familiar words with a higher number of

    occurrences in the training linguistic corpus is thus rewarded.

    Predication: This option makes it possible to obtain semantic neighbors not for words,

    but for pairs of words or predicative structures. Instead of entering the word Crisis we

    may be interested in the list of closest neighbors for the words “Crisis mundial” (world

    crisis).

    Corrected predication: It is also a method to obtain neighbors from two-word

    predicative structures, but by means of a more sophisticated algorithm put forward by

    Kintsch which can provide more interesting results from an intuitive point of view.

    For example, if we select Cosines, we enter the word crisis and ask for 10 semantic

    neighbors:

  • The list given is:

    The first semantic neighbor for the term Crisis is the term itself: “Crisis”. The first

    information provided is the vector length for the word: 6.205. Given that Crisis is a

    term often used in journalism, this vector length is very high. In order to find whether

    a vector length is high or low, the average vector length for the corpus words can be

    calculated in order to have a reference (see Export files later on). The closest semantic

    neighbor to Crisis is “económica/co” (this space is lemmatized and the term that

    appears is the masculine singular, which comprises “económica”, “económicas”,

    “económico” and “económicos”).

    If we want to find the semantic relationship between Crisis and “Económico” in the 0-1

    scale, we click on the + button to display it.

    The Activation label displays the semantic similarity between both terms: 0.76. This

    similarity is very high (close to one) and is within the 10 typical deviations above the

    average similarity in the corpus (which, it should be reminded, was 0.0466). The vector

    length for the term “Económico” is even higher than that for Crisis: 6.57, and it appears

    under the label Norm. It is also a word which a high level of representation in the press

    corpus (with a very high vector length).

    If we select the word Crisis, which again has 10 semantic neighbors, but now we chose

    the Corrected cosines method, these are the terms that appear:

  • Once again, the term “Económico” appears, but now new terms such as “Economía” or

    the verb “haber” appear. As was previously said, this option provides the closest

    neighbors, but we require semantic neighbors to be sufficiently represented, with a

    high vector length. For example, the term “Economía” has a higher vector length than

    three, whereas the term “Recesión”, which under the previous method was the third

    closest semantic neighbor has a 0.65 vector length, five times smaller.

    If we now use a two-term structure such as “Crisis mundial” (world crisis), we select

    the Predication method, and obtain the following list:

    The semantic neighbor of this two-word structure is the term “Crecimiento” (growth),

    followed by PIB (GBP), Mundial (world), etc.

    COMPARING A TERM TO ANOTHER TERM

    The semantic neighbors menu enables us to take a quick look as the words that are

    most closely related to a specific term. This can provide information about the

    semantic field in which said term is found, the words to which it is most related, etc.

    However, this does not ensure that we will be able to assess the semantic relationship

    between two specific terms, and those terms will possibly not be among the, e.g. 100

    first neighbors in the list.

  • The Queries Term-Term option makes it possible to assess the semantic relationship

    between any two terms as long as they are found within the semantic space generated

    during the training.

    For example, if we want to see how the word Crisis is associated with the term PSOE

    (the Spanish Socialist Party), we can enter the word Crisis as T1 (term 1) and the words

    PSOE as T2 (term 2).

    Clicking the Compare button shows that the vector length for the first term is 6.205

    (the unit figure may appear deleted, giving the impression that the total figure is

    0.205). The vector length for the second term is even higher: 9.97. The semantic

    relationship is 0.11. If, instead of “PSOE”, we enter the term “Banca” (the banking

    sector), the semantic relationship is even stronger: 0.18. “Banca” has a rather lower

    vector length than the two other terms: 0.50. If we choose a term that presumably has

    no semantic relationship such as “muerte” (death) (which is more associated with

    news related to murders or war), we will see that the semantic relationship is

    practically none: 0.006.

    COMPARING A DOCUMENT TO ANOTHER DOCUMENT

    This option makes it possible to compare a corpus document to another corpus

    document. Let us supposed that we have trained the program by means of 10,000

    abstracts taken from scientific journals. These abstracts are also numbered in a

    database, have an associated scientific field, etc. By means of this option, we would be

  • able to see the semantic relationship between document 1,005 and 2,198, for

    example:

    Go to the Queries Doc-Doc menu

    Enter the document number in the Doc1 box and the number for the other document

    in the Doc2 box.

    In this case, the semantic relationship between both documents is very tenuous (0.03).

    Both in this option and in the previous one, we have the possibility of calculating the

    Euclidean distance between two terms, or between two documents. The Euclidean

    distance is not a measure of similarity like the cosine, but rather a measure of

    dissimilarity. Its use in LSA is less common, but it has proven to be useful to assess text

    quality, among other things.

    COMPARING TWO TEXTS

    Another usual option is comparing the semantic relationship between two free texts.

    By free texts we mean that they are not documents that are part of the training

    corpus. For example, if we want to establish a semantic comparison between the

    relationship between two pieces of news that talk about terrorism, this option can be

    used.

  • Queries Free texts

    The semantic similarity between the texts is 0.33. Both texts have a similar vector

    length, so LSA is equally familiar with both texts and their respective lengths are

    similar.

    MOST REPRESENTATIVE TERMS

    The Queries Most representative texts option makes it possible to obtain the list of

    the k terms with the highest vector length, and thus those with the LSA is most familiar

    with.

    We just have to specify the number of terms that we want to obtain and click the

    Extract button.

    For example, if we want the 100 terms with the highest vector length:

  • There is the auxiliary verb “Haber”, which has the highest vector length (12.92),

    followed by “Año” (year) (12.10), the verb “Ir” (12.00), the verb “Ser” (to be) (11.94),

    etc.

    MOST REPRESENTATIVE DOCUMENTS

    In the same way, we can access the Queries Most representative docs document to

    list the k documents with the highest semantic relationship on average from all the

    trained documents.

  • CI SUMMARIZER

    This procedure can help to categorize a text by summarizing it in a few terms.

    For example, we enter the following document in the text box:

    In Neigh. per word, we specify the number of neighbors obtained by the procedure for

    each of the words in the text entered. We specify two.

    Final list shows which of the neighbors obtained in the previous stage are most

    representative (those which have the highest semantic similarity on average with the

    rest). Therefore, it displays a short list summarizing the text entered.

    By checking Also corrected we can remove terms that have a high vector length but do

    not contribute much to the meaning of the texts. If this option is chosen (as

    recommended), we can remove highly common verbs such as haber (auxiliary verb), or

    ser and estar (to be).

    Final neighbors provides a final list summarizing the text. If the procedure is successful,

    this list and the previous one provide a brief summary of the document entered in the

    text box.

    In this example, the final result after specifying Neigh. per word, = 2, Final list = 10,

    Also corrected = Checked and Final neighbors = 10 is the following:

  • The original document is basically summarized by words such as “propinar”, “ocurrir”,

    “detener”, “haber”, “golpe”, “agression”, “ser”, “paliza”, “agredir”, “agredido” and

    “patada” - all of which are words related to aggression and hitting. Even though there

    are some auxiliary verbs such as “ser” and “haber”, many of the terms sum up well the

    essence of the document.

  • FILE EXPORT

    EXPORT OF MATRICES AS TXT FILES TO BE READ BY SPSS OR OTHER STATISTICS

    SOFTWARE

    From Export Matrices to .txt we can generate the work matrices for the latent

    semantic space with the *.txt extension in a hard disc directory of our choice.

    Click on the button to tell the program in which directory you want it to save the

    matrices. Then click the Generate button.

    The matrices saved in their respective files are the following:

    Modulos.txt contains the vector lengths for all the terms trained. If you open this file in

    SPSS and generate a few descriptions, you will find that the vector lengths for the

    10,685 words are distributed in a positive asymmetric way (As = 4.365) with a 0.76

    average and a 0.32 mean.

  • The pesos.txt file contains the weights assigned to the 10,685 terms in this corpus. It is

    the weight applied to the gross frequencies in the original term matrix x documents.

    The S.txt file contains a square matrix of n x n, n being the number of dimensions

    specified to train the corpus, with the singular value associated with each of the

    dimensions (it represents the variance percentage associated with each dimension).

    The US.txt file contains the matrix of terms per document. In this case, as the linguistic

    corpus has been trained using 250 dimensions and we have 10,685 terms, the matrix

    for this file contains 251 columns (the first column contains the list of terms) and

    10,685 rows, each of which represents a term. This matrix is extremely interesting, as

    it provides the latent semantic space proper that was generated after the training.

    Estadísticos

    Modulo

    10685

    0

    ,7627

    ,3204

    1,35637

    4,365

    ,024

    22,053

    ,047

    Válidos

    Perdidos

    N

    Media

    Mediana

    Desv. típ.

    Asimetría

    Error típ. de asimetría

    Curtosis

    Error típ. de curtosis

  • Finally, the SV.txt matrix contains the vectorial representation not of the terms but of

    the trained documents.

    Importing the Modulo.txt, Pesos.txt, and US.txt into a single SPSS file is very easy:

    You can use all the analysis techniques available in this software.

  • EXTRACTING CLUSTERS FROM A SEMANTIC SPACE AND REPRESENTING THEM IN

    PAJEK

    This is a very good option to see a graph representation of the main concepts in the

    training linguistic corpus. We start by accessing the Export to Pajek clusters to

    pajek menu.

    This procedure generates three files. One of them is a large matrix of correlations

    between the terms which the cluster extraction procedure identifies as most relevant.

    In turn, another file is generated so that the Pajek program can directly generate a

    conceptual network diagram. The parameters must be specified:

    First of all, the path by which we want Gallito to generate the three files must be

    specified in Directory.

  • Once the path has been established, we must tell the program whether we want it to

    generate word clusters or choose the most representative word in the cluster

    generated. We will use words (Words) to see the example.

  • We also choose the Normalizing US matrix option, preventing the terms in the US

    matrix which have highest vector lengths from having a greater weight when the

    clusters are generated. The cluster analysis procedure is the K-means algorithm, a

    procedure which requires specifying the number of clusters to work with. In the

    Cluster num box we choose, for example, 35 clusters. In Cluster cycles we decide how

    many iterations the procedure is to make. If we choose 2 iterations, after the first

    cycle, in which words are assigned to their closest cluster, in the second cycle we

    reassign words to another cluster if distances have changed. The more iterations, the

    more stable the solution. However, the longer the longer the procedure will also take.

  • After the clusters are processes, the following files are generated in the directory

    specified:

    The file are ready to be used in the Pajek program. Pajek is a program

    that represents network graphs, so that we can have a quick, useful view of the

    general groupings of semantic concepts used to train the tool. Here is an example of a

    representation of association networks using the press corpora used to train Gallito.

    The cluster.mat is the input file to Pajek. Just open it with pajek, remove conections

    below a value in the transform tab (for example, 0.35) and draw it. Then use the

    Kamada–Kawai algorithm to quickly generate a reasonable layout.

  • Al Qaeda

    Inmigración

    [Immigration]

    Embarcación, petrolero, fuel…

    [Vessel, Oil tanker, Fuel]

    Bush, Azores, Blair…

    Guerra, Militar, Afganistán

    [War, Military, Afghanistan]

    Justicia

    [Justice]

    Manifestación

    [Demonstration]

    Fraude inmobiliario

    [Real estate fraud]

    Batasuna

    [Basque

    independentist

    party]

  • CONVERTING TEXT FILES TO VECTOR AND REPRESENTING THEM IN PAJEK

    This is a very good option to see a graph representation of text files (with essays,

    documents) in pajek. We start by accessing the Export to Pajek documents to

    pajek.

    This procedure convert text files (essays, documents, e-mails, etc.) into vectors and generate a .mat file (to pajek) to draw the as in the clusters. Just specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:

    Prueba.docx

    prueba2.docx

    34361692.txt

    27597718.doc

    40035398.txt

    242310602.txt

    33882636.txt

    32906839.docx

    65982026.txt

    39970932.txt

    15871651.txt

  • The file are ready to be used in the Pajek program.

  • Just open it with pajek, remove conections below a value in the transform tab (for

    example, 0.35) and draw it. Then use the Kamada–Kawai algorithm to quickly generate a

    reasonable layout.

    CHANGE OF BASIS

    IDEA

    The change of basis is a procedure that basically makes it possible to interpret each of

    the coordinates that represents a word in a much more specific way.

    When we choose to represent the latent semantic space by means of 250 dimensions,

    the mathematical representation of each term is expressed by means of dimensions

    which have no familiar meaning to us. These are abstract dimensions in which we

    known that the words are well defined, but which however do not provide an idea or

    insight to users.

    As was previously seen, the latent semantic space can be exported in a file named

    US.txt by the program, which can be easily exported to Excel or SPSS.

    We previously exported the data to SPSS, obtaining this:

  • The first word that appears is “congreso” (conference). This word has a vector length, a

    weight, and 250 coordinates in 250 dimensions. The figure above shows that the

    coordinate in dimension 1 of the term “congreso” is worth 2.039, coordinate 2 is worth

    1.951, etc. What do these dimensions mean? That is to say, the word “congreso” has a

    coordinate of 2.039 in the first dimension. What does this mean? Unfortunately, the

    first dimension in abstract and does not correlate semantically to anything.

    The idea of the change of basis involves transforming that abstract space into another

    one which can be more easily interpreted by users. This, in addition to making the

    semantic space more specific, makes it possible to use the semantic space in a more

    efficient way, as we shall see.

    SPACE CHANGE OF BASIS

    In order to carry out a change of basis, we must first access the Space Change of

    basis menu. There are two options:

  • By clusters turns abstract dimensions into types of words generated by a k-median

    cluster analysis.

    By predefined words turns abstract dimensions into new dimensions defined by users

    depending on their interests.

    We will work with this second object:

    Space Change of basis By predefined words

    When accessing this menu, the following options are available:

  • In Reference folder we will specify the path to save the new semantic space with

    meaning coordinates (together with other files which will be described later) and

    which includes a text file containing the word list chosen by users to change the

    abstract dimensions by new, meaningful dimensions.

    In the Words File boxes the name of the file where the words that will be the new

    dimensions is specified. With the help of the conceptual network graph previously

    generated by the Pajek software, together with some notions about the main concepts

    in the press corpus, we put the following dimensions forward:

    The first dimension is related to the descriptors “Guerra”, “Militar” and “Afganistán”

    (War, Military, and Afghanistan). The second one is related to the maritime terms

    “Embarcación”, “Petrolero” and Fuel (Vessel, Oil tanker, and Fuel). Note that the file is

    called Nuevas dimensiones.txt, which must be specified in the Words file dialog box.

    Gram-Schimidt orthogonalization is an algebraic procedure required to preserve the

    orthogonality of the new meaningful dimensions. If this option is not chosen, there is

    the risk of obtaining a new semantic space which is meaningful but whose meaning is

    highly oblique, which then generates very distorted neighbor comparisons and basic

    processes. The gain in meaningfulness entails a loss in practical usefulness. Thus

    activating this option is recommended.

    Normalized basis is another recommended option which not only makes the basis

    (Gram-Schmidt), but also provides it with a unitary vector length. Choosing both

    options will provide a new basis with an orthonormal meaning.

  • The words which you want to use to define the new dimensions should be among the

    trained terms. To do so, you should ensure that the words to be used as new

    dimensions do exist.

    One the relevant changes to this dialog box have been made, click the Change button.

    The process to calculate the new semantic space usually does not take long.

    The figure above shows the files generated by the procedure to change basis.

    basisMatrix.txt is a file including the new basis (the new, meaningful basis).

    basisMatrixBeforeGsOrtog.txt contains the original basis (the abstract basis).

  • GSreliability.txt is a file that shows the reliability of the dimensions (they must be

    higher than 0.70 to provide the words chosen with meaning). Taking a quick look at

    this file, we can see that the degrees of reliability associated with the dimensions really

    do have, without exception, a degree of reliability higher than 0.70:

    Note that, after the first 14 dimensions there appear the dimensions ABSTRACT15,

    ABSTRACT16, etc. This is because, once 14 meaningful dimensions have been specified,

    the rest, up to 250, must be abstract dimensions.

    newTermMatrix.bnl contains the new semantic space that includes the first 14

    meaningful dimensions. This file can be loaded (as it has the extension *.bnl) in the

    Load tab in the Term matrix box (see the section Loading the files).

    newTermMatrix.txt contains the new semantic space including the first 14 meaningful

    dimensions. Unlike the previous file, this file can be easily exported to Excel or SPSS.

    oldTermMatrix.txt is the file where the former semantic space (including the 250

    abstract dimensions) is preserved.

    Let us see some examples of how to make use of the new basis.

  • The term “Congresos” (Conferences) clearly stands out in the “Política_PSOE_PP”

    dimension over the rest. The difference with respect the former semantic space is that

    we can now say in which dimension(s) a specific word has significant weight.

    The term Rey (King) clearly saturates the “Monarquía_Rey_Princesa” dimension.

    -0,4

    -0,2

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    1,4

    1,6

    1,8

    congreso

    -1

    0

    1

    2

    3

    4

    5

    6

    rey

  • The term Islam clearly saturates the “Al_Qaeda_Bin_Laden” dimension.

    And the term “Garzón” saturates the “Justicia_Juez” dimension.

    Con este aviso sabremos que el programa ya ha terminado de entrenar el corpus

    lingüístico que le hayamos proporcionado y que ha guardado los siete ficheros con los

    que podemos comenzar a trabajar.

    Podemos buscar en el directorio especificado los siete ficheros que ha debido crear el

    programa. Nótese que los nombres de los ficheros se corresponden con los nombres

    que asigna Gallito en la pestaña Save tab. Esto facilitará luego la tarea cuando abramos

    sesión nueva con el programa y queramos cargar estos ficheros para trabajar con el

    programa.

    -0,5

    0

    0,5

    1

    1,5

    2

    2,5

    islam

    -0,5

    0

    0,5

    1

    1,5

    2

    2,5

    3

    garzón

  • BATCHES

    You will often have to analyze a group of words or files. To this end, Gallito offers the possibility of carrying out actions by batches.

    SEMANTIC NEIGHBORS

    Access the al batches Neighbors menu

    Through this menu you can extract the n first semantic neighbors of a series of terms. In the dialog box you can specify the number of neighbors, the directory where the file where the terms are specified will be located, and the file itself.

    The content of the file will look like this

    This process will also generate in the working directory one file per term in which the neighbor, the cosine, and the length of the vector or norm will be specified. The computation time for this procedure is quite short.

  • SIMILARITY MEASURES

    Access the batches Similarity matrices menu. A matrix will be extracted in which the neighbors of a term will be compared to themselves. This square matrix will have ones in its diagonal, with each cell representing the cosine between one neighbor and another. The working directory and the file name will be specified. Similarity matrices will be generated in that directory. The form can be accessed via batches Similarity Batches.

    The file contents will look like this:

    Clicking Extract generates a 200x200 dimension matrix in the first case, 300x300 in the second, etc.

  • SIMILARITY PAIRS

    Access the batches Similarity Pairs menu

    This will generate the similarities in a series of pairs of terms. The reference file and the file name must be specified in the directory.

    The file contents will look like this:

    INTRATEXT ANALYSIS

    Sometimes you will need to analyze the internal properties of a text, such as the number of paragraphs, the number of words per paragraph, per sentence, its consistency, its synthetic and informational capacity, etc. To this end, Gallito2.0 has the capacity to measure a text on the basis of various indices that will be described now. First of all, it offers coherence measures. The theoretical measure of textual coherence has been measured by means of the LSA on the basis of similarity scores between various and successive parts of the text, whether they are sentences or paragraphs (Foltz, 2007). Not only this, but behavioral correlates have been found as regards its effects. For example, Wolfe, Magliano, & Larsen (2005) found that the similarity measure between sentences measured using the LSA had an influence on reading

  • processing times. Bellissens, Jeuniaux, Duran, & McNamara (2010) have also shown how students with low level of previous knowledge of a topic, by contrast to those with a high degree of knowledge, are more sensitive to the withdrawal of information overlap between sentences, opertivized by means of LSA similarities, as well as the number of causal clauses measured with CohMetrix (Graeser et al., 2004). When textual coherence is not maintained, readers are forced to make elaborative inferences, which are more costly and require previous knowledge. The authors argue that this semantic overlap promotes the integration of what is being read and what was previously encoded, facilitating the reading flow.

    In the case of Gallito 2.0, it measures three types of coherence. Firstly, it measures Paragraph-Paragaph coherence. A paragraph is defined by a break. Paragraphs with less than 10 words (words vectorized by LSA) are not taken into account for the analysis. The procedure extracts the cosine between a paragraph and the next one, and in the end all the similarities are averaged to yield a single measure. Secondly, Gallito 2.0 measure Sentence-Sentence coherence. This coherence is measured within each paragraph, which means that the similarity between the final sentence in a paragraph and the initial sentence in the following paragraph is not measures. Gallito 2.0 acts in this way under the assumption that the nature of the paragraphs reflects a thematic unit. Obviously, one-sentence paragraph are not included in this analysis. Nor do sentences with fewer than 4 words included. At the end, sentence-sentence coherences within each paragraph are averaged, obtaining a single sentence-sentence coherence. In addition, a third type of coherence is calculated which, despite being usually measured as such, is also used to obtain the sentence which best represents a paragraph or text (Kintsch, 2002). This is Sentence-Paragraph coherence, which measures the similarity between very sentence and the paragraph that includes it. We should warn that this measure may be conflictive as we believe that it is too highly dependent on the number of sentences in a paragraph. In addition to coherences, Gallito obtains the following surface measures: number of paragraphs, number of words, average number of words per paragraph, average number of words per sentence, and average number of sentences per paragraph. Finally, Gallito 2.0 provides an estimate of the average amount of information provided by the words in a given text. This index gives an idea of the domain-specificity of the words used and of the synthesis that has taken place, that is to say, the extent to which examples or uninformative words have been used. We introduced this latter measure, namely the average global weight in each paragraph, to measure the informativeness of the language employed in texts. Global weight is the opposite of entropy, and in fact is part of its formula. The higher the global weight, the more information a word is with regard to the contexts in which it appears. The way to use this batch process is to access batches intratex Analysis. The following form will appear.

  • Specify the working directory in the Directory. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format. The list formats will look like this:

    Prueba.docx

    prueba2.docx

    34361692.txt

    27597718.doc

    40035398.txt

    242310602.txt

    33882636.txt

    32906839.docx

    65982026.txt

    39970932.txt

    15871651.txt

    The results will be given in a file called resul.txt , which will include the following columns:

  • ID (file name)

    numParagraphs (number of paragraphs)

    AverageSentencesPerParagraph (average number of sentences per paragraph)

    AverageWordsPerParagraph (average number of words per paragraph

    AverageWordsPerSentence (average number of words per sentence)

    AverageCohesionSenSen (average sentence-sentence coherence)

    AverageCohesionParSen (averae paragraph-sentence coherence)

    CohesionParPar (average paragraph-paragraph coherence)

    AverageGlobalWeight (average global weight)

    ESSAY FILE EVALUATION

    Evaluation of the content of essay files is relatively simple sing LSA. Once you have the semantic-vectorial space, you must compare the essay file for each of the students will what are known as gold essay files. Gold essay files will serve as a reference, as they have been written by an expert who has optimally summarized the topic, or who has correctly answered the question. There may be a single gold essay file or several gold essay files (Rehder, et al, 1998).

    The procedure is quite similar to intra-textual analysis, with the exception that gold essay list must be specified in addition to the student essay list. Access batches Essay Evaluation to find this form.

  • It may specify both the essay files list and the gold essay files list. In addition, as always, it a working directory must be specified. The results will be included in a file called resul.txt in the working directory. Both student and golden files can be in any of the following formats: .pdf, .docx, .doc, .txt. Both the list of essay files to be evaluated as the list of gold essay files will be in the following format:

    The results appear in the following format:

    Similarity is measured in terms of distances, so higher figures will indicate a lower score with respect to gold essay files.

  • Docs to vectors

    This procedure convert text files (essays, documents, e-mails, etc.) into vectors. Specify the working directory in the Directory textbox. Specify the file that includes the text files list to be processed in File. These files can be in .doc, .docx and .txt format.

    The list formats will look like this:

    Prueba.docx

    prueba2.docx

    34361692.txt

    27597718.doc

    40035398.txt

    242310602.txt

    33882636.txt

    32906839.docx

    65982026.txt

    39970932.txt

    15871651.txt

  • The outputs: a .txt file with a matrix whose rows are the vector of the files in the

    list.txt.