8
Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés. Recent results of automation projects in Prague Eva HAJICOVA, Jarmila PANEVOVA and Petr SGALL 1. The character of the present volume shows that mechanographical and automatic treatment of texts and of language data has a long and important tradition in Czechoslovakia. Since the end of the 1950's the Meehanographieal Laboratory, wlùch worked in the Institute of the Czech Language of the Academy of Sciences of Czechoslovakia under the leadership of Jitka Stindlova, processed data from large corpora with the help of punched-card machines, c1assifying the data according ta sets of required criteria. Along with a reverse dietionary including a morphemic classification of wards (which, unfortunately, oevel' was published), this research resulted in various lists of combinations of phonemes and graphemes, concordances and other lists, see Mater and Stindlova (1968, esp. the bibliography, pp. 313-336), also e.g. Panevova (1968). In paraliel, sinee the beginning of the 1960'8) a180 in the research group of algebraic linguistics at Charles University, Prague, punched-card machines were used for the aims of a target corpus of Czech technical texts from electrotechnics, sel' Panevova (1965). Along with this corpus, useful for any empirical inquiry into the phenomena of morphemics and of surface as weIl as underlying syntax of technical texts, also an English-to-Czech dietionary for the domain of electronics was elaborated by means of these machines j see Hajicovâ and Panevova (1968). When the use of computers (of the 2nd and 3rd generations) was possible, the development of automation in linguistics was divided into two main lines. The fust of them, partly continuing the research of Stindlova's Laboratory, aims at attending the linguist 's research work j automation means help in compiling, c1assifying and evaluating empirieal material and in the building of frequency and reverse dietionaries, as weil as other kinds of c1assified lists. The other line is oriented towards practieal applications requiring an automatic processing of texts and of the data contained in them. This concerns such domains of use

Recent results ofautomation projects in Praguepromethee.philo.ulg.ac.be/RISSHpdf/Annee1989/Articles/EHajicovaetc.pdf · Recent results ofautomation projects in Prague ... Here belongs

  • Upload
    vutu

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

Recent resultsof automation projects in Prague

Eva HAJICOVA, Jarmila PANEVOVA and Petr SGALL

1. The character of the present volume shows that mechanographical andautomatic treatment of texts and of language data has a long and importanttradition in Czechoslovakia. Since the end of the 1950's the MeehanographiealLaboratory, wlùch worked in the Institute of the Czech Language of theAcademy of Sciences of Czechoslovakia under the leadership of Jitka Stindlova,processed data from large corpora with the help of punched-card machines,c1assifying the data according ta sets of required criteria. Along with a reversedietionary including a morphemic classification of wards (which, unfortunately,oevel' was published), this research resulted in various lists of combinationsof phonemes and graphemes, concordances and other lists, see Mater andStindlova (1968, esp. the bibliography, pp. 313-336), also e.g. Panevova (1968).

In paraliel, sinee the beginning of the 1960'8) a180 in the research groupof algebraic linguistics at Charles University, Prague, punched-card machineswere used for the aims of a target corpus of Czech technical texts fromelectrotechnics, sel' Panevova (1965). Along with this corpus, useful for anyempirical inquiry into the phenomena of morphemics and of surface as weIl asunderlying syntax of technical texts, also an English-to-Czech dietionary for thedomain of electronics was elaborated by means of these machines j see Hajicovâand Panevova (1968).

When the use of computers (of the 2nd and 3rd generations) was possible,the development of automation in linguistics was divided into two main lines.The fust of them, partly continuing the research of Stindlova's Laboratory, aimsat attending the linguist 's research work j automation means help in compiling,c1assifying and evaluating empirieal material and in the building of frequencyand reverse dietionaries, as weil as other kinds of c1assified lists. The other lineis oriented towards practieal applications requiring an automatic processing oftexts and of the data contained in them. This concerns such domains of use

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

36 EVA HAJléOVA, JARMILA PANEVOVA AND PETR SGALL

outside linguistics itself as information storage and retrieval or translation. Inboth lines, not only the observational, but also the descriptive level of adequacyis required.

Among the results of the first line of research, there are those of theDepartment of Quantitative Linguistics in the Institute of the Czech Language,cspecially the evaluation of lexical and grammatical phenomena in individualfunetional styles of Czech, with a linguistic interpretation of the quantifiedphenomenaj see Tèsitelovâ (1985). Here belongs alsa the reverse dictionaryof Czech, Tesitelova, Petr and Kralfk (1986), where not only words, but alsotagget word forms are included} accompanied by data on their frequency.

2. Results of the other lioe of research, applicable outside of linguistics ,constitute the proper topic of the present paper. Wc would like ta characterizehere briefly the projects carried out at the Faculty of Mathematics and Physics,Charles University, starting with these concerning information retrieval (textualinformation and automatic indexing) and passing then ever ta an interface tadatabases.

2.1. The system ASIMUT (Automatic Selection of Information by theMethod of Full Text) is based on Horty's FUll Text method (see Kehl et al.,1961), an application of which on such a language with a rich set of inflcetionalforms, as Czech is, encounters variotls difficulties. Hs advantages are Ci) thatthe full technical texts are treated, (H) the system warks without a dictionary(which would require frequent readjustments, conneeted \Vith updating thelinguistic processor), and (iii) the system is sa construeted as to restriet the userby no linguistic requirements, Le. without aoy serious limitations on the formof the queries. Point (i) is important especially in what concerns texts from thedomain of law (including most different prescriptions, resolutions, etc.). Points(ii) and (iii) are fulfilled thanks to a detailed linguistically based algorithmhandling infledional and derivational morphemics of Czech, including anautomatic assignment of word classes, infleetional paradigms and noun genders.The algorithm (completed by a list of exceptions, reflecting the non-productivekinds of inflection) assigns every basic word form (nominative, infinitive) anindex specifying the set of the corresponding oblique forms, and also the mostproductively formed derived words (postverbal nouns and adjectives, adverbsderived from adjectives, etc.). If e.g. the basic form ends in -a, it is identified(if not belonging to the set of exceptions) as a feminine noun of the paradigmiena ('woman')j however, it is also checked whether (according to the typicalend-segment, consisting in severaI graphemes) the basic form does not belongto the neutra paradigm schéma ('scheme'). The list of exceptions helps todistinguish between the quoted paradigms and these of pi'edseda ('chainnan'),or of kamna ('oven'), a plurale tantum. With 4 paradigms for masculine animate

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

HECENT RESULTS OF AUTOMATION PROJECTS IN PRAGUE 37

nouns, 3 for inanim. mase., 3 for feminines, 7 for neutre nouns, and moreover,wit.h several paradigms of adjectives and verbs, whieh also are handled, thistask certainly is not t.rivial.

After the automatie assignment of these data the stem is adjusted by thealgorithm to the requirements of stem alternations, Le. secondary stems arederivcd J exhibiting an inserted -e- , a lengthened or shortened stem vowel, asweIl as the consonant alternat.ion h, -Z, ch - J, k - c, r - r. The next step isthe derivatian of oblique forms, displaying caeh of the appropriate infleetionalcudings (and of the productive derivational suffixes of the types mentionedubove) j in the average there arc 10 different endings attached to a regular stem,exhibiting no alternations.

The user's query contains a term, consisting possibly in more than oneword form, or a set of eoaccurrent or alternative terms (the latter arc indicatedby a comma). The requirement. for a11 grammatical forms of a given word ta beidentified in the corpus of texts is indicated by a single symbol following thatward in t.he query (e.g. by "P'). Between any two adjacent. ward forms in thequery, a dist.ance operator (no) can be placedj the set of DO's used is -1-, -3­and -4-; the operands (Le. the words standing directly to the left and to theright of the no, or the strings of \Vords put into parent.heses - in a way suitablcto the Doolean functions applied) are looked out by the system as (a) adjacentto each other if -1- is applied, (b) belonging to the same sentence with the DO

-3-, and (iv) occurring in the same paragraph in t,he case -4- is present. If noDO was applied by the user, the system supplies the default operator lookingup a11 the cases where the operands are present in a distance not larger than+3 from each other.

The identification of the relevant positions in the texts takes place withthe use of a concordance list of the ward fonus occurring there (minimizedby means of a negative dictionary, which excludes sorne of the most frequent- auxiliary and ot.her general verbs, sorne conjunctions, prepositions, etc.).The word forms taken from the query are first completed by forms derived bythe algorithm and then confronted with the concordance, where cvery wordform occurrence is provided by symbols corresponding to the names of therelevant document, subdocument , the numher of paragraph, sentence and wardform in the sentence. Alternatively, it is possible to check the oblique wordforms accurring in the concordance fust, and not to derive those which are notcontained there.

The text positions relevant for the given query are thus identified andcan he displayed as answers to the user's query. Such a query as e.g. STUDNA('weil')! -4- DROBNA ('small')! STAVBA ('construction')! is answered by se­quences such as (1) or (2), if they arc encowltered in the texts treated :

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

38 EVA HAJICOV ..C JARMILA PANEVOVA AND PETR SGALL

(1) Stavba studny je stavbou drobnou. ('The construction of a well is under­stood as a smaH construction').

(2) K drubum staveb drobnych patH studna. ('Among constructions of thesmall type thcre be!ongs a well').

This example stems from the clamain of investment constructions, in whichthe system ASIMUT has becn prepared for pradieal application, see Cejpek,Panevovâ and Zvelebil (1988).

Since ASIMUT is a system for retrieval, rather than for comprehension, itcau he appropriately usecl ta find orientation in prescriptions and laws, but it isno Hexpert'l in a givcn damain. An amenclment of the system cau he achievedby a set of so-called helps for the user (e.g. a thesaurus of the given domain,a list of synonyms and other semantically rclated terms). A list of synonyms(in a large sense) can even he included in the implemented system. Due ta theuniversal validity of the linguistic prOCeSSOl\ deiving oblique word forms on thebasis of a relatively complete handling of Czech morphemics, the system is veryeffective for general aims of retrieval, bcing independent on the domain of texts.

2.2. As was stated above, the system ASIMUT, looking for every occurrenceof a term in the corpus of texts, is useful first of aIl for queries concerningtexts from the domain of law (including administrative prescriptions). Inpolytechnical and similar domains it is important to identify those documentsthat are directly relevant for the user, i.e. those in which the given termbelongs to the centre of attention, rather than being only marginally mentioned.Therefore, in such domains a system based on automatic indexing of textsand working with weights assigned to terms according to the frequency andpositions of their occurrences, to their mutual relationship, etc., will be givenpreference. For a language with rich infieetion, it is possible to use the veryforms of words, especially the end segments of adjacent words in a tcxt, toidentify technical terms and to handle them by an automatic system. Thisis the background of the method MOSAIC, elaborated by Kirschner (1983),which cau be used both for automatic indexing of techuica! and oUler textsand for text comprimation (automatic abstracting). The method is suitableespecially for Slavonie languages, whcre not only the (often international) affixescharaeteristic of teclmical tClTIlS can he used as an important source of relevantinformation, but aisa the case endings, including the syntactic agreement of anadjective with its governing naun.

2.3. A natura! language interface to a database was implemented andtested, based on the method KODAS (see Hajic, 1983). The system, implemented(in PL/1-F) on SIEMENS 7755 is called KODAS (contact with database onSiemens).

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

RECENT RESULTS OF AUTOMATION PROJECTS IN PRAGUE 39

There are no specifie linguistic restrictions on the form of the questionsused as queries; the form of the question should obey the general grammaticalroles of formulating question in Czech and it is assumed that the user willask "reasonahle" questions, will use conunon Czech words and constructionsand will formulate his questions directly, without polite and other phrases. It ispossible ta ask for individual items of a record (of a row) or for the whole record,or, as the case may he, for several items or records; it is also possible to askfor derived information, for maximal, minimal or average values of sorne items,for total numbers of records with the given characteristics and to a restrictedextent also for a certain type of percental enumerations.The function of the system can be divided into two parts:

(1) the analysis of the query, Le. its transduction into a special formailanguage;

(2) the interpretation of this formaI language, Le. the search for the requiredrecords (items) in the database.The query is proccssed ward by ward from the left ta the right (in the query

there may be up ta 79 words or 480 characters) and the words are searched forin the lexicon.

The lexicon of the KODAS system contains relevant parts of words (calledlexical segments in the sequel) rather than aU forms of the given word; in caseswhere the stem of the word is changed uncler declension or conjugation, thelexicon inclucles all these variants.

For synonymous lexical items there is only one lexical entry as basic andthe other expressions are only quoted in the lexicon with a reference to thebasic one j in the roles then, only the basic form is referred to.The lexical entry consists of the fol1owing parts:

the graphic form of the given lexical segmentthe symbol inciicating whether for the given segment there exist transduc­tion rules that are ta be appliedthe number of the word class (Ilot only the classical word classes are used,but also severai specific classes, a cIass for punctuation marks, etc.)

an indication whether the given lexical segment refers to the name of sorneitem, or, as the case may be, whether the given segment refers to morethan one itemthe identification numbers of transduction (transformation) rules that are(under certain conditions) in the analysis of the query.

If no lexical segment was found for sorne word contained in the query,the system warns the user that the word has Ilot been iclentified j the messageItpnosI'M, POLOZ*TE OTA'ZKU JINAK. NEnoZUMI'M Vy'RAZU ..." (Please,

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

40 EVA HAJIèoVA, JARMILA PANEVOVA AND PETR SGALL

formulate the question in a different way. l do not understand the expression... )appears on the sereen and it is up ta the user ta find another formwation.

The fust step of the analysis is finished when all words included in thequery have been identified and assigned a lexical entry given in tbe lexicon.

The second step of the analysis of the query consists in a certain kindof semantico-syntactic analysis of the questions. The query, transcribed atthis point as a sequence of lexical segments (together with other parts of therespective lexical entries), is processed again from the left to the right and thetransduction l'ules are applied the identification numbers of whieh are includedas a part of the given lexical entry. The set of rules is the second part of thefile, the first part of which is the lexicon. If the condition for the use of theparticular rule is fulfilled, i.e. if the left-hand side of the rule agrees with thecorresponding part of the query rewritten as a sequence of lexical entries, thispart of the query is replaced by the right-hand side of the rule. This procedureis repeated as long as at least one Iule is applied with the last processing j ifin the course of the last processing no rule has been applied, it is tested if theresulting parts of the string (corresponding of the original lexical entries) canhe a part of a formula for the interpretation. If this is not the case, the user iswarned by the message "1 am not able ta answer your question. You might haveformulatcd il in a wrong way. Please try and put your question differently." Atthis point) the user already knmvs that all the words used in the question havebeen idcntified by the system but that the formulation (structure) itself wasnot identified in the course of the application of the transduction rules.

The resulting strings can be illustrated by the following examples ofquestions and answers :

Vypis hodnosti pracovnîku z katedry fyziky pevnych latek. (List thedegrees of workers at the department of physics of solid materials.)1l0DNOST (P, la, 0) HIE' NO (P, 10, 0) AND (L, 0, 0) PRACOV (P, 10, 0),EQ (LR, 0, 0) KATEDRA (C, 1, 0) FYZIKY (C, 1, 0) PEVNY' Cil (C, 1, 0)LA' TEK (K, 1, 0) ## ENDL (EL, 0, 0)

Napis prtunerny plat pracovnfku dekanatu starsîch 35 let. (List the averagewages of the workers of the dean's office older than 35 years of age.)PRU' ME*R (E, 0, 2) PLAT (P, 10, 0) PRACOV (P, 10, 0) EQ (LR, 0, 0)DE*KANA' T (K, 1, 0) AND (L, 0, 0) VE*K (PE, 10, 0) CT (LR, 0, 0) 35(K, 4, 0)## ENDL (EL, 0, 0)

Kolik lidî ma uvazek vètsl nez prumèrny? (How many people have morehours of dutY than is the average?)POC*ET (E, 0, 1) u' VAZEK (P, 10, 0) CT (LR, 0, 0) PRU' ME*R (E, 0,2) U' VAZEK (P, la, 0)## ALL (LT, 0, 0)## ENDL (EL, 0, 0) ## ENDL

(EL, 0, 0)

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

RECENT RESULTS OF AUTOMATION PROJECTS IN PRAGUE 41

The indications in the parentheses after cach lexical segment are written inthe arder given above in their description; the first of them, fonnerly indicatingthe necessity of applying the rules, was changed during transfonnations ta servenow as a IItype of elemenf' in the procedure of interpretation as follows :P - name of an itcm in databaseL - logical operatorLR - name of a relation

K - constant for comparing \Vith the contents of the databasec - lexical segment will continue in the next e1ement, with the appropriate

type given in the last element

EL - end of logical expression

E - evaluating procedurePE - name of a pseudoitem (which must be compiled from the values of other

item(s) in the database)LT - logical value tltrue'}.

In the procedure of interpretation the component parts of word complexesare comhined together in one wholej during the analysis the lexical segmentsthat can he a part of such a word complex are specifically marked for thatpurpose. Moreover, ail required mathematical operations are carried out andvalues of their results are suhstituted as constants into the formulas.

The database (one record after another) is then searchcd through and it istested which items fulfil the conditions of the formula. If such a record is found,ail the relevant items are printed (and appear on the scrcen).

3. With a longer perspective in view, also such systems are heing preparedby the rcsearch group of mathematica! linguistics at Charles University inwhich a much more complex linguistic analysis is necessary. On the basis ofa carefully chosen and enriched framework for linguistic description (Sgall,Hajicova and Panevova, 1986), a system of naturallanguage understanding hasbeen implemented using the method TIBAQ (Hajicova and Sgall, 1984). Thesentences of the input text are transferred by a syntactico-semantic analysisinto their disambiguated underlying forms} which are combined into a semanticnetwork. Ru1es of linguistic inferencing operate on this network to find furtherassertions, wruch have not been contained directly in the input text, butfollow from (pairs of) sentences contained there. The resulting enriched setof assertions may then he used for automatic question answering, or as aknowledge base for an expert system.

F\trthermore, two large experiments with machine translation are beingcarried out at Charles University, one of which tests the system APAC3 (seeKirschner, 1988) translating English texts on water pumps into Czech, the

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXV, 1 à 4, 1989. C.I.P.L. - Université de Liège - Tous droits réservés.

42 EVA HAJlCOVA, JARMILA PANEVOVA AND PETR SGALL

other DOC translates texts from a subdomain of eleetronÎcs from Czech ÎutaRussian. The grammatical ingredients of the two systems should handle mostdifferent syntactic, semantic and pragmatic phenomena present in the givenlanguages; besicles the classical fcatures, aisa those concerning the tapie-Cocusarticulation and the hierarchy of communicative dynamism are accounted for inthe theoretical framework of linguistic description as well as in the applications.

References

CElPEK J., PANEVOVA J. and ZVELEBI V., D03avadn( zkuJen03ti 3 vyuiiUmASIMUT 2 ve vystavbé, dans "Lingvistické metody a automatizované in­formaèni syslémy". Dûm techniky CSVTS, Prague, 1988.

HAllC J., Koda3 - A Natural Language Interface to a Simpte Databa3C, dans"Prague Bulletin of Mathematical Linguistics" 39, 1983, pp. 65-76.

HAJICOVA E. and PANEVOVA J., Some Experience with the U3C of PunchedGard Machines for Linguistic Analysis, dans ltLes Machines dans la Lin­guistique" (ed. by E. Mater and J. Stindlova), Prague, 1968, pp. 109-115.

HAllCOVA E. and SGALL P., Tex/·and·lnference ba3ed answering of question3,dans "Contributions to Funetional Syntax, Semanties l and Language Com­prehension" (ed. by P. Sgall), Prague and Amsterdam, 1984, pp. 291-320.

KEHL W. et al., An information retrieval tanguage for legal 3tudie3, Communi­cations of the Assoc. for Computing Machinery 4, 1961, pp. 380-389.

KmSCHNER Z., MOMic - A method of automatic ex/raction of 3ignificantterms from texts. Explizite Beschriebung der Sprache und automatischTextbearbeitung X, Prague, 1983.

PANEVOVA J., From the aetivitie3 of Mechanographical Laboratory, Institnte ofCzech Language, PBMI, 12, 1968, pp. 73-74.

PANEVOVA J., Razbor e1ektrotechniée3kich tehtov, PBML 4,1965, pp. 3-25.

TESITELOVA M., Kvantitativn( charakteri3tiky souéasné édtiny, Praha, 1985.

TESITELOVA M., PETR J. and KRALfK J., Retrogrcidn( 31ov.dk 3Ouéa3né leJtiny,Praha, 1986.