9

Click here to load reader

Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

  • Upload
    vohanh

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

1

Exhibit A

Dutch PAROLE Distributable Corpus Documentation

Institute for Dutch Lexicology (INL)

P.O. Box 9515 2300 RA Leiden The Netherlands www.inl.nl [email protected]

1 Introduction The PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus, which is one of the results of the large European corpus harmonisation effort, called PAROLE (MLAP/LE2-4017). Apart from the Dutch corpus, thirteen other written language corpora were built according to the same design and composition principles, in the period 1996-1998. The languages involved in PAROLE were: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portugese and Swedish.

The harmonization with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories Book, Newspaper, Periodical and Miscellaneous within a settled range. The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept. As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part of speech (PoS) only. As for the latter, the Dutch data are also checked for Type.

1 For questions and remarks about the Dutch PAROLE Distributable Corpus or its documentation, please contact [email protected].

Page 2: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

2

Section 2 below presents some further information about the selection of the sources of the Dutch Distributable Corpus. The classification, size and date of the sources can be found in section 2.1, Table of Contents, whereas their bibliographical characteristics are listed in section 2.2. The choice for full texts or other text organizing principles is accounted for in section 3, followed by the explanatory table of encoding elements used in the Distributable Corpus in section 4. Finally, facts about the linguistic annotation and checking are presented in section 5. 2 Sources of the Dutch Distributable Corpus

Criteria for selection and legal restrictions

The choice of sources to be included in the Dutch Distributable Corpus is primarily determined by the availability status of the texts (availability depending on the willingness of copyright holders to give permission for PAROLE-use). According to Dutch law, permission by the copyright holder is needed for all types of use, except for citation of small text excerpts, provided that the source is properly mentioned. This, among others, explains the presence of short text files in three Medium categories of the Dutch Distributable Corpus. 2.1 Table of Contents MEDIUM SOURCE TIMESPAN TOTAL NUMBER

OF WORDS BOOKS

Van Sterkenburg: Wdlijst tot wdboek Taal vt Journaal WNT-portret

1984 1989 1992

65,344 56,215 60,133

NEWSPAPERS

* Short Newspaper texts: MN_Collection CVNP(S)-Collection

1986-1988 1983-1990

19,537 179,220

PERIODICAL

* Short texts from - Local Papers - Magazines

1985-1988 1985-1989

47,019 164,589

MISCELLANEOUS

* Texts to be read out in TV-news broadcasts for: - General audience - Youth * Short texts from Ephemera

1992-1995 1991-1995 1985-1986

1,285,824 1,008,658 131,692

TOTAL

3,018,231

* These texts contain words between "distinct-tags". These words are not accounted for in these figures. 2.2 Bibliographical listing of sources per Medium 2.2.1 BOOKS Sterkenburg, P.G.J. van

Van woordenlijst tot woordenboek. Inleiding tot de geschiedenis van woordenboeken van het Nederlands. E.J. Brill, Leiden, 1984

Page 3: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

3

Sterkenburg, P.G.J. van

Taal van het Journaal. Een momentopname van hedendaags Nederlands. Sdu Uitgeverij, ‘s-Gravenhage, 1989

Sterkenburg, P.G.J. van

Het Woordenboek der Nederlandsche Taal. Portret van een Taalmonument. Sdu Uitgeverij, ‘s-Gravenhage, 1992

2.2.2 NEWSPAPERS Short texts (among which advertisements) from 6 different Dutch newspapers, from the collection: Materiaalverzameling Noord , 1986 t/m 1988. The newspapers are: Leidsch Dagblad, De Leidenaar, De Telegraaf, Volkskrant, Het Vrije Volk, De Waarheid. Short texts from 19 different newspapers selected from the collection “Collectie van der Veen” (CVNPS), 1983 t/m 1990. Only 1983 includes short texts from NRC (Handelsblad) also. The newspapers are: Algemeen Dagblad, Alphens Dagblad, Brabants Dagblad, Dagblad voor West-Friesland, Eindhovens Dagblad, De Gelderlander, De Gooi- en Eemlander, Haagsche Courant, Haarlems Dagblad, Leidsch Dagblad, NRC (Handelsblad), Het Parool, De Standaard, De Stem, De Telegraaf, Trouw, Utrechts Nieuwsblad, Volkskrant, Het Vrije Volk. 2.2.3 PERIODICALS Magazines Short texts from 14 different magazines, from the collection: Materiaalverzameling Noord (MNM), 1985 t/m 1989. The magazines are: Arts en Auto, Bloembollencultuur, Chemisch Magazine (CM), Consumentenbond, Intermagazine (IM), Intermediair (IN), Kampioen, Legerkoerier, Natuurmonumenten, Op pad, Op Stap, Toeristen, Tussen de Rails (TR), Vinyl. Local Papers Short texts from 8 different local papers, from the collection: Materiaalverzameling Noord Part 1-files containing short texts with 109 t/m 453 words, 1986 t/m 1988. The local papers are: Bollenstreek, De Leidse Post, Groot Voorschoten, Nieuwsweek, Oegstgeester Courant, Stadsblad, Warmonder, Zuilen. Short texts from 21 different local papers, from the collection: Materiaalverzameling Noord Part 2-files containing short texts with 15 t/m 108 words, 1985 t/m 1989

Page 4: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

4

The local papers are: Bloembollenkrant, Bollenstreek, De Echo, De Leidse Post, De Stem van Dordt, De Warmonder, Groot Leiderdorp, Grens en Maas, Groot Utrecht, Groot Voorschoten, Leiderdorps Weekblad, Leids Nieuwsblad, Merwekoerier, Merwesteyn, Nieuwsweek, Oegstgeester Courant, Stadsblad, Warmonder, Warmonder Courant, Woensdag, Zuid-Holland Post, Zuilen. 2.2.4 MISCELLANEOUS Ephemera Short texts from Ephemera, from the collection Materiaalverzameling Noord. The short texts have been compiled into files covering one year. The corpus only comprises the files of 1985 en 1986. The complete collection of leaflets (ephemara) counts over 200 different sources. Other The Eight-o’clock news, NOS, Hilversum. 30 monthly files of texts to be read out in TV-news broadcasts for a general audience: 9 files of 1992 (not: June, August and November); 7 files of 1993 (April, May, June, July, August, September, November); 7 files of 1994 (January, February, March, June, September, October, November) 7 files of 1995 (June, July, August, September, October, November, December) Jeugdjournaal: NOS, Hilversum. 44 monthly files of texts to be read out in TV-news broadcasts for Youth 10 files of 1991 (January - June & September - December) 10 files of 1992 (January - June & September - December) 10 files of 1993 (January - June & September - December) 9 files of 1994 (January - June & September - November) 5 files of 1995 (July - November)

3 Text-organizing principles

The choice for samples or full texts was left free to the PAROLE partners. Apart from that, reusability and conservation of legacy data were two of Parole’s leading principles. These facts account for the actual text characteristics of the sources. For the Book category, full texts were available. Apart from the ‘body’, relevant front and back matter such as prefaces and lists of notes have been included. As for Newspaper, the corpus includes two collections of quotations (short texts) from a variety of newspapers. The quotations from the six different Newspapers of the MN_collection have been listed in files per Newspaper per year. The CVNP(S)-collection however, contains, among others, quotations from nearly twenty different newspapers. Due to this large number of newspapers and the rather unbalanced

Page 5: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

5

coverage per newspaper, the newspaper citations of this collection have only been listed in files per year. Periodical - Local Paper. The short text material from Local papers has been split in two: large quote-files (counting 109-453 words) from 8 different Local papers and small quote-files (counting 15 – 108 words) from 22 different Local Papers. The quotations have been listed per Local Paper per year. Periodical - Magazines Short texts from 14 different magazines are listed together in files per year. For Miscellaneous – other (television broadcast texts), each news text file covers a month of daily issues grouped together. Due to proportionial restrictions, the available amount of the news texts for general audience has been reduced to files with 7 or 8 months per year instead of twelve. Apart from these, the Miscellaneous section contains Ephemera quotations taken from over 200 different leaflets, listed into year files. Due to the large amount of material, only two year files have been included in the corpus: 1985 and 1986.

4 Use of the Corpus DTD The following encoding elements figure in the PAROLE Distributable corpus:

BIBLIOGRAPHICAL REFERENCE

IDEM WITH RENDITION

<BIBL>

<BIBL REND=*>

Text body <body>

Caption, rendition= foot centered <caption rend=’foot centered’>

Citation containing a <bibl> and a <quote> <cit>

Date <date>

Text division different from surrounding context in time Editorial text element (legacy data) marked in order to be excluded from word counting

<distinct time= ‘17th century’>

<distinct type=editorial>

Page 6: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

6

Highest text division

Highest text division on text level 1

Chapter on text level 1

List on text level 1

Text division on text level 1

<div1 n=*>

<div1 type=L-1 n=*>

<div1 type=chapter n=*>

<div1 type=list n=*>

<div1 type=text n=*>

Embedded text division on text level 2

Embedded list on text level 2

Subchapter on text level 2

<div2 type=L-2 n=*>

<div2 type=list>

<div2 type=subchapter n=*>

Embedded text division on text level 3

Idem for text levels 4-5

Embedded text division on text level 6

<div3 type=L-3 n=*>

idem 4-5

<div6 type=L-6 n=*>

Front matter before start of body <front>

Indicates the point where text has been deleted for technical or editorial reason, with description of deleted division and or reason for deletion of a text division

<gap (descr=diagram)

<gap reason=missing)>

Composite text division grouping together a sequence of distinct texts

<group>

Text or division title

Idem with rendition= bold capitals, large, very large centered

<head>

<head rend=*>

* = ‘BO-CA’,’LA’,’V-LA-BX’

Highlighting of layout characteristics: 1-5 indents

Idem for centered – small capitals, italics, small and small capitals

<hi rend=’*indent’> * = 1,2,3,4,5

<hi rend=*> *= ‘BO’, ‘BO-SM-CA’, ‘IT’, ‘SM’, ‘SM-CA’

Item of list Idem with number of item

<item > <item n=* >

Line Idem with number of line

<l> <l n=*>

Page 7: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

7

Label associated with an item in a list, or editorial element in short text files indicating the keyword(s) of a quotation

<label>

Line break <lb>

Any sequence of items organised as list, with number of list

Idem of type ordered (numbered or lettered)

<list n=*>

<list type=ordered>

Note reference Idem with number of note

<note> <note n=*>

Number

Idem with rendition ‘roman capitals’ and a value, translating a roman number into an arabic number

<num>

<num rend=’RO-CA value=*>

Alinea

Idem with indentation of 1 or 2 indent

Idem with indentation of 1 left-right indent

Idem sarting with 2 or 3 tabs

<p>

<p rend=‘*indent’> *=1,2

<p rend=‘1 lrindent’>

<p rend=‘*tab’> *= 2,3

Quotation <quote>

Text element foreign to the actual text <sic>

Contains single text of any kind <text>

5 Linguistic Annotation 5.1 PAROLE Distributable Corpus: numbers of words tagged and checked MEDIUM SOURCE FILES TIMESPAN POS/TYPE

TAGGED AND CHECKED

MAXIMUM GRANULARITY

BOOKS

Van Sterkenburg: * Wdltwdb.ch. 1-3 Idem, ch. 4-10.3 WNT-portret, ch. 1-4 Idem, ch. 5-26

1984 1984 1992 1992

49,787 55,231

9,752 4,983

Page 8: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

8

MEDIUM SOURCE FILES TIMESPAN POS/TYPE TAGGED AND CHECKED

MAXIMUM GRANULARITY

NEWSPAPERS

** Short Newspaper texts: MN newspaper collection

1986-1988

22,226

PERIODICAL

** Short texts from Local Papers: Part1-files Part2-files

1986-1987 1985-1989

47,072

5,169

MISCELLANEOUS

** Texts to be read out in TV-news broadcasts for general audience 1-6 Dec. 24-30 June/1-7 July idem forYouth: 1-9 November 1-28 Sept./2-13 Oct.

1995 1995 1995 1995

20,920 36,353

8,788 8,889

TOTAL

209,363 59,807

* in this book a number of 5936 tokens (among which the content of notes) has not been tagged at all. ** the figures include words between "distinct-tags".

5.2 Procedure

Over 250,000 words of corpus texts (with TEI markup suppressed) have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer’s manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 ‘keywords’ in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. This material has been pasted back into the PAROLE/TEI-tagged text files. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots. 5.3 Manual Checking of PoS Tagging The manual check of the automatic PoS tagging has been guided by three principles. The main principle is the ‘functional approach’ in PoS assignment. This implies transcategorisation of a number of parts of speech such as verbs (infinitives and past participles) and adjectives. For example: whenever an infinitive functions as a noun, it has been tagged as a noun: • Kinderen hielpen schrijver bij maken van boek

(Jeugdjournaal: headline) Here, ‘maken’ is an infinitive, but functions as a noun and is tagged ‘noun’.

Page 9: Exhibit A Dutch PAROLE Distributable Corpus Documentation · Dutch PAROLE Distributable Corpus Documentation ... Duplication of this document or parts thereof is permitted only under

Duplication of this document or parts thereof is permitted only under the LE-PAROLE partners written permission.

9

Another principle is ‘non-inheritance of gender and number features inside Noun Phrases’ for Articles, Determiners and Adjectives, because of their different anaphoric behaviour and/or because these features are not morphologically expressed: • De laatste kans

In this example, ‘de’ is a definite article and has a multi-lable for Gender and no label for Number; ‘laatste’ is an adjective and is only tagged with label ‘inflected’; ‘kans’ is a noun and is tagged for masculine+feminine and singular. • Wij zien zijn benen Here, ‘zijn’ is a possessive determiner refering to a man and has a hyphen for Gender and feature singular for Number; ‘benen’ is a noun and is tagged for neuter and plural. The functional approach cannot be applied in cases falling under the third principle, which is the ‘separate tagging’ of words in multi-word units (like ‘ter sprake’, ‘Bill Clinton’) and of words which are ‘part of separable verbs’. From a practical point of view, our tagset has no provisions for discontinuous tagging (e.g. the two parts of a separable verb can be separated from each other by an indefinite number of words): • het ter sprake komen van de plannen The members of the multiword unit ‘ter sprake’ (‘ter’ and ‘sprake’) are not tagged as being part of a collocation, but as separate words (as adposition and noun, respectively); ‘komen’, which is the ‘head of the NP’, is tagged for the function it has: ‘noun’ (cf. above). • de taxi rijdt voor Here, ‘voor’ is part of the verb ‘voorrijden’ and is tagged ‘adposition’. • hij gaat vanmiddag met de laatste mensen …. mee Here, ‘mee’ is tagged ‘adverb’