29
Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research Foundation Flanders (FWO Vlaanderen) MMECL – 6-9 July 2009 Innsbruck

Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Embed Size (px)

Citation preview

Page 1: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies

Peter PetréFunctional Linguistics Leuven (FLL)

Research Foundation Flanders (FWO Vlaanderen)

MMECL – 6-9 July 2009Innsbruck

Page 2: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Topic• The last decades witnessed an explosion of diachronic corpora of English. • And yet, a comprehensive corpus covering the entire documented history

of English is still lacking. • In this paper I will talk about the architecture of a possible such

comprehensive corpus, provisionally called Leuven English Old to New (LEON).

• The primary goal of LEON is to make easier the study of longitudinal diachronic developments, such as grammaticalization, lexicalization and constructional change.

• Note: I started compiling LEON because, as a (naive) user, I (as well as others) did not see why a gap was there – I do now – and got frustrated by it. I am not a specialist in corpus compilation.

Introduction

Page 3: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Overview• The present situation: a big evil

◊ A plethora of corpora, not so compatible;◊ Combining existing corpora: the only option, but not a very good one;

• LEON: an attempt at a lesser evil◊ Basic architecture: 400k + 600k per period;◊ Text make-up;◊ Part A: collects material comparable to reference period 1251-1350;◊ Part B: aims at increasing overall representativeness;◊ Meta-corpus & new material◊ Reference period 1251-1350: overview◊ Present and future◊ Problems to be overcome

• Conclusion

Introduction

Page 4: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

A plethora of corpora• A selection of major corpora used in diachronic studies

◊ HC (730-1710; Rissanen & Kytö 1993): Set the standard. Covering nearly a millennium of English, compiled along sound principles, and containing useful meta-information (especially for the socio-linguist). A particularly interesting feature of the HC is the introduction of ‘prototypical text type’.

◊ ARCHER (1650-1990)◊ CLMETEV (1710-1920): based on texts in the public domain (many novels)◊ LAEME (1150-1325): lemmatized MS. samples, intended for dialectology◊ ICAMET-letters (1386-1688); Lampeter tracts (1640-1740); MEMT (1330-1500)◊ HC of Older Scots (1450-1700): Covering a single dialect. ◊ BLOB (1931), LOB (1960) & FLOB (1990): Covering one year each◊ Parsed corpus series (see next slide)◊ ...

(For citation info, see CoRD: http://www.helsinki.fi/varieng/CoRD/index.html)

Present situation

Page 5: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

The gap• The situation for English is better than for most other languages. • Despite this fact, a principled corpus covering the entire documented

history of English is still lacking. ◊ [some attempts to amend this have been done: e.g., CONCE (1700-1900; Kytö

& Rudanko 2000) as follow-up of HC]• Such a corpus is sometimes formulated as one of the most attractive aims

of corpus linguistics (Rissanen 2000:13). • It seems that corpus compilers are

◊ unwilling to consider the compilation of such a ‘long and fat’, all-genre corpus; ◊ highly (?too much) aware of the problems involved.

Problems

Page 6: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

The parsed corpus series• A possible comprehensive corpus upcoming?

◊ YCOE (730-1150): prose◊ YPC (730-1150): verse◊ PPCME2 (1150-1500): prose◊ PPCEME (1500-1710): prose◊ PCEEC (1410-1700): letters◊ PPCMBE (1710-1910): prose

Present situation

Page 7: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Problems with the parsed corpus series• While the HC is very careful in distinguishing different MSs of the same

text and putting them in separate files, YCOE, e.g., is largely a full-text corpus, and text-division is based on editions rather than MSs

◊ cobede.o2 is based on: Miller. Bede’s Ecclesiastical History of the English People, an edition that ‘mixes’ Bodleian MS Tanner 10, ‘T’, from ct. 10a, with Cambridge University Library MS Kk 3. 18 ‘Ca’, dated ct. 11b.

• The various parsed corpora are self-containing, not made with a single view in mind

◊ For instance, the OE YCOE corpus is heavily dominated by West-Saxon (particularly Ælfric), while in the early ME part of the PPCME2 Southern texts are hardly present at all;

◊ The same problem holds, to a lesser extent, for the HC;

Present situation

Page 8: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Unwelcome consequences of the gap (1)• The existing corpora allow for scientifically high-quality studies with

limited scope over time or genre (OE, ME, letters)• However:

◊ comparisons between periods and distinct genres (such as Middle English verse romances and Late Modern English novels) are made all the time;

◊ many linguists dealing with longitudinal developments such as grammaticalization need to cover very long time spans, and are forced to combine several widely different corpora (e.g. Hilpert 2008, Los 2005, Van linden 2009, many others).

◊ This situation leads to sometimes problematic analyses;

Problems

Page 9: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Unwelcome consequences of the gap (2)• Gries and Hilpert (2008) try to determine qualitative stages in the

development of linguistic phenomena in a data-driven bottom-up way. • They observe that describing change within the a priori

periodicization/genre-division provided by the corpus is risky. • In se, their arguments are very valuable (see their talks here at MMECL). • However, at least one of their case studies is problematic:

◊ Hilpert (2008) claims that shall goes through a process of subjectification, witnessed by the increased distinctiveness in late ModE, of verbs such as suppose, mention, inquire, explain, and observe.

◊ Cluster-analysis of shall's collocations over time shows that a major shift in this development takes place at about 1710.

◊ However, 1710 is precisely the date at which one corpus (PPCEME) they use ends and a second (CLMET) – rather different one – begins.

◊ This fact compromises their results (even if they might still hold).

Problems

Page 10: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: an alternative• With the available material, results of longitudinal studies are easily

compromised;• The problem will never disappear entirely, but some improvements can

readily be made, if a single corpus made with a single view in mind would exist.

• So I have (naively) started compiling such a corpus, provisionally called Leuven English Old to New (LEON).

A proposal from a corpus user

Page 11: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: basic architectureA proposal from a corpus user

Part A Part B-950 200k 100k

951-1050 400k 600k1051-1150 200k 100k1151-1250 400k 100k1251-1350 400k 100k1351-1420 400k 600k1421-1500 400k 600k1501-1570 400k 600k1571-1640 400k 600k1641-1710 400k 600k1711-1780 400k 600k1781-1850 400k 600k1851-1920 400k 600k1921-1990 400k 600k1991- 400k 600kTotal 5,6 million 7 million

Whole LEON: 12,6 million words

Page 12: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Text make-up• Text files will include (at least) the following meta-information:

<Author : X> <Period HC: m2><Contemporaneity : Contemp> <Period LEON : 1251-1350><Date of original : c1330> <Place : Oxborough/Ely><Dialect : East Midlands> <Prototypical text category : Instr sec><DOE/MED stencil : not in MED> <Relationship to foreign original : X><Edition : Hunt and Benskin. 2001. [...]> <Sample : All English parts><File name : cmcompen.txt> <Text name : First and second corpus compendium><Foreign original : X> <Verse or prose : Prose><Genre : Handb medicine> <Word count : 14243><Manuscript : Corpus Christi College Cambridge, 388> <Word count foreign passages : 306><Manuscript date : c1330> <Source : MEMT>

• Emendations are complemented with MS-readings wherever possible (scribal info will be added too)

A proposal from a corpus user

Page 13: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Part A (1)• Primary focus on comparability of genre-distribution between periods;• Selection of available material from the less well documented period

1251-1350 will serve as a template;• The template is structured along the following lines:

◊ the corpus will be strictly balanced between past tense and present tense (copulas can serve as an indicator);

◊ genre: wherever possible, texts from genres such as ‘sermon’ or ‘handbook, medicine’ represented in this period will be matched with parallel texts in the other periods;

◊ prototypical text type: where an exact genre-match is not possible, the parallelism between periods will be implemented on the basis of the concept of ‘prototypical text type’, as conceived in the HC.

◊ genre-division will necessarily be rough; a detailed feature-division (‘public’ <>‘private’, ‘formal’<>‘informal’ ..., see e.g. Wright 1994) is not realistic.

A proposal from a corpus user

Page 14: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Part A (2)• Secondary focus on comparability of dialect distribution between periods• This requires a different selection of the material from the usual. • Problems of comparing OE & ME because of dialect differences (Milroy

1996: 167);◊ YCOE (& OE HC) is heavily West-Saxon <> PPCME2/ME HC is mostly Midlands. ◊ To maximize comparability, LEON (see Petré & Cuyckens 2008):

Includes all smaller non-WS, non-literary texts;Excludes much of Ælfric for the OE part;Includes less known (trin323)/non-literary Southern texts for the period 1151-

1250.

A proposal from a corpus user

Page 15: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Part A (3)• That such a pragmatic strategy means improvement of the material is

suggested when the loss of worth ‘become’ is charted in LEON as against YCOE+PPCME2 (see Petré & Cuyckens 2008).

A proposal from a corpus user

Page 16: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Part A (4)• A different problem in (E/L)ModE (1500-1800):

◊ Standard English dominates and particularly the south of England is underrepresented.

◊ The sparse material that exists is often omitted altogether in the corpora. ◊ For southern (and other) dialects, LEON could make use of(i) Books printed by local printers, e.g. in Worcester or Tavistock (Boethius’ consolation

printed there 1525, available through EEBO; see Donaghey et al. 1999);(ii) Bibliographies & samplers of historical dialect material (e.g. Wakelyn 1986); (iii) The corpus of Witness Depositions (underway, see Kytö, Walker & Grund 2007); (iv) [Additional archival material (from local archives).]

◊ After 1800, dialect writing can be found more easily, and the problem is less pressing(i) Pauper letters (e.g. Fairman 2000)

• Not related to dialect: In order to align later stages with earlier ones, some archaic texts will have to be included in the ModE periods (e.g., modernized editions of works originally written a century earlier).

A proposal from a corpus user

Page 17: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: Part B• A self-contained second part B is envisaged in a later stage:

◊ For the OE and early ME periods, part B will vary in size & is meant primarily to increase the size of the corpus making use of whatever there is.

◊ After 1350, part B will consist of a self-sufficient 600,000 words corpus ◊ It will make up for the lack of some genres (e.g. (news)letters, diaries,

drama, ...) and ideally will provide social stratification/gender/age features.

A proposal from a corpus user

Page 18: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: a meta-corpus• LEON is only possible because it can build on existing corpora.

◊ This is evident in the earlier stages, where most of the texts are already included in at least one corpus;

◊ Despite the larger size of corpora of more recent language stages, many genres represented in these do not have a ready counterpart in earlier stages, and cannot be used. However, what little that is useful might well be enough, as e.g. the sermon-section in the BNC.

◊ LEON is still considerably different from combining existing corpora: the idea is to mine them selectively, in order to compile a ‘principled’ list of texts to be included (allowing for all kinds of unavoidable compromises).

A proposal from a corpus user

Page 19: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

LEON: new material• Not everything that is to be included is already part of a corpus:

◊ For early ME, at least one major prose MS, as well as a briefer narrative text and many prose fragments (for which see Laing 1993) have never been edited, a highly undesirable situation given the scarcity of prose data for this period.

Unedited: Bodleian, Rawlinson B.520, dated a1325, ?ca. 25000 words;Unedited: *Anc.Pet.(PRO) SC 8-192, dated (1344), ca. 1000 words;

◊ Several edited texts are not (or only partially) available in a corpus in a strict sense, but are stored in a text database such as DOEC, MEC, EEBO, TEAMS, ...;

◊ Some have not been previously made available electronically:Paper edition: Fridner, E. 1961. An English Fourteenth Century

Apocalypse Version with a Prose Commentary, LuSE 29, dated c1350, ca. 25000 words;

A proposal from a corpus user

Page 20: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Reference period 1251-1350: overview (1)• The reference period is at present in a beta-stage; some changes still have to be made

(e.g., some of the southern material will be substituted for northern). • Total number of texts at present: 52• Sources of the texts:

◊ Existing corporaHC 14 textsLAEME 11 textsPPCME2 3 textsMEMT 1 text

◊ Existing text data basesDOEC 1 textMEC 6 textsMED 5 texts

◊ OtherPaper edition 7 textsMS/Digital facsimile 2 textsDigitized jrnls/books 2 texts

A proposal from a corpus user

Page 21: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Reference period 1251-1350: overview (2)• Division prose – verse:

◊ For the reference period, this will be 200,000 words each;◊ For the other ME periods & OE3 the division will be 300,000 – 100,000; ◊ For the periods after 1500, there will be even less verse (only lyrics probably).

• Why?◊ ME verse that will be included consists of verse chronicles & legends;◊ It is mostly narrative and prosaic in character (e.g. Robert of Gloucester’s Chronicle);◊ This genre becomes less frequent from late ME onwards, prose being preferred instead.◊ ME lyrics are included, if they fill important dialect gaps or form a considerable part of

the available corpus (as for the period 1151-1250); they can be matched by popular music in ModE (or even nursery rhymes).

A proposal from a corpus user

Page 22: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Reference period 1251-1350: overview (3)• These verse texts are sufficiently comparable to later prose texts

◊ This is illustrated in the minimal differences between two 14th ct. psalters, one in verse, one in prose:

A proposal from a corpus user

(From psalm 101) Prose (a1425(c1340) Rolle Psalter) Verse (a1400 NVPsalter) 3. In what day that i hafe inkald the: 3. In whatkin dai i kalle þe, swiftly thou here me. Swithlike þan here þou me. 4. For my dayes failyd as reke: and 4. For waned als reke mi daies swa, my banys as kraghan dryid. And mi banes als krawkan dried þa. 5. Smytyn i am as hay and my hert 5. I am smiten als hai, dried mi herte, dryed: for i forgat to ete my brede. For i forgate to ete mi brede in querte.

“In what day soever I shall call upon thee, hear me speedily. For my days are vanished like smoke, and my bones are grown dry like fuel for the fire. I am smitten as grass, and my heart is withered: because I forgot to eat my bread.”

Page 23: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Reference period 1251-1350: overview (4)A proposal from a corpus user

DialectNorthern 719 0,2%Midlands (and standard/London) 221266 55,3% (West Midlands) 29809 7,5% (East Midlands 181362 45,3%Southern 178015 44,5% (Southern) 97467 24,4% (Kent) 80548 20,1%

Prototypical text typeExp 461 0,1%Instr Rel 131902 33,0%Instr Sec 19116 4,8%Narr Imag 57435 14,4%Narr Non-imag 118207 29,6%Stat 26990 6,7%X 45889 11,5%

400000 100,0%

Note: some are disputable (esp. distinction Narr Imag / Narr Non-imag not always very clear)

Page 24: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Present and future• At present, the corpus exists in an advanced beta-stage for the periods

-950 through 1251-1350; • Other periods are being explored, but most of the work remains to be done

there. • There is no intention to tag/parse the corpus; however,

◊ large parts of the corpus are already tagged in other corpora (parsed corpus series; LAEME; MEG-C; maybe texts from BNC, LOB, FLOB etc.)

◊ It would be great if the remaining parts existed in tagged format too. ◊ One way of doing this would be through a WikiCorpus (or a similar interactive

environment), inviting users of the corpus to parse any amount of text they want to. • The corpus is conceived as being dynamic: if new interesting material becomes

available, other material can be substituted by it in a new version of the corpus (with its own version number)

Page 25: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Problems to be overcome• Funding ( )• The format of the various sources needs to be made uniform:

◊ the primary goal is to make searching the corpus as easy as possible without oversimplifying the textual situation;

◊ files could be unicode-based, but with a minimum of special characters;◊ as little marking/syntax will be added to the file-format as possible◊ simple scripts as well as various search engines should be able to search it.

• The corpus should ideally be available for everyone interested;• Many texts are in the public domain; • However, copyright issues still have to be tackled:

◊ clearing copyright issues of paper editions;◊ finding an agreement with the compilers of those corpora that are used in LEON (

you ).

Page 26: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

In sum• Advantages of LEON:

◊ Language data from different periods can be compared in a more reliable way; ◊ Equal size of corpus periods makes comparison of data (frequencies) easy; ◊ Its structure and size are suitable for research in the following areas:

(?phonology/morphophonology); morphology/morphosyntax; lexical change (lexicalization, semantic change); grammaticalization processes;

constructional change in general• Limitations of LEON:

◊ LEON is Realpolitik: we use what we have, but in a sensible way. ◊ The language material is largely limited to certain written genres; ◊ Consequently, change may show up later in the corpus than it did in reality; ◊ The structure and size of LEON are less suitable for:

sociolinguistics (at least up to ME); study of low-frequency items; word-order change; discourse/text-analysis.

• For longitudinal studies, LEON might become, I hope, the lesser evil

Page 27: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

ReferencesDonaghey, Brian, Irma Taavitsainen and Erik Miller. 1999. Walton’s Boethius: From manuscript to

print. English Studies 80. 398-407.Fairman, T. 2000. English pauper letters 1830–34, and the English language. In D. Barton and N.

Hall (eds.). Letter Writing as a Social Practice. Amsterdam: John Benjamins. 63–82. Gries, Stefan Th. and Martin Hilpert. The identification of stages in diachronic data: variability-

based neighbour clustering. Corpora Vol. 3 (1): 59–81. Hilpert, Martin. 2008. Germanic future constructions A usage-based approach to language

change. Amsterdam & Philadelphia: John Benjamins.Kyto, Merja and Juhani Rudanko. 2000. Corpus of 19th Century English (CONCE). University of

Uppsala / University of Tampere.Kytö, Merja, Terry Walker & Peter Grund. 2007. English witness depositions 1560-1760: An

electronic text edition. ICAME Journal 31. 65-85.Laing, Margaret. 1993. Catalogue of sources for a linguistic atlas of early medieval English.

Woodbridge (UK): Boydell & Brewer. Los, Bettelou. 2005. The rise of the to-infinitive. Oxford: Oxford University Press. Milroy, James. 1996. “Middle English Dialects”. The Cambridge history of the English language ed.

by Richard Blake, vol. 2, 156-206. Cambridge: Cambridge University Press.

Page 28: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

References (2)Petré, Peter & Hubert Cuyckens 2008. The Old English copula weorðan and its replacement in

Middle English. In M. Gotti, M. Dossena and R. Dury (eds.) English historical linguistics 2006. Volume I: Historical syntax and morphology. Selected papers from the fourteenth International Conference on English Historical Linguistics (ICEHL 14), Bergamo, 21-25 August 2006. Amsterdam: Benjamins. 23-48.

Rissanen, Matti & Merja Kytö. 1993. General introduction. In Rissanen, Matti, Merja Kytö & Minna Palander-Collin, eds. 1993. Early English in the computer age: Explorations through the Helsinki Corpus. Berlin: Mouton de Gruyter. 1-17.

Rissanen, Matti. 2000. The world of English historical corpora: From Cædmon to computer age. Journal of English Linguistics 28: 7-20.

Van linden, An. 2009. Dynamic, deontic and evaluative adjectives and their clausal complement patterns: A synchronic-diachronic account. PhD dissertation, University of Leuven.

Wakelyn, Martin F. 1986. The Southwest of England (Varieties of English around the world. Text series, 5). Amsterdam: John Benjamins.

Wright, Susan. 1994. The place of genre in the corpus. Merja Kytö, Matti Rissanen, Susan Wright (eds.). Corpora across the centuries: proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine’s College Cambridge, 25-27 March 1993. Amsterdam: Rodopi. 101-110.

Page 29: Leuven English Old to New (LEON): Some ideas on a new corpus for longitudinal diachronic studies Peter Petré Functional Linguistics Leuven (FLL) Research

Contact informationPeter PetreDepartment of LinguisticsUniversity of LeuvenBlijde-Inkomststraat 21B-3000 Leuven, BelgiumEmail: [email protected] http://wwwling.arts.kuleuven.be/fll Link to presentation: http://perswww.kuleuven.be/~u0050685/2009_LEON_MMECL.ppt