31
1 Digital Editions for Corpus Linguistics A new approach to creating editions of historical manuscripts Alpo Honkapohja Samuli Kaislaniemi Ville Marttila University of Helsinki Digital Humanities conference www.helsinki.fi/varieng/domains/DECL.html Oulu, 25-29 June 2008 -Apologies for Alpo & Ville not making it to DH 2008 -Properly, our project could be called: creating digital editions of historical manuscripts which are suitable for corpus linguistic enquiry -The idea is forlinguistically oriented online digital editions of historical manuscripts -It’s worth emphasising that these editions are not only meant for corpus linguistics!!

Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

1

Digital Editions for Corpus LinguisticsA new approach to creating editions of historical manuscripts

Alpo Honkapohja Samuli Kaislaniemi Ville Marttila

University of Helsinki Digital Humanities conferencewww.helsinki.fi/varieng/domains/DECL.html Oulu, 25-29 June 2008

-Apologies for Alpo & Ville not making it to DH 2008

-Properly, our project could be called: creating digital editions of historicalmanuscripts which are suitable for corpus linguistic enquiry

-The idea is forlinguistically oriented online digital editions of historicalmanuscripts

-It’s worth emphasising that these editions are not only meant for corpuslinguistics!!

Page 2: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

2

• Background: historical corpus linguistics

• DECL: theoretical principles

• Current status and the first DECL editions

• A look at a DECL edition

Outline of this talk

DECL Digital Humanities 2008 26 June 2008

-Before describing the project as it stands now, this presentation will spend sometime discussing the rationale behind the project in order to give a better picture ofwhat it is DECL is after

Page 3: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

3

• Research Unit for Variation, Contacts and Changein English (VARIENG), University of Helsinki

• Aim: to study variation and long-term change inEnglish with the help of electronic corpora

• The Helsinki Corpus of English Texts (1991)– multi-genre corpus of English– c. 450 texts (c. 1.6m words)– spans the years 730-1710– available from OTA

Background 1: Historical corpuslinguistics

-This is where DECL springs from

Page 4: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

4

• The Corpus of Early English Correspondence(CEEC; in progress)– currently c. 12,000 personal letters (c. 5.1m words)– spans the years 1403-1800– published subcorpora available from OTA

• The Corpus of Early English Medical Writing(CEEM; in progress)– c. 3.75m words (estimate)– spans the years c. 1375-1800– published subcorpus available from

Benjamins (CD-ROM)

Background 1: Historical corpuslinguistics cont’d

-We’ve been involved with these projects, helping compile corpora: Sam withCEEC, Alpo and Ville with CEEM

-These are “2nd gen. corpora”, designed to answer specific research questions-CEEC is created for historical sociolinguistics: the influence of socialfactors on linguistic use-CEEM for the scientific thought-styles project: stylistic changes in medicaltexts

-Both corpora are based on historical manuscripts - in edited, printed form (for themost part)

Page 5: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

5

• Based on printed editions of historical texts+ Compilation easier and faster than from

manuscript sources– Link with manuscript originals lost

• Editorial principles vary• Orthography unreliable• Manuscript features rarely marked

– Copyright issues– Duplication of effort and errors

Background 2: Conventionalhistorical corpora

-Historical corpora have usually been compiled from a group of sources

-The creation of historical corpora from printed editions has been called“philological outsourcing” (Dollinger, Stefan. 2004. “‘Philological computing’ vs.‘philological outsourcing’ and the compilation of historical corpora: A Late ModernEnglish test case”. Vienna English Working Papers (VIEWS) 13 (2), 3–23.).

-Contextual information is also not necessarily/usually contained in the corpus orthe manual of the corpus (should one exist)

Page 6: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

6

Background 3: Historical corporabased on manuscripts

• Middle English Grammar Corpus (MEG-C)• Salem Witch Trials Corpus• Corpus of Scottish Correspondence (CSC)• A Corpus of Late Eighteenth-Century

Prose

-Meg-C & Salem were created from edited material, but checked against theoriginal manuscripts-CSC & L18C were created from mss

-MEG-C: c. 3m words-Salem: 1,000 records-CSC: 256,593-L18thC: 300,000 words

-Some problems:-CSC and L18C are not editions, which is a shame-MEG-C & Salem are great multidisciplinary projects, but like CSC &L18C, they don’t use XML, which arguably would enable easy conversionto other formats

Page 7: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

7

Background 4: Digital editions

• Could be more versatile and user-friendlyoverall– e.g. restricted text searches, hard to

manipulate texts• Have yet to become a norm for publishing

historical texts– Labour-intensive

-Compared to corpora editions, on the other hand, are not usually suited forlinguistic enquiry - which is also a shame

-Arguably, though, they are becoming a norm – albeit sloooowly-Lack of publishing infrastructure and user-friendly tools limits individual scholarsand small-scale projects-Cf. use of online text databases for non-intended purposes: e.g. historian JohnStyles using the Old Bailey Online to study 18th-century interiors

-..In any case, it is to bring the requirements of users of digital editions andlinguistic corpora closer together, that we have initiated..

Page 8: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

8

DECL Digital Humanities 2008 26 June 2008

The Digital Editions for CorpusLinguistics project (DECL)

• Aims of the project:

1. To create editions that function as both editions andcorpora — allowing equally:

• comparison of manuscript image and diplomatic transcript

• textual searches of transcripts and linguistic tags

2. To create a framework which makes the creation ofsuch editions easy and is readily adaptable todifferent types of historical texts

-DECL was started in January 2008 by Alpo, Ville and Sam

-Comparison of image and transcript is a typical feature of editions-Sophisticated text searches, on the other hand, are a requisite of corpusfunctionality

-Editing historical manuscripts is not easy – and electronic editing even less so,with technical features to learn-Further, editions usually fall short of expectations

-DECL hopes to help!

Page 9: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

9

The Digital Editions for CorpusLinguistics project (DECL)

• Features:– Historical manuscripts– Meant for individual scholars and small-scale

research projects– Using existing standards and tools– Open Source principle

DECL Digital Humanities 2008 26 June 2008

-DECL is intended to feed the need for further edited historical material-Our focus is not on literary material, although we do not exclude it

-DECL is not intended for large-scale, well-funded projects, instead, our concernis for scholars without infinite time and money

-For instance, “PhD corpora” tend to be compiled, then forgotten/lost-DECL aims to help the creation of versatile digital editions which arecompatible with other similar resources

-We do not want to reinvent the wheel - instead, bolster existing (widelyapproved) standards and good tools

-e.g. TEI-We also embrace free software and resources, and will publish DECLeditions under a suitable Creative Commons license

Page 10: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

10

DECL Digital Humanities 2008 26 June 2008

The Digital Editions for CorpusLinguistics project (DECL)

Theoretical principles

Page 11: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

11

DECL Digital Humanities 2008 26 June 2008

Theoretical background

• Retaining manuscript reality• The focus must be equally on artefact, text and

contextArtefact = the physical manuscriptText = the linguistic contents of the artefactContext = the cultural circumstances relating to thetext and the artefact

-The challenge is to represent manuscript reality in a form that suits the needs ofhistorians as well as corpus linguists

•This includes fields such as historical pragmatics, sociolinguistics anddiscourse analysis-These all need evidence of language production and the social context ofthe texts

-Note: instead of “document” used at this point in our abstract, we now feel that“artefact” lessens chances of confusion

-We're still defining and refining terms and concepts, and these are subjectto change

-All of these levels are considered equally important facets of the manuscriptreality and should be represented in the edition

-We aim at "documentary editing", our primary concern is to retainauthenticity: the link to the original manuscripts

Page 12: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

12

DECL Digital Humanities 2008 26 June 2008

Theoretical background cont'd

• Roger Lass (2004): A historical corpusshould:– preserve the text as accurately and faithfully

as possible– convey it in as flexible a form as possible– ensure that any editorial intervention remains

visible and reversible

-Our starting point is Lass's 2004 article (Lass, Roger. 2004. “Ut custodiantlitteras: Editions, Corpora and Witnesshood”. In Methods and Data in EnglishHistorical Dialectology, ed. by Marina Dossena and Roger Lass. Bern: PeterLang, 21–48. [Linguistic Insights 16]), where he argued that historical corporacompiled from editions fail to represent linguistic reality

-Others have voiced the same concerns (for references, see handout)

-However, DECL does not follow Lass's model in transcribing or encoding texts-DECL is concerned with creating editions from historical manuscripts, andnot corpora

Page 13: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

13

Three guiding principles:

flexibility, transparency and expandability

DECL Digital Humanities 2008 26 June 2008

-Based on Lass (2004), we have derived some guiding theoretical principles

Page 14: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

14

DECL Digital Humanities 2008 26 June 2008

Three guiding principles:1. Flexibility

• XML– Based on and compatible with TEI P5 guidelines.– Easily convertible into various database and document formats.– Text and metadata present in tagging, and combinableinto new

documents (e.g. subcorpora)• Layered structure with customisable online interface

– Allows viewing, searching and downloading relevant aspects ofthe edition, while leaving others out.

• Platform-independent solutions based on the Open-source principle– Tools, texts and tagging can be freely downloaded and modified

by users

Page 15: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

15

DECL Digital Humanities 2008 26 June 2008

Three guiding principles:2. Transparency

• All editorial intervention indicated by markup– Explicitly distinguished from the unemended

transcription– Reversible

• All layers of the edition accessible– Manuscript images, raw transcript and annotation– Allows users to (re-)evaluate editorial decisions

• Detailed documentation

-Ambiguities are marked as such

-Transcription principles, editorial policies and encoding practices will bethoroughly documented

-DECL guidelines will be a subset of the TEI P5 guidelines

Page 16: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

16

DECL Digital Humanities 2008 26 June 2008

Three guiding principles:3. Expandability

• Uniform editorial and encoding practices– Ensure comparability of DECL editions– Allow editions to be combined into corpora

• Modular architecture– Allows new documents to be added to editions

• Layered structure– Allows new layers of annotation to be added to an

edition

Page 17: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

17

DECL Digital Humanities 2008 26 June 2008

DECL editions underwayAlpo Honkapohja: A Digital Edition of MS O.1.77 Trinity College Cambridge

• late Middle English pocket-sized medical handbook• produced ca. 1460 in London or Westminster• c. 15 texts, in English and Latin• will be the first bilingual digital edition of early scientific writing in England

Samuli Kaislaniemi: The Life and Early Letters of Richard Cocks,English Merchant (1600-1610)

• Early Modern English intelligence letters• written between 1600 and 1610 from Bayonne in France• c. 125 letters, with some abstracts and other documents

Ville Marttila: Potage Dyvers: A digital edition of a family of late medievalculinary recipe collections

• six closely related Middle English culinary recipe collections• dating from the 15th century• over 200 recipes

-These are all PhD theses, or material compiled for our PhD theses

-We would welcome other projects, even at this early stage: more text typeswould/will lead to better general DECL guidelines

-The first edition (Sam’s) should be done in 2010; all three within 4-5 years-The DECL framework should be available 2010/2011 (v.1.0); a ‘finished’ versionc. 2012

Page 18: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

18

DECL Digital Humanities 2008 26 June 2008

Current state of the project

• Working on:– Transcriptions of manuscripts– DECL guidelines (from TEI P5)– PR– Looking for funding

-Transcriptions: Sam has finished round one, Alpo & Ville should finish round oneby the end of 2008

-Naturally the texts will need another two rounds of proofreading later on

Page 19: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

19

DECL Digital Humanities 2008 26 June 2008

Things on the To Do list

• Repository for DECL editions– University of Helsinki (CARHU – Campus

Repository of Humanities and SocialSciences)?

• User interface– Online editions, so work with web browsers

• ..things we haven’t even thought of yet –please tell us!!

-The University of Helsinki has recently been very active in developing thesustainability of electronic resources created by the University members-The University, together with the Finnish National Archives and the NationalLibrary, are also actively involved in pan-European developments in the field

-For the user interface - as with nearly everything - we will use existing solutions-It’s too early to say anything concrete

Page 20: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

20

DECL Digital Humanities 2008 26 June 2008

Copyright issues

• Texts– Transcriber holds copyright

• Images– Archives generally lenient (?)– Libraries tend to demand money

• Software / tools / solutions– Open Source, free software, Creative

Commons licenses

-We have given initial thought to copyright issues

-The DECL framework and editions will be (inshallah) released under a suitableCreative Commons license

-This can be seen as passing the issue of sustainability to the publicdomain..-Yet using existing and emerging standards will help strengthen them

Page 21: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

21

A look at a DECL edition

DECL Digital Humanities 2008 26 June 2008

It’s only amodel…

-Ok, let’s have a look at a DECL edition.-..Having said that, this is very much a mock-up, and technical details are allwork-in-progress

Page 22: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

22

ManuscriptEditing

Lemmatisation

Original Manuscript

Manuscriptstructurepagination,

lineation, layout

Manuscript featureshands, abbreviation,

decoration, emendation,annotation

Textualstructure

parts, chapters,paragraphs

Textualcontent

text tokenisedinto uniquely

identifiedword-units

Manuscriptimages

high-resolutiondigital images

Editorialnotes

textual &explanatory notes

Links

parallel versions,intertextuality,related texts,

glossaries

Manuscript descriptioncatalogue information,documenting hands &

abbreviations

Transcriptiondiplomatic, graphemic, unemended,original punctuation & word division

Normalisedtextual content

spelling variantsnormalised

to standard forms

POS tagging

Semanticannotation

Pragmaticannotation

Discourseannotation

SyntacticParsing

LinguisticAnalysis

1

2

3

- Diagram of the modular and layered structure of a DECL edition1. Normalisation is part of creating the edition (spelling variation)2. Thus, it becomes easy to use automated tools for linguistic annotation3. And all this without losing the link to the mss sources

Page 23: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

23

DECL Digital Humanities 2008 26 June 2008

Technical principles

• TEI-compliant XML (P5)– Text-type specific features as additional

modules to the TEI schema• Modular structure

– Layers of annotation– Standoff markup

-The XML-based encoding will-allow the contents of the editions to be used with any XML-aware tools-Allow easy conversion to other document or database standards-Allow easy addition of further layers of annotation

Page 24: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

24

Demo 1: Manuscript facsimile

-A linguistically oriented edition needs to be based on the original manuscript:-Images – preferably of better quality than the microfilm reproductionshown here – should be obtained-To serve both as an aid to preliminary transcription and as a componentof the finished edition-This will ensure complete transparency, as the user can verify anyeditorial readings against the image of the original

-The process begins with producing a raw transcription of the manuscript text

Page 25: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

25

Demo 2: Raw transcript

-The raw transcription contains all of the textual features (abbreviations,superscripts, additions & deletion, etc.) that are to be included in the edition-Using shorthand notation (representative symbols) at this stage speeds upkeying and proofing-Notation should be unambiguous and as far as possible automaticallyconvertible to the final XML

-In this example abbreviations have been indicated by various specialcharacters and their expansion given in parentheses-Additions by a later hand has been indicated by boldface-Underlinings and strikethroughs indicated by corresponding formatting

-Features that cannot be automatically processed should be indicated clearly sothat they can be searched for and marked up manually-The raw transcript made from a manuscript image should then be proofedagainst the manuscript, noting down features not apparent in the image

Page 26: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

26

Demo 3: XML encoding

-After the raw transcription has been checked, it will be converted into XMLfollowing the DECL guidelines

-Explicit tagging for lineation, paragraphs and other structural featuresadded-Some textual features can be replaced automatically using search andreplace-Abbreviations, emendations and other features with formal markup-The guidelines could even describe a recommended shorthand, for whichmacros could be provided to automate conversions further-Special characters (yogh, ampersand, etc.) replaced by entities orelements (still not decided)

-More complex textual features will be marked up by hand-Features not expressible in formal markup will be described in textual noteslinked to the appropriate locus

-When the textual and visual features of the manuscript originals have beenencoded in XML according to the guidelines, the finished XML document can betokenised and separated into standoff format using a series of XSLTransformations

-Content, structure and various annotation layers separated either intoindividual documents or into separate sections within a single document

Page 27: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

27

Demo 4: Stand-off markup

-Once the textual content has been separated from the markup and tokenisedinto explicitly identified word units, a normalised version will be created

-A semi-automatic process using a tool developed for the purpose-A high priority for the project-Allows searches to be made using normalised spellings and resultsreturned with original spellings-Also facilitates the application of many automatic annotation tools-New layers of linguistic analysis can be created using the normalisedspellings-Automatically become linked with the underlying original spellings and theencoded manuscript features

-From the standoff documents, new documents containing only the desiredannotation layers can be created on the fly

-The material can also be converted into database form if needed

Page 28: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

28

Demo 5: Interface mock-up

-A browser-based online interface will be provided for end-users to access DECLeditions

-Customisable interface – user can not only navigate the text but alsodefine what aspects of it are displayed and how they will be presented-All texts will also be downloadable, both in the original XML and variousother formats

-With selected annotation layers

-A full-fledged search engine and corpus interface will also be presented-Possibly based on the CQP-edition of BNCWeb-Will enable the combination of different DECL editions into corpora-Powerful search facilities and the ability to analyse, manipulate anddownload search results

Page 29: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

29

DECL Digital Humanities 2008 26 June 2008

To summarize –

• Things to remember about DECL:– Small-scale projects– Retaining manuscript reality– Using existing tools and standards– Usability, accessibility, transparency– “Remember the linguist”

• “Very simple, but desperately needed”

-The main assets of DECL editions are retaining the editorial chain intact, andworking from documented editorial principles, so that manuscript reality getscarried into the text and code of the editions. It is easier for editors of themanuscripts to tag manuscript features and normalise the text, than it is forcorpus compilers working from edited material.

-Increased workload for editor, but greatly increased benefits too

-DECL does not compile corpora! We help the process, but DECL aims at makingdigital editions of historical manuscripts-Multi-purpose resources are what we want to have

-The quote is from Eero Hyvönen in his opening plenary, in describing a softwaresolution his team had created. The same can be said about DECL.

Page 30: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

30

DECL Digital Humanities 2008 26 June 2008

What we hope to achieve

• Increase the usefulness of digital editions– Versatile digital resources– Better manuscript-based historical corpora

• Better tools and standards• Increase interdisciplinary cooperation

Page 31: Digital Editions for Corpus Linguistics · Digital Editions for Corpus Linguistics ... documents (e.g. subcorpora) ... • dating from the 15th century • over 200 recipes-These

31

DECL Digital Humanities 2008 26 June 2008

Thank You!

• www.helsinki.fi/varieng/domains/DECL.html

[email protected]