31
The Construction of Anglo-Norman Text Corpus Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth . • AHRC-funded. Anglo-Norman Online Dictionary Anglo-Norman Text Corpus http://www.anglo-norman.net

The Construction of Anglo- Norman Text Corpus Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth. AHRC-funded

  • View
    226

  • Download
    3

Embed Size (px)

Citation preview

The Construction of Anglo-Norman Text Corpus

• Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth .

• AHRC-funded.

• Anglo-Norman Online Dictionary

• Anglo-Norman Text Corpus

• http://www.anglo-norman.net

Goal of the Anglo-Norman Hub Text Digitisation Project

• To provide a set of digitised texts and articles to mediaeval linguists and historians which is searchable and fully cross-referenced within itself and to and from the Anglo-Norman Online Dictionary

Main Challenges facing the Anglo-Norman Hub Project

• Image to text migration for maximum throughput at minimum cost

• Application of markup suitable for rendering and full cross-referencing

• Handling of non-standard character sets (mediaeval abbreviations)

Image to Text Migration Strategies

• Optical Character Recognition

• Re-keying

• Both require subsequent proofreading

• Both allow insertion of appearance metadata as provisional markup

Advantages of Alternative Image to Text Migration Strategies

• OCR

• Rapid processing• Can be performed by

students on-site and can be supervised.

• Rekeying

• Less error-prone• Cheap if outsourced• Non-standard characters

can be represented by combinations

• More consistent output quality

• Image quality less critical• Consistent output quality

Economic Image to Text Migration: Conclusions

• Re-keying is more economic for the bulk of the mediaeval-language material

• OCR is competitive for modern languages (critical material)

• OCR can also be used for mediaeval language material when required by workflows provided that– good image quality can be easily achieved– the material consists of standard characters

Markup requirements: must

• Conform to widely-accepted standards• Be capable of encapsulating diverse

document structures• Allow for automation• Enable internal and external referencing• Preserve as much appearance metadata

as possible • Not be tied to any one approach to

rendering

Document types requiring a variety of XML Structures

• Texts– Verse– Prose – Lists & Tables

• Critical material– Introductions (conform to prose structures)– Notes (do not conform to any of the above

structures)

Cross-referencing of Critical Matter

• Need to navigate from pointer to note

• Need to navigate cross-references from critical material to specific points in the text or elsewhere in critical material

• Achieved by use of target-id pairs

Markup Density and Automation

• Verse: medium density; can be automated

• Prose: variable density; can be automated if footnote pointers present

• Lists & tables: medium density; can be automated

• Critical material: high-density; many cross-references; limited scope for automation

Extract from XML version of “La Passiun de St. Edmund”

• <lg n="316"><l id="L1261">A Deu del cel ad graciéd</l>

• <l id="L1262">E al martir suvent a voéd</l>

• <l id="L1263">Que si bel l'at delivréd</l>

• <pb ed="folio" n="123a"/><l id="L1264" n="1264">De ço qu'esteit ainz encumbrét.</l></lg>

Extract from XML version of “La Passiun de St. Edmund”

• <note id="N1261-4" target="L1261" targetEnd="L1264">These lines present several problems: (a) <q lang="AN" rend="b">A Deu. . .ad graciéd</q> <ref target="L1261">1261</ref>. The verb <term lang="AN" rend="i">gracier</term>, occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: <ref target="L826 L943 L1132">ll. 826, 943, 1132</ref>.

Additional Markup for Critical Material

– <term>: Terms discussed may need to be linked to the Anglo-Norman Dictionary

– <q>: Citations: may need to be linked to their sources within the text base

– <bibl>, <title> etc.: Bibliographical information needs to be encoded to link citations with their sources

• Much of the above can be extrapolated from the appearance metadata embedded in the provisional markup

• <hi>: to encode embedded appearance metadata whose significance is not apparent

“La Passiun de St. Edmund”Rendered for a Web Browser

• These lines present several problems: (a) A Deu. . .ad graciéd 1261 . The verb gracier , occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: ll. 826, 943, 1132 . T.-L. 4,502 cites one instance of gracier with indirect object, but in the construction gracier. qc. a qn . If this construction were applied here, ll. 1263-4 would have to be taken as the direct object of gracier and also, presumably, of voer 1262 . The use here of gracier with indirect object may have been influenced by the construction rendre graces a qn. employed at ll. 995, 1046, 1512 .

Markup Density and Automation

• Verse: medium density; can be automated

• Prose: variable density; can be automated if footnote pointers present

• Lists & tables: medium density; can be automated

• Critical material: high-density; many cross-references; limited scope for automation

Markup Requirements: Application

• 1,000 to 100,000 XML tags per document

• Automation essential for high throughput

• Digitisers can embed appearance metadata in provisional markup

• Well-designed provisional markup schemes facilitate automation

Facsimile of part of the Statute Roll

The same passage in the 1800 printed edition

Extract from the explanation published with the Statutes, exemplifying the two forms resembling 9s.

"rum"-abbreviation and flourishes

Handling of Non-Unicode Characters: 1) Transcription

• Transcription is the one-to-one encapsulation of character appearance metadata

• Transliteration is the expansion of abbreviated characters into an intelligible sequence of letters

• Transliteration requires transcription as a starting point

• Transcription codes must resemble originals to facilitate re-keying

P-contractions

Examples of the "per" "pro" and "pre"

contractions as represented by the agency Signifies Keyed as

Expanded example Rekeyed example

per p!! ceperit cep!!it

pro $p$ propria $p$p<sup>i</sup>a

pro $p$ probum $p$bū

per p!! persone p!!sone

per p!! apertement ap!!tement

pro $p$ profit $p$fit

per p!! permisit p!!misit

pro $p$ promisit $p$misit

pro $p$ prochein $p$chein

per p!! persona p!!<sup>a</sup>

par p!! paratus p!!atus

par p!! parceles p!!celes

por p!! tempore temp!!e

por p!! corporum corp!!um

pre p?~ presentem p?~sentem

pre p?~ prelatz p?~laz!!

pre p?~ predictum p?~d!!c~m

pre p?~ prendront p?~ndront

Transcription:1810 Edition and Rekeyed Version

<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos &amp; consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum &amp; stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>

Transcription to Transliteration:Rekeyed Version & XML File

Handling of Non-Unicode Characters: 2) Transliteration

• Manual transliteration would take too long

• Blanket replacement is not possible because of ambiguous abbreviations

• Semi-automated transliteration can be achieved using a list of words for block-replacement, derived from a concordance

• The appearance metadata from the transcription should remain embedded

Extract from Concordance

Table of expansions, example 1

Contracted word Occurrences Expansion

& 6264 &

q~ 2989 q'

p<sup>r</sup> 803 p'r

seign<sup>r</sup> 325 seignour

aut?~s 289 autres

man?~e 250 manere

s<sup>r</sup> 224 sur

p!!lement 215 parlement

t?~re 199 terre

t?~res 196 terres

denglet?~re 191 dengleterre

g<sup>a</sup>nt 181 grant

lo<sup>r</sup> 167 lour

p!!tie 152 partie

ap?~s 142 apres

s?~ront 139 serront

h<bar>o</bar>me 137 homme

Table of expansions, example 2

Contracted word Occurrences Expansion

memorand!! 8 memorandum

mest?~ 8 mestre

p!!dre 8 perdre

p?~dc~m 8

p?~mer 8 premer

p?~scheins 8 proscheins

p?~sentz 8 presentz

pasch!! 8 Pasche

t?~minez 8 terminez

t?~ra 8 terra

ten!! 8

ten~tz 8 tenementz

v?~ge 8 verge

v?~roie 8 verroie

v?~tue 8 vertue

$p$pres 7 propres

$q$ 7

Transliteration:XML File & Rendered Output

<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos &amp; consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum &amp; stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>

Main Challenges facing the Anglo-Norman Hub Project

• Image to text migration for maximum throughput at minimum cost

• Application of markup suitable for rendering and full cross-referencing

• Handling of non-standard character sets (mediaeval abbreviations)