View
226
Download
3
Tags:
Embed Size (px)
Citation preview
The Construction of Anglo-Norman Text Corpus
• Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth .
• AHRC-funded.
• Anglo-Norman Online Dictionary
• Anglo-Norman Text Corpus
• http://www.anglo-norman.net
Goal of the Anglo-Norman Hub Text Digitisation Project
• To provide a set of digitised texts and articles to mediaeval linguists and historians which is searchable and fully cross-referenced within itself and to and from the Anglo-Norman Online Dictionary
Main Challenges facing the Anglo-Norman Hub Project
• Image to text migration for maximum throughput at minimum cost
• Application of markup suitable for rendering and full cross-referencing
• Handling of non-standard character sets (mediaeval abbreviations)
Image to Text Migration Strategies
• Optical Character Recognition
• Re-keying
• Both require subsequent proofreading
• Both allow insertion of appearance metadata as provisional markup
Advantages of Alternative Image to Text Migration Strategies
• OCR
• Rapid processing• Can be performed by
students on-site and can be supervised.
• Rekeying
• Less error-prone• Cheap if outsourced• Non-standard characters
can be represented by combinations
• More consistent output quality
• Image quality less critical• Consistent output quality
Economic Image to Text Migration: Conclusions
• Re-keying is more economic for the bulk of the mediaeval-language material
• OCR is competitive for modern languages (critical material)
• OCR can also be used for mediaeval language material when required by workflows provided that– good image quality can be easily achieved– the material consists of standard characters
Markup requirements: must
• Conform to widely-accepted standards• Be capable of encapsulating diverse
document structures• Allow for automation• Enable internal and external referencing• Preserve as much appearance metadata
as possible • Not be tied to any one approach to
rendering
Document types requiring a variety of XML Structures
• Texts– Verse– Prose – Lists & Tables
• Critical material– Introductions (conform to prose structures)– Notes (do not conform to any of the above
structures)
Cross-referencing of Critical Matter
• Need to navigate from pointer to note
• Need to navigate cross-references from critical material to specific points in the text or elsewhere in critical material
• Achieved by use of target-id pairs
Markup Density and Automation
• Verse: medium density; can be automated
• Prose: variable density; can be automated if footnote pointers present
• Lists & tables: medium density; can be automated
• Critical material: high-density; many cross-references; limited scope for automation
Extract from XML version of “La Passiun de St. Edmund”
• <lg n="316"><l id="L1261">A Deu del cel ad graciéd</l>
• <l id="L1262">E al martir suvent a voéd</l>
• <l id="L1263">Que si bel l'at delivréd</l>
• <pb ed="folio" n="123a"/><l id="L1264" n="1264">De ço qu'esteit ainz encumbrét.</l></lg>
Extract from XML version of “La Passiun de St. Edmund”
• <note id="N1261-4" target="L1261" targetEnd="L1264">These lines present several problems: (a) <q lang="AN" rend="b">A Deu. . .ad graciéd</q> <ref target="L1261">1261</ref>. The verb <term lang="AN" rend="i">gracier</term>, occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: <ref target="L826 L943 L1132">ll. 826, 943, 1132</ref>.
Additional Markup for Critical Material
– <term>: Terms discussed may need to be linked to the Anglo-Norman Dictionary
– <q>: Citations: may need to be linked to their sources within the text base
– <bibl>, <title> etc.: Bibliographical information needs to be encoded to link citations with their sources
• Much of the above can be extrapolated from the appearance metadata embedded in the provisional markup
• <hi>: to encode embedded appearance metadata whose significance is not apparent
“La Passiun de St. Edmund”Rendered for a Web Browser
• These lines present several problems: (a) A Deu. . .ad graciéd 1261 . The verb gracier , occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: ll. 826, 943, 1132 . T.-L. 4,502 cites one instance of gracier with indirect object, but in the construction gracier. qc. a qn . If this construction were applied here, ll. 1263-4 would have to be taken as the direct object of gracier and also, presumably, of voer 1262 . The use here of gracier with indirect object may have been influenced by the construction rendre graces a qn. employed at ll. 995, 1046, 1512 .
Markup Density and Automation
• Verse: medium density; can be automated
• Prose: variable density; can be automated if footnote pointers present
• Lists & tables: medium density; can be automated
• Critical material: high-density; many cross-references; limited scope for automation
Markup Requirements: Application
• 1,000 to 100,000 XML tags per document
• Automation essential for high throughput
• Digitisers can embed appearance metadata in provisional markup
• Well-designed provisional markup schemes facilitate automation
Handling of Non-Unicode Characters: 1) Transcription
• Transcription is the one-to-one encapsulation of character appearance metadata
• Transliteration is the expansion of abbreviated characters into an intelligible sequence of letters
• Transliteration requires transcription as a starting point
• Transcription codes must resemble originals to facilitate re-keying
Examples of the "per" "pro" and "pre"
contractions as represented by the agency Signifies Keyed as
Expanded example Rekeyed example
per p!! ceperit cep!!it
pro $p$ propria $p$p<sup>i</sup>a
pro $p$ probum $p$bū
per p!! persone p!!sone
per p!! apertement ap!!tement
pro $p$ profit $p$fit
per p!! permisit p!!misit
pro $p$ promisit $p$misit
pro $p$ prochein $p$chein
per p!! persona p!!<sup>a</sup>
par p!! paratus p!!atus
par p!! parceles p!!celes
por p!! tempore temp!!e
por p!! corporum corp!!um
pre p?~ presentem p?~sentem
pre p?~ prelatz p?~laz!!
pre p?~ predictum p?~d!!c~m
pre p?~ prendront p?~ndront
<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos & consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum & stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>
Transcription to Transliteration:Rekeyed Version & XML File
Handling of Non-Unicode Characters: 2) Transliteration
• Manual transliteration would take too long
• Blanket replacement is not possible because of ambiguous abbreviations
• Semi-automated transliteration can be achieved using a list of words for block-replacement, derived from a concordance
• The appearance metadata from the transcription should remain embedded
Table of expansions, example 1
Contracted word Occurrences Expansion
& 6264 &
q~ 2989 q'
p<sup>r</sup> 803 p'r
seign<sup>r</sup> 325 seignour
aut?~s 289 autres
man?~e 250 manere
s<sup>r</sup> 224 sur
p!!lement 215 parlement
t?~re 199 terre
t?~res 196 terres
denglet?~re 191 dengleterre
g<sup>a</sup>nt 181 grant
lo<sup>r</sup> 167 lour
p!!tie 152 partie
ap?~s 142 apres
s?~ront 139 serront
h<bar>o</bar>me 137 homme
Table of expansions, example 2
Contracted word Occurrences Expansion
memorand!! 8 memorandum
mest?~ 8 mestre
p!!dre 8 perdre
p?~dc~m 8
p?~mer 8 premer
p?~scheins 8 proscheins
p?~sentz 8 presentz
pasch!! 8 Pasche
t?~minez 8 terminez
t?~ra 8 terra
ten!! 8
ten~tz 8 tenementz
v?~ge 8 verge
v?~roie 8 verroie
v?~tue 8 vertue
$p$pres 7 propres
$q$ 7
Transliteration:XML File & Rendered Output
<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos & consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum & stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>