Upload
keely
View
58
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CMD and TEI. CMDI interoperability workshop 2013-06-04 - Utrecht Matej Ďu r č o, ICLTT, Vienna. TEI at ICLTT. AAC – Austrian Academy Corpus diachronic corpus ~ 500 mil. tokens being converted into TEI C4 – distributed corpus of german of 20 th century Basel, Berlin, Bozen , Wien - PowerPoint PPT Presentation
Citation preview
CMD and TEI
CMDI interoperability workshop2013-06-04 - Utrecht
Matej Ďurčo, ICLTT, Vienna
2
TEI at ICLTT
• AAC – Austrian Academy Corpus– diachronic corpus ~ 500 mil. tokens– being converted into TEI
• C4 – distributed corpus of german of 20th century– Basel, Berlin, Bozen, Wien– harmonized format (TEI/teiHeader)
• Dict-Gate – TEI encoded multilingual lexicons (persian, arabic, german,
english)– however described with LexicalResourceProfile
• Abacus – Austrian Baroque Corpus– 3 (5) historical texts encoded in TEI– elaborate teiHeader
3
TEI (and friends?) in CMD
Projekt Author, Year Profile Comp/Elem/Datcats instances
Deutsches Text Archiv ?
teiHeader #clarin.eu:cr1:p_1345180279115 (NOT in CompReg!)
56/82/10 857
ICLTT Durco, 2010 teiHeader #clarin.eu:cr1:p_1282306194508
16/35/13 (7 dublincore, 6 isocat) 467
Leipzig Corpora Eckart, 2012 TEIDocumentDescription
#clarin.eu:cr1:p_13377789249924/17/17 (isocat) ?
Nederlab Zhang 2013 ?
DBNL_Tekst #clarin.eu:cr1:p_1361876010678 DBNL_Tekst_Onzelfstandig #clarin.eu:cr1:p_1366279029218 (private)
20/38/15
20/47/21 ?
• overview of currently existing TEIish CMD-profiles
4
teiHeader (ICLTT)
size = reuse in other profiles
5
teiHeader (DTA) size = count elements in instance data
6
datcats in teiHeader(DTA)
7
TEI and ISOcat
• a special DCS: TEi Header (2.1.0) – Windhouwer, 2012– a datcat for every element of the teiHeader (135 datcats)– based on an ODD-file (ODD2DCIF.xsl and DCIF2ODD.xsl
available)– owed to CLARIN-NL projects using TEI header
• a enriched schema was generated = annotated with these new data categories (dcr:datcat-attribute) put in SCHEMAcat: http://lux13.mpi.nl/schemacat/schema/teiHeader
• define relations between TEI and other data categories in RELcat (the relation registry)
8
Next Step(s) ?
• create (or adapt existing) teiHeader profile – as a union of the existing profiles ?– based on the enriched schema– i.e. linking to the new TEI data categories– define a relation set in RELcat
between TEI and ISOcat (and dublincore) data categories
9
profile: data (LINDAT)
dublincore + metashare
10
profile: data (LINDAT)
resourceInfo-component
11
dublincore I
• 2 profiles with dc-terms (55 datacategories)• 2 profiles with dc-elements (called „dc-terms“)as of 2013-01
12
dublincore II
currently (2013-06)4 DCMI-terms profiles
13
dublincore III
(almost) all datcatsshared by all
14
dublincore IV
1 profile has extra component:DANS-DC-metadata
example:language