17
Wikis, Standards and Everything Lee Gillam Laurent Romary University of Surrey Max-Planck Digital Library

Wikis, Standards and Everything Lee GillamLaurent Romary University of SurreyMax-Planck Digital Library

Embed Size (px)

Citation preview

Wikis, Standards and Everything

Lee Gillam Laurent Romary

University of Surrey Max-Planck Digital Library

Foreword

• Wikification and standards: is this the wrong talk?– Wiki: Open + free interaction on-line– ISO: Dusty documents imposing ways of thinking and working

• Still, reusability and preservation and data– Requires some minimal principles about data representation

• Interoperability

– And there are quite a few practical standards (e.g. ISO 10646)

• Background (outline)– The demonstrators: OmegaWiki– The police: ISO (International standards association)– The topic at hand: language descriptions

• Highly complementary to work done here at MPI-EVA (eWALS)

ISO standards

Title of Standard Status Registration Authority

Number of identifiers (approx)

ISO 639-1: Part 1: Alpha-2 code Published (2002) InfoTerm 150 ISO 639-2: Part 2: Alpha-3 code Published (1998) Library of Congress

(LoC) 400

ISO 639-3: Part 3: Alpha-3 code for comprehensive coverage of languages

Published (2007) Summer Institute of Linguistics (SIL)

7000

ISO 639-4: Part 4: Implementation guidelines and general principles for language coding

Expected late 2007. n/a n/a

ISO 639-5: Part 5: Alpha-3 code for language families and groups

Expected late 2007. TBC 100

ISO 639-6: Part 6: Alpha-4 representation for comprehensive coverage of language variation

Expected early 2008. GeoLang 25000

Wikis for Languages

• Some possible motivations: – 50% of languages are endangered (UNESCO); – large proportion of languages have no “resources” and no web presence; – discontinuity and fragmentation of research; – sustainability and curation issues

• And yet…..– Capability for capturing data like never before;– Expansion of capacity of the Internet and growing pressure for an

inclusive multilingual internet;– OLPC programme;– Language experts and non-experts are prepared to contribute time and

resources

• So, how about a Wiki-based infrastructure that allows us to form communities around languages and harmonize results?

Wikis for Languages

• OmegaWiki, a collaborative project to produce a free, multilingual resource in every language, with lexicological, terminological and thesaurus information

• World Language Documentation Centre (WLDC), currently comprising 22 experts in language technologies, linguistics, terminology standardisation, and localisation

• ISO, provision of the ISO 639 series of standards; focus here on 639-4 and 639-6

Wikis for Languages

ISO 639-6 dataISO 639-X data

ISO 639-6 standardISO 639-X standard

Expert review

Community review & infrastructure

“Auditors”

ISO 639-4“standards as databases”

ISO 11179ISO 12620

Co-ordination

SIL, LoC,

Infoterm

Data categoriesMetadata registries

Wikis for Languages

Wikis for Languages

• Language Documentation via ISO 639-4: association of metadata descriptors to model interoperable with DCIF (12620) (639-4 section 9)

Name Section

Language Section Representation Section

Geographical I nfo

Societal I nfo

Linguistic I nfo

Diachronic I nfo

Temporal I nfo

Cultural / Religious

I nfo

Documentation

Description (#)

Attribution informationmissing here

Wikis for Languages

• Eventual inclusion of all “available” metadata

ISO standards

• Language Codes Standards are growing in number and complexity– From 2 to 6

– From 400 identifiers to upwards of 30000

– From lists to databases

– From tables to metadata registries

– From published text documents to “published” databases

– From IETF RFC to RFCs to RFCs

– From a closed membership committee to an open Community initiative (OmegaWiki)

– …. with accompanying (web) services and products

ISO standards

• Language Codes Standards are growing in number and complexity– From 2 to 6 – eventually back to 1?

– From 400 identifiers to upwards of 30000 – plus supporting metadata

– From lists to databases – multiple metadata registers

– From tables to metadata registries – registers + policies + “auditors”

– From published text documents to “published” databases – “SAD”

– From IETF RFC to RFCs to RFCs – consume, consume, consume

– From a closed membership committee to an open Community initiative (OmegaWiki) – supporting infrastructure, expert review of community contributions (e-Voting?)

– …. with accompanying (web) services and products – Open Source and bespoke, and secured funding as necessary

ISO standards

Title of Standard Status Registration Authority

Number of identifiers (approx)

ISO 639-1: Part 1: Alpha-2 code Published (2002) InfoTerm 150 ISO 639-2: Part 2: Alpha-3 code Published (1998) Library of Congress

(LoC) 400

ISO 639-3: Part 3: Alpha-3 code for comprehensive coverage of languages

Published (2007) Summer Institute of Linguistics (SIL)

7000

ISO 639-4: Part 4: Implementation guidelines and general principles for language coding

Expected late 2007. n/a n/a

ISO 639-5: Part 5: Alpha-3 code for language families and groups

Expected late 2007. TBC 100

ISO 639-6: Part 6: Alpha-4 representation for comprehensive coverage of language variation

Expected early 2008. GeoLang 25000

Wikis for Language Resources?

Next steps

• Data and models for wiki– Structured data in necessary in scientific domains

– Registering descriptors and schemas is an essential component of long-term management of such data

• New types of standards– Stabilisation of knowledge

– Dynamic platforms for describing knowledge

– Complementary to rocket science

• Back to WALS– MPI EVA and MPDL => eWALS

• Generic environment for managing and linking 639-4 compliant data

• Connecting the whole thing…

Further Sources

• Gillam, L. (2007) "A metadata infrastructure using ISO standards". We Have to Talk about Metadata Workshop at UK e-Science Programme All Hands Meeting 2007 (AHM 2007), Nottingham, 10-13 September. Accepted.

• Gillam, L., Garside, D., Cox, C. (2007) "Developments in Language Codes standards". In Rehm, Witt and Lemnitzer (eds.): Datenstrukturen fur linguistische Ressourcen und ihre Anwendungen / Data Structures for Linguistic Resources and Applications. Proc.of GLDV 2007, 11-13 April 2007, Tubingen, Germany: Gunter Narr Verlag.

• Gillam, L., Garside, D., Cox, C. (2006). "Information volumes and linguitic diversity: meeting the challenges for content management". 3rd International Conference on Terminology, Standardization and Technology Transfer, 25-26 August, Beijing, PRC.