Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

AG Corpus-écrits, 21 novembre

Consortium Corpus-écrits

SIG TEI-CMC

Open Resources and TOols for LANGuage

http://comere.orghttp://hdl.handle.net/11403/comere

Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham,Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin

http://www.tei-c.org/Activities/SIG/CMC/

http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

Our subject and goals

Our subject:

building and annotating corpora of computer-mediated

communication (CMC) – as resources for empirical research on

CMC phenomena in the Humanities (linguistics, communication

science, language technology, …)

Cette resource doit donc être libre d'accès (open

access research data) afin d'être réutilisable par les

communautés de chercheurs

Nous reviendrons plus tard sur ce point

All genres of interpersonal communication mediated

through computer networks (the internet) and used

via personal computers and/or mobile devices: chats,

online forums, instant messaging, tweets, comments

on weblogs, discussions in wikis and on “social net-

work” sites, interactions in multimodal communication

environments such as Skype, MMORPGs or “virtual

worlds” (e.g., SecondLife), SMS, WhatsApp, ....

Computer-mediated communication (CMC):

Our subject:

building and annotating corpora of computer-mediated

communication (CMC) – as resources for empirical research on

CMC phenomena in the Humanities (linguistics, communication

science, language technology, …)

Our vision: These corpora shall be …

interoperable (i) with each other and (ii) with other types of

linguistic corpora (text corpora, speech corpora)

represented conformant to established encoding standards in

the field of Digital Humanities

linguistically annotated in order to allow for sophisticated

queries and language-focused research

The problem / challenge:

By now, there are no established standards for the

representation of CMC genres

Established standards for the representation of text genres do

not include models for the representation of the peculiarities of

“Off the shelf” NLP tools for automatic linguistic analysis and

annotation (tokenizers, part-of-speech taggers, lematizers,

normalizers, parsers) do not perform well on CMC data

(because they usually have been trained on edited text and

therefore can’t handle “non-standard” phenomena and

multimodal elements in CMC discourse)

Our goals:

work on solutions for these desiderata

develop suggestions for standards for

- packaging and sharing (mono- and multimodal) CMC

corpora,

- modeling these types of “texts” within a framework which is

conformant with the encoding framework of the Text

Encoding Initiative (TEI) and thus with a widely accepted de-

facto standard in the field of Digital Humanities,

- processing and annotating these corpora (part-of-speech,

normalization, ...) with NLP tools.

Who belongs to our community (so far)?

French CMC corpora

Infrastructure for languagesNational consortium on corpora

National infrastructure for Digital Humanities

Our kernel projects and founding members

http://hdl.handle.net/11403/comere

Dortmund Chat Corpus

http://www.chatkorpus.tu-dortmund.de

German Reference Corpus of CMC

http://www.tinyurl.com/derik-llc

Wikipedia corpus in DeReKo

(Mannheim)

Scientific network

„Empirical research of CMC“

http://www.empirikom.net

German CMC corpora

Dutch CMC corpora

(Stevin Nederlandstalig Referentiecorpus)

Italian CMC pilot corpus

http://http://glottoweb.org/web2corpus/

2013, 2014-European workshops on CMC corpora (Dortmund- special journal issue (JLCL)

Activities and initiatives (past and future)

Our pathway

2013creation of the TEI-CMC SIG

End of 2014Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC

2015Application to CLARIN-DETranform existing German corpora into TEI-CMC

2015 OctoberInternational CMC conferenceRennes (Ledegen)

2015Submission of TEI-CMC model

2015Launch largerCMC-corporacommunity

2016Common system of basic CMC-annotations(POS tagging)

Objective: Kernel corpus assembling existing corpora of different CMC

genres and new corpora build on data extracted from the Internet. These

heterogeneous corpora will be structured and processed in a uniform way,

complemented with metadata. CoMeRe will be released as OpenData

through the national infrastructure Ortolang, following constraints which will

be reused for the forthcoming “Corpus de Référence du Français”.

Project supported by the national

consortium Corpus-écrits, sub-part of

Huma-Num, and Ortolang

Variety + Standards + Open Access

Consortium Corpus-écrits

http://comere.orghttp://hdl.handle.net/11403/comere

ServeurLocal LRL

Dépositeur individuel

Ingénieur :Kun Jin

Groupe qualité

Discussion avecdépositeur

Groupe étiquetageTAL : TEI-v2

TEI-V1

Financements : ORTOLANG > Corpus-écrits > LRL

Ref Tokens Partici. Posts Envir.

(Antoniadis,2014) 449 313 359 22 052 SMS

(Falaise, 2014) 35 M 25 000 3 M textchat

(Ledegen, 2014) 357 000 850 22 000 SMS

(Reffay et al., 2014) 600 000 67 + 4 groups- textchat: 6 790- emails: 2 030 - forums: 2 686

(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat

(Abendroth-Timmeret al., 2014)

273 546 26 + 4 groups 1 200 Blog

(Longhi, Marinica, 2014)

567 851 205 34273 Tweet

Informalbusiness

Informal

education

politic

Mono- Mode- Modality

- Textchat- Forum- SMS- Tweets- Email- Blogs (image not means of interaction)

MultiModalities

LMS:- email- forum- chat

MultiModes

Conf system:- Audiochat- Textchat

Verbal Verbal & Non-verbal

Conference system,3D environmentEtc.- Audiochat- Textchat- Icones- Collec prod

WhiteboardWord proc.Semantic maps

- Avatars- …

InteractionSpace

Time(s)

Locations

ParticipantsEnvironments

AuthorAdresse(s)GroupNetwork

CourseSessionChannelSimultaneity

New macro-level elements

http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI

Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)

1.5 mn video

* Paper: (Wigham & Chanier, 2013) CALL

journal

* Data: (Wigham, 2013) LETEC corpus

Modality interplay

Computer-Mediated Communication in TEI: What Lies AheadTEI-MM 2013 (Rome)

Multimodalité : Verbal et non verbal

(Wigham & Chanier, 2013)

Collab wordprocessor

Audio:clarification

Textchat:Correction(with error)

Textchat:Requestconfirmation

Context: Lyceum conf environment, 3 learners (English L2) working intoa word processor: one writing, others helping

Maintenant en TEI-speech

31http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

l'utilisateur est autorisé à télécharger une copie du corpus […]

• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […]

• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […]

• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur

le fondement de la présente licence d'utilisation.

Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus)

Example of corpus licence displayed on the National Infrastructure for DigitalHumanities and considered as being"open access"

Viewing but not re-using isthat OA ?

Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

Science

Corpus Linguistics - Use Cases, Corpus Creation, Applications...Introduction Corpus Properties, Text Digitization, Applications 1 Introduction 2 Corpus Properties, Text Digitization,

Paracelse (1493-1541). L'art d'alchimie et autres écrits ... · Title: Paracelse (1493-1541). L'art d'alchimie et autres écrits de Théoph. Paracelse Bombast,... : tirés des traductions

Corpus Design Criteria - British National Corpus

La compréhension de textes écrits au travers des albums à

SECTION B. EXPOSÉS ÉCRITS SECTION B.-WRITTEN …section b. - exposÉs Écrits section b.-written statements. the north corfu channel. 1.-memorial submitted by the government of the

En l'absence de documents écrits: comment trouver des

Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus Christi Regional Transportation Authority Corpus Christi, … · 2018-05-23 · Corpus Christi Regional Transportation Authority . Corpus Christi, Texas . Comprehensive Annual

Corpus Designpioneer.chula.ac.th/~awirote/courses/corpus-ling/corpus-design.pdf · • Cเ;น*องใ_KใFUรายละเ‚ยด corpus มากด เอความผลR

Translation Studies and Corpus Linguistics: Introducing the Pannonia Corpus · Translation Studies and Corpus Linguistics: Introducing the Pannonia Corpus Edina ROBIN, Andrea GÖTZ,

Corpus Christi Liquefaction, LLC Cheniere Corpus Christi ...Corpus Christi Liquefaction, LLC (“CCL”) and Cheniere Corpus Christi Pipeline, L.P. (“CCPL”) propose to construct

Guide des travaux écrits HAUTE ÉCOLE FÉDÉRALE DE SPORT MACOLIN Guide … · · 2017-12-03Guide des travaux écrits Version 1.3 ... The aim of this study ... starts on a supercross

Corpus annotation for corpus linguistics (nov2009)

RST Signalling Corpus: A corpus of signals of coherence ...mtaboada/docs/publications/Das... · RST Signalling Corpus: A corpus of ... (to appear) RST Signalling Corpus: A corpus

Guide Prés Travaux Écrits

Écrits I - R. Faurisson

Corpus, Lexicon, and Construction: A Quantitative Corpus

Matérialité des écrits scientifiques et travail de frontières : le cas du

Corpus complexes et standards: un retour sur le projet CoMeRe

Théologie de la beauté dans les écrits de Maxime le Confesseuranastasis-review.ro/wp-content/uploads/2018/07/V-1-Florin... · 2018-07-13 · Théologie de la beauté dans les écrits