SPOKEN LANGUAGE CORPUS PROJECT SPOKEN CORPORA FOR THE 9
OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES
Slide 2
Workshop Overview The Asmara Declaration Rusandre Whats the
point of spoken language corpora? Jens Overview of the project and
its phases Rusandre The recording phase Jens/Mmem The transcription
phase Jens The checking phase Jens The tagging phase Leif/Rusandre
Research output - Jens
Slide 3
THE ASMARA DECLARATION - 2000 Dialogue among African languages
is essential: African languages must use the instrument of
translation to advance communication among all people, including
the disabled. All African children have the inalienable right to
attend school and learn in their mother tongues. All effort should
be made to develop African languages at all levels of
education.
Slide 4
ASMARA DECLARATION - CNTD Promoting research on African
languages is vital for their development, while the advancement of
African research and documentation will be best served by the use
of African languages. The effective and rapid development of
science and technology in Africa depends on the use of African
languages and modern technology must be used for the development of
African languages.
Slide 5
Whats the point of spoken language corpora? Jens Allwood Corpus
linguistics / Armchair linguistics
Slide 6
PROJECT MANAGEMENT Goteborg/Unisa Nguni Rhodes Fort Hare
UPE/Vista Natal Unizul Sotho N-SothoTswana Univ of NorthNorthwest
Univ Venda Univ.Univ. Botswana Venda Venda Univ Tsonga
Slide 7
OBJECTIVES To develop a platform of computer supported basic
linguistic resources for the previously disadvantaged languages of
SA The resources will be in the form of archived audio-visual
recordings of activity-based natural language use; machine-readable
transcriptions of recordings for corpus-driven searches;
morphologically tagged corpora for corpus-based searches.
Slide 8
PROJECT PHASES 2002 - 2004 1.Ongoing Audio-video recordings of
activity- based spoken language use (min. 200hrs p/l).
2.Transcriptions (enriched with comment lines) of recordings in
machine-readable text format. 3.Checking and editing of
transcriptions. 4.Manual morphological tagging of corpora.
5.Automated tagging of corpora. 6.Research outputs.
Slide 9
The recording phase What to record Activity types What to think
about when recording natural language dialogues Keep it natural The
video camera, microphone, etc Keep the camera fixed!
Slide 10
Recording and transcription Practical exercise! 1.A short
recording 2.Transcribe together
Slide 11
Transcription Structure Header (background information about
transcription and recorded activity) Body (the actual transcription
consisting of two kinds of elements) Contributions (transcribed
utterances of participants in the recorded activity) Information
lines - marks various peculiar aspects in the contributions and
recorded activity
Slide 12
Example of a header @ Recorded activity ID: V010501 @ Activity
type: Informal conversation @ Recorded activity title: Getting to
know each other @ Recorded activity date: 20020725 @ Recorder:
Britta Zawada @ Participant: A = F2 (Lunga) @ Participant: B = F1
(Bukiwe) @ Transcriber: Mvuyisi Siwisa @ Transcription date:
20020805 @ Checker: Rusandre Hendrikse @ Checking date: 20020912 @
Anonymised: No @ Activity Medium: face-to-face @ Activity duration:
00:44:30 @ Other time coding: Each section @ Tape: V0105 @ Section:
Family affairs @ Section: Crime @ Section: Unemployment @ Section:
Closing @ Comment: Medunsa open ended conversation between two
adult speech therapy students Bukiwe and Lunga
Slide 13
Transcription header @ Recorded activity ID: V010501 V = Video,
01 = project number 05 = Tape number within this project 01 =
Recording number @ Activity type: Informal conversation @ Recorded
activity title: Getting to know each other @ Recorded activity
date: 20020725 @ Recorder: Britta Zawada
Slide 14
Transcription header, cont @ Participant: A = F2 (Lunga) @
Participant: B = F1 (Bukiwe) F stands for female F1 is unique for
Bukiwe in the entire corpus A and B are ID:s for the
participants
Transcription header, cont @ Anonymised: No Indicates whether
personal names, etc have been changed to pseudonyms (Yes) or not
(No) both in the header and in the conversation @ Activity Medium:
face-to-face Normally spoken, face to face, but could also have
other values, like telephone conversations.
Slide 17
Transcription header, cont @ Activity duration: 00:44:30
Duration in hours, minutes and seconds @ Other time coding: Each
section There is a time line for each section @ Tape: V0105 This is
a part of the recorded activity ID
Slide 18
Transcription header, cont @ Section: Family affairs @ Section:
Crime @ Section: Unemployment @ Section: Closing @ Comment: Medunsa
open ended conversation between two adult speech therapy students
Bukiwe and Lunga Any relevant information that is not covered by
any of the required headings
Slide 19
The body This is the actual transcription - the background
information is in the header Four kinds of lines: $A: uyakhonza
kaneneContribution @ Information line At officeSection line #
00:10:00Time line
Slide 20
Sections Family affairs $B: sibabini kuphela esibabalwe sada
safunda ke noko sakwazi ukuphangela sikwazi ke noko kuba ndinobhuti
wam osebenzayo... Religion $B: uyakhonza kanene $A: ndiyakhonza owu
ndiyamthand{a} [4 uthixo ndiyamthanda andisoze ndimlahle
undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela
$B: [4 nantso ke sisi e: e: ]4 $B: nantso ke into efunekayo uthixo
ulithemba lethu [5 uthixo ulithemba lethu ulixhadi lethu ]5
uligwiba $A: [5 ulixhadi lethu ulixhadi lethu]5 $B: [6 uligwiba
andazi ukuba ndingangendithini ngendiphi na xa uthixo heyi ]6
Situation on their arrival at Medunsa $A: [6 ucinga ukuba ngesiphi
na ngesisemedunsa ]6 $B: uye wasithatha khona waza kusibeka kule
ndawo...
Overlaps Religion $B: uyakhonza kanene $A: ndiyakhonza owu
ndiyamthand{a} [4 ndiyamthanda andisoze ndimlahle undibonisile
ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela $B: [4 nantso
ke sisi // e: e: ]4 @
Slide 23
Contrastive stress, pauses and lengthening $B: abanye ke
bazihlalele nje: / abanye ABAZANGE bafune sikolo // uyayiqonda ke
la meko yokungabikho mzali uqhubayo / uthi aba baza emva kwam
bobabini ABAZANGE bafunde kuyaphi // kodwa ke // andigxeki nto kuba
ke / ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso
kakhulu ]2 $A: [2 ya / m: ewe ]2 hayi izinto zikuthixo azikho kuthi
nam obu bushuman bam ndiseza kutshata ndiseza kutshata
Slide 24
Unclear speech and glottal stop $M: loo nto ke njengo{ku}ba
sekunyanzeleke ukuba ndiye phaya nje (...) ndikwazi ukuncedisa
phaya ndiyiphushile ukwenzela ukuba ndibe neclaim endizakuba nayo
that is why ndithole because ndiyaclaimer so that at least uba
ndiclayimile ndikwazi ukuhamba $T: ke ngoku ke yenye yezinto
endifuna ukuyoyenza $M: ngolwesithathu (what she said to me ngoku
bendiphaya ngecawe) besingcwaba umfazi kasicaka jama $T: ee
andekufuni ukutya
Research output Jens Allwood A distributed database (corpus)
Networks (homepages) Spoken language corpus activities (seminars,
workshops)
Slide 27
TAGGING SPOKEN LANGUAGE SAMPLES PROBLEMATIC ISSUES CONVENTIONS
& STANDARDS A P Hendrikse 16/03/04
Slide 28
PROBLEMATIC ISSUES Loans and codeswitching Fixed expressions
Spoken language reductions Morphophonological issues Designing a
tag set Manual tagging A drag-and-drop tagger Automated
tagging
Slide 29
Loans and Codeswitching Non-indigenised codeswitching ndifuna
Indigenised but non-standardised codeswitching loans
>ndiyakleyimisha? ndiyaklayimisha? ndiyafonisha?
ndiyafowunisha?
Slide 30
Fixed Expressions A continuum: Idioms/proverbs prefabricated
expressions collocations How fixed is fixed? Into yokuba (*izinto
zokuba) Nantso ke (*nantsi ke?) (Ke) kaloku (ke) Bafondini/mfondini
Undincedile Ungadinwa nangomso
Slide 31
Fixed Expressions cntd Flagging fixed phrases Into_yokuba
Ke_kaloku_ke Morphosyntactic tagging or not? Ke >_kaloku >_ke
> > Or Ke_kaloku_ke >
Slide 32
Spoken language reductions Standardised reductions Ngokuba >
ngoba Written standard reduction: reconstruction convention {} not
used, i.e. *ngo{ku}ba Non-standardised reductions Musa ukuhamba
> sukuhamba (wsr) > Suhamba (non-standardised)
Morphophonological cntd Elision Andinamoto > andi >na
> >m oto > > Stem modifications Emlanjeni > e >m
>lanj >en i > >
Slide 36
Designing a tag set Granularity Lexical categories N, V
(Tagging lexical categories is problematic in an agglutinating
language) Syntagmatic morphological slots amadodana > a >ma
>dod >ana >
Slide 37
Designing cntd Paradigmatic instantiations within a syntagmatic
slot gnp = >--- > Word categories nje (wenjenje) > nje
>; njalo >; njeya > ke > ke > kaloku > ke >
ke_kaloku_ke > e >m >lanj >eni >??
Slide 38
Designing cntd Spoken language expressions Non-word like
expressions 2 problems 1.Standardising orthographic representation
2.Tags e: > mh: > uh_uh_uh >
Manual tagging Manual tagging necessary for 3 reasons
Identifying tagging problems and problematic phenomena and revising
the tag set Developing a training corpus Correcting automated
tagging errors Manual (typing) tagging not ideal Tedious
Error-prone Solution: Drag-and-drop tagger
Slide 41
Drag-and-drop tagger Demonstration of drag-and-drop tagger