Upload
arva
View
42
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa. Mtholeni N. Ngcobo and Nozibele Nomdebevana University of South Africa [email protected] [email protected]. Outline. Introduction The importance of the spoken corpus approach The description of SLCP - PowerPoint PPT Presentation
Citation preview
Progress in Developing Progress in Developing Spoken Language Corpus Spoken Language Corpus
of Indigenous Languages in of Indigenous Languages in South AfricaSouth Africa
Mtholeni N. Ngcobo and Nozibele Mtholeni N. Ngcobo and Nozibele NomdebevanaNomdebevana
University of South AfricaUniversity of South [email protected]@unisa.ac.za [email protected]@unisa.ac.za
OutlineOutline
1.1. IntroductionIntroduction
2.2. The importance of the spoken The importance of the spoken corpus approachcorpus approach
3.3. The description of SLCPThe description of SLCP
4.4. Progress and Problems in the Progress and Problems in the development of Spoken Language development of Spoken Language corpora of indigenous languagescorpora of indigenous languages
5.5. Recommended solutionsRecommended solutions
INTRODUCTIONINTRODUCTION Multilingual language policy – 11 official Multilingual language policy – 11 official
languages – 2 developed – 9 underdevelopedlanguages – 2 developed – 9 underdeveloped An urgent need to develop spoken language An urgent need to develop spoken language
corpus for indigenous languages: explained by corpus for indigenous languages: explained by Allwood and Hendrikse (2003)Allwood and Hendrikse (2003)
Corpus provides empirical research as opposed to Corpus provides empirical research as opposed to Chomskyan intuitive approachChomskyan intuitive approach
Written language corpus has already been done Written language corpus has already been done for some languages – i.e. Zulu and Sepedi (NS).for some languages – i.e. Zulu and Sepedi (NS).
But a good start in spoken language corpus – But a good start in spoken language corpus – Spoken Language Corpus Project (SLCP) – an Spoken Language Corpus Project (SLCP) – an open-ended corpus projectopen-ended corpus project
Started in 2000Started in 2000
Introd…Introd…
Collaboration – UNISA and Collaboration – UNISA and Gothenburg UniversityGothenburg University
Initially funded by NRF and SidaInitially funded by NRF and Sida UNISA is the host institutionUNISA is the host institution UNISA has now approved funds for UNISA has now approved funds for
the project as it falls under the the project as it falls under the strategic projectsstrategic projects
Goal: 1M words/tokens per languageGoal: 1M words/tokens per language
AIMSAIMS
Establish a corpus research centreEstablish a corpus research centre To adapt and develop Computational To adapt and develop Computational
linguistic software suitable for linguistic software suitable for agglutinating languages of South agglutinating languages of South AfricaAfrica
Develop indigenous languages of Develop indigenous languages of South AfricaSouth Africa
Understand the role of language and Understand the role of language and communication in real life situationscommunication in real life situations
The Importance of Spoken CorpusThe Importance of Spoken Corpus
Allwood and Hagman (1994) - Allwood and Hagman (1994) - Spoken languageSpoken language
- Fundamental trait of the human Fundamental trait of the human speciesspecies
- Integrated with the human brain and Integrated with the human brain and human societyhuman society
- There is a limited knowledge of There is a limited knowledge of spoken language as opposed to spoken language as opposed to written languagewritten language
The importance…The importance…
Corpus linguistics approach allows the use Corpus linguistics approach allows the use of statistical performance measures and of statistical performance measures and observation of language use in real life.observation of language use in real life.
This approach is in contrast with earlier This approach is in contrast with earlier Chomskyan linguistics which focused on Chomskyan linguistics which focused on ideal written language.ideal written language.
Allwood and Hagman 1994:1- Allwood and Hagman 1994:1- “ the progress in “ the progress in audio, video and computer technology enables us to record and audio, video and computer technology enables us to record and analyse spoken language without having to rely on either memory analyse spoken language without having to rely on either memory or written language.”or written language.”
Contrast at a glanceContrast at a glance Chomskyan linguistics focused on language competence Chomskyan linguistics focused on language competence
((languelangue) while corpus linguistics also considers language ) while corpus linguistics also considers language performance (performance (paroleparole) as important) as important
Chomskyan linguistics is unable to cope with many areas in Chomskyan linguistics is unable to cope with many areas in linguistic study, since the emphasis is put on the ideal linguistic study, since the emphasis is put on the ideal speaker/hearer to the exclusion of complexity/variationspeaker/hearer to the exclusion of complexity/variation
Chomskyan linguistics views language as an innate mental Chomskyan linguistics views language as an innate mental faculty while corpus linguistics views language as a social faculty while corpus linguistics views language as a social phenomenonphenomenon
Chomskyan linguistics relies on intuitive evidence whereas Chomskyan linguistics relies on intuitive evidence whereas corpus linguistics relies on empirical evidencecorpus linguistics relies on empirical evidence
Corpus linguistics looks at differences in languages while Corpus linguistics looks at differences in languages while Chomskyan linguistics concentrates on universalsChomskyan linguistics concentrates on universals
The focus of Chomskyan linguistics is on grammar (form) The focus of Chomskyan linguistics is on grammar (form) while corpus linguistics focuses also on meaning while corpus linguistics focuses also on meaning (semantics). (semantics).
The Description of SLCPThe Description of SLCP
First task: compilation of a body of First task: compilation of a body of texts (a corpus)texts (a corpus)
Computer: stores large quantities of Computer: stores large quantities of data and allows statistical data and allows statistical performance measures performance measures
Research potential: covers linguistic, Research potential: covers linguistic, social, cultural, educational, social, cultural, educational, technological, inter-lingual and inter-technological, inter-lingual and inter-communicational aspectscommunicational aspects
Descript…Descript… SLCP has chosen video recordings. Why? SLCP has chosen video recordings. Why? Allwood and Hendrikse (2003, 191) have Allwood and Hendrikse (2003, 191) have
mentioned the following reason: mentioned the following reason: “…face-to-face “…face-to-face spoken language is interactive, multimodal and context-spoken language is interactive, multimodal and context-dependent.” dependent.” So we want to capture all the So we want to capture all the dynamics of language in communicationdynamics of language in communication
Compilation is a process with four major phases:Compilation is a process with four major phases:1.1. collecting video recorded spoken language collecting video recorded spoken language
activitiesactivities2.2. Transcribing video recordingsTranscribing video recordings3.3. Quality control (checking and editing)Quality control (checking and editing)4.4. Annotation of raw dataAnnotation of raw data
Process diagramProcess diagram
CORPUS BUILDING
RECORDING TRANSCRIBINGQUALITY CONTROL
TAGGING
1. Recording phase1. Recording phase Biber et al. (1998:246) state that Biber et al. (1998:246) state that ““......a corpus is not simply a collection of texts. Rather, a a corpus is not simply a collection of texts. Rather, a
corpus seeks to represent a language or some part of a corpus seeks to represent a language or some part of a language. The appropriate design for corpus therefore language. The appropriate design for corpus therefore depends upon what it is meant to representdepends upon what it is meant to represent.” .”
Parameters: representativity of the corpus, Parameters: representativity of the corpus, control of variables in language varieties, control of variables in language varieties, recording, volume or size of the corpus and recording, volume or size of the corpus and length of each samplelength of each sample
In SLCP we use socio-economic activities as a In SLCP we use socio-economic activities as a representativity measure, e.g. meetings, representativity measure, e.g. meetings, sermons, interviews, etc.sermons, interviews, etc.
2. Transcription phase2. Transcription phase
Most crucialMost crucial Allwood and Hendrikse 2003:195 - Allwood and Hendrikse 2003:195 -
without transcriptions there will be without transcriptions there will be no computer readable corpusno computer readable corpus
Two components of a transcript: Two components of a transcript: Header and BodyHeader and Body
The Header E.G.The Header E.G.@ Recorded activity Identity code (ID): U-ZV-01-01-01@ Recorded activity Identity code (ID): U-ZV-01-01-01@ Name of recorder: Magda Altman, Brenda Gonzales@ Name of recorder: Magda Altman, Brenda Gonzales@ Duration of recorded activity: 3 hours @ Duration of recorded activity: 3 hours @ Recorded activity date: 2006-08-04@ Recorded activity date: 2006-08-04@ Recorded activity type: Interview with Traditional Healers@ Recorded activity type: Interview with Traditional Healers@ Recorded activity title: Interview with Traditional Healers@ Recorded activity title: Interview with Traditional Healers@ Short name: TH Interview 1@ Short name: TH Interview 1@ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu @ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu @Activity mode:@Activity mode: Face to face InterviewFace to face Interview@ Participant: B=F1 (Makhosi Queen Ntuli)@ Participant: B=F1 (Makhosi Queen Ntuli)@ Participant: M=F8 (Philisiwe Mkhize)@ Participant: M=F8 (Philisiwe Mkhize)@ Participant: K=F3 (Jabu Eunice Ncikazi)@ Participant: K=F3 (Jabu Eunice Ncikazi)@ Participant: J=F2 (Thokozile Shezi)@ Participant: J=F2 (Thokozile Shezi)@ Participant: G=F4 (Thembeni Roge Magubane)@ Participant: G=F4 (Thembeni Roge Magubane)@ @ Participant: H=GR (All participants)@ @ Participant: H=GR (All participants)@ Tape ID code: U-ZV-01-01@ Tape ID code: U-ZV-01-01@ Transcription name: U-ZV-01-01-T1@ Transcription name: U-ZV-01-01-T1@ Transcriber: Mtholeni@ Transcriber: Mtholeni@ Transcription date: @ Transcription date: @ Transcription system:@ Transcription system:@ Electronic checking@ Electronic checking@ Editor:@ Editor:@ Checker: @ Checker: @ Checking dates:@ Checking dates:@ Section: @ Section: @ Section:@ Section:@ Time coding@ Time coding@ Comment(s):@ Comment(s):
The BodyThe Body
3 types of lines in the transcription 3 types of lines in the transcription body:body:
§ - section line (the topic of discussion)§ - section line (the topic of discussion)
$ - contribution (interlocutor’s speech)$ - contribution (interlocutor’s speech)
@ - information line (comment)@ - information line (comment)
The Body…The Body…
Standardised orthography is used, Standardised orthography is used, but no capital letters or punctuation but no capital letters or punctuation marksmarks
Plain text format is used to make Plain text format is used to make transcription machine readabletranscription machine readable
Own communication management Own communication management (i.e. hesitations) and interactive (i.e. hesitations) and interactive communication management (i.e. communication management (i.e. feedback) are indicatedfeedback) are indicated
The Body...The Body...
Certain symbols are used to transcribe the Certain symbols are used to transcribe the following:following:
Elisions { } – curly bracketsElisions { } – curly brackets
Overlaps [ ] – square bracketsOverlaps [ ] – square brackets
Comments < > - angle bracketsComments < > - angle brackets
Pauses / or // or /// - slashes Pauses / or // or /// - slashes
Lengthening : - colonLengthening : - colon
Unclear speech (. . .) three bracketed dotsUnclear speech (. . .) three bracketed dots
The Body E.G.The Body E.G.
Example:Example: Elisions, overlaps, comments, Elisions, overlaps, comments, pauses, lengtheningpauses, lengthening
§ Religion§ Religion$A: uyakhonza konje$A: uyakhonza konje$B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [ ] $B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [ ]
ngiyamthanda angisoze ngimlahlengiyamthanda angisoze ngimlahle@ <name: person>@ <name: person>$A: [nanso_ke <1 sisi>1 // e: e:]$A: [nanso_ke <1 sisi>1 // e: e:]@ <adoptive: English: sister>@ <adoptive: English: sister>
Example: Example: Unclear speech and code-Unclear speech and code-switchingswitching
$Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is $Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is why>2 <3 ngiclaimile>3why>2 <3 ngiclaimile>3
@ <1 code-mix: English>@ <1 code-mix: English>@ <2 code-switch: English>@ <2 code-switch: English>@ <3 code-mix: English>@ <3 code-mix: English>
3. The checking phase3. The checking phase
The transcription is manually checked by The transcription is manually checked by another person than the transcriber to another person than the transcriber to ensure quality control and reliabilityensure quality control and reliability
The transcription is also checked The transcription is also checked electronically for correctness of format electronically for correctness of format before it is inserted into the corpusbefore it is inserted into the corpus
We currently use a GTS checking tool to We currently use a GTS checking tool to monitor compliance with the transcription monitor compliance with the transcription standards.standards.
4. The tagging phase4. The tagging phase
The process whereby the corpus is The process whereby the corpus is annotated by means of various tags -annotated by means of various tags -enriching a raw corpus with grammatical enriching a raw corpus with grammatical tagstags..
E.G. E.G. abantwana - abantwana - aa«prepref»«prepref»baba«pref»«pref»ntwantwa«nstem»«nstem»anaana«dimsuf»«dimsuf»
Corpus driven approach (information Corpus driven approach (information retrieved from raw data) retrieved from raw data) vs.vs. corpus based corpus based approach (information retrieved from an approach (information retrieved from an annotated corpus)annotated corpus)
4. Tagging …4. Tagging … Allwood and Hendrikse (2003, 199) argue Allwood and Hendrikse (2003, 199) argue
that while the corpus driven approach that while the corpus driven approach works well with isolating languages, in works well with isolating languages, in agglutinating languages the corpus based agglutinating languages the corpus based approach may be used. They also note approach may be used. They also note that Leech (1991) has warned against the that Leech (1991) has warned against the danger of bias underlying any form of danger of bias underlying any form of annotation. However, they argue that the annotation. However, they argue that the tagging of corpora is now fairly general tagging of corpora is now fairly general practice (Allwood and Hendrikse 2003, practice (Allwood and Hendrikse 2003, 199). The tagging set for the agglutinating 199). The tagging set for the agglutinating languages has been discussed in detail by languages has been discussed in detail by Allwood et al (2003).Allwood et al (2003).
Progress and success in SLCPProgress and success in SLCP
Only Xhosa out of the nine languages Only Xhosa out of the nine languages has been able to show greater has been able to show greater progressprogress
Why? It was used for piloting the Why? It was used for piloting the project and has a consistent project and has a consistent transcribertranscriber
Zulu is following behind with almost Zulu is following behind with almost 20 000 transcribed tokens so far20 000 transcribed tokens so far
Progress…Progress…
Number of Tokens per language
0
50000
100000
150000
200000
250000
300000
350000
Ndebele
N.Sot
ho
S.Sot
hoSwat
i
Tsong
a
Tswan
a
Venda
Xhosa
Zulu
Progress…Progress…
RecordingsRecordings Audio Audio VideoVideo N. of recordingsN. of recordings 3333 112112 HoursHours 3838 128128 Un-transcribedUn-transcribed 22 1717 TranscribedTranscribed 3131 6868 CheckedChecked 3131 5454 TokensTokens 45 72345 723 201 201
292292
Progress…Progress…
We also have some un-transcribed We also have some un-transcribed recordings for Tsonga, especially for recordings for Tsonga, especially for children speechchildren speech
People behind the progress - the People behind the progress - the corpus group - share on issues of corpus group - share on issues of progress, motivating one another progress, motivating one another and presenting on key research and presenting on key research aspects of the projectaspects of the project
ProblemsProblems Little or nothing is currently happening in the Little or nothing is currently happening in the
development of corpora for the remaining official development of corpora for the remaining official languageslanguages
Lack of appropriate monitoring - some of the Lack of appropriate monitoring - some of the video recordings get damaged and some of the video recordings get damaged and some of the digitized recordings have been lostdigitized recordings have been lost
Poor quality of recordingsPoor quality of recordings Lost dataLost data Uncoordinated individual activities Uncoordinated individual activities Insufficient tools, financial and human resourcesInsufficient tools, financial and human resources ……etc.etc.
Recommendations and solutions: Recommendations and solutions: hope for the futurehope for the future
Ultimately: A fully developed spoken Ultimately: A fully developed spoken corpus resource centre – corpus resource centre –
the establishment of a resource the establishment of a resource centre will not only be a sign of centre will not only be a sign of growth and prosperity, but also a growth and prosperity, but also a sign of an investment in the future of sign of an investment in the future of languages and their speakers.languages and their speakers.
Recommend…Recommend… To this end, the following recommendations need to be considered:To this end, the following recommendations need to be considered:1.1. More recordings and more trained transcribers are required in order to More recordings and more trained transcribers are required in order to
expedite the process.expedite the process.2.2. Recorders and transcribers for all the languages should be remunerated to Recorders and transcribers for all the languages should be remunerated to
encourage them to do more in their work.encourage them to do more in their work.3.3. A network with other institutions, such as universities, should be created.A network with other institutions, such as universities, should be created.4.4. Short and medium term corpus development targets should be set up.Short and medium term corpus development targets should be set up.5.5. A server dedicated to corpus must be established as a matter of urgency.A server dedicated to corpus must be established as a matter of urgency.6.6. A properly structured corpus archive must be set up and maintained by a A properly structured corpus archive must be set up and maintained by a
web master.web master.7.7. All the various corpus-related projects should be re-organised under one All the various corpus-related projects should be re-organised under one
corpus management structure.corpus management structure.8.8. Corpus maintenance, tagging and mining tools designed for the Corpus maintenance, tagging and mining tools designed for the
agglutinating languages and other peculiar searches (e.g. communicative agglutinating languages and other peculiar searches (e.g. communicative gestures) must be developed.gestures) must be developed.
9.9. Preliminary corpus mining should begin for the benefit of the tool Preliminary corpus mining should begin for the benefit of the tool development enterprise and to encourage the use of corpora for language development enterprise and to encourage the use of corpora for language research and development.research and development.
This will lead to the establishment of a dedicated corpus publication series This will lead to the establishment of a dedicated corpus publication series for the indigenous languages of South Africa.for the indigenous languages of South Africa.
ReferencesReferences Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing
a tag set and tagger for the African Languages of South a tag set and tagger for the African Languages of South Africa with special reference to Xhosa. Africa with special reference to Xhosa. Southern African Southern African Linguistics and Applied Language StudiesLinguistics and Applied Language Studies 21 (4) 221-235. 21 (4) 221-235.
Alwood J and Hagman J. 1994. Some simple automatic Alwood J and Hagman J. 1994. Some simple automatic measures of spoken interaction. measures of spoken interaction. Proc. Of the 14th Scand. Proc. Of the 14th Scand. Conf. of Linguistics & 8th Conf. of Nordic and Gen. Conf. of Linguistics & 8th Conf. of Nordic and Gen. Linguistics, Vol. 72, Univ. of Göteborg.Linguistics, Vol. 72, Univ. of Göteborg.
Allwood J and Hendrikse AP. 2003. Spoken language Allwood J and Hendrikse AP. 2003. Spoken language corpora for the nine official languages of South Africa. corpora for the nine official languages of South Africa. Southern African Linguistics and Applied Language StudiesSouthern African Linguistics and Applied Language Studies 21 (4) 187-199.21 (4) 187-199.
Biber D, Condrad S, and Repen R. 1998. Biber D, Condrad S, and Repen R. 1998. Corpus Linguistics: Corpus Linguistics: Investigating Language Structure and UseInvestigating Language Structure and Use. Cambridge: . Cambridge: Cambridge University Press.Cambridge University Press.