Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa

Progress in Developing Progress in Developing Spoken Language Corpus Spoken Language Corpus

of Indigenous Languages in of Indigenous Languages in South AfricaSouth Africa

Mtholeni N. Ngcobo and Nozibele Mtholeni N. Ngcobo and Nozibele NomdebevanaNomdebevana

University of South AfricaUniversity of South [email protected]@unisa.ac.za [email protected]@unisa.ac.za

OutlineOutline

1.1. IntroductionIntroduction

2.2. The importance of the spoken The importance of the spoken corpus approachcorpus approach

3.3. The description of SLCPThe description of SLCP

4.4. Progress and Problems in the Progress and Problems in the development of Spoken Language development of Spoken Language corpora of indigenous languagescorpora of indigenous languages

5.5. Recommended solutionsRecommended solutions

INTRODUCTIONINTRODUCTION Multilingual language policy – 11 official Multilingual language policy – 11 official

languages – 2 developed – 9 underdevelopedlanguages – 2 developed – 9 underdeveloped An urgent need to develop spoken language An urgent need to develop spoken language

corpus for indigenous languages: explained by corpus for indigenous languages: explained by Allwood and Hendrikse (2003)Allwood and Hendrikse (2003)

Corpus provides empirical research as opposed to Corpus provides empirical research as opposed to Chomskyan intuitive approachChomskyan intuitive approach

Written language corpus has already been done Written language corpus has already been done for some languages – i.e. Zulu and Sepedi (NS).for some languages – i.e. Zulu and Sepedi (NS).

But a good start in spoken language corpus – But a good start in spoken language corpus – Spoken Language Corpus Project (SLCP) – an Spoken Language Corpus Project (SLCP) – an open-ended corpus projectopen-ended corpus project

Started in 2000Started in 2000

Introd…Introd…

Collaboration – UNISA and Collaboration – UNISA and Gothenburg UniversityGothenburg University

Initially funded by NRF and SidaInitially funded by NRF and Sida UNISA is the host institutionUNISA is the host institution UNISA has now approved funds for UNISA has now approved funds for

the project as it falls under the the project as it falls under the strategic projectsstrategic projects

Goal: 1M words/tokens per languageGoal: 1M words/tokens per language

AIMSAIMS

Establish a corpus research centreEstablish a corpus research centre To adapt and develop Computational To adapt and develop Computational

linguistic software suitable for linguistic software suitable for agglutinating languages of South agglutinating languages of South AfricaAfrica

Develop indigenous languages of Develop indigenous languages of South AfricaSouth Africa

Understand the role of language and Understand the role of language and communication in real life situationscommunication in real life situations

The Importance of Spoken CorpusThe Importance of Spoken Corpus

Allwood and Hagman (1994) - Allwood and Hagman (1994) - Spoken languageSpoken language

- Fundamental trait of the human Fundamental trait of the human speciesspecies

- Integrated with the human brain and Integrated with the human brain and human societyhuman society

- There is a limited knowledge of There is a limited knowledge of spoken language as opposed to spoken language as opposed to written languagewritten language

The importance…The importance…

Corpus linguistics approach allows the use Corpus linguistics approach allows the use of statistical performance measures and of statistical performance measures and observation of language use in real life.observation of language use in real life.

This approach is in contrast with earlier This approach is in contrast with earlier Chomskyan linguistics which focused on Chomskyan linguistics which focused on ideal written language.ideal written language.

Allwood and Hagman 1994:1- Allwood and Hagman 1994:1- “ the progress in “ the progress in audio, video and computer technology enables us to record and audio, video and computer technology enables us to record and analyse spoken language without having to rely on either memory analyse spoken language without having to rely on either memory or written language.”or written language.”

Contrast at a glanceContrast at a glance Chomskyan linguistics focused on language competence Chomskyan linguistics focused on language competence

((languelangue) while corpus linguistics also considers language ) while corpus linguistics also considers language performance (performance (paroleparole) as important) as important

Chomskyan linguistics is unable to cope with many areas in Chomskyan linguistics is unable to cope with many areas in linguistic study, since the emphasis is put on the ideal linguistic study, since the emphasis is put on the ideal speaker/hearer to the exclusion of complexity/variationspeaker/hearer to the exclusion of complexity/variation

Chomskyan linguistics views language as an innate mental Chomskyan linguistics views language as an innate mental faculty while corpus linguistics views language as a social faculty while corpus linguistics views language as a social phenomenonphenomenon

Chomskyan linguistics relies on intuitive evidence whereas Chomskyan linguistics relies on intuitive evidence whereas corpus linguistics relies on empirical evidencecorpus linguistics relies on empirical evidence

Corpus linguistics looks at differences in languages while Corpus linguistics looks at differences in languages while Chomskyan linguistics concentrates on universalsChomskyan linguistics concentrates on universals

The focus of Chomskyan linguistics is on grammar (form) The focus of Chomskyan linguistics is on grammar (form) while corpus linguistics focuses also on meaning while corpus linguistics focuses also on meaning (semantics). (semantics).

The Description of SLCPThe Description of SLCP

First task: compilation of a body of First task: compilation of a body of texts (a corpus)texts (a corpus)

Computer: stores large quantities of Computer: stores large quantities of data and allows statistical data and allows statistical performance measures performance measures

Research potential: covers linguistic, Research potential: covers linguistic, social, cultural, educational, social, cultural, educational, technological, inter-lingual and inter-technological, inter-lingual and inter-communicational aspectscommunicational aspects

Descript…Descript… SLCP has chosen video recordings. Why? SLCP has chosen video recordings. Why? Allwood and Hendrikse (2003, 191) have Allwood and Hendrikse (2003, 191) have

mentioned the following reason: mentioned the following reason: “…face-to-face “…face-to-face spoken language is interactive, multimodal and context-spoken language is interactive, multimodal and context-dependent.” dependent.” So we want to capture all the So we want to capture all the dynamics of language in communicationdynamics of language in communication

Compilation is a process with four major phases:Compilation is a process with four major phases:1.1. collecting video recorded spoken language collecting video recorded spoken language

activitiesactivities2.2. Transcribing video recordingsTranscribing video recordings3.3. Quality control (checking and editing)Quality control (checking and editing)4.4. Annotation of raw dataAnnotation of raw data

Process diagramProcess diagram

CORPUS BUILDING

RECORDING TRANSCRIBINGQUALITY CONTROL

TAGGING

1. Recording phase1. Recording phase Biber et al. (1998:246) state that Biber et al. (1998:246) state that ““......a corpus is not simply a collection of texts. Rather, a a corpus is not simply a collection of texts. Rather, a

corpus seeks to represent a language or some part of a corpus seeks to represent a language or some part of a language. The appropriate design for corpus therefore language. The appropriate design for corpus therefore depends upon what it is meant to representdepends upon what it is meant to represent.” .”

Parameters: representativity of the corpus, Parameters: representativity of the corpus, control of variables in language varieties, control of variables in language varieties, recording, volume or size of the corpus and recording, volume or size of the corpus and length of each samplelength of each sample

In SLCP we use socio-economic activities as a In SLCP we use socio-economic activities as a representativity measure, e.g. meetings, representativity measure, e.g. meetings, sermons, interviews, etc.sermons, interviews, etc.

2. Transcription phase2. Transcription phase

Most crucialMost crucial Allwood and Hendrikse 2003:195 - Allwood and Hendrikse 2003:195 -

without transcriptions there will be without transcriptions there will be no computer readable corpusno computer readable corpus

Two components of a transcript: Two components of a transcript: Header and BodyHeader and Body

The Header E.G.The Header E.G.@ Recorded activity Identity code (ID): U-ZV-01-01-01@ Recorded activity Identity code (ID): U-ZV-01-01-01@ Name of recorder: Magda Altman, Brenda Gonzales@ Name of recorder: Magda Altman, Brenda Gonzales@ Duration of recorded activity: 3 hours @ Duration of recorded activity: 3 hours @ Recorded activity date: 2006-08-04@ Recorded activity date: 2006-08-04@ Recorded activity type: Interview with Traditional Healers@ Recorded activity type: Interview with Traditional Healers@ Recorded activity title: Interview with Traditional Healers@ Recorded activity title: Interview with Traditional Healers@ Short name: TH Interview 1@ Short name: TH Interview 1@ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu @ Recorded activity location: Queen Ntuli’s home, Folweni, Umbumbulu @Activity mode:@Activity mode: Face to face InterviewFace to face Interview@ Participant: B=F1 (Makhosi Queen Ntuli)@ Participant: B=F1 (Makhosi Queen Ntuli)@ Participant: M=F8 (Philisiwe Mkhize)@ Participant: M=F8 (Philisiwe Mkhize)@ Participant: K=F3 (Jabu Eunice Ncikazi)@ Participant: K=F3 (Jabu Eunice Ncikazi)@ Participant: J=F2 (Thokozile Shezi)@ Participant: J=F2 (Thokozile Shezi)@ Participant: G=F4 (Thembeni Roge Magubane)@ Participant: G=F4 (Thembeni Roge Magubane)@ @ Participant: H=GR (All participants)@ @ Participant: H=GR (All participants)@ Tape ID code: U-ZV-01-01@ Tape ID code: U-ZV-01-01@ Transcription name: U-ZV-01-01-T1@ Transcription name: U-ZV-01-01-T1@ Transcriber: Mtholeni@ Transcriber: Mtholeni@ Transcription date: @ Transcription date: @ Transcription system:@ Transcription system:@ Electronic checking@ Electronic checking@ Editor:@ Editor:@ Checker: @ Checker: @ Checking dates:@ Checking dates:@ Section: @ Section: @ Section:@ Section:@ Time coding@ Time coding@ Comment(s):@ Comment(s):

The BodyThe Body

3 types of lines in the transcription 3 types of lines in the transcription body:body:

§ - section line (the topic of discussion)§ - section line (the topic of discussion)

$ - contribution (interlocutor’s speech)$ - contribution (interlocutor’s speech)

@ - information line (comment)@ - information line (comment)

The Body…The Body…

Standardised orthography is used, Standardised orthography is used, but no capital letters or punctuation but no capital letters or punctuation marksmarks

Plain text format is used to make Plain text format is used to make transcription machine readabletranscription machine readable

Own communication management Own communication management (i.e. hesitations) and interactive (i.e. hesitations) and interactive communication management (i.e. communication management (i.e. feedback) are indicatedfeedback) are indicated

The Body...The Body...

Certain symbols are used to transcribe the Certain symbols are used to transcribe the following:following:

Elisions { } – curly bracketsElisions { } – curly brackets

Overlaps [ ] – square bracketsOverlaps [ ] – square brackets

Comments < > - angle bracketsComments < > - angle brackets

Pauses / or // or /// - slashes Pauses / or // or /// - slashes

Lengthening : - colonLengthening : - colon

Unclear speech (. . .) three bracketed dotsUnclear speech (. . .) three bracketed dots

The Body E.G.The Body E.G.

Example:Example: Elisions, overlaps, comments, Elisions, overlaps, comments, pauses, lengtheningpauses, lengthening

§ Religion§ Religion$A: uyakhonza konje$A: uyakhonza konje$B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [ ] $B: ngiyakhonza ngiyamthand{a} <1 unkulunkulu>1 [ ]

ngiyamthanda angisoze ngimlahlengiyamthanda angisoze ngimlahle@ <name: person>@ <name: person>$A: [nanso_ke <1 sisi>1 // e: e:]$A: [nanso_ke <1 sisi>1 // e: e:]@ <adoptive: English: sister>@ <adoptive: English: sister>

Example: Example: Unclear speech and code-Unclear speech and code-switchingswitching

$Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is $Z: sekuphoqelekile ukuba (. . .) <1 neclaim>1 futhi (. . .) <2 that is why>2 <3 ngiclaimile>3why>2 <3 ngiclaimile>3

@ <1 code-mix: English>@ <1 code-mix: English>@ <2 code-switch: English>@ <2 code-switch: English>@ <3 code-mix: English>@ <3 code-mix: English>

3. The checking phase3. The checking phase

The transcription is manually checked by The transcription is manually checked by another person than the transcriber to another person than the transcriber to ensure quality control and reliabilityensure quality control and reliability

The transcription is also checked The transcription is also checked electronically for correctness of format electronically for correctness of format before it is inserted into the corpusbefore it is inserted into the corpus

We currently use a GTS checking tool to We currently use a GTS checking tool to monitor compliance with the transcription monitor compliance with the transcription standards.standards.

4. The tagging phase4. The tagging phase

The process whereby the corpus is The process whereby the corpus is annotated by means of various tags -annotated by means of various tags -enriching a raw corpus with grammatical enriching a raw corpus with grammatical tagstags..

E.G. E.G. abantwana - abantwana - aa«prepref»«prepref»baba«pref»«pref»ntwantwa«nstem»«nstem»anaana«dimsuf»«dimsuf»

Corpus driven approach (information Corpus driven approach (information retrieved from raw data) retrieved from raw data) vs.vs. corpus based corpus based approach (information retrieved from an approach (information retrieved from an annotated corpus)annotated corpus)

4. Tagging …4. Tagging … Allwood and Hendrikse (2003, 199) argue Allwood and Hendrikse (2003, 199) argue

that while the corpus driven approach that while the corpus driven approach works well with isolating languages, in works well with isolating languages, in agglutinating languages the corpus based agglutinating languages the corpus based approach may be used. They also note approach may be used. They also note that Leech (1991) has warned against the that Leech (1991) has warned against the danger of bias underlying any form of danger of bias underlying any form of annotation. However, they argue that the annotation. However, they argue that the tagging of corpora is now fairly general tagging of corpora is now fairly general practice (Allwood and Hendrikse 2003, practice (Allwood and Hendrikse 2003, 199). The tagging set for the agglutinating 199). The tagging set for the agglutinating languages has been discussed in detail by languages has been discussed in detail by Allwood et al (2003).Allwood et al (2003).

Progress and success in SLCPProgress and success in SLCP

Only Xhosa out of the nine languages Only Xhosa out of the nine languages has been able to show greater has been able to show greater progressprogress

Why? It was used for piloting the Why? It was used for piloting the project and has a consistent project and has a consistent transcribertranscriber

Zulu is following behind with almost Zulu is following behind with almost 20 000 transcribed tokens so far20 000 transcribed tokens so far

Progress…Progress…

Number of Tokens per language

0

50000

100000

150000

200000

250000

300000

350000

Ndebele

N.Sot

ho

S.Sot

hoSwat

i

Tsong

a

Tswan

a

Venda

Xhosa

Zulu


RecordingsRecordings Audio Audio VideoVideo N. of recordingsN. of recordings 3333 112112 HoursHours 3838 128128 Un-transcribedUn-transcribed 22 1717 TranscribedTranscribed 3131 6868 CheckedChecked 3131 5454 TokensTokens 45 72345 723 201 201

292292


We also have some un-transcribed We also have some un-transcribed recordings for Tsonga, especially for recordings for Tsonga, especially for children speechchildren speech

People behind the progress - the People behind the progress - the corpus group - share on issues of corpus group - share on issues of progress, motivating one another progress, motivating one another and presenting on key research and presenting on key research aspects of the projectaspects of the project

ProblemsProblems Little or nothing is currently happening in the Little or nothing is currently happening in the

development of corpora for the remaining official development of corpora for the remaining official languageslanguages

Lack of appropriate monitoring - some of the Lack of appropriate monitoring - some of the video recordings get damaged and some of the video recordings get damaged and some of the digitized recordings have been lostdigitized recordings have been lost

Poor quality of recordingsPoor quality of recordings Lost dataLost data Uncoordinated individual activities Uncoordinated individual activities Insufficient tools, financial and human resourcesInsufficient tools, financial and human resources ……etc.etc.

Recommendations and solutions: Recommendations and solutions: hope for the futurehope for the future

Ultimately: A fully developed spoken Ultimately: A fully developed spoken corpus resource centre – corpus resource centre –

the establishment of a resource the establishment of a resource centre will not only be a sign of centre will not only be a sign of growth and prosperity, but also a growth and prosperity, but also a sign of an investment in the future of sign of an investment in the future of languages and their speakers.languages and their speakers.

Recommend…Recommend… To this end, the following recommendations need to be considered:To this end, the following recommendations need to be considered:1.1. More recordings and more trained transcribers are required in order to More recordings and more trained transcribers are required in order to

expedite the process.expedite the process.2.2. Recorders and transcribers for all the languages should be remunerated to Recorders and transcribers for all the languages should be remunerated to

encourage them to do more in their work.encourage them to do more in their work.3.3. A network with other institutions, such as universities, should be created.A network with other institutions, such as universities, should be created.4.4. Short and medium term corpus development targets should be set up.Short and medium term corpus development targets should be set up.5.5. A server dedicated to corpus must be established as a matter of urgency.A server dedicated to corpus must be established as a matter of urgency.6.6. A properly structured corpus archive must be set up and maintained by a A properly structured corpus archive must be set up and maintained by a

web master.web master.7.7. All the various corpus-related projects should be re-organised under one All the various corpus-related projects should be re-organised under one

corpus management structure.corpus management structure.8.8. Corpus maintenance, tagging and mining tools designed for the Corpus maintenance, tagging and mining tools designed for the

agglutinating languages and other peculiar searches (e.g. communicative agglutinating languages and other peculiar searches (e.g. communicative gestures) must be developed.gestures) must be developed.

9.9. Preliminary corpus mining should begin for the benefit of the tool Preliminary corpus mining should begin for the benefit of the tool development enterprise and to encourage the use of corpora for language development enterprise and to encourage the use of corpora for language research and development.research and development.

This will lead to the establishment of a dedicated corpus publication series This will lead to the establishment of a dedicated corpus publication series for the indigenous languages of South Africa.for the indigenous languages of South Africa.

ReferencesReferences Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing Allwood J, Grönqvist L, and Hendrikse AP. 2003. Developing

a tag set and tagger for the African Languages of South a tag set and tagger for the African Languages of South Africa with special reference to Xhosa. Africa with special reference to Xhosa. Southern African Southern African Linguistics and Applied Language StudiesLinguistics and Applied Language Studies 21 (4) 221-235. 21 (4) 221-235.

Alwood J and Hagman J. 1994. Some simple automatic Alwood J and Hagman J. 1994. Some simple automatic measures of spoken interaction. measures of spoken interaction. Proc. Of the 14th Scand. Proc. Of the 14th Scand. Conf. of Linguistics & 8th Conf. of Nordic and Gen. Conf. of Linguistics & 8th Conf. of Nordic and Gen. Linguistics, Vol. 72, Univ. of Göteborg.Linguistics, Vol. 72, Univ. of Göteborg.

Allwood J and Hendrikse AP. 2003. Spoken language Allwood J and Hendrikse AP. 2003. Spoken language corpora for the nine official languages of South Africa. corpora for the nine official languages of South Africa. Southern African Linguistics and Applied Language StudiesSouthern African Linguistics and Applied Language Studies 21 (4) 187-199.21 (4) 187-199.

Biber D, Condrad S, and Repen R. 1998. Biber D, Condrad S, and Repen R. 1998. Corpus Linguistics: Corpus Linguistics: Investigating Language Structure and UseInvestigating Language Structure and Use. Cambridge: . Cambridge: Cambridge University Press.Cambridge University Press.

Documents

Progress in Developing Spoken Language Corpus of Indigenous Languages in South Africa