36
Compiling a Spoken Chi nese Corpus of Situate d Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Compiling a Spoken Chinese Corpus of Situated Discourse

Gu Yueguo

The Institute of Linguistics

The Chinese Academy of Social Sciences

Page 2: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Corpora Overview

Spoken Chinese CorporaA corpus of situated discourse

A corpus of major dialects

A corpus of speech

Written Chinese CorporaA corpus of contemporary written Chinese

A corpus of Pre-Qing written Chinese

Page 3: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Main headings

Components of the compiling process1. Real world discourse –what is it?

2. Recording

3. Encoding1. Transcription (a)

2. Transcription (b)

3. Mark-up

4. Tagging

4. Application

Page 4: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

(0) ‘real world’ spoken discourse

Recording (1)

(2a) Character transcription

(3) Mark-up

(5) Application

(4) Coding

(2b) Transcription for a special purpose

Page 5: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

0

Discourse in the Real World

Page 6: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

 

  No prepara-tion

Topics pre-set with no

preparation

Topics pre-set with no

written preparation

Talking based on a written script

Reading a written script

Single speaker

e.g. talk to oneself

e.g. narrate a personal story

e.g. oral exam

e.g. soliloquy, 1-person cross talk

e.g. news reading, reading practice

Two or more speakers

*e.g. everyday talks

* e.g. sports saloon 

*e.g. press interview

e.g. acting, cross talk

e.g. collective reciting

Spoken Chinese

Page 7: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Real world situated discourse (1) It is situated to an actual social situation; (2) It is situated to actual users; (3) It is situated to an inter-subjective world of disc

ourse; (4) It is situated to actual goals; (5) It is situated to spatial and temporal setting; (6) It is situated to the cognitive capacity of actual user

s; (7) It is situated to performance contingencies of actu

al users who are engaged in spontaneous talking with little pre-planning.

Page 8: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

clerks

colleagues

Staff meeting

Phone calls

F-staff

visitors

ZWFAcademic

Building X

studentsOther colleagues

visitors

Phone calls

Colleague 1

Colleague 2

Academic

Thurs Mon

Building Y

Academic Prjct team 1

Prjct team 2

Prjct team 3

Tues Wedn Fri

Building Z

Page 9: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Academicwife

sonkindergarten

markets

Neighbours

Mon-Fri Swimming pool

Residential Building

Senior managers

Research center staff

Academic

Conference organizers

Hotel staff

Sports playmates

Sat

Sun

Summer Resort

Page 10: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

 

1. Talking is the task, e.g., meeting, seminar, (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process. Note that turn-taking rules are based on such a type of talking-task relation)

2. Talking is the main constitutive part of the task, some classroom discourse, doctor patient discourse (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process)

3. Talking is a constitutive part of the task, e.g. giving instructions from time to time (task performance is dominant, talking tends to be fragmented)

4. Tasking and task run in conflicting parallel, the achievement of the latter serves as a means to the goal of the former, e.g. business dinner (business table talk) (Note that segmenting this kind of talk can be based on the task)

5. Talking is an embedded social part of the task, e.g. talking over the meal (talking has no specific goal to reach)

6. Talking is a decorative part of the task, e.g., talking accompanying tea-making

7. Talking is a hindrance to the task, e.g. talking over a written exam 8. Talking and task are independent to each other

Talking and Doing Interwoven in the Real World

Page 11: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Micro performance analysis of five minute activities

Spatial- temporal

Relations between acts

doing relation btwn doing & talking

talking

00 : 00-1 : 15 

Parallel and independent

X helps himself with noodles

conflictive

X and Y gossipY sorts out the things on the table

Parallel and independent

 

1 : 27-2 : 6 

Parallel and independent

X sorts out the bowl and the chopsticks Parallel and

relevant

X and Y talk about the journal editingY switches on the

computer

2 : 11-3 : 06

Parallel and independent

X sorts out the things on the table

Parallel and independent

Y talks to X about a politician

Y continues to sort out the things

3 : 19-4 : 25

 

Parallel and independent

X starts to reinstall his computer

Parallel and relevant

X talks to Y about the Journal layoutY starts to do the layout

on computer

4 : 34-4 : 40

Parallel and independent

X continues reinstallingParallel and

relevant

X continues talking to Y about the Journal editing

Y continues doing the layout

Page 12: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Sampling: Whose job?Sinclair (1991:13) writes:

The specification of a corpus --- the types and proportions of material in it --- is hardly a job for linguists at all, but more appropriate to the sociology of culture. The stance of the linguist should be a readiness to describe and analyse any instances of language placed before him or her. In the infancy of the discipline of corpus linguistics, the linguists have to do the text selection as well; when the impact of the work is felt more widely, it may be possible to hand over this responsibility to language-oriented social scientists.

Page 13: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

The standard variety approach

it is arguable that Putonghua should be chosen as the target language to rule out other dialects from the picture. There are at least two major reasons for doing so. First, Putonghua serves as the standard language used by the media and education. Second, other spoken corpora have also adopted the standard variety.

Page 14: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Criticisms of the standard variety approach

Subject to serious criticisms relating to the reservation of the naturalness of language use. The standard variety is given its identity before the corpus is compiled. The corpus cannot be used to represent its naturalness, nor be used to establish or demonstrate its identity. … what the compilers believe what Putonghua looks like. Subjective judgment is also involved in sampling Putonghua speakers by filtering non-standard speakers out. … Unless they are ‘commissioned’ to talk among themselves, the activities the standard and non-standard interactants are engaged in have to be properly filtered as well.

Page 15: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

The sampling: The workplace approach

It is true that situated discourses are unlimited in number. However, the types of social situations to which they are situated can be in theory exhaustively enlisted. According to the Beijing Yellow Book 1999, there are 67783 social work units which we divide them into 6 major categories and 31 sub-categories,

Page 16: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

01 Government, Parties and Other Social Bodies

4823 7.12%

02 Economical organizations 53838 79.43%

03 education, research and arts 6840 10.09%

04 health, sports, and social welfare 1365 2.01%

05 public welfare 890 1.46%

06 military 27 0.04%

6 major categories of social work units

Page 17: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

descriptive title no of mp3 files the total size

1 accident mediation 1 5 23,369,3262 accident mediation 2 8 30,944,1143 Administrative meetings 107 561,000,0004 assessment meeting 6 68,500,0005 auction 30 158,000,0006 bfsu meeting 14 66,200,0007 Birthday celebration 10 43,100,0008 btvu seminar 26 138,000,0009 bus talk 60 294,392,29810 business negotiation 1 27 143,285,17811 business negotiation 2 26 140,260,74412 business negotiation 3 54 284,761,45813 business negotiation 4 9 44,767,134

Page 18: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

14 child discourse 163 1,115,063,560

15 Chinese and Korean first contact 7 34,708,71616 Chinese New Year celebration 11 126,323,48417 Classmates get-together 14 73,063,72818 Classroom discourse-teach Chinese to Koreans 125 574,000,00019 commercial house key-handling procedure 16 84,512,806

20 community talks 322 1,734,865,326

21 end year celebration 17 78,310,71622 fortune telling 33 390,741,36223 Gu yueguo a week record 248 1,235,679,18624 house allocation meeting 44 239,388,83825 house decoration team talks 36 181,660,95226 Jiangsu TVU review meeting 11 49,675,91827 kindergarten meeting 28 146,741,69028 Lan Baochun family talks 22 285,975,640

Page 19: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

29 lawsuit 93 508,628,42230 lovers conversation 11 59,845,16031 medical discourse 156 764,274,19832 ministry education meeting 99 522,992,40433 office talk ministry of communication 114 577,889,24234 peasant family 73 373,917,09435 Peking Univ ceremony 7 46,894,31236 play mah-jong 28 145,754,88437 private conversation 77 401,858,42438 Radio Communication interviews 24 919,456,51239 sell and buy 296 1,150,000,00040 seventy-eighty yrs old peasant talks 22 125,624,13841 street market shopping 37 190,887,97242 student dormitory talks 66 345,920,58243 table talks 89 529,995,69844 visit blood doners 14 71,655,10445 Zhu Rongji press conference 20 97,984,672

total (1second=15.6503KB) 2705 15,180,870,992=970005.11 sds/269.44 hrs

Page 20: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

1

Recording

Page 21: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Recording 1. Who does the recording? 2. In what role does the person assume while

recording? 3. What is the quality of the recording? 4. In what manner is the recording to be made? 5. How is the ethics of recording to be properly

taken care? 6. What details are to be noted while recording?7. How are the recordings to be kept safe?

Page 22: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

In what role does the person assume while recording

The recording person as a legitimate observer: s/he is allowed by the authority to take non-active part in the activity and record the talk. S/he is an outsider. The party is aware of her or his presence and of her or his purpose of being there.

The recording person as a genuine participant: s/he is an insider.

The recording person as a surreptitious observer: s/he is one of the public members, and her or his presence draws no particular attention from anyone else.

Page 23: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

In what manner is the recording to be made? With the approval of all the participants With the approval of the key participant With the approval of the unit authority Open recording which can be noticed by

anyone Surreptitiously

Page 24: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

录 音 记 录 卡 录音人姓名 : ________________ 性别 : ______________ 职业 : _______________________开 始 录 音 日 期 _____ 年 ____ 月 ____ 日 结 束 录 音 日 期 _____ 年 ____ 月 ____ 日开 始 录 音 时 间 : 上 午 _____ 点 下 午 _____ 点 晚 上 _____ 点结 束 录 音 时 间 : 上 午 _____ 点 下 午 _____ 点 晚 上 _____ 点谈 话 地 点 _____ 省 _____ 市 ____ 县 ____ 乡 _____ 村 单位 : ______________________________________________ 谈 话 场 所 : 如 办 公 室、 朋 友 家、 餐 馆、 会议室、 超 市、 火 车 上、 车 间、 家 中 、 商 场、 医 院、 法 庭、 宾 馆、 街 上、 晚 会 上、 ___________ 在 录 本 面 磁 带 时 您 在 何 处?1.     ________________ 2. __________________ 3. ___________________ 录音方式 : 公开 秘密 先秘密后公开 有些人知道并同意 都知道并同意 请 把 本 面 磁 带 的 谈 话 人 员 的 有 关 情 况 填 在 下 面 的 表 里 ( 越详细越好 ) : 

姓 名

职 业、职称、职务

年 龄

性  别

 文化程度

口 音

与 您 以 及 和 别 的 谈 话 人 的 关 系 

             

             

             

             

              

谈话目的和事由: _____________________________________________________________________________________________________________________________________________________________________________________________________________________________ 提 醒 您 本 面 录 完 后 要 检 查 一 下 磁 带 是 否 要 翻 面! ( 以下由语料库工作人员填写 )------------------------------------------------------------------------------------------------------------------------------原始声波文件名 :_____________________ 汉字转写文件名 : ____________________________原始声波文件光盘编号 : ______________ 切分后声波文件名 : __________________________归类文件夹名 : ______________________ 其他 : ______________________________________

Page 25: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

How are the recordings to be kept safe?

The recordings on the 74 minute mini disks are all converted into wav files by using the recording function of the sound card. The format is 16 bits, stereo, 44100 Hz. The wav files are then stored on 640 mb recordable compact discs. They are further backed up by being converted into MP3 format (to economize on storage space) and saved again on separate 640 mb recordable compact discs. Furthermore, all the MP3 files are stalled on a USB movable 20G hard disk.

Page 26: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

2

Transcription

Page 27: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

The encoding process

1. Transcription in Chinese characters

2. Transcription in Pinying/IPA symbols

3. Transcription by using Praat

4. Mark-up by XML

5. Tagging

Page 28: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Issues in segmentationSegmenting sound streams into orthographic and phonetic linear units is the first major concern of the present project. It proves to be theoretically significant and practically difficult. The only natural unit boundaries are speaker-turns (turn defined in terms of the speaker’s presence of phonation). The other units either larger or smaller than turns tend to be more like theoretical constructs than otherwise.

Page 29: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Basic unit ---?Acoustically speaking, a spontaneous talk is a sequence

of strings of sounds uttered by two or more speakers. Prosodic or intonational units seem to be natural segments of the sequence. They are treated as basic units of talk and seem to have the same status as sentence does in written text. The weaknesses of such segmentation are (1) segments larger than intonational units are assumed to be the mere stacking of these basic units, which are untrue, hence misleading; and (2) talk is treated as a self-contained product waiting to be sliced into intonational units, thus ignoring the dynamic aspect of talk and its intrinsic relation with the social activities at large.

Page 30: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Multiple level segmentation 1 The first-level segment: The activity boundary

(segmenting talk from other social activities) Schedule boundary, e.g. a two-hour meeting,

classroom discourse Visit boundary, e.g. a patient’s visit to a doctor Case boundary, an accident settlement Appointment boundary, e.g. Business boundary, e.g. buy something

Page 31: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Multiple level segmentation 2The second-level segment: goal-oriented

segmentation

(segmenting talk into goal-attaining chunks) The segmentation is made on the basis of goal-

attaining process – goal-attainment structure E.g., Opening, negotiating, closing of a meeting E.g., examine-diagnose-prescribe-recommending The presentation of a speaker

Page 32: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Multiple level segmentation 3

The third-level segment: turn-oriented segmentation

(segmenting goal-attaining chunks into turn-taking chunks)

The segmentation is made on the basis of turn-boundary

Page 33: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Multiple level segmentation 4

The fourth-level segment: functional units(segmenting turn-taking chunks into functional units)The segmentation is made on the basis of functional markers or clues. • A meaningful cluster with a clear forward function• A meaningful cluster with a clear backward function• A meaningful cluster with a clear downward function• A meaningful cluster having a clear cognitive function: planning or searching for words

Page 34: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Multiple level segmentation 5

The fifth level segment: linear character and phonetic units

Page 35: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Trajectories of life path

Trajectories of life path

Internalized language out of life path trajectories

Trajectories of life path

Trajectories of life path

Internalized language out of life path trajectories

Trajectories of life path

Trajectories of life path

Internalized language out of life path trajectories

Natural growth and development of language

Page 36: Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

Trajectories of life pathT

rajectories of life path

Internalized language out of life path trajectories

Trajectories of life path

Trajectories of

life path

Internalized language out of life path trajectories

Trajectories of life path

Trajectories of

life pathInternalized language out of life path trajectories

Linguistic theory as reconstruction as modeling as description as standardization