View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Designing a Multi-Lingual Corpus Collection System
Jonathan Law
Naresh Trilok
Pace University04/19/2002
Advisors:
Dr. Charles Tappert (Pace University)
Dr. Zhong-hua Wang (IBM)
Dr. Fred Grossman (Pace University)
What is a Corpus ?
• Any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text).
• Corpus in modern linguistics must have these properties
– Sampling and representation– Finite Size– Machine-readable form– A standard reference
Importance of a Corpus forAutomatic Speech Recognition (ASR)
• To Provide Training data for Speech Recognition• To supply Training data for Automatic Language
Identification• To offer body of language to Research Community• To enable analysis of language at all levels.• To support transcription and labeling document for
linguistic research
Major Components
– Corpus Collection Module• via telephone• Native speakers
– Corpus Verification Module• via Web or telephone• Native speakers
Data Recordings Process
– Via toll-free number on Tellme platform • Caller select native language• Prompt for general attributes (for naming
convention)• Prompt with pre-defined scripts (for short
utterance)• Prompt with open set responses (for long
utterance)
Corpus Collection Protocol
• “Script” of questions and prompts for user responses• Reproduced in language by native speaker (all in wav
files)• Prompts and Questions are all the same in all
languages– Are you male or female (gender) ?– What day is today (date)?– What time is it (time)?– Please say all the days of the week ?– Describe the route you take to work or school (route)?– Describe the climate today (climate) ?