Upload
suresh-geepalam
View
215
Download
0
Embed Size (px)
Citation preview
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
1/22
11
Problems and ProspectsProblems and Prospectsin Collecting Spokenin Collecting Spoken
Language DataLanguage DataKishore PrahalladKishore PrahalladSuryakanth V GangashettySuryakanth V Gangashetty
B. YegnanarayanaB. Yegnanarayana
Raj ReddyRaj Reddy
IIIT Hyderabad, IndiaIIIT Hyderabad, India
Carnegie Mellon University, USA.Carnegie Mellon University, USA.
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
2/22
22
OutlineOutline
Need for digital library of audio and videoNeed for digital library of audio and videodatadata
Characteristics of spoken language dataCharacteristics of spoken language data
Prototype data collectionPrototype data collection IIIT HyderabadIIIT Hyderabad
IIT MadrasIIT Madras
Lessons LearntLessons Learnt Proposal to collect IL dataProposal to collect IL data
as a part of Jimbakers global project.as a part of Jimbakers global project.
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
3/22
33
Need for Digital Library of Audio &Need for Digital Library of Audio &
Video DataVideo Data Current and future data will be in audio and video formatsCurrent and future data will be in audio and video formats Current technology makes it possible to digitize and store such largeCurrent technology makes it possible to digitize and store such large
amounts of dataamounts of data
Collection, storage and indexing of such data makes it possible to provideCollection, storage and indexing of such data makes it possible to provideinformation to current and future generationinformation to current and future generation
Acts as test bed for several research challenges exists in organizing,Acts as test bed for several research challenges exists in organizing,indexing and retrieving such large data collectionsindexing and retrieving such large data collections
Algorithms for quick and easier access to the information present in AVAlgorithms for quick and easier access to the information present in AVformat by providing a query using text / audio / video modesformat by providing a query using text / audio / video modes
Algorithms using multiAlgorithms using multi--modal data for biomodal data for bio--metric authenticationmetric authentication
Development of multiDevelopment of multi--lingual speech synthesis and speech recognitionlingual speech synthesis and speech recognitionsystemssystems
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
4/22
44
Characteristics of Spoken Language DataCharacteristics of Spoken Language Data
MessageMessage -- Information to be conveyedInformation to be conveyed
SpeakerSpeaker Who is the speaker?Who is the speaker?
His/her backgroundHis/her background Age, gender, literacy levels, knowledgeAge, gender, literacy levels, knowledgelevels, mannerisms etc.levels, mannerisms etc.
EmotionsEmotions Anger, sad, happy etc.Anger, sad, happy etc.
IdiolectIdiolect An individual distinctive style of speakingAn individual distinctive style of speaking
Medium of transmissionMedium of transmission Microphone, telephone, satellite etc.Microphone, telephone, satellite etc.
EnvironmentEnvironment-- partyparty--environment, airport/station,environment, airport/station,
LanguageLanguage
DialectDialect grammar and the vocabulary associated with a regional orgrammar and the vocabulary associated with a regional orsocial use of a language.social use of a language.
Culture and civilizationCulture and civilization The richness of usage of vocabulary,The richness of usage of vocabulary,grammar etc, indicates the times of the language and the society.grammar etc, indicates the times of the language and the society.
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
5/22
55
Characteristics of SpokenCharacteristics of Spoken
Language DataLanguage Data How a language was spoken 25 years ago, 50 years ago, 100How a language was spoken 25 years ago, 50 years ago, 100
years ago and beyond?years ago and beyond?
How a famous poem was recited or sung by the author?How a famous poem was recited or sung by the author?
How a particular language was spoken in different geographicalHow a particular language was spoken in different geographicallocations of a state/country?locations of a state/country?
How a particular language/dialect has evolved over a period ofHow a particular language/dialect has evolved over a period oftime?time?
What were the rare languages/dialects (which were no more inWhat were the rare languages/dialects (which were no more inexistence)?. How they were spoken?existence)?. How they were spoken?
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
6/22
66
Phase 0: Prototype data collectionPhase 0: Prototype data collection
at IIIT Hydat IIIT Hyd High quality studio recordingsHigh quality studio recordings
2 hrs of single speaker recordings for speech2 hrs of single speaker recordings for speechsynthesissynthesis
Telugu, Hindi, Tamil and IndianTelugu, Hindi, Tamil and Indian--EnglishEnglish Developed text to speech systems in these 4Developed text to speech systems in these 4
languageslanguages
Telephone and CellTelephone and Cell--phone corpusphone corpus 150 hrs (540 speakers)150 hrs (540 speakers)
Telugu, Tamil and MarathiTelugu, Tamil and Marathi
Developed speech recognition systems in these 3Developed speech recognition systems in these 3languageslanguages
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
7/22
77
Phase 0: Prototype data collectionPhase 0: Prototype data collection
at IIT Madrasat IIT Madras 15 hours (72 speakers)15 hours (72 speakers)
TV news in Tamil, Telugu and HindiTV news in Tamil, Telugu and Hindi
LanguagesLanguages Text to speech systems (TTS)Text to speech systems (TTS)
Language IdentificationLanguage Identification
Duration modeling for TTS systemsDuration modeling for TTS systems
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
8/22
88
Tools Aiding forTools Aiding for
Acquisition/Correction of Speech DataAcquisition/Correction of Speech Data Transcription correction tool (TCT)Transcription correction tool (TCT)
Spoken errors at phone, syllable, word levelSpoken errors at phone, syllable, word level
Background noise, abrupt begin or end, low SNRBackground noise, abrupt begin or end, low SNR
TCT corrects the above errors in three levelsTCT corrects the above errors in three levels
Audio & Video Transcription ToolAudio & Video Transcription Tool Used to annotate movie databasesUsed to annotate movie databases
Correction of Segment labelsCorrection of Segment labels EmulabelEmulabel
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
9/22
99
Lessons LearntLessons Learnt
Speech correction needs 3Speech correction needs 3--6 times more6 times morethan collectionthan collection Better to collect more data than correctingBetter to collect more data than correcting
Needs a unified frameworkNeeds a unified framework Standardize, processes, procedure and toolsStandardize, processes, procedure and tools
Need larger collection of spoken and textNeed larger collection of spoken and text
corporacorpora For building practical speech systems inFor building practical speech systems in
Indian languagesIndian languages
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
10/22
1010
Proposal for collection of largerProposal for collection of larger
Spoken Language Data for ILSpoken Language Data for IL Focus of information present in speechFocus of information present in speech
modemode
Collect spoken language data from allCollect spoken language data from allIndian languages and also fromIndian languages and also fromneighboring countriesneighboring countries
Collect about200,000 (.2 M) hours ofCollect about200,000 (.2 M) hours of
speechspeech As a part of JimBakers global project ofAs a part of JimBakers global project of
collecting 1 Million hours of speechcollecting 1 Million hours of speech
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
11/22
1111
New in our approachNew in our approach
Collection of large speech data upto 200,000 (0.2 M)Collection of large speech data upto 200,000 (0.2 M)hourshours All Indian languages and dialectsAll Indian languages and dialects
23 official Indian languages23 official Indian languages Approx. 10,000 hours per languageApprox. 10,000 hours per language
All types: Traditional, Read, spoken, conversational, dialog,All types: Traditional, Read, spoken, conversational, dialog,movies, broadcast etc.movies, broadcast etc.
All modes: microphone, clean, telephone, cellphone, satellite etcAll modes: microphone, clean, telephone, cellphone, satellite etc
Standard procedure for organizing, annotating andStandard procedure for organizing, annotating andindexingindexing
More focus on larger collection (and elimination than ofMore focus on larger collection (and elimination than ofcorrection)correction)
Make available this data for general public useMake available this data for general public use
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
12/22
1212
Key MakeKey Make--AA--Difference CapabilityDifference Capability
Availability of information (Stories, lectures, poems, books, articles)Availability of information (Stories, lectures, poems, books, articles)in spoken languagein spoken language For illiterateFor illiterate Vision ImpairedVision Impaired
Collection and Storage of spoken language data of popular as wellCollection and Storage of spoken language data of popular as wellas rare languages & dialectsas rare languages & dialects
Promotes research and development inPromotes research and development in Speech TechnologySpeech Technology
SpeechSpeech--toto--speech translation in Indian languagesspeech translation in Indian languages Phonetic engine (Language Independent)Phonetic engine (Language Independent) Speech synthesis (TextSpeech synthesis (Text--toto--speech for Indian languages)speech for Indian languages) Speaker recognition (Text independent and dependent)Speaker recognition (Text independent and dependent)
Language IdentificationLanguage Identification Speech enhancementSpeech enhancement Speech signal processingSpeech signal processing
Biometrics:Biometrics: Multimodal: AudioMultimodal: Audio--Video modesVideo modes
Information Access, Storage and RetrievalInformation Access, Storage and Retrieval AudioAudio--video data (indexing)video data (indexing) Data Mining (searching)Data Mining (searching) Speech Coding (UltraSpeech Coding (Ultra--low bit coding)low bit coding)
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
13/22
1313
Implementation PlanImplementation Plan
Phase 1: (3.5 months)Phase 1: (3.5 months) 10 languages10 languages
33,300 hours33,300 hours
Phase 2: (8 months)Phase 2: (8 months) 10 (of phase 1) languages10 (of phase 1) languages
66,000 hours66,000 hours
Phase 3: (10 months)Phase 3: (10 months) 1313 -- remaining languagesremaining languages
80,000 hours80,000 hours
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
14/22
1414
MidMid--Term and Final TermsTerm and Final Terms
MidMid--TermTerm
Phase 1, collection of33,300 hours of speechPhase 1, collection of33,300 hours of speech
Collection, Storage and Indexing of speech data forCollection, Storage and Indexing of speech data for
public information accesspublic information access Visible research output using the speech dataVisible research output using the speech data
Demonstrations of speech technology productsDemonstrations of speech technology products
Speech recognition in 10 languagesSpeech recognition in 10 languages
Final TermFinal Term Phase 1 + Phase 2Phase 1 + Phase 2
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
15/22
1515
Q & AQ & A
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
16/22
1616
Misc.Misc.
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
17/22
1717
Impact of Audio Digital LibraryImpact of Audio Digital Library
Availability of information in spoken language form forAvailability of information in spoken language form forilliterate and othersilliterate and others
Promotes research in speech technology for IndianPromotes research in speech technology for Indianlanguageslanguages
Enable to develop speech technology products useful forEnable to develop speech technology products useful forcommon mancommon man Examples:Examples:
SpeechSpeech--speech translation systemsspeech translation systems For information exchangeFor information exchange
Screen readers,Screen readers, For illiterate and physically challengedFor illiterate and physically challenged
Naturally speaking dialog systemsNaturally speaking dialog systems For information access over voice modeFor information access over voice mode
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
18/22
1818
Phase 1: Time EstimatePhase 1: Time Estimate
Phase 1:Phase 1: 10 official Indian languages10 official Indian languages Parallel collection of dataParallel collection of data ~ 3000 hours per language~ 3000 hours per language
5,0005,000 -- 10,000 speakers10,000 speakers
> 10 min of speech each per speaker> 10 min of speech each per speaker Total:33,300 hoursTotal:33,300 hours
Time Estimates: (~ 3.5 months all 10 languages)Time Estimates: (~ 3.5 months all 10 languages) 10 persons10 persons--team per languageteam per language Each person worksEach person works
8 hours a day8 hours a day 30
mins of speech recording per hour30
mins of speech recording per hour 11--3 speakers per hour3 speakers per hour 240 mins of speech per day240 mins of speech per day
11--24 speakers per day,24 speakers per day,
240 speakers per day240 speakers per day 20,000 speakers per language in 84 working days20,000 speakers per language in 84 working days
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
19/22
1919
Phase 1: Cost EstimatePhase 1: Cost Estimate
Man power cost: Rs 140 LakhsMan power cost: Rs 140 Lakhs
Equipment cost: Rs 55 LakhsEquipment cost: Rs 55 Lakhs
Communication cost: Rs 40 LakhsCommunication cost: Rs 40 Lakhs Contingency (10%): Rs 25 LakhsContingency (10%): Rs 25 Lakhs
Total Cost: Rs 2.6 Crores (~ $ 565,000)Total Cost: Rs 2.6 Crores (~ $ 565,000)
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
20/22
2020
ManMan--Power CostPower Cost
Data collection Team: Rs 86 lakhsData collection Team: Rs 86 lakhs 10 (for data collection) x Rs 10 K PM10 (for data collection) x Rs 10 K PM
10 (for data correction) x Rs 10 K PM10 (for data correction) x Rs 10 K PM
1 data manager (Rs 15 K PM)1 data manager (Rs 15 K PM) 4 months cost:8, 60, 000 per language4 months cost:8, 60, 000 per language
5 engineers: Rs 4 Lakhs5 engineers: Rs 4 Lakhs B.Tech Level (Rs 20,000 PM)B.Tech Level (Rs 20,000 PM)
Gifts per speaker: Rs 50 LakhsGifts per speaker: Rs 50 Lakhs Rs 25 per speakerRs 25 per speaker
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
21/22
2121
Machines CostMachines Cost
Machines:Machines:
30 servers: Rs 30 Lakhs30 servers: Rs 30 Lakhs
3 servers per languages3 servers per languages
Each server has 4 ports for data collectionEach server has 4 ports for data collection
30 CTI cards: Rs 20 Lakhs30 CTI cards: Rs 20 Lakhs
Storage:20 TB: Rs 5 LakhsStorage:20 TB: Rs 5 Lakhs
Two copies of20 TBTwo copies of20 TB
8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data
22/22
2222
Communications CostCommunications Cost
Telephonic charges: Rs 20 LakhsTelephonic charges: Rs 20 Lakhs
Rs 1 per min (local telephonic charges)Rs 1 per min (local telephonic charges)
Transportation: Rs 20 LakhsTransportation: Rs 20 Lakhs