(Kishore) Problems and Prospects in Collecting Spoken Language Data

8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

1/22

11

Problems and ProspectsProblems and Prospectsin Collecting Spokenin Collecting Spoken

Language DataLanguage DataKishore PrahalladKishore PrahalladSuryakanth V GangashettySuryakanth V Gangashetty

B. YegnanarayanaB. Yegnanarayana

Raj ReddyRaj Reddy

IIIT Hyderabad, IndiaIIIT Hyderabad, India

Carnegie Mellon University, USA.Carnegie Mellon University, USA.


2/22

22

OutlineOutline

Need for digital library of audio and videoNeed for digital library of audio and videodatadata

Characteristics of spoken language dataCharacteristics of spoken language data

Prototype data collectionPrototype data collection IIIT HyderabadIIIT Hyderabad

IIT MadrasIIT Madras

Lessons LearntLessons Learnt Proposal to collect IL dataProposal to collect IL data

as a part of Jimbakers global project.as a part of Jimbakers global project.


3/22

33

Need for Digital Library of Audio &Need for Digital Library of Audio &

Video DataVideo Data Current and future data will be in audio and video formatsCurrent and future data will be in audio and video formats Current technology makes it possible to digitize and store such largeCurrent technology makes it possible to digitize and store such large

amounts of dataamounts of data

Collection, storage and indexing of such data makes it possible to provideCollection, storage and indexing of such data makes it possible to provideinformation to current and future generationinformation to current and future generation

Acts as test bed for several research challenges exists in organizing,Acts as test bed for several research challenges exists in organizing,indexing and retrieving such large data collectionsindexing and retrieving such large data collections

Algorithms for quick and easier access to the information present in AVAlgorithms for quick and easier access to the information present in AVformat by providing a query using text / audio / video modesformat by providing a query using text / audio / video modes

Algorithms using multiAlgorithms using multi--modal data for biomodal data for bio--metric authenticationmetric authentication

Development of multiDevelopment of multi--lingual speech synthesis and speech recognitionlingual speech synthesis and speech recognitionsystemssystems


4/22

44

Characteristics of Spoken Language DataCharacteristics of Spoken Language Data

MessageMessage -- Information to be conveyedInformation to be conveyed

SpeakerSpeaker Who is the speaker?Who is the speaker?

His/her backgroundHis/her background Age, gender, literacy levels, knowledgeAge, gender, literacy levels, knowledgelevels, mannerisms etc.levels, mannerisms etc.

EmotionsEmotions Anger, sad, happy etc.Anger, sad, happy etc.

IdiolectIdiolect An individual distinctive style of speakingAn individual distinctive style of speaking

Medium of transmissionMedium of transmission Microphone, telephone, satellite etc.Microphone, telephone, satellite etc.

EnvironmentEnvironment-- partyparty--environment, airport/station,environment, airport/station,

LanguageLanguage

DialectDialect grammar and the vocabulary associated with a regional orgrammar and the vocabulary associated with a regional orsocial use of a language.social use of a language.

Culture and civilizationCulture and civilization The richness of usage of vocabulary,The richness of usage of vocabulary,grammar etc, indicates the times of the language and the society.grammar etc, indicates the times of the language and the society.


5/22

55

Characteristics of SpokenCharacteristics of Spoken

Language DataLanguage Data How a language was spoken 25 years ago, 50 years ago, 100How a language was spoken 25 years ago, 50 years ago, 100

years ago and beyond?years ago and beyond?

How a famous poem was recited or sung by the author?How a famous poem was recited or sung by the author?

How a particular language was spoken in different geographicalHow a particular language was spoken in different geographicallocations of a state/country?locations of a state/country?

How a particular language/dialect has evolved over a period ofHow a particular language/dialect has evolved over a period oftime?time?

What were the rare languages/dialects (which were no more inWhat were the rare languages/dialects (which were no more inexistence)?. How they were spoken?existence)?. How they were spoken?


6/22

66

Phase 0: Prototype data collectionPhase 0: Prototype data collection

at IIIT Hydat IIIT Hyd High quality studio recordingsHigh quality studio recordings

2 hrs of single speaker recordings for speech2 hrs of single speaker recordings for speechsynthesissynthesis

Telugu, Hindi, Tamil and IndianTelugu, Hindi, Tamil and Indian--EnglishEnglish Developed text to speech systems in these 4Developed text to speech systems in these 4

languageslanguages

Telephone and CellTelephone and Cell--phone corpusphone corpus 150 hrs (540 speakers)150 hrs (540 speakers)

Telugu, Tamil and MarathiTelugu, Tamil and Marathi

Developed speech recognition systems in these 3Developed speech recognition systems in these 3languageslanguages


7/22

77

Phase 0: Prototype data collectionPhase 0: Prototype data collection

at IIT Madrasat IIT Madras 15 hours (72 speakers)15 hours (72 speakers)

TV news in Tamil, Telugu and HindiTV news in Tamil, Telugu and Hindi

LanguagesLanguages Text to speech systems (TTS)Text to speech systems (TTS)

Language IdentificationLanguage Identification

Duration modeling for TTS systemsDuration modeling for TTS systems


8/22

88

Tools Aiding forTools Aiding for

Acquisition/Correction of Speech DataAcquisition/Correction of Speech Data Transcription correction tool (TCT)Transcription correction tool (TCT)

Spoken errors at phone, syllable, word levelSpoken errors at phone, syllable, word level

Background noise, abrupt begin or end, low SNRBackground noise, abrupt begin or end, low SNR

TCT corrects the above errors in three levelsTCT corrects the above errors in three levels

Audio & Video Transcription ToolAudio & Video Transcription Tool Used to annotate movie databasesUsed to annotate movie databases

Correction of Segment labelsCorrection of Segment labels EmulabelEmulabel


9/22

99

Lessons LearntLessons Learnt

Speech correction needs 3Speech correction needs 3--6 times more6 times morethan collectionthan collection Better to collect more data than correctingBetter to collect more data than correcting

Needs a unified frameworkNeeds a unified framework Standardize, processes, procedure and toolsStandardize, processes, procedure and tools

Need larger collection of spoken and textNeed larger collection of spoken and text

corporacorpora For building practical speech systems inFor building practical speech systems in

Indian languagesIndian languages


10/22

1010

Proposal for collection of largerProposal for collection of larger

Spoken Language Data for ILSpoken Language Data for IL Focus of information present in speechFocus of information present in speech

modemode

Collect spoken language data from allCollect spoken language data from allIndian languages and also fromIndian languages and also fromneighboring countriesneighboring countries

Collect about200,000 (.2 M) hours ofCollect about200,000 (.2 M) hours of

speechspeech As a part of JimBakers global project ofAs a part of JimBakers global project of

collecting 1 Million hours of speechcollecting 1 Million hours of speech


11/22

1111

New in our approachNew in our approach

Collection of large speech data upto 200,000 (0.2 M)Collection of large speech data upto 200,000 (0.2 M)hourshours All Indian languages and dialectsAll Indian languages and dialects

23 official Indian languages23 official Indian languages Approx. 10,000 hours per languageApprox. 10,000 hours per language

All types: Traditional, Read, spoken, conversational, dialog,All types: Traditional, Read, spoken, conversational, dialog,movies, broadcast etc.movies, broadcast etc.

All modes: microphone, clean, telephone, cellphone, satellite etcAll modes: microphone, clean, telephone, cellphone, satellite etc

Standard procedure for organizing, annotating andStandard procedure for organizing, annotating andindexingindexing

More focus on larger collection (and elimination than ofMore focus on larger collection (and elimination than ofcorrection)correction)

Make available this data for general public useMake available this data for general public use


12/22

1212

Key MakeKey Make--AA--Difference CapabilityDifference Capability

Availability of information (Stories, lectures, poems, books, articles)Availability of information (Stories, lectures, poems, books, articles)in spoken languagein spoken language For illiterateFor illiterate Vision ImpairedVision Impaired

Collection and Storage of spoken language data of popular as wellCollection and Storage of spoken language data of popular as wellas rare languages & dialectsas rare languages & dialects

Promotes research and development inPromotes research and development in Speech TechnologySpeech Technology

SpeechSpeech--toto--speech translation in Indian languagesspeech translation in Indian languages Phonetic engine (Language Independent)Phonetic engine (Language Independent) Speech synthesis (TextSpeech synthesis (Text--toto--speech for Indian languages)speech for Indian languages) Speaker recognition (Text independent and dependent)Speaker recognition (Text independent and dependent)

Language IdentificationLanguage Identification Speech enhancementSpeech enhancement Speech signal processingSpeech signal processing

Biometrics:Biometrics: Multimodal: AudioMultimodal: Audio--Video modesVideo modes

Information Access, Storage and RetrievalInformation Access, Storage and Retrieval AudioAudio--video data (indexing)video data (indexing) Data Mining (searching)Data Mining (searching) Speech Coding (UltraSpeech Coding (Ultra--low bit coding)low bit coding)


13/22

1313

Implementation PlanImplementation Plan

Phase 1: (3.5 months)Phase 1: (3.5 months) 10 languages10 languages

33,300 hours33,300 hours

Phase 2: (8 months)Phase 2: (8 months) 10 (of phase 1) languages10 (of phase 1) languages


Phase 3: (10 months)Phase 3: (10 months) 1313 -- remaining languagesremaining languages



14/22

1414

MidMid--Term and Final TermsTerm and Final Terms

MidMid--TermTerm

Phase 1, collection of33,300 hours of speechPhase 1, collection of33,300 hours of speech

Collection, Storage and Indexing of speech data forCollection, Storage and Indexing of speech data for

public information accesspublic information access Visible research output using the speech dataVisible research output using the speech data

Demonstrations of speech technology productsDemonstrations of speech technology products

Speech recognition in 10 languagesSpeech recognition in 10 languages

Final TermFinal Term Phase 1 + Phase 2Phase 1 + Phase 2


15/22

1515

Q & AQ & A


16/22

1616

Misc.Misc.


17/22

1717

Impact of Audio Digital LibraryImpact of Audio Digital Library

Availability of information in spoken language form forAvailability of information in spoken language form forilliterate and othersilliterate and others

Promotes research in speech technology for IndianPromotes research in speech technology for Indianlanguageslanguages

Enable to develop speech technology products useful forEnable to develop speech technology products useful forcommon mancommon man Examples:Examples:

SpeechSpeech--speech translation systemsspeech translation systems For information exchangeFor information exchange

Screen readers,Screen readers, For illiterate and physically challengedFor illiterate and physically challenged

Naturally speaking dialog systemsNaturally speaking dialog systems For information access over voice modeFor information access over voice mode


18/22

1818

Phase 1: Time EstimatePhase 1: Time Estimate

Phase 1:Phase 1: 10 official Indian languages10 official Indian languages Parallel collection of dataParallel collection of data ~ 3000 hours per language~ 3000 hours per language

5,0005,000 -- 10,000 speakers10,000 speakers

> 10 min of speech each per speaker> 10 min of speech each per speaker Total:33,300 hoursTotal:33,300 hours

Time Estimates: (~ 3.5 months all 10 languages)Time Estimates: (~ 3.5 months all 10 languages) 10 persons10 persons--team per languageteam per language Each person worksEach person works

8 hours a day8 hours a day 30

mins of speech recording per hour30

mins of speech recording per hour 11--3 speakers per hour3 speakers per hour 240 mins of speech per day240 mins of speech per day

11--24 speakers per day,24 speakers per day,

240 speakers per day240 speakers per day 20,000 speakers per language in 84 working days20,000 speakers per language in 84 working days


19/22

1919

Phase 1: Cost EstimatePhase 1: Cost Estimate

Man power cost: Rs 140 LakhsMan power cost: Rs 140 Lakhs

Equipment cost: Rs 55 LakhsEquipment cost: Rs 55 Lakhs

Communication cost: Rs 40 LakhsCommunication cost: Rs 40 Lakhs Contingency (10%): Rs 25 LakhsContingency (10%): Rs 25 Lakhs

Total Cost: Rs 2.6 Crores (~ $ 565,000)Total Cost: Rs 2.6 Crores (~ $ 565,000)


20/22

2020

ManMan--Power CostPower Cost

Data collection Team: Rs 86 lakhsData collection Team: Rs 86 lakhs 10 (for data collection) x Rs 10 K PM10 (for data collection) x Rs 10 K PM

10 (for data correction) x Rs 10 K PM10 (for data correction) x Rs 10 K PM

1 data manager (Rs 15 K PM)1 data manager (Rs 15 K PM) 4 months cost:8, 60, 000 per language4 months cost:8, 60, 000 per language

5 engineers: Rs 4 Lakhs5 engineers: Rs 4 Lakhs B.Tech Level (Rs 20,000 PM)B.Tech Level (Rs 20,000 PM)

Gifts per speaker: Rs 50 LakhsGifts per speaker: Rs 50 Lakhs Rs 25 per speakerRs 25 per speaker


21/22

2121

Machines CostMachines Cost

Machines:Machines:

30 servers: Rs 30 Lakhs30 servers: Rs 30 Lakhs

3 servers per languages3 servers per languages

Each server has 4 ports for data collectionEach server has 4 ports for data collection

30 CTI cards: Rs 20 Lakhs30 CTI cards: Rs 20 Lakhs

Storage:20 TB: Rs 5 LakhsStorage:20 TB: Rs 5 Lakhs

Two copies of20 TBTwo copies of20 TB


22/22

2222

Communications CostCommunications Cost

Telephonic charges: Rs 20 LakhsTelephonic charges: Rs 20 Lakhs

Rs 1 per min (local telephonic charges)Rs 1 per min (local telephonic charges)

Transportation: Rs 20 LakhsTransportation: Rs 20 Lakhs

Documents

(Kishore) Problems and Prospects in Collecting Spoken Language Data