(Kishore) Problems and Prospects in Collecting Spoken Language Data

Embed Size (px)

Citation preview

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    1/22

    11

    Problems and ProspectsProblems and Prospectsin Collecting Spokenin Collecting Spoken

    Language DataLanguage DataKishore PrahalladKishore PrahalladSuryakanth V GangashettySuryakanth V Gangashetty

    B. YegnanarayanaB. Yegnanarayana

    Raj ReddyRaj Reddy

    IIIT Hyderabad, IndiaIIIT Hyderabad, India

    Carnegie Mellon University, USA.Carnegie Mellon University, USA.

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    2/22

    22

    OutlineOutline

    Need for digital library of audio and videoNeed for digital library of audio and videodatadata

    Characteristics of spoken language dataCharacteristics of spoken language data

    Prototype data collectionPrototype data collection IIIT HyderabadIIIT Hyderabad

    IIT MadrasIIT Madras

    Lessons LearntLessons Learnt Proposal to collect IL dataProposal to collect IL data

    as a part of Jimbakers global project.as a part of Jimbakers global project.

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    3/22

    33

    Need for Digital Library of Audio &Need for Digital Library of Audio &

    Video DataVideo Data Current and future data will be in audio and video formatsCurrent and future data will be in audio and video formats Current technology makes it possible to digitize and store such largeCurrent technology makes it possible to digitize and store such large

    amounts of dataamounts of data

    Collection, storage and indexing of such data makes it possible to provideCollection, storage and indexing of such data makes it possible to provideinformation to current and future generationinformation to current and future generation

    Acts as test bed for several research challenges exists in organizing,Acts as test bed for several research challenges exists in organizing,indexing and retrieving such large data collectionsindexing and retrieving such large data collections

    Algorithms for quick and easier access to the information present in AVAlgorithms for quick and easier access to the information present in AVformat by providing a query using text / audio / video modesformat by providing a query using text / audio / video modes

    Algorithms using multiAlgorithms using multi--modal data for biomodal data for bio--metric authenticationmetric authentication

    Development of multiDevelopment of multi--lingual speech synthesis and speech recognitionlingual speech synthesis and speech recognitionsystemssystems

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    4/22

    44

    Characteristics of Spoken Language DataCharacteristics of Spoken Language Data

    MessageMessage -- Information to be conveyedInformation to be conveyed

    SpeakerSpeaker Who is the speaker?Who is the speaker?

    His/her backgroundHis/her background Age, gender, literacy levels, knowledgeAge, gender, literacy levels, knowledgelevels, mannerisms etc.levels, mannerisms etc.

    EmotionsEmotions Anger, sad, happy etc.Anger, sad, happy etc.

    IdiolectIdiolect An individual distinctive style of speakingAn individual distinctive style of speaking

    Medium of transmissionMedium of transmission Microphone, telephone, satellite etc.Microphone, telephone, satellite etc.

    EnvironmentEnvironment-- partyparty--environment, airport/station,environment, airport/station,

    LanguageLanguage

    DialectDialect grammar and the vocabulary associated with a regional orgrammar and the vocabulary associated with a regional orsocial use of a language.social use of a language.

    Culture and civilizationCulture and civilization The richness of usage of vocabulary,The richness of usage of vocabulary,grammar etc, indicates the times of the language and the society.grammar etc, indicates the times of the language and the society.

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    5/22

    55

    Characteristics of SpokenCharacteristics of Spoken

    Language DataLanguage Data How a language was spoken 25 years ago, 50 years ago, 100How a language was spoken 25 years ago, 50 years ago, 100

    years ago and beyond?years ago and beyond?

    How a famous poem was recited or sung by the author?How a famous poem was recited or sung by the author?

    How a particular language was spoken in different geographicalHow a particular language was spoken in different geographicallocations of a state/country?locations of a state/country?

    How a particular language/dialect has evolved over a period ofHow a particular language/dialect has evolved over a period oftime?time?

    What were the rare languages/dialects (which were no more inWhat were the rare languages/dialects (which were no more inexistence)?. How they were spoken?existence)?. How they were spoken?

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    6/22

    66

    Phase 0: Prototype data collectionPhase 0: Prototype data collection

    at IIIT Hydat IIIT Hyd High quality studio recordingsHigh quality studio recordings

    2 hrs of single speaker recordings for speech2 hrs of single speaker recordings for speechsynthesissynthesis

    Telugu, Hindi, Tamil and IndianTelugu, Hindi, Tamil and Indian--EnglishEnglish Developed text to speech systems in these 4Developed text to speech systems in these 4

    languageslanguages

    Telephone and CellTelephone and Cell--phone corpusphone corpus 150 hrs (540 speakers)150 hrs (540 speakers)

    Telugu, Tamil and MarathiTelugu, Tamil and Marathi

    Developed speech recognition systems in these 3Developed speech recognition systems in these 3languageslanguages

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    7/22

    77

    Phase 0: Prototype data collectionPhase 0: Prototype data collection

    at IIT Madrasat IIT Madras 15 hours (72 speakers)15 hours (72 speakers)

    TV news in Tamil, Telugu and HindiTV news in Tamil, Telugu and Hindi

    LanguagesLanguages Text to speech systems (TTS)Text to speech systems (TTS)

    Language IdentificationLanguage Identification

    Duration modeling for TTS systemsDuration modeling for TTS systems

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    8/22

    88

    Tools Aiding forTools Aiding for

    Acquisition/Correction of Speech DataAcquisition/Correction of Speech Data Transcription correction tool (TCT)Transcription correction tool (TCT)

    Spoken errors at phone, syllable, word levelSpoken errors at phone, syllable, word level

    Background noise, abrupt begin or end, low SNRBackground noise, abrupt begin or end, low SNR

    TCT corrects the above errors in three levelsTCT corrects the above errors in three levels

    Audio & Video Transcription ToolAudio & Video Transcription Tool Used to annotate movie databasesUsed to annotate movie databases

    Correction of Segment labelsCorrection of Segment labels EmulabelEmulabel

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    9/22

    99

    Lessons LearntLessons Learnt

    Speech correction needs 3Speech correction needs 3--6 times more6 times morethan collectionthan collection Better to collect more data than correctingBetter to collect more data than correcting

    Needs a unified frameworkNeeds a unified framework Standardize, processes, procedure and toolsStandardize, processes, procedure and tools

    Need larger collection of spoken and textNeed larger collection of spoken and text

    corporacorpora For building practical speech systems inFor building practical speech systems in

    Indian languagesIndian languages

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    10/22

    1010

    Proposal for collection of largerProposal for collection of larger

    Spoken Language Data for ILSpoken Language Data for IL Focus of information present in speechFocus of information present in speech

    modemode

    Collect spoken language data from allCollect spoken language data from allIndian languages and also fromIndian languages and also fromneighboring countriesneighboring countries

    Collect about200,000 (.2 M) hours ofCollect about200,000 (.2 M) hours of

    speechspeech As a part of JimBakers global project ofAs a part of JimBakers global project of

    collecting 1 Million hours of speechcollecting 1 Million hours of speech

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    11/22

    1111

    New in our approachNew in our approach

    Collection of large speech data upto 200,000 (0.2 M)Collection of large speech data upto 200,000 (0.2 M)hourshours All Indian languages and dialectsAll Indian languages and dialects

    23 official Indian languages23 official Indian languages Approx. 10,000 hours per languageApprox. 10,000 hours per language

    All types: Traditional, Read, spoken, conversational, dialog,All types: Traditional, Read, spoken, conversational, dialog,movies, broadcast etc.movies, broadcast etc.

    All modes: microphone, clean, telephone, cellphone, satellite etcAll modes: microphone, clean, telephone, cellphone, satellite etc

    Standard procedure for organizing, annotating andStandard procedure for organizing, annotating andindexingindexing

    More focus on larger collection (and elimination than ofMore focus on larger collection (and elimination than ofcorrection)correction)

    Make available this data for general public useMake available this data for general public use

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    12/22

    1212

    Key MakeKey Make--AA--Difference CapabilityDifference Capability

    Availability of information (Stories, lectures, poems, books, articles)Availability of information (Stories, lectures, poems, books, articles)in spoken languagein spoken language For illiterateFor illiterate Vision ImpairedVision Impaired

    Collection and Storage of spoken language data of popular as wellCollection and Storage of spoken language data of popular as wellas rare languages & dialectsas rare languages & dialects

    Promotes research and development inPromotes research and development in Speech TechnologySpeech Technology

    SpeechSpeech--toto--speech translation in Indian languagesspeech translation in Indian languages Phonetic engine (Language Independent)Phonetic engine (Language Independent) Speech synthesis (TextSpeech synthesis (Text--toto--speech for Indian languages)speech for Indian languages) Speaker recognition (Text independent and dependent)Speaker recognition (Text independent and dependent)

    Language IdentificationLanguage Identification Speech enhancementSpeech enhancement Speech signal processingSpeech signal processing

    Biometrics:Biometrics: Multimodal: AudioMultimodal: Audio--Video modesVideo modes

    Information Access, Storage and RetrievalInformation Access, Storage and Retrieval AudioAudio--video data (indexing)video data (indexing) Data Mining (searching)Data Mining (searching) Speech Coding (UltraSpeech Coding (Ultra--low bit coding)low bit coding)

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    13/22

    1313

    Implementation PlanImplementation Plan

    Phase 1: (3.5 months)Phase 1: (3.5 months) 10 languages10 languages

    33,300 hours33,300 hours

    Phase 2: (8 months)Phase 2: (8 months) 10 (of phase 1) languages10 (of phase 1) languages

    66,000 hours66,000 hours

    Phase 3: (10 months)Phase 3: (10 months) 1313 -- remaining languagesremaining languages

    80,000 hours80,000 hours

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    14/22

    1414

    MidMid--Term and Final TermsTerm and Final Terms

    MidMid--TermTerm

    Phase 1, collection of33,300 hours of speechPhase 1, collection of33,300 hours of speech

    Collection, Storage and Indexing of speech data forCollection, Storage and Indexing of speech data for

    public information accesspublic information access Visible research output using the speech dataVisible research output using the speech data

    Demonstrations of speech technology productsDemonstrations of speech technology products

    Speech recognition in 10 languagesSpeech recognition in 10 languages

    Final TermFinal Term Phase 1 + Phase 2Phase 1 + Phase 2

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    15/22

    1515

    Q & AQ & A

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    16/22

    1616

    Misc.Misc.

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    17/22

    1717

    Impact of Audio Digital LibraryImpact of Audio Digital Library

    Availability of information in spoken language form forAvailability of information in spoken language form forilliterate and othersilliterate and others

    Promotes research in speech technology for IndianPromotes research in speech technology for Indianlanguageslanguages

    Enable to develop speech technology products useful forEnable to develop speech technology products useful forcommon mancommon man Examples:Examples:

    SpeechSpeech--speech translation systemsspeech translation systems For information exchangeFor information exchange

    Screen readers,Screen readers, For illiterate and physically challengedFor illiterate and physically challenged

    Naturally speaking dialog systemsNaturally speaking dialog systems For information access over voice modeFor information access over voice mode

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    18/22

    1818

    Phase 1: Time EstimatePhase 1: Time Estimate

    Phase 1:Phase 1: 10 official Indian languages10 official Indian languages Parallel collection of dataParallel collection of data ~ 3000 hours per language~ 3000 hours per language

    5,0005,000 -- 10,000 speakers10,000 speakers

    > 10 min of speech each per speaker> 10 min of speech each per speaker Total:33,300 hoursTotal:33,300 hours

    Time Estimates: (~ 3.5 months all 10 languages)Time Estimates: (~ 3.5 months all 10 languages) 10 persons10 persons--team per languageteam per language Each person worksEach person works

    8 hours a day8 hours a day 30

    mins of speech recording per hour30

    mins of speech recording per hour 11--3 speakers per hour3 speakers per hour 240 mins of speech per day240 mins of speech per day

    11--24 speakers per day,24 speakers per day,

    240 speakers per day240 speakers per day 20,000 speakers per language in 84 working days20,000 speakers per language in 84 working days

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    19/22

    1919

    Phase 1: Cost EstimatePhase 1: Cost Estimate

    Man power cost: Rs 140 LakhsMan power cost: Rs 140 Lakhs

    Equipment cost: Rs 55 LakhsEquipment cost: Rs 55 Lakhs

    Communication cost: Rs 40 LakhsCommunication cost: Rs 40 Lakhs Contingency (10%): Rs 25 LakhsContingency (10%): Rs 25 Lakhs

    Total Cost: Rs 2.6 Crores (~ $ 565,000)Total Cost: Rs 2.6 Crores (~ $ 565,000)

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    20/22

    2020

    ManMan--Power CostPower Cost

    Data collection Team: Rs 86 lakhsData collection Team: Rs 86 lakhs 10 (for data collection) x Rs 10 K PM10 (for data collection) x Rs 10 K PM

    10 (for data correction) x Rs 10 K PM10 (for data correction) x Rs 10 K PM

    1 data manager (Rs 15 K PM)1 data manager (Rs 15 K PM) 4 months cost:8, 60, 000 per language4 months cost:8, 60, 000 per language

    5 engineers: Rs 4 Lakhs5 engineers: Rs 4 Lakhs B.Tech Level (Rs 20,000 PM)B.Tech Level (Rs 20,000 PM)

    Gifts per speaker: Rs 50 LakhsGifts per speaker: Rs 50 Lakhs Rs 25 per speakerRs 25 per speaker

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    21/22

    2121

    Machines CostMachines Cost

    Machines:Machines:

    30 servers: Rs 30 Lakhs30 servers: Rs 30 Lakhs

    3 servers per languages3 servers per languages

    Each server has 4 ports for data collectionEach server has 4 ports for data collection

    30 CTI cards: Rs 20 Lakhs30 CTI cards: Rs 20 Lakhs

    Storage:20 TB: Rs 5 LakhsStorage:20 TB: Rs 5 Lakhs

    Two copies of20 TBTwo copies of20 TB

  • 8/6/2019 (Kishore) Problems and Prospects in Collecting Spoken Language Data

    22/22

    2222

    Communications CostCommunications Cost

    Telephonic charges: Rs 20 LakhsTelephonic charges: Rs 20 Lakhs

    Rs 1 per min (local telephonic charges)Rs 1 per min (local telephonic charges)

    Transportation: Rs 20 LakhsTransportation: Rs 20 Lakhs