Enabling Information Access through Mobile Based Dialog ... 0.1.pdf · •Speaker Selection and ... Adeeba et al., 2014] –1M CLE Urdu Digest corpus –Urdu news corpus •Selection

Enabling Information Access through Mobile Based Dialog Systems and Screen

Readers for UrduReaders for Urdu

Data Collection and Tagging Team

• TTS Data

• ASR Data

TTS Data

• Corpus Design

• Speaker Selection and Recording

• Data Annotation

• Speech Annotation Quality Assessment• Speech Annotation Quality Assessment

Corpus Design

• Source Data – 37 million words corpus [Adeeba et al., 2014]

– 1M CLE Urdu Digest corpus

– Urdu news corpus– Urdu news corpus

• Selection Mechanism– Greedy Algorithm [Habib et al ., 2014]

• Shortlisted Data– 10500 sentences from the three corpora

– 8 hours speech

Speaker Selection and Recording

• Speaker Selection Tests – Phonetic level assessment – Phonological level assessment

• Process of Recording the Speech Corpus– 30 recording sessions– 14 batches in 1 recording session– 25 sentences in 1 batch– 25 sentences in 1 batch– Naming scheme for the batch: RS<ID>_B<ID> i.e. RS1_B2– Naming scheme for the Sentence: RS<ID>_B<ID> _S<ID>_U<ID>i.e.

RS1_B2_S1_U1

– RS : Recording Session– B: Batch– S: Sentence– U: Utterance

Data Annotation

• Speech Corpus needs to be annotated at six different levels [Mumtaz et al., 2014] for TTS synthesis which are as follows:– Segment Level Annotation

– Word Level Annotation– Word Level Annotation

– Syllable Level Annotation

– Break Index Level Annotation

– Stress Level Annotation

– Intonation Level Annotation

• Annotation is done using PRAAT

Speech Annotation Quality Assessment

• Segment Level Assessment– Phoneme labels checking

– Phoneme boundaries checking using maximum string alignment algorithm

• Word Level Assessment– Word label should not contain any non speech phoneme label; SIL,

PAU

– The number of words in text form should be equal to the number of annotated words in the source file

– All the labeled words can be syllabified according to the Urdu syllabification rules

– The pronunciation of labeled word is compared with the standard Urdu pronunciation available in the pronunciation lexicon

Status of the Urdu Speech Corpus

1st

Hour

2nd

Hour

3rd

Hour

4th

Hour

5th

Hour

6th

Hour

7th

Hour

8th

Hour

9th

Hour

10

Hour

Recording of the Speech Corpus

Re-recording of the Rejected

Sentences

Segmentation of the Speech

•Guidelines, Testing process and Annotation Corpus at Sentence Level

Phoneme Level Annotation

Word Level Annotation

Syllable Level Annotation

Break Index Level Annotation

Stress Level Annotation

Intonation Level Annotation

Annotation completed • Guidelines process completed•Unexplored

Pronunciation Lexicon

• Sources– 19.3 million Words

– Oxford University Press (OUP)

– Speech Corpus for the text-to speech system

• Lexicon for Urdu consist of three parts; • Lexicon for Urdu consist of three parts; – Urdu word

– POS tag

– pronunciation in IPA/CISAMPA.

• Lexicon Size: 60K

ASR Data• Vocabulary Size

– Weather Service: 144– Location Service: 49

• Data Collection Design – Speaker Variation – Geographic Variation – Mobile Channel (Multiple Mobile operators, Different Mobile Phones)

• Data Collection Methodology • Data Collection Methodology – Randomize– Lists

• ASR Recording Setup – Available at 042-36811113 – Information Required

• Name• Zilla• Language

• ASR Data Cleaning (Guidelines)

Status

Data Types Duration No. Of Speakers

VocabularySize

Weather Service

11 hours and 22 minutes

2218 144

Location 10 hours 300 49

• Requirement– Vocabulary Size:

500– Number of

Speakers: 200– Duration: 6 hoursLocation

Service10 hours 300 49 – Duration: 6 hours

Telephony Team

• Telephony Framework R &D

• Current configuration

– LinkSys VoIP box (Hardware)

– Trixbox (Software, runs on CentOS)– Trixbox (Software, runs on CentOS)

• Multiple Sessions Handling

– Hardware can handle up to four calls

– Software has the capability to handle multiple calls

Dialog Framework Team

• Dialog Manager R & D

• Raven Claw

• Dialog Design for Weather forecast System

• Dialog Design for Location based Service• Dialog Design for Location based Service

• Finalized Dialog framework

System Architecture for Single Session [Haq et al., 2014]

URDU ASR

FEDORA MACHINE

TELEPHONY SERVER/FRAMEWORK GALAXY FRAMEWORK

text

DIALOGUE

DIALOG TASK TREE

WINDOWS MACHINE

VoIP box

IVR

BACKEND

INTERACTION MANAGER

GALAXY HUB

RECOGNIZER

ASTERISK SERVER

CENT OS MACHINE

CALL/SESSION INITIATION

USER INPUT & SIGNALLING

SYSTEM OUTPUT & SIGNALLING

DIALOGUE MANAGER

WEATHER DB

DIALOG DB

INTERNET

StatusQuarter Deliverable Status

25% (Initial Stage)

50% (Prototype Stage)

75% (Final Stage)

100% (Completed)

3 Report on Telephony framework design

5 Prototype weather dialog system

6 Prototype location dialog system

8 Dialog manager for weather domaindomain

9 Dialog manager for location finder domain

10 Spoken dialog system for weather domain

11 Spoken dialog system for location finder domain

11 General framework for dialog system

Text to Speech Synthesis Team

• Natural Language Processing (NLP) engine

• Text to speech synthesis using Festival

• Text to speech synthesis using HTS

• Integration of NLP • Integration of NLP

• Evaluation

Natural Language Processing (NLP)

• Tokenize the words

• Text normalization [Basit and Hussain, 2014]

• Select appropriate pronunciation

• Apply syllabification via template matching• Apply syllabification via template matching

• Performs Parts of Speech (POS) tagging

NLP Engine (Output)

Input Output

ر��ل ا��م � و�� ر��ل ا��م �� هللا ��

12324 ��

�ار � �رہ �� 12324 ��

�ار � �رہ ��

۸۹۳۳ ��

��

� ��

��ار � ��

�آ�

12-2-2000 �ار �وری �� دو ��

�رہ � ��

5:00 ��

� � ��

۴:۴۰ � ��

� �ر ��

1/3 ��

�� ا��

Festival (Speech Synthesis Engine)

– A general framework for building synthetic voices

– Incorporates essential tools, for building voices in a new language

– Supports Unit Selection as well as HMM based – Supports Unit Selection as well as HMM based Speech Synthesis

– HMM based (HTS) and Unit selection voices for Urdu language can be accessed at:

http://182.180.102.251:8080/urdutts/

Hidden Markov Model(HMM) based Speech Synthesis (HTS)

– HTS is a toolkit for building statistical parametric based voices

– Synthesis models used are the Hidden Markov Models (HMMs)Models (HMMs)

– Works as a patch to the HMM toolkit (HTK) [Nawaz et al. , 2014]

– Uses Festival’s front-end as a text analyzer

NLP Integration with Festival

– Festival provides the flexibility of introducing additional modules, if required by the new language.

Hmm based

Festival Speech Synthesis

Input:Urdu Text

NLPModule

Unit Selection

based Speech

Synthesis

Hmm based Speech

SynthesisOutput:

Synthesized Waveform

Evaluation• Subjective Evaluation

– Naturalness• Measures the naturalness of the synthesized voice• Scale varies from 1-5

– Intelligibility• Measures the intelligibility of the synthesized voice• Measures the intelligibility of the synthesized voice• Scale varies from 1-5

• Interface can be accessed at:http://www.cle.org.pk/tts

• Objective Evaluation– Uses Automatic processes to evaluate the synthesized quality– Performed by building two Automatic Speech Recognizers (ASRs)– Due to the low accuracy of ASRs, the correlation between subjective

and objective is tests is very low

Quarter Deliverable Status

25% (Initial Stage)


75% (Final Stage)

100% (Completed)

2 TTS toolkit report

4 Prototype TTS NLP module

5 TTS corpus design report

7 Prototype TTS System

8Baseline TTS system for 5000

8Baseline TTS system for 5000 words

9 TTS quality report

10Improved TTS system for 10,000 words and test report

10Screen reader integration with TTS system and test report

11 User Guide for the TTS

Automatic Speech Recognition(ASR) Team

• Spoken Dialog-based Weather forecast System for Urdu

• Spoken Dialog-based Location based system

• Keyword Spotting based ASR• Keyword Spotting based ASR

• Accent Identification

ASR Systems

• Spoken Dialog-based Weather forecast System for Urdu

– Vocabulary Size: 174

• Spoken Dialog-based Location System for Urdu• Spoken Dialog-based Location System for Urdu

– Vocabulary Size: 338

Keyword Spotting System [Irtaza et al., 2014]

• Systems – Location : 2 Clusters– Response: Yes/ No– Counting:– Time

• Vocabulary • Accuracy Issue

Integration with Dialog System

• ASR systems are integrated with Dialog Manager

Accent Identification

• Acoustic distance based measure to check the similarity between Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of Urdu language [Afsheen et al., 2014].[Afsheen et al., 2014].

• Currently Working on Statistical based accent identification system

Quarter Deliverable Status

25% (Initial Stage)


75% (Final Stage)

100% (Completed)

2 ASR toolkit report

3 Prototype ASR system

6 Baseline speech recognition system for weatherdomain

7 Test report for baseline speech recognition system recognition system for weather

8 Speech recognition system for weather domain

9 ASR speech recognition system for location finder domain

11 Final ASR system for weather and location finder domain and test report

References

• [Irtza, 2014] Irtza, S., Rehman, K., Hussain, S. and Adeeba, F. "Urdu Keyword Spotting System using HMM", in the Proceedings of Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).

• [Adeeba, 2014] Adeeba, F., Akram, Q., Khalid, H. and Hussain, S. "CLE Urdu Books N-grams", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).

• [Mumtaz, 2014] Mumtaz, B., Hussain, A., Hussain, S., Mahmood. A., Bhatti, R., Farooq, M. and Rauf, S. "Multitier Annotation of Urdu Speech Corpus", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).

• [Afsheen, 2014] Afsheen., Irtza, S., Farooq, M., Hussain, S. "Accent Classification among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).

• [Haq, 2014] Haq, I. A., Anwar, A., Ahmad, A., Habib, T., Hussain, S. and Rahman, S. “Spoken Dialog System: Direction Guide for Lahore City”, poster presentation in Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).

• [Nawaz, 2014] Nawaz, O. and Habib, T. "Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).

• [Basit, 2014] Basit, H. R. and Hussain, S. "Text Processing for Urdu TTS System", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).

• [Habib,2014] Habib, W. Basit, H. R., Hussain, S. and Adeeba, F. "Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).

Documents

Enabling Information Access through Mobile Based Dialog ... 0.1.pdf · •Speaker Selection and ... Adeeba et al., 2014] –1M CLE Urdu Digest corpus –Urdu news corpus •Selection