Upload
phunghuong
View
213
Download
1
Embed Size (px)
Citation preview
Enabling Information Access through Mobile Based Dialog Systems and Screen
Readers for UrduReaders for Urdu
Data Collection and Tagging Team
• TTS Data
• ASR Data
TTS Data
• Corpus Design
• Speaker Selection and Recording
• Data Annotation
• Speech Annotation Quality Assessment• Speech Annotation Quality Assessment
Corpus Design
• Source Data – 37 million words corpus [Adeeba et al., 2014]
– 1M CLE Urdu Digest corpus
– Urdu news corpus– Urdu news corpus
• Selection Mechanism– Greedy Algorithm [Habib et al ., 2014]
• Shortlisted Data– 10500 sentences from the three corpora
– 8 hours speech
Speaker Selection and Recording
• Speaker Selection Tests – Phonetic level assessment – Phonological level assessment
• Process of Recording the Speech Corpus– 30 recording sessions– 14 batches in 1 recording session– 25 sentences in 1 batch– 25 sentences in 1 batch– Naming scheme for the batch: RS<ID>_B<ID> i.e. RS1_B2– Naming scheme for the Sentence: RS<ID>_B<ID> _S<ID>_U<ID>i.e.
RS1_B2_S1_U1
– RS : Recording Session– B: Batch– S: Sentence– U: Utterance
Data Annotation
• Speech Corpus needs to be annotated at six different levels [Mumtaz et al., 2014] for TTS synthesis which are as follows:– Segment Level Annotation
– Word Level Annotation– Word Level Annotation
– Syllable Level Annotation
– Break Index Level Annotation
– Stress Level Annotation
– Intonation Level Annotation
• Annotation is done using PRAAT
Speech Annotation Quality Assessment
• Segment Level Assessment– Phoneme labels checking
– Phoneme boundaries checking using maximum string alignment algorithm
• Word Level Assessment– Word label should not contain any non speech phoneme label; SIL,
PAU
– The number of words in text form should be equal to the number of annotated words in the source file
– All the labeled words can be syllabified according to the Urdu syllabification rules
– The pronunciation of labeled word is compared with the standard Urdu pronunciation available in the pronunciation lexicon
Status of the Urdu Speech Corpus
1st
Hour
2nd
Hour
3rd
Hour
4th
Hour
5th
Hour
6th
Hour
7th
Hour
8th
Hour
9th
Hour
10
Hour
Recording of the Speech Corpus
Re-recording of the Rejected
Sentences
Segmentation of the Speech
•Guidelines, Testing process and Annotation Corpus at Sentence Level
Phoneme Level Annotation
Word Level Annotation
Syllable Level Annotation
Break Index Level Annotation
Stress Level Annotation
Intonation Level Annotation
Annotation completed • Guidelines process completed•Unexplored
Pronunciation Lexicon
• Sources– 19.3 million Words
– Oxford University Press (OUP)
– Speech Corpus for the text-to speech system
• Lexicon for Urdu consist of three parts; • Lexicon for Urdu consist of three parts; – Urdu word
– POS tag
– pronunciation in IPA/CISAMPA.
• Lexicon Size: 60K
ASR Data• Vocabulary Size
– Weather Service: 144– Location Service: 49
• Data Collection Design – Speaker Variation – Geographic Variation – Mobile Channel (Multiple Mobile operators, Different Mobile Phones)
• Data Collection Methodology • Data Collection Methodology – Randomize– Lists
• ASR Recording Setup – Available at 042-36811113 – Information Required
• Name• Zilla• Language
• ASR Data Cleaning (Guidelines)
Status
Data Types Duration No. Of Speakers
VocabularySize
Weather Service
11 hours and 22 minutes
2218 144
Location 10 hours 300 49
• Requirement– Vocabulary Size:
500– Number of
Speakers: 200– Duration: 6 hoursLocation
Service10 hours 300 49 – Duration: 6 hours
Telephony Team
• Telephony Framework R &D
• Current configuration
– LinkSys VoIP box (Hardware)
– Trixbox (Software, runs on CentOS)– Trixbox (Software, runs on CentOS)
• Multiple Sessions Handling
– Hardware can handle up to four calls
– Software has the capability to handle multiple calls
Dialog Framework Team
• Dialog Manager R & D
• Raven Claw
• Dialog Design for Weather forecast System
• Dialog Design for Location based Service• Dialog Design for Location based Service
• Finalized Dialog framework
System Architecture for Single Session [Haq et al., 2014]
URDU ASR
FEDORA MACHINE
TELEPHONY SERVER/FRAMEWORK GALAXY FRAMEWORK
text
DIALOGUE
DIALOG TASK TREE
WINDOWS MACHINE
VoIP box
IVR
BACKEND
INTERACTION MANAGER
GALAXY HUB
RECOGNIZER
ASTERISK SERVER
CENT OS MACHINE
CALL/SESSION INITIATION
USER INPUT & SIGNALLING
SYSTEM OUTPUT & SIGNALLING
DIALOGUE MANAGER
WEATHER DB
DIALOG DB
INTERNET
StatusQuarter Deliverable Status
25% (Initial Stage)
50% (Prototype Stage)
75% (Final Stage)
100% (Completed)
3 Report on Telephony framework design
5 Prototype weather dialog system
6 Prototype location dialog system
8 Dialog manager for weather domaindomain
9 Dialog manager for location finder domain
10 Spoken dialog system for weather domain
11 Spoken dialog system for location finder domain
11 General framework for dialog system
Text to Speech Synthesis Team
• Natural Language Processing (NLP) engine
• Text to speech synthesis using Festival
• Text to speech synthesis using HTS
• Integration of NLP • Integration of NLP
• Evaluation
Natural Language Processing (NLP)
• Tokenize the words
• Text normalization [Basit and Hussain, 2014]
• Select appropriate pronunciation
• Apply syllabification via template matching• Apply syllabification via template matching
• Performs Parts of Speech (POS) tagging
NLP Engine (Output)
Input Output
ر��ل ا��م � و��� ر��ل ا��م ��� هللا ����
12324 ��� ��� �� �� � ���
�ار � �رہ �� ��12324 ��� ��� �� �� � ���
�ار � �رہ �� ��
۸۹۳۳ ����
��
� ���� �� �
��ار � �� �
�آ�
12-2-2000 �ار �وری �� دو ���
�رہ � ��
5:00 ��
� � ��
۴:۴۰ � ��� ����� �� �� ��
� �ر ��
1/3 �����
�� � ا��
Festival (Speech Synthesis Engine)
– A general framework for building synthetic voices
– Incorporates essential tools, for building voices in a new language
– Supports Unit Selection as well as HMM based – Supports Unit Selection as well as HMM based Speech Synthesis
– HMM based (HTS) and Unit selection voices for Urdu language can be accessed at:
http://182.180.102.251:8080/urdutts/
Hidden Markov Model(HMM) based Speech Synthesis (HTS)
– HTS is a toolkit for building statistical parametric based voices
– Synthesis models used are the Hidden Markov Models (HMMs)Models (HMMs)
– Works as a patch to the HMM toolkit (HTK) [Nawaz et al. , 2014]
– Uses Festival’s front-end as a text analyzer
NLP Integration with Festival
– Festival provides the flexibility of introducing additional modules, if required by the new language.
Hmm based
Festival Speech Synthesis
Input:Urdu Text
NLPModule
Unit Selection
based Speech
Synthesis
Hmm based Speech
SynthesisOutput:
Synthesized Waveform
Evaluation• Subjective Evaluation
– Naturalness• Measures the naturalness of the synthesized voice• Scale varies from 1-5
– Intelligibility• Measures the intelligibility of the synthesized voice• Measures the intelligibility of the synthesized voice• Scale varies from 1-5
• Interface can be accessed at:http://www.cle.org.pk/tts
• Objective Evaluation– Uses Automatic processes to evaluate the synthesized quality– Performed by building two Automatic Speech Recognizers (ASRs)– Due to the low accuracy of ASRs, the correlation between subjective
and objective is tests is very low
Quarter Deliverable Status
25% (Initial Stage)
50% (Prototype Stage)
75% (Final Stage)
100% (Completed)
2 TTS toolkit report
4 Prototype TTS NLP module
5 TTS corpus design report
7 Prototype TTS System
8Baseline TTS system for 5000
8Baseline TTS system for 5000 words
9 TTS quality report
10Improved TTS system for 10,000 words and test report
10Screen reader integration with TTS system and test report
11 User Guide for the TTS
Automatic Speech Recognition(ASR) Team
• Spoken Dialog-based Weather forecast System for Urdu
• Spoken Dialog-based Location based system
• Keyword Spotting based ASR• Keyword Spotting based ASR
• Accent Identification
ASR Systems
• Spoken Dialog-based Weather forecast System for Urdu
– Vocabulary Size: 174
• Spoken Dialog-based Location System for Urdu• Spoken Dialog-based Location System for Urdu
– Vocabulary Size: 338
Keyword Spotting System [Irtaza et al., 2014]
• Systems – Location : 2 Clusters– Response: Yes/ No– Counting:– Time
• Vocabulary • Accuracy Issue
Integration with Dialog System
• ASR systems are integrated with Dialog Manager
Accent Identification
• Acoustic distance based measure to check the similarity between Punjabi, Urdu, Pashto, Saraiki and Sindhi accents of Urdu language [Afsheen et al., 2014].[Afsheen et al., 2014].
• Currently Working on Statistical based accent identification system
Quarter Deliverable Status
25% (Initial Stage)
50% (Prototype Stage)
75% (Final Stage)
100% (Completed)
2 ASR toolkit report
3 Prototype ASR system
6 Baseline speech recognition system for weatherdomain
7 Test report for baseline speech recognition system recognition system for weather
8 Speech recognition system for weather domain
9 ASR speech recognition system for location finder domain
11 Final ASR system for weather and location finder domain and test report
References
• [Irtza, 2014] Irtza, S., Rehman, K., Hussain, S. and Adeeba, F. "Urdu Keyword Spotting System using HMM", in the Proceedings of Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).
• [Adeeba, 2014] Adeeba, F., Akram, Q., Khalid, H. and Hussain, S. "CLE Urdu Books N-grams", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).
• [Mumtaz, 2014] Mumtaz, B., Hussain, A., Hussain, S., Mahmood. A., Bhatti, R., Farooq, M. and Rauf, S. "Multitier Annotation of Urdu Speech Corpus", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).
• [Afsheen, 2014] Afsheen., Irtza, S., Farooq, M., Hussain, S. "Accent Classification among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).2014 (CLT 14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).
• [Haq, 2014] Haq, I. A., Anwar, A., Ahmad, A., Habib, T., Hussain, S. and Rahman, S. “Spoken Dialog System: Direction Guide for Lahore City”, poster presentation in Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL: http://cs.dsu.edu.pk/clt14/).
• [Nawaz, 2014] Nawaz, O. and Habib, T. "Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).
• [Basit, 2014] Basit, H. R. and Hussain, S. "Text Processing for Urdu TTS System", poster presentation in Conference on Language and Technology 2014 (CLT 14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).
• [Habib,2014] Habib, W. Basit, H. R., Hussain, S. and Adeeba, F. "Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm", in the Proceedings of Conference on Language and Technology 2014 (CLT14), Karachi, Pakistan. (URL:http://cs.dsu.edu.pk/clt14).