Download ppt - Assamese to English Statistical Machine Translation

Assamese- ENGLISH Statistical Machine Translation Using Moses PRESENTED BY KALYANEE KANCHAN BARUAH AND PRANJAL DAS

CONTENTS INTRODUCTION LITERATURE REVIEW IMPLEMENTATION TRANSLITERATION IN TRANSLATION EVALUATION CONCLUSION AND FURURE WORK REFERENCES

INTRODUCTION What is Natural Language Processing ? Natural Language Processing (NLP) is the ability of a computer program to understand human speech as it is spoken. NLP automates the translation between computers and humans.

WHAT IS MACHINE TRANSLATION Machine translation (MT) is automated translation. It is the process by which computer software is used to translate a text from one natural language (such as Assamese) to another (such as English).

WHAT IS MACHINE TRANSLATION The ideal aim of machine translation systems is to produce the best possible translation without human assistance. Basically every machine translation system requires programs for translation and automated dictionaries and grammars to support translation.

ADVANTAGES OF MACHINE TRANSLATION Quick Translation Low price Confidentiality Online translation and translation of web page content Overcomes technological barriers

PROBLEMS IN MACHINE TRANSLATION Translation is not straightforward Word order Word sense Idioms

TYPES OF MACHINE TRANSLATION BILINGUAL MT systems that produce translations between any two particular languages. MULTILINGUAL MT systems that produce translations for any given pair of languages. They are preferred to bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa

SOME EXISTING MT SYSTEMS Google Translate Systran Bing Translator Bable Fish Apertium

SOME MAJOR MT PROECTS IN INDIA Anglabharat (and Anubharati) Anusaaraka MaTra UCSG-based English-Kannada MT Tamil-Hindi Anusaaraka and English-Tamil MT Anuvadak English-Hindi software Sampark

MACHINE TRANSLATION APPROACHES

STATISTICAL MACHINE TRANSLATION Enables us to automatically build machine translation systems using statistical models trained by text data. Every sentence in a language has a possible translation in another language.

STATISTICAL MACHINE TRANSLATION

LANGUAGE MODEL Gives the probability of a sentence Uses n-gram model IRSTLM is used to develop the Language Model The probability of sentence P (S), is broken down as the probability of individual words P(w). P(s) = P(w1, w2, w3,....., wn) =P(w1) P(w2|w1) P(w3,|w1w2 ) P(w4|w1w2w3)...P(wn|w1w2...wn-1)

LANGUAGE MODEL Suppose for a large amount of corpus we have the following bigram probabilities .001Eat British.03Eat today .007Eat dessert.04Eat Indian .01Eat tomorrow.04Eat a .02Eat Mexican.04Eat at .02Eat Chinese.05Eat dinner .02Eat in.06Eat lunch .03Eat breakfast.06Eat some .03Eat Thai.16Eat on

LANGUAGE MODEL .01British lunch.05Want a .01British cuisine.65Want to .15British restaurant.04I have .60British food.08I dont .02To be.29I would .09To spend.32I want .14To have.02 Im .26To eat.04 Tell .01Want Thai.06 Id .04Want some.25 I

TRANSLATION MODEL Computes the probability of source sentence S, for a given target sentence T i.e. P(S|T). May be done word based or phrase based. Output of TM is fed into the Moses decoder. Giza++ along with mkcls is used to develop Translation Model.

TRANSLATION MODEL Example : Jaipur is a famous city of Rajasthan

DECODER Maximizes the probability of the translated text Search for sentence T is performed that maximizes P (S|T) i.e. Pr (S, T) = argmax P (S|T) P (T) DECODING ALGORITHM TRANSLATION MODEL LANGUAGE MODEL

ARCHITECTURE OF OUR SMT

HOW OUR SMT WORKS

IMPLEMENTATION Install all packages in Moses Install Giza++ Install IRSTLM Training Tuning Generate output (decoding)

TRAINING THE MOSES DECODER Prepare data Run Giza++ Get lexical translation table Build lexicalized reordering model Create configuration fileBuild generation models. Align words Extract phrases

PREPARING THE DATA Tokenising - inserting spaces between words and punctuation. Truecasing - setting the case of the first word in each sentence. Cleaning - removing empty lines, redundant spaces, and lines that are too short or too long.

EXAMPLE PARALLEL DATA ass-eng1.as ass-eng1.en The famous Bikaneri Bhujias and sweets are some of the best items to purchase in Bikaner. , Jaipur, popularly known as the Pink City, is the capital of Rajasthan state, India. The Amber Palace is a classic example of Mughal and Hindu architecture. Kanak Vrindavan is a popular picnic spot in Jaipur. , Jaipur is also famous for marble statues, blue pottery and the Rajasthani shoes

SAMPLE OUTPUTS Input Assamese Sentence Output English sentence Jaipur is a famous city of Rajasthan . the Taj Mahal , is located in the heart of the Agra . Jama Masjid built by Shahjahan . Andhra Pradesh is one of the state of one of India . Guwahati is connected by the capital of the State . Agra is the one of the famous city Delhi is the capital of India .

PROBLEMS WITH PROPER NOUNS Input Assamese Sentence Output English sentence is a vast country . .. from the city is located at a distance of of Rajasthan . the capital of Goa , |

TRANSLITERATION IN TRANSLATION Transliteration Transcription from one alphabet to another Some proper nouns which are not in our corpus are not translated. For example: Translating gives is a vast country. Because is not in our corpus.

TRANSLITERATION IN TRANSLATION Store each Assamese alphabet and their English transliteration in a perl script For example: -> k -> kh -> g Used this perl script and run with moses using the following command echo | ~/mymoses/bin/moses f ~/work/mert- work/moses.ini | ./transliterate.pl Output : kanada is a vast country .

IMPLEMENTING TRANSLITERATION INPUT ASSAMESE SENTENCE OUTPUT BEFORE TRANSLITERATION OUTPUT AFTER TRANSLITERATION is a vast country . kanada is a vast country . .. from the city is located at a distance of of Rajasthan . multan from the city is located at a distance of 999 of Rajasthan . the capital of Goa , | the capital of Goa , panaji .

EVALUATION OF BLEU SCORE Source/Target Bleu Score 1/2/3/4-gram precision Assamese English 7.02 30.5/8.5/4.1/2.3

CONCLUSION AND FUTURE WORK The SMT is a part of corpus based MT system which requires parallel corpus before undertaking translation. A parallel corpus of about 2500 Assamese and English sentences was used to train the system. The SMT system developed accepts Assamese sentences as input and generates corresponding translation in Assamese. The results shows that significant improvements can be made by increasing the amount of parallel corpus.

CONCLUSION AND FUTURE WORK In the future, we will try to include the Transliteration in our system. We will try to increase the volume of our corpus, such that we get a much better translation system. We will also try to implement the translation process without using the Moses toolkit

REFERENCES Machine Translation, [Online]. Available: http://en.wikipedia.org/wiki/Machine_translation Statistical Machine Translation , [Online]. Available: http://en.wikipedia.org/wiki/Statistical_machine_translation Problems in Machine Translation system, [Online]. Available: http://languagedirect.org/machine-translation/ Machine Translation, [Online]. Available: http://faculty.ksu.edu.sa/homiedan/Publications/Machine%20Translation.pdf D. D. Rao, Machine Translation A Gentle Introduction, RESONANCE, July 1998. S.K. Dwivedi and P. P. Sukadeve, Machine Translation System Indian Perspectives, Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010.

REFERENCES P. F. Brown, S. De. Pietra, V. D. Pietra and R. Mercer, The mathematics of statistical machine translation: parameter estimation. Journal Computational Linguistics, vol. 10, no.2, June 1993 Natural Language Processing , [Online]. Available: http://www.techopedia.com/definition/653/natural-language-processing-nlp

THANK YOU