Assamese- ENGLISH Statistical Machine Translation Using Moses
PRESENTED BY KALYANEE KANCHAN BARUAH AND PRANJAL DAS
CONTENTS INTRODUCTION LITERATURE REVIEW IMPLEMENTATION
TRANSLITERATION IN TRANSLATION EVALUATION CONCLUSION AND FURURE
WORK REFERENCES
INTRODUCTION What is Natural Language Processing ? Natural
Language Processing (NLP) is the ability of a computer program to
understand human speech as it is spoken. NLP automates the
translation between computers and humans.
WHAT IS MACHINE TRANSLATION Machine translation (MT) is
automated translation. It is the process by which computer software
is used to translate a text from one natural language (such as
Assamese) to another (such as English).
WHAT IS MACHINE TRANSLATION The ideal aim of machine
translation systems is to produce the best possible translation
without human assistance. Basically every machine translation
system requires programs for translation and automated dictionaries
and grammars to support translation.
ADVANTAGES OF MACHINE TRANSLATION Quick Translation Low price
Confidentiality Online translation and translation of web page
content Overcomes technological barriers
PROBLEMS IN MACHINE TRANSLATION Translation is not
straightforward Word order Word sense Idioms
TYPES OF MACHINE TRANSLATION BILINGUAL MT systems that produce
translations between any two particular languages. MULTILINGUAL MT
systems that produce translations for any given pair of languages.
They are preferred to bi-directional and bi-lingual as they have
ability to translate from any given language to any other given
language and vice versa
SOME EXISTING MT SYSTEMS Google Translate Systran Bing
Translator Bable Fish Apertium
SOME MAJOR MT PROECTS IN INDIA Anglabharat (and Anubharati)
Anusaaraka MaTra UCSG-based English-Kannada MT Tamil-Hindi
Anusaaraka and English-Tamil MT Anuvadak English-Hindi software
Sampark
MACHINE TRANSLATION APPROACHES
STATISTICAL MACHINE TRANSLATION Enables us to automatically
build machine translation systems using statistical models trained
by text data. Every sentence in a language has a possible
translation in another language.
STATISTICAL MACHINE TRANSLATION
LANGUAGE MODEL Gives the probability of a sentence Uses n-gram
model IRSTLM is used to develop the Language Model The probability
of sentence P (S), is broken down as the probability of individual
words P(w). P(s) = P(w1, w2, w3,....., wn) =P(w1) P(w2|w1)
P(w3,|w1w2 ) P(w4|w1w2w3)...P(wn|w1w2...wn-1)
LANGUAGE MODEL Suppose for a large amount of corpus we have the
following bigram probabilities .001Eat British.03Eat today .007Eat
dessert.04Eat Indian .01Eat tomorrow.04Eat a .02Eat Mexican.04Eat
at .02Eat Chinese.05Eat dinner .02Eat in.06Eat lunch .03Eat
breakfast.06Eat some .03Eat Thai.16Eat on
LANGUAGE MODEL .01British lunch.05Want a .01British
cuisine.65Want to .15British restaurant.04I have .60British
food.08I dont .02To be.29I would .09To spend.32I want .14To have.02
Im .26To eat.04 Tell .01Want Thai.06 Id .04Want some.25 I
LANGUAGE MODEL Then, the probability of a sentence I want to
eat British food is P(I want to eat British food) = P(I|) P(want |
I) P(to | want) P(eat | to) P(British | eat) P(food | British) =
.25*.32*.65*.26*.001*.60 = .000080
TRANSLATION MODEL Computes the probability of source sentence
S, for a given target sentence T i.e. P(S|T). May be done word
based or phrase based. Output of TM is fed into the Moses decoder.
Giza++ along with mkcls is used to develop Translation Model.
TRANSLATION MODEL Example : Jaipur is a famous city of
Rajasthan
DECODER Maximizes the probability of the translated text Search
for sentence T is performed that maximizes P (S|T) i.e. Pr (S, T) =
argmax P (S|T) P (T) DECODING ALGORITHM TRANSLATION MODEL LANGUAGE
MODEL
ARCHITECTURE OF OUR SMT
HOW OUR SMT WORKS
IMPLEMENTATION Install all packages in Moses Install Giza++
Install IRSTLM Training Tuning Generate output (decoding)
TRAINING THE MOSES DECODER Prepare data Run Giza++ Get lexical
translation table Build lexicalized reordering model Create
configuration fileBuild generation models. Align words Extract
phrases
PREPARING THE DATA Tokenising - inserting spaces between words
and punctuation. Truecasing - setting the case of the first word in
each sentence. Cleaning - removing empty lines, redundant spaces,
and lines that are too short or too long.
EXAMPLE PARALLEL DATA ass-eng1.as ass-eng1.en The famous
Bikaneri Bhujias and sweets are some of the best items to purchase
in Bikaner. , Jaipur, popularly known as the Pink City, is the
capital of Rajasthan state, India. The Amber Palace is a classic
example of Mughal and Hindu architecture. Kanak Vrindavan is a
popular picnic spot in Jaipur. , Jaipur is also famous for marble
statues, blue pottery and the Rajasthani shoes
SAMPLE OUTPUTS Input Assamese Sentence Output English sentence
Jaipur is a famous city of Rajasthan . the Taj Mahal , is located
in the heart of the Agra . Jama Masjid built by Shahjahan . Andhra
Pradesh is one of the state of one of India . Guwahati is connected
by the capital of the State . Agra is the one of the famous city
Delhi is the capital of India .
PROBLEMS WITH PROPER NOUNS Input Assamese Sentence Output
English sentence is a vast country . .. from the city is located at
a distance of of Rajasthan . the capital of Goa , |
TRANSLITERATION IN TRANSLATION Transliteration Transcription
from one alphabet to another Some proper nouns which are not in our
corpus are not translated. For example: Translating gives is a vast
country. Because is not in our corpus.
TRANSLITERATION IN TRANSLATION Store each Assamese alphabet and
their English transliteration in a perl script For example: -> k
-> kh -> g Used this perl script and run with moses using the
following command echo | ~/mymoses/bin/moses f ~/work/mert-
work/moses.ini | ./transliterate.pl Output : kanada is a vast
country .
IMPLEMENTING TRANSLITERATION INPUT ASSAMESE SENTENCE OUTPUT
BEFORE TRANSLITERATION OUTPUT AFTER TRANSLITERATION is a vast
country . kanada is a vast country . .. from the city is located at
a distance of of Rajasthan . multan from the city is located at a
distance of 999 of Rajasthan . the capital of Goa , | the capital
of Goa , panaji .
EVALUATION OF BLEU SCORE Source/Target Bleu Score 1/2/3/4-gram
precision Assamese English 7.02 30.5/8.5/4.1/2.3
CONCLUSION AND FUTURE WORK The SMT is a part of corpus based MT
system which requires parallel corpus before undertaking
translation. A parallel corpus of about 2500 Assamese and English
sentences was used to train the system. The SMT system developed
accepts Assamese sentences as input and generates corresponding
translation in Assamese. The results shows that significant
improvements can be made by increasing the amount of parallel
corpus.
CONCLUSION AND FUTURE WORK In the future, we will try to
include the Transliteration in our system. We will try to increase
the volume of our corpus, such that we get a much better
translation system. We will also try to implement the translation
process without using the Moses toolkit
REFERENCES Machine Translation, [Online]. Available:
http://en.wikipedia.org/wiki/Machine_translation Statistical
Machine Translation , [Online]. Available:
http://en.wikipedia.org/wiki/Statistical_machine_translation
Problems in Machine Translation system, [Online]. Available:
http://languagedirect.org/machine-translation/ Machine Translation,
[Online]. Available:
http://faculty.ksu.edu.sa/homiedan/Publications/Machine%20Translation.pdf
D. D. Rao, Machine Translation A Gentle Introduction, RESONANCE,
July 1998. S.K. Dwivedi and P. P. Sukadeve, Machine Translation
System Indian Perspectives, Proceeding of Journal of Computer
Science Vol. 6 No. 10. pp 1082-1087, May 2010.
REFERENCES P. F. Brown, S. De. Pietra, V. D. Pietra and R.
Mercer, The mathematics of statistical machine translation:
parameter estimation. Journal Computational Linguistics, vol. 10,
no.2, June 1993 Natural Language Processing , [Online]. Available:
http://www.techopedia.com/definition/653/natural-language-processing-nlp