Machine translation from English to Hindi

MAJOR PRESENTATION

Project Title: “ENGLISH TO HINDI MACHINE TRANSLATION”

Jaypee Institute of Information Technology, CSE Department, May 2014

Project Supervisor: Mr. K. Vimal Kumar

SUbmitted by: Garvita Sharma(10103467)

Rajat jain (10103571)

PAPER COMMUNICATED TO International

on artificial intelligence 2014.(“word

order based machine translation”)

NATURAL LANGUAGE PROCESSING

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges

in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input,

and others involve natural language generation.

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Artificial_intelligence

http://en.wikipedia.org/wiki/Linguistics

http://en.wikipedia.org/wiki/Computers

http://en.wikipedia.org/wiki/Natural_language

http://en.wikipedia.org/wiki/Human%E2%80%93computer_interaction

http://en.wikipedia.org/wiki/Natural_language_understanding

INTRODUCTION TO NLP

Analyze, understand and generate human languages just like humans do.

Applying computational techniques to language domain..

To explain linguistic theories, to use the theories to build systems that can be of social use..

Started off as a branch of Artificial Intelligence..

Borrows from Linguistics, Psycholinguistics, Cognitive Science & Statistics.

Make computers learn our language rather than we learn theirs.

NLP APPLICATIONS

Question answering Who is the first Taiwanese president?

Text Categorization/Routing e.g., customer e-mails.

Text Mining Find everything that interacts with BRCA1.

Machine (Assisted) Translation Language Teaching/Learning

Usage checking

Spelling correction Is that just dictionary lookup?

APPLICATIONS OF NLP

Machine Translation Database Access Information Retrieval

Selecting from a set of documents the ones that are relevant to a query

Text Categorization Sorting text into fixed topic categories

Extracting data from text Converting unstructured text into structure data

Spoken language control systems Spelling and grammar checkers

LEXICAL TRANSLATION PROBLEM

Even assuming monolingual disambiguation …

Style/register differences (eg domicile, merde, medical~anatomical~familiar)

Proper names (eg Addition Barrières)

Conceptual differences

Lexical gaps

MACHINE TRANSLATION APPROACHES

Grammar-based

Interlingua-based

Transfer-based

Direct

Example-based

Statistical

STATISTICAL MACHINE TRANSLATION

Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models

whose parameters are derived from the analysis of bilingual text corpora.

Rule-Based vs. Statistical MT

Rule-based MT: Hand-written transfer rules Rules can be based on lexical or structural transfer Pro: firm grip on complex translation phenomena Con: Often very labor-intensive -> lack of robustness

Statistical MT Mainly word or phrase-based translations Translation are learned from actual data Pro: Translations are learned automatically Con: Difficult to model complex translation phenomena

DOCUMENT VS SENTENCE

MT problem: generate high quality translations of documents

However, all current MT systems work only at sentence level!

Translation of independent sentences is a difficult problem that is worth solving

But remember that important discourse phenomena are ignored!

Example: How to translate English it to French (choice of feminine vs masculine it) or German (feminine/masculine/neuter it) if object referred to is in another sentence?

COMPUTING TRANSLATION PROBABILITIES

Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f)

Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!

P(e | f ) could be re-defined as:

Problem: The English words maximizing P(e | f ) might not result in a readable sentence

P(e | f ) maxeif j

P(ei | f j )

PROBLEMS IN STATISTICAL TRANSLATION

Sentence alignment

Statistical anomalies

Data dilution

Idioms

Different word orders

Out of vocabulary (OOV) words

PROPOSED ALGORITHM

The Algorithm that we are following is

Calculation of the individual Probabilities

Calculation of Probabilities according to the tagged words and their precedence words.

Combining the two probabilities.

Deriving the final probabilities.

Deriving the unavailable word from the dictionary

Adding word and corresponding meaning if not available in the dictionary as well

Restructuring of sentences.

Subject Verb Object (English) -> Subject Object Verb (Hindi)

OUTPUT

BLOCK DIAGRAM

GRAPHICAL USER INTERFACE

CONCLUSIONS

The project fulfils the following functionalities:

Parallel translation according to the probabilities from the tagged corpus.

Calculation of probability according to the precedence word and precedence word tagging.

Word meaning retrieval from the attached dictionary in case of absence of input word from the corpus.

Facility of new word and corresponding meaning addition in case of absence of word from the dictionary as well

FUTURE WORK

The future work can include the following functionalities:

Sentence rearrangement according to the output language grammar.

Introducing tagging in the target language as well.

Calculation of precedence word and tag of the target language in order to enhance accuracy and efficiency.

REFERENCES

[1] D. W. Oard and B. J. Dorr. A survey of multilingual text retrieval, Technical Report MIACS-TR-96-19, University of Maryland,Institute for Advanced Computer Studies, College Park, MD, 1996.

[2] H. H. Chen, C. C. Lin, and W. C. Lin. Construction of a chineseenglish wordnet

and its application to clir. In Proceedings of 5thInternational Workshop on Information Retrieval with Asian Languages, pages 189–196, 2000.

[3] ] Hsin-Chang Yang and Chung-Hong Lee, "Multilingual Information Retrieval

using GHSOM.", In Proceedings of The Eighth International Conference on Intelligent Systems Design and Applications (ISDA-2008), Vol. 1, Kaohsiung, Taiwan, Nov. 26-28, 2008, pp. 225-228.

[4] ] Jaya Saraswati, Rajita Shukla Ripple P. Goyal Pushpak Bhattacharyya, Hindi

to English Wordnet Linkage: Challenges and Solutions,

REFERENCES Cont..

[5] L. Ballesteros and W. B. Croft. Dictionary-based methods for cross lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pages 791–801, 1996.

[6]Rahul Kumar Yadav and Deepa Gupta, Annotation Guidelines for Hindi-English Word Alignment. International Conference on Asian Language Processing.2010.

[7] Raju Korra, Pothula Sujatha, Sidige Chetana, Madarapu Naresh Kumar. Performance Evaluation of Multilingual Information Retrieval (MLIR) System over Information Retrieval (IR) System.IEEE-International Conference on Recent Trends in Information Technology, ICRTIT 2011

[8] Ramanathan, A., P. Bhattacharyya,J. Hegde, R.M. Shah, andM. Sasikumar.2008. Simple syntactic and morphological processing can help english-hindi statistical machine translation. In Proceedings of International Joint Conference on Natural Language Processing.

Education

Machine translation from English to Hindi