Upload
rajat-jain
View
532
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi submitted by Garvita Sharma,10103467,B3 Rajat Jain,10103571,B6
Citation preview
MAJOR PRESENTATION
Project Title: “ENGLISH TO HINDI MACHINE TRANSLATION”
Jaypee Institute of Information Technology, CSE Department, May 2014
Project Supervisor: Mr. K. Vimal Kumar
SUbmitted by: Garvita Sharma(10103467)
Rajat jain (10103571)
PAPER COMMUNICATED TO International
on artificial intelligence 2014.(“word
order based machine translation”)
NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges
in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input,
and others involve natural language generation.
INTRODUCTION TO NLP
Analyze, understand and generate human languages just like humans do.
Applying computational techniques to language domain..
To explain linguistic theories, to use the theories to build systems that can be of social use..
Started off as a branch of Artificial Intelligence..
Borrows from Linguistics, Psycholinguistics, Cognitive Science & Statistics.
Make computers learn our language rather than we learn theirs.
NLP APPLICATIONS
Question answering Who is the first Taiwanese president?
Text Categorization/Routing e.g., customer e-mails.
Text Mining Find everything that interacts with BRCA1.
Machine (Assisted) Translation Language Teaching/Learning
Usage checking
Spelling correction Is that just dictionary lookup?
APPLICATIONS OF NLP
Machine Translation Database Access Information Retrieval
Selecting from a set of documents the ones that are relevant to a query
Text Categorization Sorting text into fixed topic categories
Extracting data from text Converting unstructured text into structure data
Spoken language control systems Spelling and grammar checkers
LEXICAL TRANSLATION PROBLEM
Even assuming monolingual disambiguation …
Style/register differences (eg domicile, merde, medical~anatomical~familiar)
Proper names (eg Addition Barrières)
Conceptual differences
Lexical gaps
MACHINE TRANSLATION APPROACHES
Grammar-based
Interlingua-based
Transfer-based
Direct
Example-based
Statistical
STATISTICAL MACHINE TRANSLATION
Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models
whose parameters are derived from the analysis of bilingual text corpora.
Rule-Based vs. Statistical MT
Rule-based MT: Hand-written transfer rules Rules can be based on lexical or structural transfer Pro: firm grip on complex translation phenomena Con: Often very labor-intensive -> lack of robustness
Statistical MT Mainly word or phrase-based translations Translation are learned from actual data Pro: Translations are learned automatically Con: Difficult to model complex translation phenomena
DOCUMENT VS SENTENCE
MT problem: generate high quality translations of documents
However, all current MT systems work only at sentence level!
Translation of independent sentences is a difficult problem that is worth solving
But remember that important discourse phenomena are ignored!
Example: How to translate English it to French (choice of feminine vs masculine it) or German (feminine/masculine/neuter it) if object referred to is in another sentence?
COMPUTING TRANSLATION PROBABILITIES
Given a parallel corpus we can estimate P(e | f) The maximum likelihood estimation of P(e | f) is: freq(e,f)/freq(f)
Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!
P(e | f ) could be re-defined as:
Problem: The English words maximizing P(e | f ) might not result in a readable sentence
P(e | f ) maxeif j
P(ei | f j )
PROBLEMS IN STATISTICAL TRANSLATION
Sentence alignment
Statistical anomalies
Data dilution
Idioms
Different word orders
Out of vocabulary (OOV) words
PROPOSED ALGORITHM
The Algorithm that we are following is
Calculation of the individual Probabilities
Calculation of Probabilities according to the tagged words and their precedence words.
Combining the two probabilities.
Deriving the final probabilities.
Deriving the unavailable word from the dictionary
Adding word and corresponding meaning if not available in the dictionary as well
Restructuring of sentences.
Subject Verb Object (English) -> Subject Object Verb (Hindi)
OUTPUT
BLOCK DIAGRAM
GRAPHICAL USER INTERFACE
CONCLUSIONS
The project fulfils the following functionalities:
Parallel translation according to the probabilities from the tagged corpus.
Calculation of probability according to the precedence word and precedence word tagging.
Word meaning retrieval from the attached dictionary in case of absence of input word from the corpus.
Facility of new word and corresponding meaning addition in case of absence of word from the dictionary as well
FUTURE WORK
The future work can include the following functionalities:
Sentence rearrangement according to the output language grammar.
Introducing tagging in the target language as well.
Calculation of precedence word and tag of the target language in order to enhance accuracy and efficiency.
REFERENCES
[1] D. W. Oard and B. J. Dorr. A survey of multilingual text retrieval, Technical Report MIACS-TR-96-19, University of Maryland,Institute for Advanced Computer Studies, College Park, MD, 1996.
[2] H. H. Chen, C. C. Lin, and W. C. Lin. Construction of a chineseenglish wordnet
and its application to clir. In Proceedings of 5thInternational Workshop on Information Retrieval with Asian Languages, pages 189–196, 2000.
[3] ] Hsin-Chang Yang and Chung-Hong Lee, "Multilingual Information Retrieval
using GHSOM.", In Proceedings of The Eighth International Conference on Intelligent Systems Design and Applications (ISDA-2008), Vol. 1, Kaohsiung, Taiwan, Nov. 26-28, 2008, pp. 225-228.
[4] ] Jaya Saraswati, Rajita Shukla Ripple P. Goyal Pushpak Bhattacharyya, Hindi
to English Wordnet Linkage: Challenges and Solutions,
REFERENCES Cont..
[5] L. Ballesteros and W. B. Croft. Dictionary-based methods for cross lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications, pages 791–801, 1996.
[6]Rahul Kumar Yadav and Deepa Gupta, Annotation Guidelines for Hindi-English Word Alignment. International Conference on Asian Language Processing.2010.
[7] Raju Korra, Pothula Sujatha, Sidige Chetana, Madarapu Naresh Kumar. Performance Evaluation of Multilingual Information Retrieval (MLIR) System over Information Retrieval (IR) System.IEEE-International Conference on Recent Trends in Information Technology, ICRTIT 2011
[8] Ramanathan, A., P. Bhattacharyya,J. Hegde, R.M. Shah, andM. Sasikumar.2008. Simple syntactic and morphological processing can help english-hindi statistical machine translation. In Proceedings of International Joint Conference on Natural Language Processing.