Upload
hai-le
View
213
Download
1
Embed Size (px)
Citation preview
An Enhanced Model for Lexical Gap Processing in English- Vietnamese Machine Translation
Tuoi Phan Thi Department of Computer Science and Computer
Engineering Ho Chi Minh City University of Technology
Ho Chi Minh City , Vietnam [email protected]
Hai Le Manh Department of Information Technology.
HUTECH University Ho Chi Minh City , Vietnam
Abstract—Lexical gap is a non-trivial problem in English- Vietnamese Machine translation. The gap occurs when a word in source language does not have any equivalent word in target language. Recently, some studies have found that well-processing method can improve quality of machine translation such as a model for word-to-phrase translation. However, there are still cases that the word – to – phrase model cannot pick a correct phrase to build target sentence. This study establishes a model, based on word – to – phrase translation model, which enhances quality of machine translation using Vietnamese Lexical Functional Grammar (VLFG) rule set for Vietnamese noun phrase in reconstruction algorithm.
Keywords-component; word – to - phrase translation, lexical gap, lexical functional grammar, English – Vietnamese machine translation
I. INTRODUCTION Lexical gap is a phenomenon of lacking an equivalent
word in target language to a source word. For example, an English word “abeyant” in English – Vietnamese dictionary has equivalent Vietnamese phrase “t�m th�i �ình ch�” (“t�m th�i”: temporary; “�ình ch�”: suspend). For dictionary – based machine translation, lexical gap causes structure mismatch and generates ungrammatical sentences. Hai et. al. [1] develops a word – to – phrase model for lexical gap processing. The model works in three basic steps: (1). A bilingual dictionary uses Vietnamese lexical functional grammar (VLFG) to keep target phrases in special structures; (2). A model replaces a lexical gap by an equivalent phrase structure in the dictionary and (3). A model refines target sentence structure using VLFG. However, the model successfully transfers only 82 sentences in total 200 tested sentences [1]. Many transferred sentences were rejected by target grammar rule set. This paper enhances the word-to-phrase model proposed in [1], limits Vietnamese phrase structures in bilingual dictionary and improves processing algorithms.
In this paper section 2 describes word-to-phrase transfer model and its limitations. Section 3 presents new rules for VLFG, and section 4 improves algorithms to enhance the word-to-phrase transfer model. Some experiment results are presented in section 5 and conclusions are discussed in section 6
II. A WORD-TO-PHRASE TRANSFER MODEL AND ITS LIMITATION
The model proposed in [1] has three parts: a parser, a transfer module with dictionary and a generation module (fig.1).
Figure 1. A word-to-phrase transfer model ([1]).
In Fig.1, source text is analyzed with a parser. In this
study, the parsing unit is delivered as open source OpenNLP [3]. The transfer module maps source structures to target structures and correct them using VLFG. Correct target structures and lexicon create target sentence in generation module.
Figure 2. Rejected sentence with grammar violation. A model has rejected most of incorrect structures
because of grammar constraints. Fig. 2 describes why Vietnamese sentence has been rejecting by the model [2]. In Fig.2 a violation occurs because a part in C- structure (left part) does not match to F- structure (right matrix).
TOPIC ph�i
COMMENT
RELATION
TOPIC anh
COMMENT
RELATION direct
TOPIC
COMMENT
RELATION “là”
TOPIC �i COMMENT r�i
anh �i r�i là ph�i
PR V ADV KT ADJ
VP
IP
S
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.22
105
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.22
105
The enhanced model can solve this problem by two new approaches: (1) New rule set for VLFG and (2). New algorithms for generation module.
Next section describes new rule set for VLFG.
III. NEW RULE SET FOR VLFG Vietnamese Lexical Functional Grammar has been
developed for word-to-phrase machine translation model. English-Vietnamese dictionary builds complicate matrix for each lexical gap as is fig. 3
Figure 3. An entry of English – Vietnamese dictionary New rule set for VLFG is applied with general phrase
structure rule. This rule requires every kind of Vietnamese phrase has common structure. The most important structure is Noun phrase structure. Hieu et. al. propose common rule for Vietnamese noun phrase in Table 1 [4].
TABLE 1: GENERAL STRUCTURES OF
VIETNAMESE NOUN PHRASE COMPARED TO ENGLISH.
In Table 1, parts of Vietnamese noun phrase noted as
X and Y is for English. This common structure is used in dictionary for functional component constraint. The rule
sets for Vietnamese verb and adjective/adverb phrase structure are studied but may need more time for testing.
With new rule set, the system accepts some structure from bilingual dictionary (Fig.4)
Figure 4. New structure accepted
In Fig. 4 component “will” and “be” should map to empty component in Vietnamese.
IV. NEW ALGORITHM FOR WORD-TO-PHRASE MACHINE TRANSLATION MODEL
In [1] publishes four algorithms: algorithm “Pruning component” cuts down a component from a structure to create a simpler structure; algorithm “Replacing a component” moves components and replaces simple component by complicate structure; algorithm “Adding a component” inserts a component to a structure and “Modifying a structure” changes phrase structures to others VLFG structures. Figure 5 describes algorithm “Replacing a component”. This algorithm uses “Adding a component” and “Modifying a component” to ensure Vietnamese sentence structure is correct after lexical gap processing.
Figure 5. New algorithm “Component replacement”
diagram
106106
Table 2 is an algorithm “Component replacement” - one of implementation
TABLE 2. ALGORITHM “COMPONENT
REPLACEMENT”
The example below demonstrates a lexical gap processing of the sentence “They water the buffaloes”. The word “water” has Vietnamese appropriate phrase as “cho… u�ng nc” and its structure is in figure 8. The syntactic structure of English sentence “They water the buffaloes“ is (S (NP (PRP they)) (VP (VBP water) (NP (DT the) (NNS buffaloes)))). Because “water” is lexical gap, the algorithm “Replacing a component” moves noun phrase NP “the buffaloes“ into a position in phrase structure “cho u�ng nc” (water) and outputs new structure: (S (NP(PRP they)) (VP(VBP cho(CNP(NP(DT the) (NNS buffaloes)) (VP(VBP u�ng) (NNS nc))).
Figure 6. Mapping between component structure and
functional structure of phrase “cho u�ng n��c”.
This produces grammatical Vietnamese sentence “H� cho các con trâu u�ng nc” (Fig. 6).
V. EXPERIMENT RESULTS
A. Data selection There are two sources of data: Wordnet database [6]
and Oxford English for Computing [7]. A hundreds of sentences have been randomly choosen for each source. The purpose of using two set of data is to compare nomal text and sentences in information technology domain. The data has been processed before parsing step. For example, any number must represented in character manner such as “two” instead of ”2” and so on.
B. English sentence parsing English sentence is parsed by open source OpenNL[3].
To check the validity of feature structure of English sentence, we create an unit for syntactic processing analyzing . The mapping unit maps English sentence structure to appropriate Vietnamese sentence structure. Every English word then transferred to Vietnamese word, if it existed. Lexical gaps are invoved to the model with lexical gap processing algorithm and converted to Viettnamese phrases.
C. Result evaluation
The method of evaluation is simple: output text is examinated by a lingual expert to get “pass” or “not”. Table 3 shows the ratio of the sentences containing lexical gap.
TABLE 3 RATIO OF THE SENTENCE
CONTAINING LEXICAL GAP Resource Sentences translated
by Google MT 200
Sentences contains lexical gap
120 Wordnet [11]
ratio 60% Sentence contains lexical gap
100 Oxford English for Computing [15].
Ratio 50%
In the table 3, from 200 input sentences there are 33
sentences from [6] and 27 sentences in [7] have wrong meanings in Vietnamese, caused by wrong meanign selection. Among remain sentences, 40 and 50 sentences respectively have functional structures matched VLFG rules. Table 4 shows the results of translation and lexical gap processing.
From table 4, two characteristics are recall and
precision have been calculatted. Recal is 59.7% ( 40/67) and 68% (50/73) for eache set of input data; Precision is 40% (40/100) and 50% (50/100) respective. In 90 accepted output sentences, 11 sentences (27.5%) from [6] and 10 sentences (20%) from [7] have to rebuild componentstructures. These sentences must applied the
107107
structure modifying algorithm. Some complicate sentences use the pruning a component algorithm.
TABLE 4 RESULTS OF TRANSLATION AND
LEXICAL GAP PROCESSING
Resource Sentences with lexical gap
Number Of sentence
Ratio %
sentences with wrong meaning
33 33
Sentences with wrong structure
27 27
Wordnet [11]
Sentences which are accepted
40 40
Total 100 100 sentences with wrong meaning
27 27
Sentences with wrong structure
23 23
Oxford English for Computing [15].
Sentences which are accepted
50 50
Total 100 100
6. Conclusions
The paper has represented phrase transfer model, which extended from rule - based machine translation to solve lexical gap in English – Vienamese machine translation. A bilingual dictionary is built for English – Vienamese Lexical gap entries. Although tested data is small and evaluation is simple, some improvements have been recognized.
Acknowledgements
This article is a part of our research on EVMT at Ho Chi Minh University of Technology under sponsor of Vietnam National University of Ho Chi Minh City in key project BB22001100--2200--0033TT�� ““ AA mmooddeell pprroocceessssiinngg lleexxiiccaall ggaapp ffoorr EEnngglliisshh--VViieettnnaammeessee ttrraannssllaattiioonn””.. We deeply thank Dr. Nguyen Chanh Thanh for his advice and code corrections.
Reference
[1] Le Manh Hai, Phan Thi Tuoi, “Lexical gap in English-Vietnamese machine translation: What to do ?”, IALP 2010 2010 International Conference on Asian Language Processing, Harbin, Heilongjiang, China, ISBN-13:978-0-7695-4288-1, 265-269, [2] Le Manh Hai, “A solution for lexical gap in English-Vietnamese machine translation” . PhD. thesis. Vietnam National University of Ho Chi Minh City. 2011. (in Vietnamese) [3] OpenNLP (http://incubator.apache.org/opennlp/) [4] Nguyen Chi Hieu (2008), “A model expoitting characteristics of target language to define appropriate English-Vietnamese base noun phrases, “ PhD Thesis, HCM City University of Technology , Vietnam (in Vietnamese) [5] Le Manh Hai, Phan Thi Tuoi, Nguyen Chi Hieu (2006). Dictionaries for English-Vietnamese Machine Translation, ICCPOL 2006 Singapore [6] Miller, Gerge A., (1995). “Wordnet- a lexical database for the English language” Communications of the ACM 38 , November 1995. [7] Keith Boeckner - P. Charles Brown, “Oxford English for Computing” http://www.vinabook.com/oxford-english-for computing-tieng-anh-cho-vi-tinh-m11i4259.html
108108