Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach...

Preview:

Citation preview

Intelligent Key Prediction byN-grams and Error-correction Rules

Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong PotipitiInformation Research and Development Division,National Electronics and Computer Technology

Center(NECTEC), Thailand

Outline Introduction Related

Works The Approach Experiments Conclusions

Introduction

The Thai-English Keyboard System, Problems and Solutions

The Thai-English Keyboard System

174 characters. 83 in Thai. 52 in English. 39 in Symbol.

4 characters/1 Button. 2 languages. 2 sets of characters.

Use special keys to select languages and to distinguish a set of characters.

ฤ ฟ

EnglishThai

Shift

No shift

A

a

A

The Problems Two annoyances for Thais to type

Thai-English bilingual texts. To switch between languages by

using a language switching key. To distinguish a set of characters by

using a combination of shift-key and a character-key.

Other multilingual users face the same problems.

Related Works

Language Identification and Key Prediction

Language Identification Microsoft Word 2000.

Automatic language identification.

Considering only surrounding text.

Problems. No key prediction. Often detect incorrect language. Difficult to switch to the proper

language.

Key Prediction Zheng et al.

Automatic word prediction. Base on lexical tree. Word class n-grams are

employed. Problems.

No language identification. Required large amount of

memory space.

The Approach

Overview, Language Identification, Key Prediction and Pattern Shortening

The System Overview There are two main

processes in our system.

Automatic language identification.

Automatic Thai key prediction.

Language?

Key Input

Thai KeyPrediction

Output

Eng

Thai

(1)

(2)1

11)(

m

iiiE KKpEprob

1

11)(

m

iiiT KKpTprob

Language IdentificationLanguage Identification

Thai Key Prediction

)|(.),|(maxarg 2-1-1

..,2,1

iiiii

n

iccccKpcccp

n

(3)

Error-correction Rules Some character

sequences often generate errors.

Error patterns are collected for error-correction process.

Mutual information is used to optimize the patterns.

Training corpus

Tri-gram prediction

Error correction rules

Errors from Tri-gram prediction

Error-collection Rule Reduction

Pattern No.

Error Key Sequences

Correct Patterns

1. k[lkl9

2. mpklkl9

3. kkklkl9

4. lkl9

Pattern Shortening

)()(

)()(

zpxp

zxpzxLm

y

yy

)()(

)()(

zpxp

zxpzxRm

y

yy

(4)

(5)

The Error-correction Process

Finite Automata pattern matching is applied.

The longest pattern is selected.

Key Input

Finite Automata

Terminal?

No Yes

Correction

Experiments

Language Identification and Key Prediction

Language Identification Result

Bi-Gram is employed. Using artificial

corpus for testing. The first 6 characters

are enough.

m (number of first characters)

Identification Accuracy (%)

3 94.27

4 97.06

5 98.16

6 99.1

7 99.11

Key Prediction Corpus Information

25 MB training corpus.

5 MB test corpus

Training Corpus

(%)

Test Corpus

(%)

Unshift Alphabets

88.63 88.95

Shift Alphabets

11.37 11.05

Key Prediction Result

Prediction Accuracy from

Training Corpus (%)

Prediction Accuracy from

Test Corpus (%)

Tri-gram Prediction

93.11 92.21

Tri-gram Prediction +

Error Correction99.53 99.42

Conclusions

The Approach and Future Works

The Approach We have applied the tri-gram

model and error-correction rules for intelligent Thai key prediction and English-Thai language identification.

The experiment reports 99 percent in accuracy, which is very impressive.

Future Works Hopefully, this technique is

applicable to other Asian languages and multilingual systems.

Our future work is to apply the algorithm to mobile phones, handheld devices and multilingual input systems.

The End

Questions and Answers

Recommended