6
IEEE Proceedings of 4 th International Conference on Intelligent Human Computer Interaction, Kharagpur, India, December 27-29, 2012 Cost of Error Correction Quantification with Bengali Text Transcription Soumalya Ghosh 1, Debasis Samanta 2and Monalisa Sarma 3Indian Institute of Technology Kharagpur, Kharagpur, India - 721302 Indian Institute of Technology Indore, Indore, India - 452017 Email: 1 [email protected], 2 [email protected], 3 [email protected] Abstract—Text based interaction in user mother language be- comes one of the most primitive and usable interaction technique with the gigantic growth of the digital communication application like e-mailing, messaging, chatting, blogging etc. Efficiency of a text entry technology mainly depends on text entry rate and error rate. Number of errors occurs during typing are gener- ally measured by Levenshtein minimum string distance (MSD) statistic. It counts the number of error present in transcribed text concerning presented text and quantifies the numbers of single edit operation required to transform transcribed text to presented text. It reckons one single edit operation for one error. However, it is not applicable to Indian languages to error correction operation quantification due to language-related features like - complex (‘juktakkhor’) and matra characters. As single error in a complex character may required multiple single edit operations to correct. The error on complex character is basically confined the practical usefulness of MSD quantification method in Indian language scenario. In this paper, we quantify the minimum number of single edit primitive required to correct the errors in Indian language transcribed text typed by any single stroke or tap text entry tool. To accomplish our objective, initially, we identify the mismatched character (error) positions in both transcribed and presented text both by employing longest common subsequence (LCS) algorithm. Then, we proposed an algorithm to identify whether error in simple character or in complex character. After that, we propose another algorithm to calculate the minimum number of operations to renovate transcribed to presented text depending upon error positions and type (error in simple or complex character). Lastly, we define correction cost per error (CCPE) metric to calculate average correction cost for an erroneous transcribed text. Index Terms—Minimum string distance, keystrokes per char- acter, longest common subsequence, human computer interaction. I. I NTRODUCTION Text entry mechanism becomes ubiquitous in the area of Human Computer Interaction. Text entry technologies are designed and evaluated through empirical evaluation with intervention of user. Designing of efficient text entry mechanisms are mainly focused on two parameter, speed and accuracy or error. Speed is generally measure through word per minute (WPM). Errors in transcribed text (typed text) concerning with presented text (given for type) are usually measured by MSD or keystrokes per character (KSPC) statistic. The major shortcoming of KSPC statistic is it does not separately identify what cost for committing the errors and fixing errors. The larger value of KSPC metric indicates that : either many errors were committed, and correction took few keystrokes or that few errors were committed but correcting them was requiring many keystrokes. The Smaller value of KSPC indicates few errors was committed while typing. However, if many errors were committed and those remain uncorrected than KSPC value will be small. Moreover, value of KSPC is highly depended on the text entry mechanism under consideration. As, an example, error-free typing also gives the different KSPC value for a single tap (KSPC =1) to multi tap (KSPC>1) keyboard. So a comparing study on error analysis using KSPC metric on a typing text entered by different text entry technologies cannot be meaningful. On the other hand, MSD statistic calculates the number of single editing primitives require to convert the transcribed text to presented text. The primitives are single edit operation of any very general text entry technology, like insertion, deletion, and substitution of a single character. The major shortcoming of MSD statistic is that it only compare between presented and transcribed text not work for user input stream. So it only work for uncorrected error and not for corrected error. Moreover, it works for only when single edit operation is required to correct the single error. However, MSD does not work for the case of multiple operation require to correct a single error. So far the typing error analysis on transcribed text in reference of presented text in English is quantified through simple MSD statistic [1], [2]. In contrast, the structures of Indian languages are completely different from English containing simple, complex and matra characters. Error in simple character can be measured using MSD. But, a complex character is made by combining multiple characters altogether and error in one single character may be required multiple edit operations to fix it. In this case, fixing of a single error requires additional edit operation in addition to non erroneous characters. In this situation, MSD can count the number of errors but unable to measure total number of operations to fix it. So, there is a scope of work for Indian language to quantify the minimum number of edit primitives required to transform transcribed into presented text. Bengali, being the national language of Bangladesh and majorly spoken in eastern India (west Bengal, Tripura, Assam etc.), is the world’s seventh most spoken language 1 . In this 1 http//en.wikipedia.org/wiki/List of languages by number of native speakers 978-1-4673-4369-5/12/$31.00 c 2012 IEEE

[IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

IEEE Proceedings of 4th International Conference on Intelligent Human Computer Interaction, Kharagpur, India, December 27-29, 2012

Cost of Error Correction Quantification withBengali Text Transcription

Soumalya Ghosh1∗, Debasis Samanta2∗ and Monalisa Sarma3†∗Indian Institute of Technology Kharagpur, Kharagpur, India - 721302

†Indian Institute of Technology Indore, Indore, India - 452017Email: [email protected], [email protected], [email protected]

Abstract—Text based interaction in user mother language be-comes one of the most primitive and usable interaction techniquewith the gigantic growth of the digital communication applicationlike e-mailing, messaging, chatting, blogging etc. Efficiency of atext entry technology mainly depends on text entry rate anderror rate. Number of errors occurs during typing are gener-ally measured by Levenshtein minimum string distance (MSD)statistic. It counts the number of error present in transcribedtext concerning presented text and quantifies the numbers ofsingle edit operation required to transform transcribed text topresented text. It reckons one single edit operation for oneerror. However, it is not applicable to Indian languages toerror correction operation quantification due to language-relatedfeatures like - complex (‘juktakkhor’) and matra characters. Assingle error in a complex character may required multiple singleedit operations to correct. The error on complex character isbasically confined the practical usefulness of MSD quantificationmethod in Indian language scenario. In this paper, we quantifythe minimum number of single edit primitive required to correctthe errors in Indian language transcribed text typed by anysingle stroke or tap text entry tool. To accomplish our objective,initially, we identify the mismatched character (error) positions inboth transcribed and presented text both by employing longestcommon subsequence (LCS) algorithm. Then, we proposed analgorithm to identify whether error in simple character or incomplex character. After that, we propose another algorithmto calculate the minimum number of operations to renovatetranscribed to presented text depending upon error positions andtype (error in simple or complex character). Lastly, we definecorrection cost per error (CCPE) metric to calculate averagecorrection cost for an erroneous transcribed text.

Index Terms—Minimum string distance, keystrokes per char-acter, longest common subsequence, human computer interaction.

I. INTRODUCTION

Text entry mechanism becomes ubiquitous in the area

of Human Computer Interaction. Text entry technologies

are designed and evaluated through empirical evaluation

with intervention of user. Designing of efficient text entry

mechanisms are mainly focused on two parameter, speed and

accuracy or error. Speed is generally measure through word

per minute (WPM). Errors in transcribed text (typed text)

concerning with presented text (given for type) are usually

measured by MSD or keystrokes per character (KSPC) statistic.

The major shortcoming of KSPC statistic is it does not

separately identify what cost for committing the errors and

fixing errors. The larger value of KSPC metric indicates

that : either many errors were committed, and correction

took few keystrokes or that few errors were committed but

correcting them was requiring many keystrokes. The Smaller

value of KSPC indicates few errors was committed while

typing. However, if many errors were committed and those

remain uncorrected than KSPC value will be small. Moreover,

value of KSPC is highly depended on the text entry mechanism

under consideration. As, an example, error-free typing also

gives the different KSPC value for a single tap (KSPC =1)

to multi tap (KSPC>1) keyboard. So a comparing study on

error analysis using KSPC metric on a typing text entered

by different text entry technologies cannot be meaningful. On

the other hand, MSD statistic calculates the number of single

editing primitives require to convert the transcribed text to

presented text. The primitives are single edit operation of any

very general text entry technology, like insertion, deletion, and

substitution of a single character. The major shortcoming of

MSD statistic is that it only compare between presented and

transcribed text not work for user input stream. So it only work

for uncorrected error and not for corrected error. Moreover, it

works for only when single edit operation is required to correct

the single error. However, MSD does not work for the case of

multiple operation require to correct a single error.

So far the typing error analysis on transcribed text in

reference of presented text in English is quantified through

simple MSD statistic [1], [2]. In contrast, the structures

of Indian languages are completely different from English

containing simple, complex and matra characters. Error in

simple character can be measured using MSD. But, a complex

character is made by combining multiple characters altogether

and error in one single character may be required multiple

edit operations to fix it. In this case, fixing of a single error

requires additional edit operation in addition to non erroneous

characters. In this situation, MSD can count the number of

errors but unable to measure total number of operations to fix

it. So, there is a scope of work for Indian language to quantify

the minimum number of edit primitives required to transform

transcribed into presented text.

Bengali, being the national language of Bangladesh and

majorly spoken in eastern India (west Bengal, Tripura, Assam

etc.), is the world’s seventh most spoken language1. In this

1http//en.wikipedia.org/wiki/List of languages by number of native speakers

978-1-4673-4369-5/12/$31.00 c©2012 IEEE

Page 2: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

paper, we have targeted to quantify the minimum number

of edit primitive required to correct the errors of Bengali

transcribed text. For this purpose, initially we identify the

mismatched characters (errors) position in transcribed text

with reference to presented text by employing LCS algorithm.

Then, we propose an algorithm to identify whether error

is present in simple or in complex character and finally,

propose another algorithm to calculate the minimum number

of operations required to convert the transcribed text into

presented text depending upon error positions in simple and

complex character.

This paper is organized as follows. Details of existing

related work is illustrated in Section II. In Section III, we

describe about Bengali text composition using any single

tap or stroke text entry tool. Finding match and mismatch

position between presented and transcribed text are elaborated

in Section IV. Section V, quantifies the minimum number of

required edit primitives to fix the errors. Error metrics are

discussed in Section VI. Finally, Section VII concludes the

paper with the future direction.

II. STATE OF THE ART

Several works are proposed to measurement of user typing

error and also classification of typing error. MacKenzie and

Soukoreff [1] describe a technique to analyze character-level

errors in evaluations of text entry methods using MSD. Then

they [2] upgrade their work using MSD and KSPC both to

calculate number of error in typed text. Lastly, they identify

the shortcomings of both MSD and KSPC and proposed a

new statistics include total error rate, corrected error rate

(errors committed but corrected) and the not corrected error

rate (errors left in the transcribed text). All of their works are

performing for English text not for Bengali text.

Gong and Tarasewich [3] are define a new error metric

for text entry method. They define the metric CPC (Cost

per Correction) for correcting the error. CPC is calculated by

their own define new keystroke classes. Those class are much

similar with Soukoreff and MacKenzie work [2]. Moreover

they assume error correction mechanism only though the

backspace key. But it must be change as the text entry methods

changed.

Webbrock and Myers [4] presented an algorithm in order to

perform automatic analysis of character-level errors in input

streams text entry evaluations in a unconstrained way. The

work also elaborates new character-level metrics which can

provide support to text entry methods. They performed two

analysis for employing the metrics on data from an actual text

entry experiment. One analysis exercise only presented and

transcribed strings. The other analysis use input strings. The

results reveal that input stream error analysis possess more

information for similar empirical data.

Arif and Stuerzlinger [5] discuss the most common

performance metrics employed for text entry. They basically

collect data from well-designed and well-reported experiments

for the most important text entry methods, including those for

handheld devices to measure the text entry rate (WPM) and

error rate.

Arif and Stuerzlinger [6] analyze real-life text entry error

correction behaviors. Then they proposed a model to predict

error correction cost with considering human factors and

various system parameters for character-based text entry

mechanisms. They basically predict the error correction time

for character based text entry technologies.

The entire above all mention works are mainly define error

rate and calculate it for English text but those principles are not

directly applicable for Bengali text. In this work, we address

this problem.

III. BENGALI TEXT COMPOSITION

Bengali characters, composed by any single tap text entry

tool, can be classified into two broad categories according to

character formation as simple and complex character. Simple

character is formed by just one single character. On the other

hand, complex character is formed by combining multiple

simple characters. This combination is done by a special

character called halant ( ). To provide better understanding,

we give an example of a Bengali word containing both simple

and complex character composed by a single tap virtual

keyboard as shown Table I.

TABLE IBENGALI CHARACTER CLASSIFICATION

Word rÔSimple character rComplex character ÔCharacter typed rt n

Those simple characters, belonging to a complex character,

are not separately accessible or editable in standard text

editor (Microsoft word, notepad etc.). We can take the mouse

pointer at beginning or end position of the complex character.

However, the pointer is unable to locate the individual simple

characters’ position of any complex character directly. As

a consequence, multiple operations are needed to edit a

simple character in a complex character. Above discussed

scenario is pictorially illustrated in the Table II. The Table

shows just one single error occurred in complex character

in transcribed text. But, it is required to substitute the all

simple characters of the complex character to correct the error

(substitutions are represented by ‘−’). So, minimum three

substitution operations are required to fix the single error.

Moreover, Bengali contains matra characters ( a, i, u etc.) which

can be also associated with simple and complex characters.

Furthermore, error on matra character may be required more

than one edit operations in a similar way.

TABLE IIMULTIPLE EDIT OPERATIONS FOR ERROR IN COMPLEX CHARACTER

Presented text rÔ - rt nTranscribed text rpj - rp nCorrected text rÔ - rp n−−−t n

Page 3: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

IV. FINDING MISMATCHED AND MATCHED POSITION IN

PRESENTED AND TRANSCRIBED TEXT

Before quantifying the edit primitives, we have to find

the mismatched and matched character positions in both

transcribed and presented text. The errors are presented in

the mismatched positions only and matched positions are

error free. Here, we used the longest common subsequence

(LCS) algorithm [7] to find the longest common subsequence

in between both transcribed and presented text. Let two

sequences be defined as follows: X = (x1, x2, . . . , xm) and

Y = (y1, y2, . . . , yn). The prefixes of X are X1,2,...,m; the

prefixes of Y are Y1,2,...,n. Let c(Xi, Yj) represents the set of

longest common subsequence of prefixes Xi and Yj . This set

of sequences is given by the following Eqn 1.

c(Xi, Yj) =

⎧⎪⎪⎨⎪⎪⎩

0 if i = 0 or j = 0,

c[Xi−1, Y(j − 1)] + 1 if i, j > 0

and xi = yj ,

max(c[Xi, Yj−1], c[Xi−1, Yj ]) if i, j > 0

and xi �= yj

(1)

Based on the Eqn. 1, Algorithm 1 [7] is written to compute

the length of an LCS in between transcribed and presented

text. It take two string X = (x1, x2, . . . , xm) and Y =(y1, y2, . . . , yn) as input. It store the c[i, j] values in the table

c[0 . . .m, 0 . . . n] whose entries are computed in row-major

order. It also maintains the table b[1 . . .m, 1 . . . n] to simplify

construction of an optimal solution. Intuitively, b[i, j] points

to the table entry corresponding to the optimal subproblem

solution chosen when computing c[i, j]. The procedure retunes

the b and c tables; c[m,n] contains the length of X and Y.

Then from it, we find the matched and mismatched positions

and after that, we proceed with the matched and mismatched

positions to calculate number of operation needed. The

sequence from both string are matched only if b[i, j] =‘Arrow-corner’. Matched and mismatched positions of both presented

and transcribed text are computed by Algorithm 2.

Algorithm-1 longest common subsequence [7]LCS-LENGTH(X, Y)m = length[X]n = length[Y]for i = 1 to m do

do c[i, 0] = 0endfor j = 1 to n do

do c[0, j] = 0endfor i = 1 to m do

for j = 1 to n doif xi == yj then

c[i,j]=c[i-1,j-1]+1b[i,j]= Arrow-corner

endifelse if c[i− 1, j] >= c[i, j − 1] then

c[i, j]=c[i-1,j]b[i,j]=Arrow-up

endifelse

c[i,j]=c[i,j-1]endifb[i,j]=Arrow-left

endendreturn c and bStop

Algorithm 1 and 2 are applied to both presented and

Algorithm-2 Matched and mismatched in longest common subsequenceMATCHED-MISMATCHED-LCS(b, X, Y, i, j)if i=0 or j=0 then

then returnendifif b[i,j]==Arrow-corner then

Xi =matched and Yj = matchedMATCHED-MISMATCHED-LCS(b, X, i-1, j-1)

endifelse if b[i, j] == Arrow-up then

Xi =mismatched and Yj = mismatchedMATCHED-MISMATCHED-LCS(b, X, i-1, j)

endifelse

Xi =mismatched and Yj = mismatchedMATCHED-MISMATCHED-LCS(b, X, i, j-1)

endifStop

transcribed text to find longest common subsequence and

matched and mismatched positions in between them. Here,

we take an example shown in Fig. 1. Algorithm 1 finds

LCS and Algorithm 2 indignify matched and mismatched

positions. Matched positions are shown by gray color block

and mismatched positions are identified by blank block

(uncolored block) on both x, y string in the matrix in Fig. 1.

Fig. 1. LCS on presented (x), transcribed (y) string and matched (gray blockin x, y), mismatched positions(uncolored block in x, y) in between them

V. EDIT PRIMITIVE QUANTIFICATION

In this section, we discuss about quantification of minimum

required edit primitives for correcting the errors of transcribed

Bengali text or string in different conditions. Initially, we

calculate error primitives for both presented and transcribed

text containing only simple character and then we calculate

it for both strings containing simple and complex characters.

Details are given with particular examples in the following

subsections.

A. Containing only simple characters

In this section, we consider presented and transcribed text

or string contains only simple characters. Before proceed to

the minimum number of edit operation quantification, we have

to check whether both stings contain only simple characters or

not. If a string contains any complex character, then there must

be at least one halant character. The procedure of checking

whether there is any complex character or all the characters

are simple, is described in Algorithm 3.

Then we compute the minimum number of required edit

operations to convert the transcribed text to the presented text

for both containing only simple characters. To accomplish

the task initially, we calculate length and length difference

Page 4: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

Algorithm-3 Simple character string identificationString-Identification(x)m = length[x]boolean Flag = 0for i = 1 to m do

if x[i]==‘halant’ thenFlag = 1Break

endifendif Flag == 0 then

PRINT “All characters are simple”endifelse

PRINT “The string has complex character”endifStop

between both strings. Then we find the matched and

mismatched sequences for each character position in both

strings. Firstly, if both strings are mismatched in same

character positions (indexes) and corresponding characters or

the next character in both strings are not matra character,

then one edit operation is required to fix this character of the

transcribed string. Otherwise, two edit operations are required

to fix the error on transcribed string. Similarly, if one string is

mismatched and other is matched in same character position,

then also one operation is required to fix it. Lastly, minimum

one edit primitives is necessary if one string is mismatched

and other is also out of index in same character position. If

both positions in presented and transcribe string are in matched

sequence, then no operation is needed. Because both characters

and character positions in strings are corrected in presented

and transcribed string. Counting minimum number of required

edit primitives to provide better understanding about the above

discussion is illustrated in Algorithm 4.

Algorithm-4 Minimum number of operation for both string contains only simplecharactersMinimum Operation(P,T)m = length[P]n = length[T]No of Operation = 0diff = mod(m-n)s = Smaller (m,n)l = larger (m,n)boolean Flag = 0for i = 1 to s do

if P[i] and T[i]== mismatched thenif P[i] and T[i]== ‘matra’ then

No of Operation = No of Operation +2endifif P[i+1] and T[i+1]== ‘matra’ then

No of Operation = No of Operation +2i = i+1

endifelse

No of Operation = No of Operation +1endif

endifif P[i] == mismatched and T[i] != mismatched then

No of Operation = No of Operation +1endifif P[i] != mismatched and T[i] == mismatched then

No of Operation = No of Operation +1endif

endfor i = s + 1 to l do

if P[i] == mismatched or T[i] == mismatched thenNo of Operation = No of Operation +1

endifendreturn No of OperationStop

(a) Mismatched positions arein same place

(b) Mismatched positions are indifferent place

Fig. 2. Minimum number of required operation calculation for both stringscontains only simple character

Algorithm 4 is illustrated with particular examples shown

in Fig. 2. Here Fig. 2(a) shows that both presented and

transcribed text contain only one mismatched positions and

is in same place for both presented and transcribed (1st index

in both strings). According to Algorithm 4, to fix the error in

transcribed text, minimum 1 edit primitives is required. In this

case, just one substitution on 1st position of the transcribed

text make it similar with presented text. Similarly, Fig. 2(b)

depicted that two mismatched positions are in transcribed

text only. Hence, according to Algorithm 4, only 2 edit

operations (2 deletion) are required to transform transcribe

text to presented text.

B. Containing both simple and complex characters

In this section, we consider presented and transcribed text

contain both simple and complex characters. However, there

is only one constraint that we consider both text containing

the equal number of complex characters. Before proceed to

the minimum number of required edit primitive quantification,

we have to identify the initial, intermediate and last character

positions of a complex character. Here, we take the string

named char-ref to store the character reference sequentially,

whether it is simple character or initial, intermediate or last

character position of a complex character. To accomplish the

task, we have to scan all characters from the beginning to

last index of the string. If the next character of the current

character is not halant character, then the current character is

a simple character. Otherwise, the character is in a complex

character and it is in initial position of the complex character.

The next character obviously is in intermediate position of the

complex character. Then we have to check whether the next

position of the intermediate position is the last position or not.

If it is not the last position, then it is in intermediate position.

We iteratively executes the procedure for all the character of

the string. Algorithm 5 describes the procedure to identify the

character references of a string.

Algorithm 5 is illustrated with particular example shown

in Fig. 3. Here we take two Bengali strings (roÔo, oibíu) and

identify the reference of each character of those strings and

sequentially store them into char-ref string (scfcicl and

sscfcicicl) by applying Algorithm 5. Here, s stands for simple

Page 5: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

Algorithm-5 Character identification in a Bengali stringChar-Identification(x)m = length[x]String char-ref [m]// s stand for simple characterfor i = 1 to m do

if x[i+1] != ‘halant’ thenchar-ref[i] = si = i+1

endif// cf, ci and cl represent the first, intermediate

and last character of a complex charactercorresponding

if x[i+1] == ‘halant’ thenchar-ref[i] = cfchar-ref[i+1] = cichar-ref[i+2] = cli = i + 3while x[i] == ‘matra’ or x[i] == ‘halant’ do

if x[i] == ‘matra’ thenchar-ref[i-1] = cichar-ref[i] = cli = i + 1Break

endifif x[i] == ‘halant’ then

char-ref[i-1] = cichar-ref[i] = cichar-ref[i+1] = cli = i + 2

endifend

endifendreturn char-refStop

character and first, intermediate, last character of the complex

character are represented by cf , ci and cl respectively.

Fig. 3. Character reference of Bengali string

After classifying all the characters, we come at the stage of

edit primitive quantification for both presented and transcribed

string contains complex and simple characters. Initially,

all simple and complex characters of both presented and

transcribed strings are stored into separate strings. Let, P is

the presented string or text and it is divided into two sub string

namely Ps and Pc. All simple and complex characters of P

are assigned sequentially in Ps and Pc, respectively. Similarly,

transcribed (T ) string is also sub divided into two parts like Ts

and Tc. Then, we find the matched and mismatched positions

using Algorithm 1,2 for both Ps, Ts and with the reference

of matched and mismatched positions, minimum number of

required edit primitives to convert the Ts to Ps is calculated

by using Algorithm 4.

Then we take both sub string of complex characters (Pc

and Tc) and find the mismatched positions using Algorithm

1, 2 for each complex character pair (Pc1, Pc2, . . . , Pcn and

Tc1, Tc2, . . . , Tcn) of both strings. After that, Algorithm 6 is

applied on the matched and mismatched positions to calculate

minimum number of require edit primitives separately for

each complex character pair from Pc and Tc. According to

algorithm 6, if both characters (Pci , Tci) are of equal length

and mismatched position in last character, then minimum

numbers of required operations is 2 (one deletion and one

insertion). If mismatched positions are in other than the last

position, then number of operations are the length of Pci

(minimal operations can be total substitution). On the other

hand, if both strings are of not equal length and mismatched

position is within smaller length in any strings, then number of

operation is the length of Pci (minimal operation can be total

substitution). On the other hand, if the mismatched position is

in larger portion, then minimum number of operations are the

length difference between both strings (operations are insertion

or deletion).

Algorithm-6 Minimum number of operation for both string contains simple andcomplex charactersMinimum-Operation(Pci

, Tci)

m = length[Pci]

n = length[Tci]

No of Operation = 0boolean Flag = 0if m == n then

for j = 0 to m doif Pci

[j] == mismatched or Tci[j] == mismatched then

if char-ref[i] == Cl thenNo of Operation = No of Operation + 2Break

endifelse

No of Operation = No of Operation + mBreak

endifendif

endendifif m > n then

for j = 0 to n doif Pci

[j] == mismatched or Tci[j] == mismatched then

No of Operation = No of Operation + mFlag = 1Break

endifendif Flag = 0 then

No of Operation = No of Operation + (m-n)endif

endifif m < n then

for j = 0 to m doif Pci

[j] == mismatched or Tci[j] == mismatched then

No of Operation = No of Operation + mFlag = 1Break

endifendif Flag = 0 then

No of Operation = No of Operation + (n-m)endif

endifStop

The calculation for the minimum number of required

operations in the circumstance of both strings containing

simple and complex character is illustrated with particular

example in Fig. 4. It shows the condition of both strings are

of equal length and mismatched position is in the first position

of the complex character. According to the Algorithm 6, the

minimum number of operations is the length of that particular

string (number of operations = 3).

VI. ERROR METRICS

In this section, we are going to introduce a new metric

namely correction cost per error (CCPE). It is basically the

measurement of the average number of edit primitives required

to fix each error. To compute CCPE, initially we have to

Page 6: [IEEE 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) - Kharagpur, India (2012.12.27-2012.12.29)] 2012 4th International Conference on Intelligent

Fig. 4. Number of required operation calculation for both strings containssimple and complex characters

calculate the number of error present in the transcribed text

with reference to the presented text. Number of errors is the

largest mismatch position in between both strings as shown in

Eqn 2. Hence, correction cost per error (CCPE) is computed

by dividing the total number of operation with the number of

error depicted in Eqn 3.

Number of error = max mismatched (P, T) (2)

CCPE =Total number of required operations

Number of errors(3)

Moreover, when all error are presented in simple characters

only then CCPE will be one. On the other hand, when errors

are on both simple and complex characters then it may be

greater than one. Multiple operations may required to one

single error on complex character.

Here we give a complete example of presented and

corresponding transcribed texts in Table III. Errors in

transcribed tasks are shown in italic characters. All the

error metrics namely, number of errors, number of minimum

operations requires to fix the errors, cost per correction are

also shown in Table III.

VII. CONCLUSION AND FUTURE WORK

We have developed method for identifying the error in

Bengali transcribed text in reference with presented text and

quantifying the required edit primitives to fix those errors.

We also analyze different edit cost for errors in simple

and complex characters. A complex character is made by

combining several simple characters altogether and is not

editable the individual characters directly. Here, we consider

mouse or pointing device directly pointing or selecting the

character to edit it. We are quantifying the minimum numbers

of required edit primitives to fix the errors after pointing the

error position. Here, we also introduce a new error metric

named correction cost per error (CCPE) to calculate average

number of operations required to fix a single error in the

specific transcribed text. Our proposed methodology can be

applicable for error quantification in any type of single stroke

TABLE IIIANALYSIS OF PROPOSED METRICS IN BENGALI TEXT

Presented text Transcribe textNumber

oferrors

Numberof

operationsCCPE

kolkata moHanogorŇpoiścmboećgr rajzanŇ,pŐŕb varoetr baoiNoijYk

rajzanŇ.

kolkata moHanogorŇpoiścmboeěgr rajzanŇ,pŐŕb varoetr baoiNoijYk

rajzanŇ.

1 4 4

kolkata moHanogorŇpoiścmboećgr rajzanŇ,puŕb varoetr baoiNoiJYk

rajzanŇ.

2 6 3

kolkata moHanogorŇp oişcmboeěgr rajza oin,pŐŕb varoetr baoiNoijYk

rajzanŇ.

3 10 3.33

kolkata moHanogorŇp oiśqmboećgr rajzanŇ,pŐŕb varoetr baoiNoijYb

rajzanŇ.

2 5 2.5

or tap text entry tool, where one character is assigned for only

one key. The newly proposed CCPE metric can further be

useful in evaluating the cost of error correction for characters,

specially complex characters of a language.

In this work, we consider a constraint of same number of

complex characters in both presented and transcribed strings.

In future research, we will extend our work without that

constraint and calculate for minimum number of required

edit primitives for different numbers of complex characters

present in both strings. Here, in our work we only analyze

minimum number of required edit primitives for single tap

text entry technologies. In near future, we will extend our work

to calculate minimum number of required edit primitives for

multi tap and prediction based text entry technologies.

REFERENCES

[1] I. S. MacKenzie and R. W. Soukoreff. A Character-level Error AnalysisTechnique for Evaluating Text Entry Methods. In Proceedings of SecondNordic Conference on Human-computer Interaction, pages 243–246.ACM, NY, USA, 2002.

[2] R. W. Soukoreff and I. S. MacKenzie. Metrics for Text Entry Research:An Evaluation of MSD and KSPC, and A New Unified Error Metric.In Proceedings of SIGCHI Conference on Human Factors in ComputingSystems, pages 113 – 120. ACM, NY, USA, 2003.

[3] J. Gong and P. Tarasewich. A New Error Metric for Text Entry MethodEvaluation. In Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems, pages 471 – 474. ACM, NY, USA, 2006.

[4] J. O. Wobbrock and B. A. Myers. Analyzing the Input Stream forCharacter -level Errors in Unconstrained Text Entry Evaluations. ACMTransactions on Computer-Human Interaction, 13(4):458 – 489, 2006.

[5] A. S. Arif and W. Stuerzlinger. Analysis of text entry performancemetrics. In Proceedings of the IEEE Toronto International Conference -Science and Technology for Humanity, pages 100–105. IEEE, New York:USA, 2009.

[6] A. S. Arif and W. Stuerzlinger. Predicting the cost of error correction incharacter-based text entry technologies. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, pages 5–14. ACM,NY, USA, 2010.

[7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction toAlgorithms. MIT University Press, Cambridge, 2001.