CHAPTER 4 CRYPTANALYSIS USING …shodhganga.inflibnet.ac.in/bitstream/10603/2193/13/13...4. CRYPTANALYSIS USING LANGUAGE MODEL 4.1 INTRODUCTION In language modeling, n-gram models

CHAPTER 4

CRYPTANALYSIS USING

LANGUAGE MODEL

58

4. CRYPTANALYSIS USING LANGUAGE

MODEL

4.1 INTRODUCTION

In language modeling, n-gram models are probabilistic models of

text that use limited amount of history, or character or word

dependencies, where n refers to the number of characters or words

that participate in the dependency relation. A statistical language

model assigns probability to a sequence of n successive characters or

words by using probability distribution. Language modeling is used in

several natural language processing applications like Speech

Recognition, Machine translation, Parsing, Information Retrieval and

Cryptanalysis.

The complex properties and the characteristics of natural

languages play an important role in cryptanalysis. Different

approaches of cryptanalysis in the literature use language

characteristics to understand the strength of cipher system. One such

approach deals with probability analysis, where in the process of

determining the probability of each symbol in the encrypted message

leads to prediction of plain text. The characteristics of the language

like probability distribution are reflected in the transformed text and

also while performing encryption of the message. This information

59

along with knowledge of symbol frequencies of the language, help to

determine which cipher text symbol maps to the respective plaintext

symbol.

Extensive statistical analysis of probability distribution of

characters is an additive knowledge while retrieving plain text

message partly. This Probability distribution as a parameter in the

process of reverse mapping is mostly dependent on language

specificity. Generally, the probability characteristics differ from

language to language. In case of English due to the smaller size of the

character set, the probability characteristics may effectively be

reflected in the transformed data. If the size of the meaningful units is

large enough, then complexity of probability characteristics is to be

evaluated. Moreover a single-letter probability analysis is helpful in

obtaining initial key and to perform more powerful bi-gram analysis.

An attempt is made to understand the reflection of probability

characteristics and its impact on cryptanalysis with a case study on

Telugu script. In case of Indic scripts, code points are considered as

message units.

4.2 FEATURES OF INDIC SCRIPT

Every language has certain parameters in such a way that

language rules are embodied in sequence while formulating document.

Complexity of script is mainly dependent on character, word and

60

sentence formulation methods. A document with a meaningful

summary can be represented as DSWC where D is document ,

S,W and C are sentences, words and characters respectively. In

case of English, C is represented with the help of one-to-one

correspondence of character code points in any machine, where as

Indic script representation is associated with two fold phenomena. C

in real terms is associated with Syllable which in turn represented as

a set of multiple character code points. Now C can be written as Sy

CC where Sy is syllable and CC is character code point.

In Indic scripts words are treated as sequences of syllables (basic

unit). The script grammar is used to segment a word into syllables.

The units of orthography are syllables, which are essentially C*V core

syllables where C denotes a consonant and V a vowel. Vowel

suppressed consonant segments are also allowed. A syllable is formed

using a canonical code structure given by (C(C))CV where C stands

for consonant and V stands for vowel. A detailed analysis is carried

out [PRA 2001] by Pratap et al. on various possible combinations of

the canonical structure. The possible decomposition of syllables are V,

CV, CCV, CCCV, C, CC, CCC Where V is an independent vowel and C

is vowel suppressed consonant representation. CV is a basic unit of

consonant vowel core. This unit is found in two forms. One is

consonant form and the other form is found with a combination of

consonant and vowel sign. The groups CCV and CCCV are conjunct

61

formations where one and two consonants are grouped with CV core

unit. The other groups CC and CCC are also conjunct formations

without a vowel. Various special symbols like Anuswara, visarga and

other Sanskrit symbols exists with these Aksharas, leading to a large

number of possible combinations. Vowels and consonants are

syllables that are treated as independent syllables. The syllables of CV

core combinations are influenced by Vowel modifiers. Similarly,

((C)C)CV) combinations are influenced by Consonant modifiers.

Indian Standard Code for Information Interchange (ISCII) which

originates from Brahmi script is the character code for Indian

languages evolved by a committee under the Department of

Electronics during 1986-88 and was adopted in 1991 by the Bureau

of Indian Standards (BIS). The ISO-10646 and Unicode standards

define their repertoires for the written scripts in the world. ISCII is an

8-bit encoding that uses an escape sequences to announce the

particular Indic script represented by a following coded character

sequence. Unicode is designed to be a multilingual encoding scheme

that requires no escape sequences or switching between scripts.

Except for a few minor differences, ISCII and Unicode correspond

directly and the layout is as shown in Figure 4.1. For any given Indic

script, the consonant and vowel codes of Unicode are based on ISCII.

ISCII combines letters with the characters NUKTA, INV, & HALANT to

62

allow control over character formation. Unicode provides the same

using ZWJ & ZWNJ characters.

In Telugu the first consonant forms the CV cluster and the other

consonants after CV cluster appear in dependent form. Basic

structure [Pratap et al.] deals with vowels, consonants and characters

with consonant plus vowel sign. The other characters are coded with

the help of these three groups plus special signs Virama, Anuswara

and Visarga. The possible groups for conjuncts and their code

sequence is as shown in Table 4.1.

Table-4.1 ISCII/Unicode Code sequences for Conjuncts

Conjunct

Character

Code sequence

CCA C + Virama + C

CCV C + Virama + C + Vs

CCCA C + Virama + C + Virama + C

CCCV C + Virama + C + Virama + C + Vs

CC C + Virama + C + Virama

CCC C + Virama + C + Virama + C + Virama

Base

Symbol

Subscript

Superscript

Post

symbo

l

Pre

Symbol

Figure 4.1 Basic Telugu Syllable Layout

63

In case of Indic scripts there is many to one correspondence in the

form of code sequences. While representing a syllable non uniform set

of code points will exist. For example consider the word

NEWZELAND in English which contains 9 basic units

N,E,W,Z,E,L,A,N,D called characters where each character is of fixed

size i.e. 1 byte. But for Indic scripts the basic unit syllable is a

combination of several character codes. Consider the above English

word in Telugu, then it can be written as . The above

word contains 4 syllables , , , .

The syllable = + + + + which is a CCV

structure that occupies 5 bytes of memory. The syllable = +

which is a CV structure that occupies 2 byte of memory. The syllable

= + + which is a structure that occupies 3 bytes of

memory. The syllable = + which is a structure that

occupies 2 bytes of memory. That means each syllable is of varying

size based on the canonical structure and whose size can range from

1 to 10 bytes .The complexity in the script is of much use during the

process of cryptography.

4.3 PROBABILITY DISTRIBUTION OF CHARACTERS

64

Basic unit of script description is found with syllable, which is

defined by the canonical structure ((C)C)CV. Machine representation

of this structure is composed of a set of character code points that are

defined in the Unicode code chart. Human perception of these code

points is non linear where as the machine perception is linear as

illustrated in Figure 4.3 and Figure 4.4. This non linearity is different

for different languages as shown in Figure 4.2 and Figure 4.5. Even

though syllables are the meaningful units of script, they are abided by

the specific grammar rules of the script, whereas the character code

points in machine representation are perceived as a reflective

mechanism of these grammar rules. It is necessary to understand the

complex nature of the script in the utility nature of the syllables,

which is dynamic in historical perspective.

0

2

4

6

8

10

12

14

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Figure 4.2 Probability distribution of characters of English

66

0

1

2

3

4

5

6

7

8

9

Series1

Figure 4.4 Probability distribution of characters of Telugu

0

2

4

6

8

10

12

14

E T A O I N S H R D L C U M W F G X P B V K J Q Y Z

Figure 4.3 Sorted Probability distribution of characters of English

67

In actual transformation, the character code points are

transformed with the help of crypto system. This transformation is

carried out onto a different plane where the mapping is a reversible

phenomenon. Text is first transformed to bit stream domain then to

another domain. Both are human understandable but transformed

text or information cannot be understood. The transformation

characteristics of the meaningful units from the stand point of the

probability characteristics, is a point of interest in the present work.

Generally, the probability characteristics differ from language to

language.

0

1

2

3

4

5

6

7

8

9

Series1

Figure 4.5 Sorted Probability distribution of characters of Telugu

68

In the present work the machine representation of character code

points is considered and their characteristics in the form of probability

distribution as one of the information that is adopted for the crypto

analysis. The script complexity in Indic scripts in the form of

probability distribution is the basis for the proposed model. Many

attempts are made on Latin text while extracting the probability

distribution of basic alphabets. The probability distribution of

characters of Latin text is carried out on a sample of over 10,00,000

character code points. They demonstrated the dominance of a small

set of characters in regular usage. Similar concept is extended in the

present work to evaluate the characteristic nature of variable

character code points that are embodied in syllables of Telugu,

Kannada and Hindi text.

A sample of 32,00,000 character code points are used for the above

analysis which are mainly compiled from the present usage of text

taken from passages from numerous newspapers, novels, stories,

songs, sports and literature etc. For text in Telugu, the probabilities

expressed as a percentage of the character code points of the alphabet

from the sample that is considered is represented in Table 4.2. The

reason for certain frequencies in the Table to be zero is that they are

the deprecated characters in the usage of the language. The zero

frequencies are observed for the numbers from 0 to 9 in Telugu

language which are not used in general usage. An interesting

69

phenomenon is observed in the probability distribution of character

code points. The highest probability of 1% among vowels is associated

with the vowel \U 0C05. All other vowels are observed with the

Code

Point Prob %

Code

Point Prob %

Code

Point Prob %

8.86 1.34 0.14 7.66 1.05 0.10

6.63 0.86 0.07 6.51 0.74 0.06 6.12 0.72 0.06 5.28 0.59 0.05 4.75 0.50 0.04 4.71 0.49 0.03 4.35 0.48 0.03 3.36 0.45 0.02 3.20 0.44 0.02 2.92 0.42 0.02 2.69 0.41 0.02 2.59 0.40 0.00 2.58 0.36 0.00 2.30 0.34 0.00 2.24 0.34 0.00 2.19 0.31 0.00 2.06 0.19 0.00

2.01 0.16 0.00 1.96 0.16 0.00 1.89 0.15 0.00 1.43 0.14

Table 4.2 : Probability Distribution of character code

points of Telugu script

70

probability less than or equal to 0.5%. Among consonants the highest

probability of 6.2% is associated with the consonant . Only four

consonants are observed with probability greater than 4%. Among

vowel signs, only three of them are observed with probability around

7%. This phenomena is more associated with CV Core which are

reported with 54% in the syllable structure. The Nasal symbol is

observed with 4.7% probability and the highest probability of 8.86% is

associated with Halant . It is quite interesting to know that Halant is

not treated as a syllable at all. However the significant roll of Halant is

observed in the conjunct formations of syllables.

For Kannada, a sample of 16,94,650 character code points are

used for the above analysis which are mainly compiled from the

present usage of text taken from passages from numerous

newspapers, novels, stories, songs, sports and literature etc. For text

in Kannada, Table 4.3 shows the probabilities expressed as a

percentage of the character code points of the alphabet from the

sample that is considered. Certain frequencies in the Table are zero as

they are the deprecated characters in the present usage of the

language. A highest probability of 9.25% is associated with the

character .

71

Code Point

Prob %

Code Point

Prob %

Code Point

Prob %

Code Point

Prob %

Code Point

Prob %

Code Point

Prob %

9.25 2.66 0.89 0.24 0.05 0.01

7.76 2.64 0.86 0.23 0.05 0.01

5.69 2.58 0.75 0.22 0.04 0.01

5.51 2.45 0.68 0.21 0.04 0.01

5.38 1.87 0.68 0.17 0.04 0.01

5.38 1.66 0.64 0.16 0.04 0.00

4.90 1.56 0.55 0.16 0.03 0.00

4.39 1.53 0.55 0.11 0.03 0.00

4.33 1.50 0.52 0.11 0.03 0.00

3.93 1.41 0.51 0.09 0.02 0.00

3.60 1.24 0.48 0.08 0.02 0.00

3.54 1.07 0.35 0.07 0.02 0.00

3.05 0.94 0.33 0.06 0.02

Table 4.3 Probability Distribution of character code points of Kannada script

72

Code

Point

Prob

%

Code

Point

Prob

%

Code

Point

Prob

%

Code

Point

Prob

%

Code

Point

Prob

%

8.76 1.80 0.46 0.06 0.00 6.57 1.73 0.41 0.05 0.00 6.43 1.59 0.40 0.05 0.00 6.27 1.52 0.37 0.04 0.00 4.46 1.47 0.37 0.02 0.00 4.20 1.21 0.31 0.02 0.00 4.16 0.92 0.31 0.02 0.00 3.95 0.82 0.28 0.02 0.00 3.92 0.82 0.27 0.02 0.00 3.60 0.81 0.17 0.02 0.00 3.53 0.77 0.16 0.01 0.00 3.42 0.77 0.16 0.01 0.00 3.40 0.73 0.15 0.01 0.00 2.76 0.68 0.14 0.01 0.00 2.69 0.66 0.12 0.01 0.00 2.37 0.64 0.11 0.01 2.28 0.61 0.10 0.01 1.96 0.60 0.07 0.01 1.85 0.51 0.06 0.01

For Hindi, a sample of 9,36,707 character code points are used for

the above analysis which are mainly compiled from the present usage

of text. A highest probability of 9.25% is associated with the vowel sign

character . Among consonants the character has highest

probability of 6.57 % as listed in Table 4.4.


points of Hindi script

73

S.No. Code Point

Probability %

S.No. Code Point Probability

%

1 E 12.90 14 P 2.62

2 T 9.65 15 U 2.57

3 O 7.73 16 F 2.37

4 S 7.40 17 G 1.86

5 N 7.20 18 B 1.59

6 I 7.19 19 W 1.40

7 A 7.16 20 Y 1.18

8 R 6.65 21 V 0.89

9 C 4.52 22 K 0.75

10 H 3.92 23 X 0.30

11 L 3.81 24 Q 0.19

12 D 3.26 25 J 0.12

13 M 2.67 26 Z 0.09

For English, a sample of 10,00,000 character code points are used

for the analysis which are mainly compiled from the present usage of

text. A highest probability of 12.9% is associated with the character

E where as a minimum probability of 0.19% is associated with the

character Z as listed in Table 4.5.

Like single letters having typical probability distributions, multiple

letter combinations also occur with varying and predictable

probabilities. Extending the unconditional probability approach, the

probabilities are determined at which bigrams and trigrams occur in

the text. For Telugu, approximately 4096 bigrams are possible. The

most frequently occurring 26 bigrams are listed in Table 4.6 . The


points of English script

74

Sl.No Code Point Probability

% Sl.No Code Point

Probability

%

1 1.41 14 0.80 2 1.17 15 0.78 3 1.16 16 0.77 4 1.11 17 0.76 5 1.08 18 0.75 6 1.06 19 0.75 7 0.93 20 0.75 8 0.91 21 0.75 9 0.87 22 0.74 10 0.87 23 0.72 11 0.85 24 0.71 12 0.84 25 0.71 13 0.82 26 0.70

bigram has highest probability of 1.41%. By observing these

values it is easy to infer that the bigrams are formed as a set of

clusters around specific values, which increases the complexity of

reverse mapping . Similar evaluation of probability distribution is

carried out on Kannada, Hindi and the most probable 26 character

bigrams are listed in Table 4.7 and Table 4.8 respectively for both the

languages. A highest probability of 1.34% is observed for in

Kannada and 1.10% for in Hindi. Like in Telugu, similar

observation with regard to clustering of bigrams is observed in

Kannada and Hindi.

Table 4.6 Probability Distribution of most frequent bi gram

character code points of Telugu script

75


% Sl.No Code Point

Probability

%

1 1.34 14 0.78 2 1.28 15 0.77 3 1.24 16 0.74 4 1.21 17 0.73 5 1.09 18 0.71 6 1.06 19 0.71 7 1.06 20 0.70 8 1.04 21 0.70 9 0.96 22 0.65 10 0.91 23 0.64 11 0.82 24 0.64 12 0.81 25 0.64 13 0.80 26 0.62


% Sl.No Code Point

Probability

%

1 1.10 14 0.66 2 1.06 15 0.66 3 1.00 16 0.64 4 0.98 17 0.63 5 0.95 18 0.60 6 0.92 19 0.59 7 0.88 20 0.59 8 0.87 21 0.58 9 0.83 22 0.58 10 0.78 23 0.57 11 0.75 24 0.57 12 0.74 25 0.56 13 0.67 26 0.52

Table 4.8 : Probability Distribution of most frequent bi gram

character code points of Hindi script

Table 4.7 Probability Distribution of most frequent bi gram

character code points of Kannada script

76


% Sl.No Code Point

Probability

%

1 TH 2.47 14 NT 1.23

2 ER 1.96 15 AN 1.22

3 HE 1.96 16 ET 1.15

4 IN 1.83 17 SE 1.02

5 ES 1.68 18 ED 1.02

6 ON 1.60 19 TO 1.00

7 RE 1.44 20 CO 0.99

8 TE 1.37 21 EC 0.96

9 ST 1.37 22 IS 0.95

10 TI 1.35 23 RO 0.91

11 EN 1.33 24 ND 0.88

12 AT 1.25 25 IT 0.87

13 OR 1.23 26 AR 0.86

For English 676 bigrams are possible logically. The most probably

occurring 26 bigrams are listed in Table 4.9 . The bigram TH has

highest probability of 2.47%. By observing these values it is easy to

infer that the bigrams are formed as a set of clusters around specific

values. These 26 bigrams of English correspond to 35% of total

distribution where as for Indic scripts the correspondence is in the

range 19 to 22%. The complexity of reverse mapping is thus difficult

in case of Indic scripts than English.

An enhanced analysis, by extending bigram probability distribution

to trigram probability distribution of code points provides a better

knowledge about the language characteristics. For Telugu, a total of

almost 2,63,000 trigram code points are possible. Table 4.10 shows

Table 4.9 : Probability Distribution of most frequent bi gram character code points of English script

77

the top 20 trigrams based on probability distribution. The trigram

code point has highest probability of 0.86%. Because of large

number of trigrams that are possible, there are more code points

which are clustered around specific values. This makes the mapping

process more complex. Similarly out of 4,20,000 trigrams possible in

Kannada Table 4.11 displays top 26 character code points as per the

trigram distribution. The code point has highest probability of

1.16%. For Hindi a total of around 7,29,000 trigrams are possible, out

of which has got highest probability of 0.74% as listed in Table

4.12. Because of the huge trigram space, in Hindi reverse mapping is

much more complex.


% Sl.No Code Point

Probability

%

1 0.86 14 0.25 2 0.47 15 0.25 3 0.46 16 0.24 4 0.45 17 0.24 5 0.40 18 0.23 6 0.38 19 0.22 7 0.31 20 0.21 8 0.30 21 0.21 9 0.29 22 0.21 10 0.28 23 0.21 11 0.27 24 0.20 12 0.26 25 0.20 13 0.25 26 0.20

Table 4.10 Probability Distribution of most frequent tri gram

character code points of Telugu script

78


% Sl.No Code Point

Probability

%

1 1.16 14 0.32 2 0.90 15 0.30 3 0.90 16 0.26 4 0.62 17 0.24 5 0.53 18 0.24 6 0.46 19 0.21 7 0.45 20 0.20 8 0.44 21 0.20 9 0.40 22 0.20 10 0.36 23 0.19 11 0.33 24 0.19 12 0.32 25 0.18 13 0.32 26 0.17


% Sl.No Code Point

Probability %

1 0.74 14 0.17 2 0.35 15 0.16 3 0.33 16 0.16 4 0.31 17 0.15 5 0.29 18 0.15 6 0.28 19 0.15 7 0.27 20 0.15 8 0.26 21 0.14 9 0.24 22 0.14 10 0.24 23 0.14 11 0.23 24 0.13 12 0.23 25 0.13 13 0.20 26 0.13


character code points of Kannada script


character code points of Hindi script

79

In case of English, a total of 17,576 trigrams are possible which is

less when compared to possible trigrams in case of Indic scripts. Table

4.13 lists the most frequent 26 trigrams. The trigram THE has

highest probability of 1.76% followed by ION with 0.73% and ING

with probability of 0.65%. Because of the smaller trigram space of

English than Indic scripts, reverse mapping is less complex for

English.


% S.No. Code Point

Probability

%

1 THE 1.76

14 EST 0.31

2 ION 0.73

15 ERE 0.31

3 ING 0.65

16 ATE 0.30

4 TIO 0.64

17 USE 0.29

5 AND 0.55

18 AGE 0.28

6 ENT 0.52

19 STH 0.28

7 FOR 0.45

20 HER 0.28

8 PRO 0.44

21 THA 0.27

9 CON 0.42

22 ONS 0.26

10 ESS 0.40

23 ECT 0.26

11 TER 0.39

24 NTH 0.25

12 ATI 0.39

25 ONT 0.25

13 INT 0.32

26 ETH 0.25


character code points of English script

80

4.4 CONDITIONAL PROBABILITY DISTRIBUTION OF

CHARACTERS

The statistical influences extending over n symbols of the text

provides better apriori knowledge to the system to achieve

consistency. For this purpose, the conditional probability of a

character, knowing the preceding (n-1) characters needs to be

calculated. The Conditional probability P(A|B) is the probability of

some event A, given the occurrence of some other event B.

P(A|B) = P(A,B)/P(B) (4.1)

Where P(A,B) is the Joint probability, which is the probability of two

events in conjunction. It is the probability of both events together and

P(B) is the unconditional probability of the event B.

The unconditional probability distributions of different unigram

character code points are to be calculated from large corpus of the

language. The joint probabilities of all possible combinations of

character code points for a character are also calculated. This process

is repeated for all character code points of that language. Using these

unconditional and joint probabilities and the expression (4.1), the

conditional probabilities of all character code points of the language

can be calculated. This procedure is adopted on four different

languages English, Telugu, Kannada and Hindi .

The conditional probabilities for the English character S are listed

in Table 4.14 . From the Table, it is evident that ST has highest

81

conditional probability which shows that there is more probability of S

followed by T. Similarly SZ has minimum conditional probability.

Similar computation is done for all 26 characters of the language.

Similarly conditional probabilities for all characters of Telugu,

Kannad, Hindi are computed. For illustration Table 4.15 shows the

conditional probability of the Telugu character , Table 4.16 shows

the conditional probability of Kannada character and Table 4.17

gives the conditional probabilities of the Hindi character . These

conditional probabilities are like wise calculated for all characters of

the language.

S.No. Char Prob % Cond

Prob S.No. Char

Prob

%

Cond

Prob

1 ST 7.40 2.49 14 SY 7.40 0.25

2 SE 7.40 1.90 15 SB 7.40 0.24

3 SS 7.40 1.38 16 SR 7.40 0.20

4 SI 7.40 1.31 17 SN 7.40 0.20

5 SA 7.40 1.25 18 SD 7.40 0.15

6 SO 7.40 1.02 19 SL 7.40 0.14

7 SC 7.40 0.54 20 SK 7.40 0.04

8 SU 7.40 0.52 21 SG 7.40 0.04

9 SP 7.40 0.52 22 SV 7.40 0.03

10 SH 7.40 0.38 23 SJ 7.40 0.02

11 SM 7.40 0.32 24 SQ 7.40 0.01

12 SF 7.40 0.28 25 SX 7.40 0.01

13 SW 7.40 0.28 26 SZ 7.40 0.00

Table 4.14 Conditional Probability Distribution of character code point S of English script

82

Char Prob % Cond Prob Char Prob %

Cond Prob Char Prob %

Cond Prob

5.28 3.78 5.28 0.12 5.28 0.00 5.28 2.80 5.28 0.12 5.28 0.00 5.28 2.57 5.28 0.11 5.28 0.00 5.28 2.28 5.28 0.09 5.28 0.00 5.28 0.97 5.28 0.07 5.28 0.00 5.28 0.52 5.28 0.07 5.28 0.00 5.28 0.47 5.28 0.07 5.28 0.00 5.28 0.45 5.28 0.04 5.28 0.00 5.28 0.45 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.39 5.28 0.02 5.28 0.00 5.28 0.31 5.28 0.01 5.28 0.00 5.28 0.28 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.26 5.28 0.01 5.28 0.00 5.28 0.20 5.28 0.01 5.28 0.00 5.28 0.18 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.14 5.28 0.00 5.28 0.00 5.28 0.13 5.28 0.00 5.28 0.12 5.28 0.00

Table 4.15 Conditional Probability Distribution of character

code point of Telugu script

83

Char Prob %

Cond Prob

Char Prob % Cond Prob

Char Prob % Cond Prob

5.51 3.60 5.51 0.08 5.51 0.00

5.51 2.30 5.51 0.05 5.51 0.00

5.51 1.77 5.51 0.04 5.51 0.00

5.51 1.38 5.51 0.04 5.51 0.00

5.51 1.28 5.51 0.04 5.51 0.00

5.51 0.81 5.51 0.04 5.51 0.00

5.51 0.55 5.51 0.04 5.51 0.00

5.51 0.54 5.51 0.03 5.51 0.00

5.51 0.49 5.51 0.03 5.51 0.00

5.51 0.49 5.51 0.02 5.51 0.00

5.51 0.47 5.51 0.02 5.51 0.00

5.51 0.44 5.51 0.02 5.51 0.00

5.51 0.37 5.51 0.02 5.51 0.00

5.51 0.37 5.51 0.02 5.51 0.00

5.51 0.37 5.51 0.02 5.51 0.00

5.51 0.33 5.51 0.01 5.51 0.00

5.51 0.33 5.51 0.01 5.51 0.00

5.51 0.29 5.51 0.01 5.51 0.00

5.51 0.27 5.51 0.01 5.51 0.00

5.51 0.24 5.51 0.01 5.51 0.00

5.51 0.17 5.51 0.01 5.51 0.00

5.51 0.16 5.51 0.01 5.51 0.00

5.51 0.16 5.51 0.01 5.51 0.00

5.51 0.15 5.51 0.00 5.51 0.00

5.51 0.13 5.51 0.00 5.51 0.00

5.51 0.11 5.51 0.00


code point of Kannada script

84

Char Prob % Cond

Prob Char Prob % Cond

Prob Char Prob % Cond

Prob

2.69 5.56 2.69 0.12 2.69 0.00 2.69 4.90 2.69 0.12 2.69 0.00 2.69 3.82 2.69 0.11 2.69 0.00 2.69 2.73 2.69 0.10 2.69 0.00 2.69 2.30 2.69 0.10 2.69 0.00 2.69 2.29 2.69 0.08 2.69 0.00 2.69 2.12 2.69 0.08 2.69 0.00 2.69 2.05 2.69 0.08 2.69 0.00 2.69 1.20 2.69 0.07 2.69 0.00 2.69 1.10 2.69 0.07 2.69 0.00 2.69 1.06 2.69 0.06 2.69 0.00 2.69 0.71 2.69 0.06 2.69 0.00 2.69 0.70 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.47 2.69 0.03 2.69 0.00 2.69 0.46 2.69 0.02 2.69 0.00 2.69 0.42 2.69 0.01 2.69 0.00 2.69 0.40 2.69 0.01 2.69 0.00 2.69 0.39 2.69 0.01 2.69 0.00 2.69 0.32 2.69 0.01 2.69 0.00 2.69 0.25 2.69 0.01 2.69 0.00 2.69 0.24 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.20 2.69 0.01 2.69 0.00 2.69 0.18 2.69 0.00 2.69 0.00 2.69 0.15 2.69 0.00 2.69 0.00 2.69 0.14 2.69 0.00 2.69 0.00 2.69 0.13 2.69 0.00 2.69 0.00


code point of Hindi script

85

4.5 ENCRYPTION AND DECRYPTION

The proposed model defines meaningful units that are embedded

in text documents as essential units and also treated as meaningful

units in the form of character or byte stream. The byte stream is a

symbolic representation of text. In case of Indic scripts this byte

stream is a complex byte stream, where as in case of Latin text the

byte stream is one-to-one mapping. The present model addressed this

specificity by taking into consideration of words in the form of

syllables and extraction of byte stream from syllables.

Algorithm for Encryption of Indic Scripts

1 : Divide the given text document into set of words.

2 : Divide each word into syllables (which is basic unit).

3 : For each syllable generate the character code point byte stream

which may consists of single or multiple code points that will

form that syllable.

4 : Generate bit stream for the byte stream generated in step3.

5 : Apply Encryption technique on the bit stream generated in Step 4

and a key stream generated randomly which results in the cipher

text.

6 : Repeat steps 3 to 5 for each syllable generated in step2.

Figure 4.6 Algorithm for Encryption of Indic Scripts

script

86

This byte stream consist of single code point units or multiple code

point units. They are transformed into a code point byte stream,

converted to bit stream which undergoes transformation similar to

that of any system. The code point streams that are derivative of

syllables are converted to bit stream. A key stream is generated [KAT

2005] using efficient Random number generator. With this key

stream, transformation function is applied which results in cipher

text. The process of encryption and decryption is illustrated in Figure

4.6 and Figure 4.7. The cryptographic model for Indic scripts is as

illustrated in Figure 4.8.

Algorithm for Decryption of Indic Scripts

1 : Generate bit stream for the cipher text.

2 : Apply Decryption technique on the bit stream generated in Step 1

with a key stream generated during encryption resulting in a byte

stream.

3 : Combine the bit streams of step2 to form code point byte stream.

4 : Combine the code point byte stream of step3 to form syllables

5 : Combine the syllables to form words and the words into text

document

6 : Repeat step 1 through 5 for all byte streams in the cipher text.

Figure 4.7 Algorithm for Decryption of Indic Scripts

script

87

MULTIPLE

CODE

POINTS SINGLE

CODE

POINT

MULTIPLE

CODE

POINTS

.

KEY STREM

TEXT

DOCUMENT

WORDS

SYLLABLES

SINGLE

CODE

POINT

BIT STREAM

ENCRYPTION

FUNCTION

CIPHER TEXT

CIPHER

TEXT

TEXT

DOCUMENT

WORDS

SYLLABLES

BIT STREAM

DECRYPTION

FUNCTION

Symbolic representation of byte stream

Symbolic representation of byte stream

CODE POINT

BYTE STREAM

CODE POINT

BYTE STREAM

Figure. 4.8 Flow chart for Encryption and Decryption of

Indic scripts

88

The proposed cryptographic model is tested on four languages i.e.

English, Telugu, Kannada, Hindi. The encryption algorithm is

implemented on text of different sizes. For this process a key is

generated randomly using a OS based random generator. A simple 8-

bit,16-bit, 32-bit key streams are used in the present work. The plain

text is encrypted using the proposed algorithm and randomly

generated key resulting in cipher text.

Figure 4.9 Sample Plain Text, Encrypted Text and

Decrypted Text in English

89

Fig. 4.10 Sample Plain Text, Encrypted Text and

Decrypted Text in Telugu

4.6 CRYPTANALYSIS USING LANGUAGE MODEL

In a conventional cryptographic system, a plain text message (m) is

generated by the sender. An encryption transformation E, which

depends on a secret key k, transforms the plain text m to cipher text

(c) using the expression c=Ek(m). Cipher text c is then transmitted to

the receiver where the decryption transformation D, which also

depends on secret key k ( in case of symmetric key cryptography), is

used to recover the plain text m using the expression m= Dk(c). The

information flow is as illustrated in Figure 4.11. The assumption is

that an opponent does not possess k and cannot recover m from c

90

Figure 4.11 Decipherment Model that uses language statistics

using D. (Note that algorithms D and E may be kept public). For the

key k to remain secret, a secure communication channel is needed

between the sender and receiver.

The information available for a cryptanalyst is a variable that can

be only cipher text, the complete knowledge of the system (except the

key), the algorithm used, the characteristics of the language and other

language statistics. If s represents the information available to an

opponent and D1 represents the process of cryptanalysis, then the

deduced information, m1 is expressed as m1 = D1(s). The coincidence

between m and m1 is a measure of the strength of the system.

m=Dk(c)

SENDER

RECIEVER

CRYPTANALYST

Apriori knowledge

about Language

y

Key (k)

distribution

m c=Ek(m) c

k k

merge m

1

91

Figure 4.12 Sample retrieved Plain Text in English

Figure 4.13 Sample retrieved Plain Text in Telugu

92

In the present work, the probability distribution of different n-gram

character code points is considered to be the apriori knowledge for

cryptanalysis. The probabilities of different characters in the cipher

text are calculated and the results are tabulated. Mapping is done

between the characters of plain text and cipher text based on the

probability distribution. Now the characters in cipher text are replaced

with the mapped characters of plain text and the percentage of plain

text retrieved is calculated.

When English Text is considered the problems are much less

because of the fact that the correspondence is between the

transformed text and the original text. Though the key is generated

randomly, since it is fixed, the mapping function transforms it into a

distinct point in the orthogonal plane. On many occasions for large

text size almost all characters are present. Even for a medium sized

text this is true because of less number of characters that exist. More

over because of one-to-one mapping predictability is more. The

percentage of retrieved code points is calculated using probability

distribution.

If Indic scripts are considered, the number of character codes that

exist in the original text need not be the complete set. Even though

the mapping function takes care of one to one correspondence, in the

transformation process all character codes may not exist from the

original set of code points. This may lead to confusion in the crypto

93

Table 4.18. Retrieved Plain Text using Unigram probability

analysis. A threshholding function is adopted in the crypto analysis

process for reverse mapping. The percentage of plain text that is

retrieved is listed in Table 4.18. This result in case of Telugu,

Kannada, Hindi is relatively less when compared to English which is

due to complex nature of Indic scripts.

S.No.

Cipher text

Length (in code

points)

Retrieval % using probability of uni grams

ENGLISH TELUGU KANNADA HINDI

1 6000 29.97 22.51 27.86 19.88

2 12000 35.93 26.68 27.96 31.55

3 25000 40.40 23.77 38.95 23.83

4 40000 42.34 16.90 26.73 27.10

5 50000 38.33 08.81 28.03 27.89

6 70000 63.17 23.60 26.11 26.31

7 90000 48.04 17.44 25.58 25.93

8 110000 78.59 18.76 25.70 27.22

Average 47.10 19.81 28.37 26.21

The encryption algorithm is implemented on different sizes of

Telugu text samples. For this process an eight bit key is generated

using OS based random generator. Plain text is encrypted using the

94

proposed algorithm and the randomly generated 8-bit key resulting in

cipher text. The frequencies of different characters in the cipher text

are extracted. Mapping is carried out between the characters of plain

text and cipher text based on these frequencies. Now the characters in

cipher text are replaced with the mapped characters of plain text and

the percentage of the exact retrieval as compared to plain text is

calculated which is presented in Table 4.18.

From the Table, it is easy to infer that cryptanalysis of text of

complex languages like Telugu, Kannada, Hindi is much more

difficult. Then the larger key size applicable to applications in Latin

text can be reduced in case of applications in complex languages like

Telugu. The percentage of plain text retrieved is not linear with text

size because a proper threshold function is required to map cipher

text symbols to corresponding plain text symbols.

Using unconditional probability distribution approach the

decipherment model is applied on cipher texts of 4 languages and on

various sizes ranging from 6000 to 110000 character code points. The

retrieved plain text percentages are tabulated in Table 4.18. English

text resulted in maximum retrieval of 78.59% and a minimum of

29.97%. For Telugu a maximum of 26.68% and a minimum of 8.81%

plain text is retrieved. The retrieved text for Kannada lies in the range

of 38.95% to 25.58% and for Hindi in the range 31.55 % to 19.88%.

95

Table 4.19. Mapping process in decipherment for a sample of most

frequently occurring characters of English using unconditional

probability

From the observed results it can be concluded that the reverse

mapping is more complex in case of Indic scripts (specific reference to

Telugu) with smaller key sizes also.

Mapping of fourteen independent characters of English language on

various samples varying from 6000 to 110000 is presented in Table

4.19. An empty cell indicates that the character is not mapped

correctly. The mapping varies with variation in sample size due to the

heuristic nature of the language which results in variation of

percentage of retrieval. Table 4.20 and Table 4.21 illustrate similar

computations with associated results of code points with regard to

Hindi, Telugu languages and are observed with the similar nature.

This nature results in inconsistent variation in the mapping process.

S.No Plain Text E T O S N I A R C H L D M P

1 6000 E T R

2 12000 E T R

3 25000 E T R H D M

4 40000 E T R H D M

5 50000 E T R D M

6 70000 E T O S I R C

7 90000 E T R L D M

8 110000 E T O S I R C H L D M

96

Table 4.21 Mapping process in decipherment for a sample of most frequently occurring characters of Telugu using

unconditional probability

Table 4.20. Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using

unconditional probability

S.No. Plain Text

1 6000

2 12000

3 25000

4 40000

5 50000

6 70000

7 90000

8 110000

Sl.No Plain Text

4 6000

6 12000

9 25000

11 40000

12 50000

13 70000

14 90000

15 110000

97

Table 4.22 Retrieved Plain Text using bigram probability

Now the unconditional probability approach is extended by using

the probabilities at which bigrams and trigrams occur within the

cipher text. These probabilities are mapped with that of known

probability distribution of the original language. Using this reverse

mapping, the plain text is retrieved and the retrieved percentage of

plain text for four different languages with varying sample sizes using

bigrams and trigrams are listed in Table 4.22 and Table4.23

Using unconditional probability distribution of bigram character

code points, the decipherment model is applied on cipher texts of four

languages and on various sizes ranging from 6000 to 110000

character code points. The retrieved plain text percentages using

bigram statistics are tabulated in Table 4.22. English text resulted in

maximum retrieval of 10.77% and a minimum of 2.82%. For Telugu a

S.No.

Cipher text Length

(in code

points)

Retrieval % using probability of bi grams


1 6000 5.20 0.87 2.20 0.07

2 12000 2.82 2.47 2.52 3.47

3 40000 7.85 2.86 3.43 3.40

4 50000 9.22 2.13 3.84 2.02

5 70000 6.64 0.91 3.83 3.90

6 90000 10.77 0.03 4.57 3.49

7 110000 9.60 1.80 2.70 0.44

Average 7.44 1.58 3.30 2.40

98

Table 4.23 Retrieved Plain Text using Trigram probability

maximum of 2.86% and a minimum of 0.03% plain text is retrieved.

The retrieved text for Kannada lies in the range of 4.57% to 2.20% and

for Hindi in the range 3.90 % to 0.07%. The retrieved percentage of

plain text is more for English than Indic scripts.

The same process is repeated using trigram probability distribution

of character code points, on cipher texts of four languages and on


retrieved plain text percentages using trigram statistics are tabulated

in Table 4.23. English text resulted in maximum retrieval of 4.18%

and a minimum of 2.25%. For Telugu a maximum of 2.20% and a

S.No.

Cipher text Length

(in code

points)

Retrieval % using probability of Tri grams

ENGLISH

TELUGU

KANNADA

HINDI

1 6000 2.95 1.00 1.45 0.60

2 12000 3.80 1.20 4.72 1.95

3 25000 2.25 1.67 5.04 0.84

4 40000 3.10 2.20 3.00 1.27

5 50000 3.54 2.01 2.52 0.74

6 70000 3.61 1.22 2.33 0.99

7 90000 4.18 1.15 3.90 1.00

8 110000 2.36 1.61 3.03 0.62

Average 3.22 1.51 3.25 1.00

99

minimum of 1.00% plain text is retrieved. The retrieved text for

Kannada lies in the range of 5.04% to 1.45% and for Hindi in the

range 1.95 % to 0.60%. The retrieved percentage of plain text is more

for English than Indic scripts.

Using conditional probability distribution approach the

decipherment model is applied on cipher texts of 4 languages and on


retrieved plain text percentages are tabulated in Table 4.24. English

text resulted in maximum retrieval of 57.85% and a minimum of

42.42% For Telugu a maximum of 33.93% and a minimum of 24.72%

plain text is retrieved. The retrieved text for Kannada lies in the range

of 29.77% to 38.30% and for Hindi in the range 24 % to 34.71%.

Table 4.25 illustrates the mapping of 14 most probable characters

using conditional probability. An empty cell indicates that for a given

sample size that character is not mapped correctly. When this is

compared with the mapping process using unconditional probability is

illustrated in Table 4.19, some of the characters are not mapped

correctly using unconditional probability approach, where as more

characters are mapped properly using conditional probability

approach. This results in increased retrieval percentage of message

text.

100

Table 4.24. Retrieved Plain Text using Conditional probability

S.No.

Cipher text Length

(in code

points)

Retrieval % using conditional probability


1 6000 42.42 24.72 29.77 31.15

2 12000 49.65 27.29 35.94 34.71

3 25000 52.76 27.99 38.30 25.00

4 40000 48.76 26.92 36.07 25.74

5 50000 43.42 26.46 34.88 31.33

6 70000 56.33 33.93 30.69 31.01

7 90000 57.85 33.13 30.33 32.67

8 110000 57.35 29.88 31.31 34.55

Average 51.07 28.79 33.41 30.77

Similar conclusions are drawn from Table 4.26 and Table 4.27

which represent the mapping process for character code points of

Telugu and Hindi respectively. For these languages conditional

probability distribution approach resulted in improved mapping

performance than unconditional approach.

101

Table 4.26 Mapping process in decipherment for a sample of

most frequently occurring characters of Telugu using

conditional probability

Table 4.25 Mapping process in decipherment for a sample of

most frequently occurring characters of English using conditional probability

S.No Plain Text E T O S N I A R C H L D M P

1 6000 E T O A R H P

2 12000 E T O S I A R H L M

3 25000 E T O S I R H L D M

4 40000 E T O S I R H L D

5 50000 E T S I R H L D P

6 70000 E T O S N I R H L D M P

7 90000 E T O S N I A R H L D M P

8 110000 E T O S N I R H L D M P

S.No Plain Text

1 6000

2 12000

3 25000

4 40000

5 50000

6

6

70000

7 90000

8 110000

102

Figure. 4.14 Average Retrieved Plain Text using Unconditional and

Conditional probability of different languages

Table 4.27 Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using

conditional probability

0

10

20

30

40

50

60

English Telugu Kannada Hindi

LANGUAGES

AV

ER

AG

E %

RE

TR

IEV

AL

Unconditional

Probability

Conditional

Probability

S.No.

Plain Text

1 6000

2 12000

3 25000

4 40000

5 50000

6 70000

7 90000

8 110000

103

Average retrieval efficiency of English is observed to be 47.1% using

unconditional probability and 51.07% using conditional probability is

presented in Figure. 4.14. Using unconditional probability the

retrieved plain text percentage of Telugu, Kannada, Hindi are observed

to be 19.81, 28.37, 26.21% respectively and the same using

conditional probability are 28.79, 33.41 and 30.77 respectively. It is

observed from these results that conditional probability results in

relatively larger percentage of retrieval than unconditional probability

distribution. Thus using conditional probability as apriori knowledge

yields in increased retrieval efficiency and improved consistency in

retrieval percentage. The retrieval percentage for English is larger than

the Indic scripts viz. Telugu, Kannada and Hindi while applying

conditional probabilities. Quite interestingly the retrieval efficiency

associated with conditional probability is found to be more stable than

probability distribution alone. In case of conditional probability,

smaller text sizes (below 2000) of all languages posses improved

retrieval efficiency when compared with probability distribution alone.

Adaptation of conditional probability allowed appropriated coincidence

with more character code points, which is not the case in the

matching pattern of probability alone which is illustrated in Table

4.20. Similar improvements are observed with the above languages

where the number of code points is relatively large.

In the process of providing security, the context of message is

termed as script dependent text. In the world of multi lingual data,

104

every script possesses different complexity levels. A method for

deciphering substitution ciphers with low-order models of the

language with a case study on Indic scripts is provided in the present

work. An attempt is made to analyze the text based crypto model

using probability distribution of character code points as a parameter

with specific study on Indic scripts. The complex orthographic nature

of Indic scripts is explored while studying the impact of probability

distribution of character code points in the cryptanalysis of text

retrieval. An extensive analysis is carried out on a large set of

character code points of Indic scripts compiled from the present usage

of script. The encryption and decryption process is tested in

comparison with English and also on Telugu, Kannada, Hindi with

different key sizes. A comparison between language complexities of

English and Indic scripts is presented from the stand point of

probability distribution of character code points while adopting

cryptanalysis. Evaluation of the model is carried out with the help of

probability distribution as one of the prominent characteristic of text.

Crypto analysis is also carried out using conditional probability

distribution and it is observed that there is an improvement in

percentage of matches in the reverse transformation. Reduced

efficiency of mapping is an indicative measure of language complexity

with specific reference to probability distribution of character code

points. This reflects the fact that the reverse mapping is much more

complex in case of Indic scripts which is observed from the results.

Documents

CHAPTER 4 CRYPTANALYSIS USING …shodhganga.inflibnet.ac.in/bitstream/10603/2193/13/13...4. CRYPTANALYSIS USING LANGUAGE MODEL 4.1 INTRODUCTION In language modeling, n-gram models