Upload
vuonghanh
View
237
Download
0
Embed Size (px)
Citation preview
CHAPTER 4
CRYPTANALYSIS USING
LANGUAGE MODEL
58
4. CRYPTANALYSIS USING LANGUAGE
MODEL
4.1 INTRODUCTION
In language modeling, n-gram models are probabilistic models of
text that use limited amount of history, or character or word
dependencies, where n refers to the number of characters or words
that participate in the dependency relation. A statistical language
model assigns probability to a sequence of n successive characters or
words by using probability distribution. Language modeling is used in
several natural language processing applications like Speech
Recognition, Machine translation, Parsing, Information Retrieval and
Cryptanalysis.
The complex properties and the characteristics of natural
languages play an important role in cryptanalysis. Different
approaches of cryptanalysis in the literature use language
characteristics to understand the strength of cipher system. One such
approach deals with probability analysis, where in the process of
determining the probability of each symbol in the encrypted message
leads to prediction of plain text. The characteristics of the language
like probability distribution are reflected in the transformed text and
also while performing encryption of the message. This information
59
along with knowledge of symbol frequencies of the language, help to
determine which cipher text symbol maps to the respective plaintext
symbol.
Extensive statistical analysis of probability distribution of
characters is an additive knowledge while retrieving plain text
message partly. This Probability distribution as a parameter in the
process of reverse mapping is mostly dependent on language
specificity. Generally, the probability characteristics differ from
language to language. In case of English due to the smaller size of the
character set, the probability characteristics may effectively be
reflected in the transformed data. If the size of the meaningful units is
large enough, then complexity of probability characteristics is to be
evaluated. Moreover a single-letter probability analysis is helpful in
obtaining initial key and to perform more powerful bi-gram analysis.
An attempt is made to understand the reflection of probability
characteristics and its impact on cryptanalysis with a case study on
Telugu script. In case of Indic scripts, code points are considered as
message units.
4.2 FEATURES OF INDIC SCRIPT
Every language has certain parameters in such a way that
language rules are embodied in sequence while formulating document.
Complexity of script is mainly dependent on character, word and
60
sentence formulation methods. A document with a meaningful
summary can be represented as DSWC where D is document ,
S,W and C are sentences, words and characters respectively. In
case of English, C is represented with the help of one-to-one
correspondence of character code points in any machine, where as
Indic script representation is associated with two fold phenomena. C
in real terms is associated with Syllable which in turn represented as
a set of multiple character code points. Now C can be written as Sy
CC where Sy is syllable and CC is character code point.
In Indic scripts words are treated as sequences of syllables (basic
unit). The script grammar is used to segment a word into syllables.
The units of orthography are syllables, which are essentially C*V core
syllables where C denotes a consonant and V a vowel. Vowel
suppressed consonant segments are also allowed. A syllable is formed
using a canonical code structure given by (C(C))CV where C stands
for consonant and V stands for vowel. A detailed analysis is carried
out [PRA 2001] by Pratap et al. on various possible combinations of
the canonical structure. The possible decomposition of syllables are V,
CV, CCV, CCCV, C, CC, CCC Where V is an independent vowel and C
is vowel suppressed consonant representation. CV is a basic unit of
consonant vowel core. This unit is found in two forms. One is
consonant form and the other form is found with a combination of
consonant and vowel sign. The groups CCV and CCCV are conjunct
61
formations where one and two consonants are grouped with CV core
unit. The other groups CC and CCC are also conjunct formations
without a vowel. Various special symbols like Anuswara, visarga and
other Sanskrit symbols exists with these Aksharas, leading to a large
number of possible combinations. Vowels and consonants are
syllables that are treated as independent syllables. The syllables of CV
core combinations are influenced by Vowel modifiers. Similarly,
((C)C)CV) combinations are influenced by Consonant modifiers.
Indian Standard Code for Information Interchange (ISCII) which
originates from Brahmi script is the character code for Indian
languages evolved by a committee under the Department of
Electronics during 1986-88 and was adopted in 1991 by the Bureau
of Indian Standards (BIS). The ISO-10646 and Unicode standards
define their repertoires for the written scripts in the world. ISCII is an
8-bit encoding that uses an escape sequences to announce the
particular Indic script represented by a following coded character
sequence. Unicode is designed to be a multilingual encoding scheme
that requires no escape sequences or switching between scripts.
Except for a few minor differences, ISCII and Unicode correspond
directly and the layout is as shown in Figure 4.1. For any given Indic
script, the consonant and vowel codes of Unicode are based on ISCII.
ISCII combines letters with the characters NUKTA, INV, & HALANT to
62
allow control over character formation. Unicode provides the same
using ZWJ & ZWNJ characters.
In Telugu the first consonant forms the CV cluster and the other
consonants after CV cluster appear in dependent form. Basic
structure [Pratap et al.] deals with vowels, consonants and characters
with consonant plus vowel sign. The other characters are coded with
the help of these three groups plus special signs Virama, Anuswara
and Visarga. The possible groups for conjuncts and their code
sequence is as shown in Table 4.1.
Table-4.1 ISCII/Unicode Code sequences for Conjuncts
Conjunct
Character
Code sequence
CCA C + Virama + C
CCV C + Virama + C + Vs
CCCA C + Virama + C + Virama + C
CCCV C + Virama + C + Virama + C + Vs
CC C + Virama + C + Virama
CCC C + Virama + C + Virama + C + Virama
Base
Symbol
Subscript
Superscript
Post
symbo
l
Pre
Symbol
Figure 4.1 Basic Telugu Syllable Layout
63
In case of Indic scripts there is many to one correspondence in the
form of code sequences. While representing a syllable non uniform set
of code points will exist. For example consider the word
NEWZELAND in English which contains 9 basic units
N,E,W,Z,E,L,A,N,D called characters where each character is of fixed
size i.e. 1 byte. But for Indic scripts the basic unit syllable is a
combination of several character codes. Consider the above English
word in Telugu, then it can be written as . The above
word contains 4 syllables , , , .
The syllable = + + + + which is a CCV
structure that occupies 5 bytes of memory. The syllable = +
which is a CV structure that occupies 2 byte of memory. The syllable
= + + which is a structure that occupies 3 bytes of
memory. The syllable = + which is a structure that
occupies 2 bytes of memory. That means each syllable is of varying
size based on the canonical structure and whose size can range from
1 to 10 bytes .The complexity in the script is of much use during the
process of cryptography.
4.3 PROBABILITY DISTRIBUTION OF CHARACTERS
64
Basic unit of script description is found with syllable, which is
defined by the canonical structure ((C)C)CV. Machine representation
of this structure is composed of a set of character code points that are
defined in the Unicode code chart. Human perception of these code
points is non linear where as the machine perception is linear as
illustrated in Figure 4.3 and Figure 4.4. This non linearity is different
for different languages as shown in Figure 4.2 and Figure 4.5. Even
though syllables are the meaningful units of script, they are abided by
the specific grammar rules of the script, whereas the character code
points in machine representation are perceived as a reflective
mechanism of these grammar rules. It is necessary to understand the
complex nature of the script in the utility nature of the syllables,
which is dynamic in historical perspective.
0
2
4
6
8
10
12
14
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Figure 4.2 Probability distribution of characters of English
65
66
0
1
2
3
4
5
6
7
8
9
Series1
Figure 4.4 Probability distribution of characters of Telugu
0
2
4
6
8
10
12
14
E T A O I N S H R D L C U M W F G X P B V K J Q Y Z
Figure 4.3 Sorted Probability distribution of characters of English
67
In actual transformation, the character code points are
transformed with the help of crypto system. This transformation is
carried out onto a different plane where the mapping is a reversible
phenomenon. Text is first transformed to bit stream domain then to
another domain. Both are human understandable but transformed
text or information cannot be understood. The transformation
characteristics of the meaningful units from the stand point of the
probability characteristics, is a point of interest in the present work.
Generally, the probability characteristics differ from language to
language.
0
1
2
3
4
5
6
7
8
9
Series1
Figure 4.5 Sorted Probability distribution of characters of Telugu
68
In the present work the machine representation of character code
points is considered and their characteristics in the form of probability
distribution as one of the information that is adopted for the crypto
analysis. The script complexity in Indic scripts in the form of
probability distribution is the basis for the proposed model. Many
attempts are made on Latin text while extracting the probability
distribution of basic alphabets. The probability distribution of
characters of Latin text is carried out on a sample of over 10,00,000
character code points. They demonstrated the dominance of a small
set of characters in regular usage. Similar concept is extended in the
present work to evaluate the characteristic nature of variable
character code points that are embodied in syllables of Telugu,
Kannada and Hindi text.
A sample of 32,00,000 character code points are used for the above
analysis which are mainly compiled from the present usage of text
taken from passages from numerous newspapers, novels, stories,
songs, sports and literature etc. For text in Telugu, the probabilities
expressed as a percentage of the character code points of the alphabet
from the sample that is considered is represented in Table 4.2. The
reason for certain frequencies in the Table to be zero is that they are
the deprecated characters in the usage of the language. The zero
frequencies are observed for the numbers from 0 to 9 in Telugu
language which are not used in general usage. An interesting
69
phenomenon is observed in the probability distribution of character
code points. The highest probability of 1% among vowels is associated
with the vowel \U 0C05. All other vowels are observed with the
Code
Point Prob %
Code
Point Prob %
Code
Point Prob %
8.86 1.34 0.14 7.66 1.05 0.10
6.63 0.86 0.07 6.51 0.74 0.06 6.12 0.72 0.06 5.28 0.59 0.05 4.75 0.50 0.04 4.71 0.49 0.03 4.35 0.48 0.03 3.36 0.45 0.02 3.20 0.44 0.02 2.92 0.42 0.02 2.69 0.41 0.02 2.59 0.40 0.00 2.58 0.36 0.00 2.30 0.34 0.00 2.24 0.34 0.00 2.19 0.31 0.00 2.06 0.19 0.00
2.01 0.16 0.00 1.96 0.16 0.00 1.89 0.15 0.00 1.43 0.14
Table 4.2 : Probability Distribution of character code
points of Telugu script
70
probability less than or equal to 0.5%. Among consonants the highest
probability of 6.2% is associated with the consonant . Only four
consonants are observed with probability greater than 4%. Among
vowel signs, only three of them are observed with probability around
7%. This phenomena is more associated with CV Core which are
reported with 54% in the syllable structure. The Nasal symbol is
observed with 4.7% probability and the highest probability of 8.86% is
associated with Halant . It is quite interesting to know that Halant is
not treated as a syllable at all. However the significant roll of Halant is
observed in the conjunct formations of syllables.
For Kannada, a sample of 16,94,650 character code points are
used for the above analysis which are mainly compiled from the
present usage of text taken from passages from numerous
newspapers, novels, stories, songs, sports and literature etc. For text
in Kannada, Table 4.3 shows the probabilities expressed as a
percentage of the character code points of the alphabet from the
sample that is considered. Certain frequencies in the Table are zero as
they are the deprecated characters in the present usage of the
language. A highest probability of 9.25% is associated with the
character .
71
Code Point
Prob %
Code Point
Prob %
Code Point
Prob %
Code Point
Prob %
Code Point
Prob %
Code Point
Prob %
9.25 2.66 0.89 0.24 0.05 0.01
7.76 2.64 0.86 0.23 0.05 0.01
5.69 2.58 0.75 0.22 0.04 0.01
5.51 2.45 0.68 0.21 0.04 0.01
5.38 1.87 0.68 0.17 0.04 0.01
5.38 1.66 0.64 0.16 0.04 0.00
4.90 1.56 0.55 0.16 0.03 0.00
4.39 1.53 0.55 0.11 0.03 0.00
4.33 1.50 0.52 0.11 0.03 0.00
3.93 1.41 0.51 0.09 0.02 0.00
3.60 1.24 0.48 0.08 0.02 0.00
3.54 1.07 0.35 0.07 0.02 0.00
3.05 0.94 0.33 0.06 0.02
Table 4.3 Probability Distribution of character code points of Kannada script
72
Code
Point
Prob
%
Code
Point
Prob
%
Code
Point
Prob
%
Code
Point
Prob
%
Code
Point
Prob
%
8.76 1.80 0.46 0.06 0.00 6.57 1.73 0.41 0.05 0.00 6.43 1.59 0.40 0.05 0.00 6.27 1.52 0.37 0.04 0.00 4.46 1.47 0.37 0.02 0.00 4.20 1.21 0.31 0.02 0.00 4.16 0.92 0.31 0.02 0.00 3.95 0.82 0.28 0.02 0.00 3.92 0.82 0.27 0.02 0.00 3.60 0.81 0.17 0.02 0.00 3.53 0.77 0.16 0.01 0.00 3.42 0.77 0.16 0.01 0.00 3.40 0.73 0.15 0.01 0.00 2.76 0.68 0.14 0.01 0.00 2.69 0.66 0.12 0.01 0.00 2.37 0.64 0.11 0.01 2.28 0.61 0.10 0.01 1.96 0.60 0.07 0.01 1.85 0.51 0.06 0.01
For Hindi, a sample of 9,36,707 character code points are used for
the above analysis which are mainly compiled from the present usage
of text. A highest probability of 9.25% is associated with the vowel sign
character . Among consonants the character has highest
probability of 6.57 % as listed in Table 4.4.
Table 4.4 : Probability Distribution of character code
points of Hindi script
73
S.No. Code Point
Probability %
S.No. Code Point Probability
%
1 E 12.90 14 P 2.62
2 T 9.65 15 U 2.57
3 O 7.73 16 F 2.37
4 S 7.40 17 G 1.86
5 N 7.20 18 B 1.59
6 I 7.19 19 W 1.40
7 A 7.16 20 Y 1.18
8 R 6.65 21 V 0.89
9 C 4.52 22 K 0.75
10 H 3.92 23 X 0.30
11 L 3.81 24 Q 0.19
12 D 3.26 25 J 0.12
13 M 2.67 26 Z 0.09
For English, a sample of 10,00,000 character code points are used
for the analysis which are mainly compiled from the present usage of
text. A highest probability of 12.9% is associated with the character
E where as a minimum probability of 0.19% is associated with the
character Z as listed in Table 4.5.
Like single letters having typical probability distributions, multiple
letter combinations also occur with varying and predictable
probabilities. Extending the unconditional probability approach, the
probabilities are determined at which bigrams and trigrams occur in
the text. For Telugu, approximately 4096 bigrams are possible. The
most frequently occurring 26 bigrams are listed in Table 4.6 . The
Table 4.5 : Probability Distribution of character code
points of English script
74
Sl.No Code Point Probability
% Sl.No Code Point
Probability
%
1 1.41 14 0.80 2 1.17 15 0.78 3 1.16 16 0.77 4 1.11 17 0.76 5 1.08 18 0.75 6 1.06 19 0.75 7 0.93 20 0.75 8 0.91 21 0.75 9 0.87 22 0.74 10 0.87 23 0.72 11 0.85 24 0.71 12 0.84 25 0.71 13 0.82 26 0.70
bigram has highest probability of 1.41%. By observing these
values it is easy to infer that the bigrams are formed as a set of
clusters around specific values, which increases the complexity of
reverse mapping . Similar evaluation of probability distribution is
carried out on Kannada, Hindi and the most probable 26 character
bigrams are listed in Table 4.7 and Table 4.8 respectively for both the
languages. A highest probability of 1.34% is observed for in
Kannada and 1.10% for in Hindi. Like in Telugu, similar
observation with regard to clustering of bigrams is observed in
Kannada and Hindi.
Table 4.6 Probability Distribution of most frequent bi gram
character code points of Telugu script
75
Sl.No Code Point Probability
% Sl.No Code Point
Probability
%
1 1.34 14 0.78 2 1.28 15 0.77 3 1.24 16 0.74 4 1.21 17 0.73 5 1.09 18 0.71 6 1.06 19 0.71 7 1.06 20 0.70 8 1.04 21 0.70 9 0.96 22 0.65 10 0.91 23 0.64 11 0.82 24 0.64 12 0.81 25 0.64 13 0.80 26 0.62
Sl.No Code Point Probability
% Sl.No Code Point
Probability
%
1 1.10 14 0.66 2 1.06 15 0.66 3 1.00 16 0.64 4 0.98 17 0.63 5 0.95 18 0.60 6 0.92 19 0.59 7 0.88 20 0.59 8 0.87 21 0.58 9 0.83 22 0.58 10 0.78 23 0.57 11 0.75 24 0.57 12 0.74 25 0.56 13 0.67 26 0.52
Table 4.8 : Probability Distribution of most frequent bi gram
character code points of Hindi script
Table 4.7 Probability Distribution of most frequent bi gram
character code points of Kannada script
76
S.No. Code Point Probability
% Sl.No Code Point
Probability
%
1 TH 2.47 14 NT 1.23
2 ER 1.96 15 AN 1.22
3 HE 1.96 16 ET 1.15
4 IN 1.83 17 SE 1.02
5 ES 1.68 18 ED 1.02
6 ON 1.60 19 TO 1.00
7 RE 1.44 20 CO 0.99
8 TE 1.37 21 EC 0.96
9 ST 1.37 22 IS 0.95
10 TI 1.35 23 RO 0.91
11 EN 1.33 24 ND 0.88
12 AT 1.25 25 IT 0.87
13 OR 1.23 26 AR 0.86
For English 676 bigrams are possible logically. The most probably
occurring 26 bigrams are listed in Table 4.9 . The bigram TH has
highest probability of 2.47%. By observing these values it is easy to
infer that the bigrams are formed as a set of clusters around specific
values. These 26 bigrams of English correspond to 35% of total
distribution where as for Indic scripts the correspondence is in the
range 19 to 22%. The complexity of reverse mapping is thus difficult
in case of Indic scripts than English.
An enhanced analysis, by extending bigram probability distribution
to trigram probability distribution of code points provides a better
knowledge about the language characteristics. For Telugu, a total of
almost 2,63,000 trigram code points are possible. Table 4.10 shows
Table 4.9 : Probability Distribution of most frequent bi gram character code points of English script
77
the top 20 trigrams based on probability distribution. The trigram
code point has highest probability of 0.86%. Because of large
number of trigrams that are possible, there are more code points
which are clustered around specific values. This makes the mapping
process more complex. Similarly out of 4,20,000 trigrams possible in
Kannada Table 4.11 displays top 26 character code points as per the
trigram distribution. The code point has highest probability of
1.16%. For Hindi a total of around 7,29,000 trigrams are possible, out
of which has got highest probability of 0.74% as listed in Table
4.12. Because of the huge trigram space, in Hindi reverse mapping is
much more complex.
Sl.No Code Point Probability
% Sl.No Code Point
Probability
%
1 0.86 14 0.25 2 0.47 15 0.25 3 0.46 16 0.24 4 0.45 17 0.24 5 0.40 18 0.23 6 0.38 19 0.22 7 0.31 20 0.21 8 0.30 21 0.21 9 0.29 22 0.21 10 0.28 23 0.21 11 0.27 24 0.20 12 0.26 25 0.20 13 0.25 26 0.20
Table 4.10 Probability Distribution of most frequent tri gram
character code points of Telugu script
78
Sl.No Code Point Probability
% Sl.No Code Point
Probability
%
1 1.16 14 0.32 2 0.90 15 0.30 3 0.90 16 0.26 4 0.62 17 0.24 5 0.53 18 0.24 6 0.46 19 0.21 7 0.45 20 0.20 8 0.44 21 0.20 9 0.40 22 0.20 10 0.36 23 0.19 11 0.33 24 0.19 12 0.32 25 0.18 13 0.32 26 0.17
Sl.No Code Point Probability
% Sl.No Code Point
Probability %
1 0.74 14 0.17 2 0.35 15 0.16 3 0.33 16 0.16 4 0.31 17 0.15 5 0.29 18 0.15 6 0.28 19 0.15 7 0.27 20 0.15 8 0.26 21 0.14 9 0.24 22 0.14 10 0.24 23 0.14 11 0.23 24 0.13 12 0.23 25 0.13 13 0.20 26 0.13
Table 4.11 Probability Distribution of most frequent tri gram
character code points of Kannada script
Table 4.12 Probability Distribution of most frequent tri gram
character code points of Hindi script
79
In case of English, a total of 17,576 trigrams are possible which is
less when compared to possible trigrams in case of Indic scripts. Table
4.13 lists the most frequent 26 trigrams. The trigram THE has
highest probability of 1.76% followed by ION with 0.73% and ING
with probability of 0.65%. Because of the smaller trigram space of
English than Indic scripts, reverse mapping is less complex for
English.
S.No. Code Point Probability
% S.No. Code Point
Probability
%
1 THE 1.76
14 EST 0.31
2 ION 0.73
15 ERE 0.31
3 ING 0.65
16 ATE 0.30
4 TIO 0.64
17 USE 0.29
5 AND 0.55
18 AGE 0.28
6 ENT 0.52
19 STH 0.28
7 FOR 0.45
20 HER 0.28
8 PRO 0.44
21 THA 0.27
9 CON 0.42
22 ONS 0.26
10 ESS 0.40
23 ECT 0.26
11 TER 0.39
24 NTH 0.25
12 ATI 0.39
25 ONT 0.25
13 INT 0.32
26 ETH 0.25
Table 4.13 Probability Distribution of most frequent tri gram
character code points of English script
80
4.4 CONDITIONAL PROBABILITY DISTRIBUTION OF
CHARACTERS
The statistical influences extending over n symbols of the text
provides better apriori knowledge to the system to achieve
consistency. For this purpose, the conditional probability of a
character, knowing the preceding (n-1) characters needs to be
calculated. The Conditional probability P(A|B) is the probability of
some event A, given the occurrence of some other event B.
P(A|B) = P(A,B)/P(B) (4.1)
Where P(A,B) is the Joint probability, which is the probability of two
events in conjunction. It is the probability of both events together and
P(B) is the unconditional probability of the event B.
The unconditional probability distributions of different unigram
character code points are to be calculated from large corpus of the
language. The joint probabilities of all possible combinations of
character code points for a character are also calculated. This process
is repeated for all character code points of that language. Using these
unconditional and joint probabilities and the expression (4.1), the
conditional probabilities of all character code points of the language
can be calculated. This procedure is adopted on four different
languages English, Telugu, Kannada and Hindi .
The conditional probabilities for the English character S are listed
in Table 4.14 . From the Table, it is evident that ST has highest
81
conditional probability which shows that there is more probability of S
followed by T. Similarly SZ has minimum conditional probability.
Similar computation is done for all 26 characters of the language.
Similarly conditional probabilities for all characters of Telugu,
Kannad, Hindi are computed. For illustration Table 4.15 shows the
conditional probability of the Telugu character , Table 4.16 shows
the conditional probability of Kannada character and Table 4.17
gives the conditional probabilities of the Hindi character . These
conditional probabilities are like wise calculated for all characters of
the language.
S.No. Char Prob % Cond
Prob S.No. Char
Prob
%
Cond
Prob
1 ST 7.40 2.49 14 SY 7.40 0.25
2 SE 7.40 1.90 15 SB 7.40 0.24
3 SS 7.40 1.38 16 SR 7.40 0.20
4 SI 7.40 1.31 17 SN 7.40 0.20
5 SA 7.40 1.25 18 SD 7.40 0.15
6 SO 7.40 1.02 19 SL 7.40 0.14
7 SC 7.40 0.54 20 SK 7.40 0.04
8 SU 7.40 0.52 21 SG 7.40 0.04
9 SP 7.40 0.52 22 SV 7.40 0.03
10 SH 7.40 0.38 23 SJ 7.40 0.02
11 SM 7.40 0.32 24 SQ 7.40 0.01
12 SF 7.40 0.28 25 SX 7.40 0.01
13 SW 7.40 0.28 26 SZ 7.40 0.00
Table 4.14 Conditional Probability Distribution of character code point S of English script
82
Char Prob % Cond Prob Char Prob %
Cond Prob Char Prob %
Cond Prob
5.28 3.78 5.28 0.12 5.28 0.00 5.28 2.80 5.28 0.12 5.28 0.00 5.28 2.57 5.28 0.11 5.28 0.00 5.28 2.28 5.28 0.09 5.28 0.00 5.28 0.97 5.28 0.07 5.28 0.00 5.28 0.52 5.28 0.07 5.28 0.00 5.28 0.47 5.28 0.07 5.28 0.00 5.28 0.45 5.28 0.04 5.28 0.00 5.28 0.45 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.39 5.28 0.02 5.28 0.00 5.28 0.31 5.28 0.01 5.28 0.00 5.28 0.28 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.26 5.28 0.01 5.28 0.00 5.28 0.20 5.28 0.01 5.28 0.00 5.28 0.18 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.14 5.28 0.00 5.28 0.00 5.28 0.13 5.28 0.00 5.28 0.12 5.28 0.00
Table 4.15 Conditional Probability Distribution of character
code point of Telugu script
83
Char Prob %
Cond Prob
Char Prob % Cond Prob
Char Prob % Cond Prob
5.51 3.60 5.51 0.08 5.51 0.00
5.51 2.30 5.51 0.05 5.51 0.00
5.51 1.77 5.51 0.04 5.51 0.00
5.51 1.38 5.51 0.04 5.51 0.00
5.51 1.28 5.51 0.04 5.51 0.00
5.51 0.81 5.51 0.04 5.51 0.00
5.51 0.55 5.51 0.04 5.51 0.00
5.51 0.54 5.51 0.03 5.51 0.00
5.51 0.49 5.51 0.03 5.51 0.00
5.51 0.49 5.51 0.02 5.51 0.00
5.51 0.47 5.51 0.02 5.51 0.00
5.51 0.44 5.51 0.02 5.51 0.00
5.51 0.37 5.51 0.02 5.51 0.00
5.51 0.37 5.51 0.02 5.51 0.00
5.51 0.37 5.51 0.02 5.51 0.00
5.51 0.33 5.51 0.01 5.51 0.00
5.51 0.33 5.51 0.01 5.51 0.00
5.51 0.29 5.51 0.01 5.51 0.00
5.51 0.27 5.51 0.01 5.51 0.00
5.51 0.24 5.51 0.01 5.51 0.00
5.51 0.17 5.51 0.01 5.51 0.00
5.51 0.16 5.51 0.01 5.51 0.00
5.51 0.16 5.51 0.01 5.51 0.00
5.51 0.15 5.51 0.00 5.51 0.00
5.51 0.13 5.51 0.00 5.51 0.00
5.51 0.11 5.51 0.00
Table 4.16 Conditional Probability Distribution of character
code point of Kannada script
84
Char Prob % Cond
Prob Char Prob % Cond
Prob Char Prob % Cond
Prob
2.69 5.56 2.69 0.12 2.69 0.00 2.69 4.90 2.69 0.12 2.69 0.00 2.69 3.82 2.69 0.11 2.69 0.00 2.69 2.73 2.69 0.10 2.69 0.00 2.69 2.30 2.69 0.10 2.69 0.00 2.69 2.29 2.69 0.08 2.69 0.00 2.69 2.12 2.69 0.08 2.69 0.00 2.69 2.05 2.69 0.08 2.69 0.00 2.69 1.20 2.69 0.07 2.69 0.00 2.69 1.10 2.69 0.07 2.69 0.00 2.69 1.06 2.69 0.06 2.69 0.00 2.69 0.71 2.69 0.06 2.69 0.00 2.69 0.70 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.47 2.69 0.03 2.69 0.00 2.69 0.46 2.69 0.02 2.69 0.00 2.69 0.42 2.69 0.01 2.69 0.00 2.69 0.40 2.69 0.01 2.69 0.00 2.69 0.39 2.69 0.01 2.69 0.00 2.69 0.32 2.69 0.01 2.69 0.00 2.69 0.25 2.69 0.01 2.69 0.00 2.69 0.24 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.20 2.69 0.01 2.69 0.00 2.69 0.18 2.69 0.00 2.69 0.00 2.69 0.15 2.69 0.00 2.69 0.00 2.69 0.14 2.69 0.00 2.69 0.00 2.69 0.13 2.69 0.00 2.69 0.00
Table 4.17 Conditional Probability Distribution of character
code point of Hindi script
85
4.5 ENCRYPTION AND DECRYPTION
The proposed model defines meaningful units that are embedded
in text documents as essential units and also treated as meaningful
units in the form of character or byte stream. The byte stream is a
symbolic representation of text. In case of Indic scripts this byte
stream is a complex byte stream, where as in case of Latin text the
byte stream is one-to-one mapping. The present model addressed this
specificity by taking into consideration of words in the form of
syllables and extraction of byte stream from syllables.
Algorithm for Encryption of Indic Scripts
1 : Divide the given text document into set of words.
2 : Divide each word into syllables (which is basic unit).
3 : For each syllable generate the character code point byte stream
which may consists of single or multiple code points that will
form that syllable.
4 : Generate bit stream for the byte stream generated in step3.
5 : Apply Encryption technique on the bit stream generated in Step 4
and a key stream generated randomly which results in the cipher
text.
6 : Repeat steps 3 to 5 for each syllable generated in step2.
Figure 4.6 Algorithm for Encryption of Indic Scripts
script
86
This byte stream consist of single code point units or multiple code
point units. They are transformed into a code point byte stream,
converted to bit stream which undergoes transformation similar to
that of any system. The code point streams that are derivative of
syllables are converted to bit stream. A key stream is generated [KAT
2005] using efficient Random number generator. With this key
stream, transformation function is applied which results in cipher
text. The process of encryption and decryption is illustrated in Figure
4.6 and Figure 4.7. The cryptographic model for Indic scripts is as
illustrated in Figure 4.8.
Algorithm for Decryption of Indic Scripts
1 : Generate bit stream for the cipher text.
2 : Apply Decryption technique on the bit stream generated in Step 1
with a key stream generated during encryption resulting in a byte
stream.
3 : Combine the bit streams of step2 to form code point byte stream.
4 : Combine the code point byte stream of step3 to form syllables
5 : Combine the syllables to form words and the words into text
document
6 : Repeat step 1 through 5 for all byte streams in the cipher text.
Figure 4.7 Algorithm for Decryption of Indic Scripts
script
87
MULTIPLE
CODE
POINTS SINGLE
CODE
POINT
MULTIPLE
CODE
POINTS
.
KEY STREM
TEXT
DOCUMENT
WORDS
SYLLABLES
SINGLE
CODE
POINT
BIT STREAM
ENCRYPTION
FUNCTION
CIPHER TEXT
CIPHER
TEXT
TEXT
DOCUMENT
WORDS
SYLLABLES
BIT STREAM
DECRYPTION
FUNCTION
Symbolic representation of byte stream
Symbolic representation of byte stream
CODE POINT
BYTE STREAM
CODE POINT
BYTE STREAM
Figure. 4.8 Flow chart for Encryption and Decryption of
Indic scripts
88
The proposed cryptographic model is tested on four languages i.e.
English, Telugu, Kannada, Hindi. The encryption algorithm is
implemented on text of different sizes. For this process a key is
generated randomly using a OS based random generator. A simple 8-
bit,16-bit, 32-bit key streams are used in the present work. The plain
text is encrypted using the proposed algorithm and randomly
generated key resulting in cipher text.
Figure 4.9 Sample Plain Text, Encrypted Text and
Decrypted Text in English
89
Fig. 4.10 Sample Plain Text, Encrypted Text and
Decrypted Text in Telugu
4.6 CRYPTANALYSIS USING LANGUAGE MODEL
In a conventional cryptographic system, a plain text message (m) is
generated by the sender. An encryption transformation E, which
depends on a secret key k, transforms the plain text m to cipher text
(c) using the expression c=Ek(m). Cipher text c is then transmitted to
the receiver where the decryption transformation D, which also
depends on secret key k ( in case of symmetric key cryptography), is
used to recover the plain text m using the expression m= Dk(c). The
information flow is as illustrated in Figure 4.11. The assumption is
that an opponent does not possess k and cannot recover m from c
90
Figure 4.11 Decipherment Model that uses language statistics
using D. (Note that algorithms D and E may be kept public). For the
key k to remain secret, a secure communication channel is needed
between the sender and receiver.
The information available for a cryptanalyst is a variable that can
be only cipher text, the complete knowledge of the system (except the
key), the algorithm used, the characteristics of the language and other
language statistics. If s represents the information available to an
opponent and D1 represents the process of cryptanalysis, then the
deduced information, m1 is expressed as m1 = D1(s). The coincidence
between m and m1 is a measure of the strength of the system.
m=Dk(c)
SENDER
RECIEVER
CRYPTANALYST
Apriori knowledge
about Language
y
Key (k)
distribution
m c=Ek(m) c
k k
merge m
1
91
Figure 4.12 Sample retrieved Plain Text in English
Figure 4.13 Sample retrieved Plain Text in Telugu
92
In the present work, the probability distribution of different n-gram
character code points is considered to be the apriori knowledge for
cryptanalysis. The probabilities of different characters in the cipher
text are calculated and the results are tabulated. Mapping is done
between the characters of plain text and cipher text based on the
probability distribution. Now the characters in cipher text are replaced
with the mapped characters of plain text and the percentage of plain
text retrieved is calculated.
When English Text is considered the problems are much less
because of the fact that the correspondence is between the
transformed text and the original text. Though the key is generated
randomly, since it is fixed, the mapping function transforms it into a
distinct point in the orthogonal plane. On many occasions for large
text size almost all characters are present. Even for a medium sized
text this is true because of less number of characters that exist. More
over because of one-to-one mapping predictability is more. The
percentage of retrieved code points is calculated using probability
distribution.
If Indic scripts are considered, the number of character codes that
exist in the original text need not be the complete set. Even though
the mapping function takes care of one to one correspondence, in the
transformation process all character codes may not exist from the
original set of code points. This may lead to confusion in the crypto
93
Table 4.18. Retrieved Plain Text using Unigram probability
analysis. A threshholding function is adopted in the crypto analysis
process for reverse mapping. The percentage of plain text that is
retrieved is listed in Table 4.18. This result in case of Telugu,
Kannada, Hindi is relatively less when compared to English which is
due to complex nature of Indic scripts.
S.No.
Cipher text
Length (in code
points)
Retrieval % using probability of uni grams
ENGLISH TELUGU KANNADA HINDI
1 6000 29.97 22.51 27.86 19.88
2 12000 35.93 26.68 27.96 31.55
3 25000 40.40 23.77 38.95 23.83
4 40000 42.34 16.90 26.73 27.10
5 50000 38.33 08.81 28.03 27.89
6 70000 63.17 23.60 26.11 26.31
7 90000 48.04 17.44 25.58 25.93
8 110000 78.59 18.76 25.70 27.22
Average 47.10 19.81 28.37 26.21
The encryption algorithm is implemented on different sizes of
Telugu text samples. For this process an eight bit key is generated
using OS based random generator. Plain text is encrypted using the
94
proposed algorithm and the randomly generated 8-bit key resulting in
cipher text. The frequencies of different characters in the cipher text
are extracted. Mapping is carried out between the characters of plain
text and cipher text based on these frequencies. Now the characters in
cipher text are replaced with the mapped characters of plain text and
the percentage of the exact retrieval as compared to plain text is
calculated which is presented in Table 4.18.
From the Table, it is easy to infer that cryptanalysis of text of
complex languages like Telugu, Kannada, Hindi is much more
difficult. Then the larger key size applicable to applications in Latin
text can be reduced in case of applications in complex languages like
Telugu. The percentage of plain text retrieved is not linear with text
size because a proper threshold function is required to map cipher
text symbols to corresponding plain text symbols.
Using unconditional probability distribution approach the
decipherment model is applied on cipher texts of 4 languages and on
various sizes ranging from 6000 to 110000 character code points. The
retrieved plain text percentages are tabulated in Table 4.18. English
text resulted in maximum retrieval of 78.59% and a minimum of
29.97%. For Telugu a maximum of 26.68% and a minimum of 8.81%
plain text is retrieved. The retrieved text for Kannada lies in the range
of 38.95% to 25.58% and for Hindi in the range 31.55 % to 19.88%.
95
Table 4.19. Mapping process in decipherment for a sample of most
frequently occurring characters of English using unconditional
probability
From the observed results it can be concluded that the reverse
mapping is more complex in case of Indic scripts (specific reference to
Telugu) with smaller key sizes also.
Mapping of fourteen independent characters of English language on
various samples varying from 6000 to 110000 is presented in Table
4.19. An empty cell indicates that the character is not mapped
correctly. The mapping varies with variation in sample size due to the
heuristic nature of the language which results in variation of
percentage of retrieval. Table 4.20 and Table 4.21 illustrate similar
computations with associated results of code points with regard to
Hindi, Telugu languages and are observed with the similar nature.
This nature results in inconsistent variation in the mapping process.
S.No Plain Text E T O S N I A R C H L D M P
1 6000 E T R
2 12000 E T R
3 25000 E T R H D M
4 40000 E T R H D M
5 50000 E T R D M
6 70000 E T O S I R C
7 90000 E T R L D M
8 110000 E T O S I R C H L D M
96
Table 4.21 Mapping process in decipherment for a sample of most frequently occurring characters of Telugu using
unconditional probability
Table 4.20. Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using
unconditional probability
S.No. Plain Text
1 6000
2 12000
3 25000
4 40000
5 50000
6 70000
7 90000
8 110000
Sl.No Plain Text
4 6000
6 12000
9 25000
11 40000
12 50000
13 70000
14 90000
15 110000
97
Table 4.22 Retrieved Plain Text using bigram probability
Now the unconditional probability approach is extended by using
the probabilities at which bigrams and trigrams occur within the
cipher text. These probabilities are mapped with that of known
probability distribution of the original language. Using this reverse
mapping, the plain text is retrieved and the retrieved percentage of
plain text for four different languages with varying sample sizes using
bigrams and trigrams are listed in Table 4.22 and Table4.23
Using unconditional probability distribution of bigram character
code points, the decipherment model is applied on cipher texts of four
languages and on various sizes ranging from 6000 to 110000
character code points. The retrieved plain text percentages using
bigram statistics are tabulated in Table 4.22. English text resulted in
maximum retrieval of 10.77% and a minimum of 2.82%. For Telugu a
S.No.
Cipher text Length
(in code
points)
Retrieval % using probability of bi grams
ENGLISH TELUGU KANNADA HINDI
1 6000 5.20 0.87 2.20 0.07
2 12000 2.82 2.47 2.52 3.47
3 40000 7.85 2.86 3.43 3.40
4 50000 9.22 2.13 3.84 2.02
5 70000 6.64 0.91 3.83 3.90
6 90000 10.77 0.03 4.57 3.49
7 110000 9.60 1.80 2.70 0.44
Average 7.44 1.58 3.30 2.40
98
Table 4.23 Retrieved Plain Text using Trigram probability
maximum of 2.86% and a minimum of 0.03% plain text is retrieved.
The retrieved text for Kannada lies in the range of 4.57% to 2.20% and
for Hindi in the range 3.90 % to 0.07%. The retrieved percentage of
plain text is more for English than Indic scripts.
The same process is repeated using trigram probability distribution
of character code points, on cipher texts of four languages and on
various sizes ranging from 6000 to 110000 character code points. The
retrieved plain text percentages using trigram statistics are tabulated
in Table 4.23. English text resulted in maximum retrieval of 4.18%
and a minimum of 2.25%. For Telugu a maximum of 2.20% and a
S.No.
Cipher text Length
(in code
points)
Retrieval % using probability of Tri grams
ENGLISH
TELUGU
KANNADA
HINDI
1 6000 2.95 1.00 1.45 0.60
2 12000 3.80 1.20 4.72 1.95
3 25000 2.25 1.67 5.04 0.84
4 40000 3.10 2.20 3.00 1.27
5 50000 3.54 2.01 2.52 0.74
6 70000 3.61 1.22 2.33 0.99
7 90000 4.18 1.15 3.90 1.00
8 110000 2.36 1.61 3.03 0.62
Average 3.22 1.51 3.25 1.00
99
minimum of 1.00% plain text is retrieved. The retrieved text for
Kannada lies in the range of 5.04% to 1.45% and for Hindi in the
range 1.95 % to 0.60%. The retrieved percentage of plain text is more
for English than Indic scripts.
Using conditional probability distribution approach the
decipherment model is applied on cipher texts of 4 languages and on
various sizes ranging from 6000 to 110000 character code points. The
retrieved plain text percentages are tabulated in Table 4.24. English
text resulted in maximum retrieval of 57.85% and a minimum of
42.42% For Telugu a maximum of 33.93% and a minimum of 24.72%
plain text is retrieved. The retrieved text for Kannada lies in the range
of 29.77% to 38.30% and for Hindi in the range 24 % to 34.71%.
Table 4.25 illustrates the mapping of 14 most probable characters
using conditional probability. An empty cell indicates that for a given
sample size that character is not mapped correctly. When this is
compared with the mapping process using unconditional probability is
illustrated in Table 4.19, some of the characters are not mapped
correctly using unconditional probability approach, where as more
characters are mapped properly using conditional probability
approach. This results in increased retrieval percentage of message
text.
100
Table 4.24. Retrieved Plain Text using Conditional probability
S.No.
Cipher text Length
(in code
points)
Retrieval % using conditional probability
ENGLISH TELUGU KANNADA HINDI
1 6000 42.42 24.72 29.77 31.15
2 12000 49.65 27.29 35.94 34.71
3 25000 52.76 27.99 38.30 25.00
4 40000 48.76 26.92 36.07 25.74
5 50000 43.42 26.46 34.88 31.33
6 70000 56.33 33.93 30.69 31.01
7 90000 57.85 33.13 30.33 32.67
8 110000 57.35 29.88 31.31 34.55
Average 51.07 28.79 33.41 30.77
Similar conclusions are drawn from Table 4.26 and Table 4.27
which represent the mapping process for character code points of
Telugu and Hindi respectively. For these languages conditional
probability distribution approach resulted in improved mapping
performance than unconditional approach.
101
Table 4.26 Mapping process in decipherment for a sample of
most frequently occurring characters of Telugu using
conditional probability
Table 4.25 Mapping process in decipherment for a sample of
most frequently occurring characters of English using conditional probability
S.No Plain Text E T O S N I A R C H L D M P
1 6000 E T O A R H P
2 12000 E T O S I A R H L M
3 25000 E T O S I R H L D M
4 40000 E T O S I R H L D
5 50000 E T S I R H L D P
6 70000 E T O S N I R H L D M P
7 90000 E T O S N I A R H L D M P
8 110000 E T O S N I R H L D M P
S.No Plain Text
1 6000
2 12000
3 25000
4 40000
5 50000
6
6
70000
7 90000
8 110000
102
Figure. 4.14 Average Retrieved Plain Text using Unconditional and
Conditional probability of different languages
Table 4.27 Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using
conditional probability
0
10
20
30
40
50
60
English Telugu Kannada Hindi
LANGUAGES
AV
ER
AG
E %
RE
TR
IEV
AL
Unconditional
Probability
Conditional
Probability
S.No.
Plain Text
1 6000
2 12000
3 25000
4 40000
5 50000
6 70000
7 90000
8 110000
103
Average retrieval efficiency of English is observed to be 47.1% using
unconditional probability and 51.07% using conditional probability is
presented in Figure. 4.14. Using unconditional probability the
retrieved plain text percentage of Telugu, Kannada, Hindi are observed
to be 19.81, 28.37, 26.21% respectively and the same using
conditional probability are 28.79, 33.41 and 30.77 respectively. It is
observed from these results that conditional probability results in
relatively larger percentage of retrieval than unconditional probability
distribution. Thus using conditional probability as apriori knowledge
yields in increased retrieval efficiency and improved consistency in
retrieval percentage. The retrieval percentage for English is larger than
the Indic scripts viz. Telugu, Kannada and Hindi while applying
conditional probabilities. Quite interestingly the retrieval efficiency
associated with conditional probability is found to be more stable than
probability distribution alone. In case of conditional probability,
smaller text sizes (below 2000) of all languages posses improved
retrieval efficiency when compared with probability distribution alone.
Adaptation of conditional probability allowed appropriated coincidence
with more character code points, which is not the case in the
matching pattern of probability alone which is illustrated in Table
4.20. Similar improvements are observed with the above languages
where the number of code points is relatively large.
In the process of providing security, the context of message is
termed as script dependent text. In the world of multi lingual data,
104
every script possesses different complexity levels. A method for
deciphering substitution ciphers with low-order models of the
language with a case study on Indic scripts is provided in the present
work. An attempt is made to analyze the text based crypto model
using probability distribution of character code points as a parameter
with specific study on Indic scripts. The complex orthographic nature
of Indic scripts is explored while studying the impact of probability
distribution of character code points in the cryptanalysis of text
retrieval. An extensive analysis is carried out on a large set of
character code points of Indic scripts compiled from the present usage
of script. The encryption and decryption process is tested in
comparison with English and also on Telugu, Kannada, Hindi with
different key sizes. A comparison between language complexities of
English and Indic scripts is presented from the stand point of
probability distribution of character code points while adopting
cryptanalysis. Evaluation of the model is carried out with the help of
probability distribution as one of the prominent characteristic of text.
Crypto analysis is also carried out using conditional probability
distribution and it is observed that there is an improvement in
percentage of matches in the reverse transformation. Reduced
efficiency of mapping is an indicative measure of language complexity
with specific reference to probability distribution of character code
points. This reflects the fact that the reverse mapping is much more
complex in case of Indic scripts which is observed from the results.