48
CHAPTER 4 CRYPTANALYSIS USING LANGUAGE MODEL

CHAPTER 4 CRYPTANALYSIS USING …shodhganga.inflibnet.ac.in/bitstream/10603/2193/13/13...4. CRYPTANALYSIS USING LANGUAGE MODEL 4.1 INTRODUCTION In language modeling, n-gram models

Embed Size (px)

Citation preview

  • CHAPTER 4

    CRYPTANALYSIS USING

    LANGUAGE MODEL

  • 58

    4. CRYPTANALYSIS USING LANGUAGE

    MODEL

    4.1 INTRODUCTION

    In language modeling, n-gram models are probabilistic models of

    text that use limited amount of history, or character or word

    dependencies, where n refers to the number of characters or words

    that participate in the dependency relation. A statistical language

    model assigns probability to a sequence of n successive characters or

    words by using probability distribution. Language modeling is used in

    several natural language processing applications like Speech

    Recognition, Machine translation, Parsing, Information Retrieval and

    Cryptanalysis.

    The complex properties and the characteristics of natural

    languages play an important role in cryptanalysis. Different

    approaches of cryptanalysis in the literature use language

    characteristics to understand the strength of cipher system. One such

    approach deals with probability analysis, where in the process of

    determining the probability of each symbol in the encrypted message

    leads to prediction of plain text. The characteristics of the language

    like probability distribution are reflected in the transformed text and

    also while performing encryption of the message. This information

  • 59

    along with knowledge of symbol frequencies of the language, help to

    determine which cipher text symbol maps to the respective plaintext

    symbol.

    Extensive statistical analysis of probability distribution of

    characters is an additive knowledge while retrieving plain text

    message partly. This Probability distribution as a parameter in the

    process of reverse mapping is mostly dependent on language

    specificity. Generally, the probability characteristics differ from

    language to language. In case of English due to the smaller size of the

    character set, the probability characteristics may effectively be

    reflected in the transformed data. If the size of the meaningful units is

    large enough, then complexity of probability characteristics is to be

    evaluated. Moreover a single-letter probability analysis is helpful in

    obtaining initial key and to perform more powerful bi-gram analysis.

    An attempt is made to understand the reflection of probability

    characteristics and its impact on cryptanalysis with a case study on

    Telugu script. In case of Indic scripts, code points are considered as

    message units.

    4.2 FEATURES OF INDIC SCRIPT

    Every language has certain parameters in such a way that

    language rules are embodied in sequence while formulating document.

    Complexity of script is mainly dependent on character, word and

  • 60

    sentence formulation methods. A document with a meaningful

    summary can be represented as DSWC where D is document ,

    S,W and C are sentences, words and characters respectively. In

    case of English, C is represented with the help of one-to-one

    correspondence of character code points in any machine, where as

    Indic script representation is associated with two fold phenomena. C

    in real terms is associated with Syllable which in turn represented as

    a set of multiple character code points. Now C can be written as Sy

    CC where Sy is syllable and CC is character code point.

    In Indic scripts words are treated as sequences of syllables (basic

    unit). The script grammar is used to segment a word into syllables.

    The units of orthography are syllables, which are essentially C*V core

    syllables where C denotes a consonant and V a vowel. Vowel

    suppressed consonant segments are also allowed. A syllable is formed

    using a canonical code structure given by (C(C))CV where C stands

    for consonant and V stands for vowel. A detailed analysis is carried

    out [PRA 2001] by Pratap et al. on various possible combinations of

    the canonical structure. The possible decomposition of syllables are V,

    CV, CCV, CCCV, C, CC, CCC Where V is an independent vowel and C

    is vowel suppressed consonant representation. CV is a basic unit of

    consonant vowel core. This unit is found in two forms. One is

    consonant form and the other form is found with a combination of

    consonant and vowel sign. The groups CCV and CCCV are conjunct

  • 61

    formations where one and two consonants are grouped with CV core

    unit. The other groups CC and CCC are also conjunct formations

    without a vowel. Various special symbols like Anuswara, visarga and

    other Sanskrit symbols exists with these Aksharas, leading to a large

    number of possible combinations. Vowels and consonants are

    syllables that are treated as independent syllables. The syllables of CV

    core combinations are influenced by Vowel modifiers. Similarly,

    ((C)C)CV) combinations are influenced by Consonant modifiers.

    Indian Standard Code for Information Interchange (ISCII) which

    originates from Brahmi script is the character code for Indian

    languages evolved by a committee under the Department of

    Electronics during 1986-88 and was adopted in 1991 by the Bureau

    of Indian Standards (BIS). The ISO-10646 and Unicode standards

    define their repertoires for the written scripts in the world. ISCII is an

    8-bit encoding that uses an escape sequences to announce the

    particular Indic script represented by a following coded character

    sequence. Unicode is designed to be a multilingual encoding scheme

    that requires no escape sequences or switching between scripts.

    Except for a few minor differences, ISCII and Unicode correspond

    directly and the layout is as shown in Figure 4.1. For any given Indic

    script, the consonant and vowel codes of Unicode are based on ISCII.

    ISCII combines letters with the characters NUKTA, INV, & HALANT to

  • 62

    allow control over character formation. Unicode provides the same

    using ZWJ & ZWNJ characters.

    In Telugu the first consonant forms the CV cluster and the other

    consonants after CV cluster appear in dependent form. Basic

    structure [Pratap et al.] deals with vowels, consonants and characters

    with consonant plus vowel sign. The other characters are coded with

    the help of these three groups plus special signs Virama, Anuswara

    and Visarga. The possible groups for conjuncts and their code

    sequence is as shown in Table 4.1.

    Table-4.1 ISCII/Unicode Code sequences for Conjuncts

    Conjunct

    Character

    Code sequence

    CCA C + Virama + C

    CCV C + Virama + C + Vs

    CCCA C + Virama + C + Virama + C

    CCCV C + Virama + C + Virama + C + Vs

    CC C + Virama + C + Virama

    CCC C + Virama + C + Virama + C + Virama

    Base

    Symbol

    Subscript

    Superscript

    Post

    symbo

    l

    Pre

    Symbol

    Figure 4.1 Basic Telugu Syllable Layout

  • 63

    In case of Indic scripts there is many to one correspondence in the

    form of code sequences. While representing a syllable non uniform set

    of code points will exist. For example consider the word

    NEWZELAND in English which contains 9 basic units

    N,E,W,Z,E,L,A,N,D called characters where each character is of fixed

    size i.e. 1 byte. But for Indic scripts the basic unit syllable is a

    combination of several character codes. Consider the above English

    word in Telugu, then it can be written as . The above

    word contains 4 syllables , , , .

    The syllable = + + + + which is a CCV

    structure that occupies 5 bytes of memory. The syllable = +

    which is a CV structure that occupies 2 byte of memory. The syllable

    = + + which is a structure that occupies 3 bytes of

    memory. The syllable = + which is a structure that

    occupies 2 bytes of memory. That means each syllable is of varying

    size based on the canonical structure and whose size can range from

    1 to 10 bytes .The complexity in the script is of much use during the

    process of cryptography.

    4.3 PROBABILITY DISTRIBUTION OF CHARACTERS

  • 64

    Basic unit of script description is found with syllable, which is

    defined by the canonical structure ((C)C)CV. Machine representation

    of this structure is composed of a set of character code points that are

    defined in the Unicode code chart. Human perception of these code

    points is non linear where as the machine perception is linear as

    illustrated in Figure 4.3 and Figure 4.4. This non linearity is different

    for different languages as shown in Figure 4.2 and Figure 4.5. Even

    though syllables are the meaningful units of script, they are abided by

    the specific grammar rules of the script, whereas the character code

    points in machine representation are perceived as a reflective

    mechanism of these grammar rules. It is necessary to understand the

    complex nature of the script in the utility nature of the syllables,

    which is dynamic in historical perspective.

    0

    2

    4

    6

    8

    10

    12

    14

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

    Figure 4.2 Probability distribution of characters of English

  • 65

  • 66

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Series1

    Figure 4.4 Probability distribution of characters of Telugu

    0

    2

    4

    6

    8

    10

    12

    14

    E T A O I N S H R D L C U M W F G X P B V K J Q Y Z

    Figure 4.3 Sorted Probability distribution of characters of English

  • 67

    In actual transformation, the character code points are

    transformed with the help of crypto system. This transformation is

    carried out onto a different plane where the mapping is a reversible

    phenomenon. Text is first transformed to bit stream domain then to

    another domain. Both are human understandable but transformed

    text or information cannot be understood. The transformation

    characteristics of the meaningful units from the stand point of the

    probability characteristics, is a point of interest in the present work.

    Generally, the probability characteristics differ from language to

    language.

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Series1

    Figure 4.5 Sorted Probability distribution of characters of Telugu

  • 68

    In the present work the machine representation of character code

    points is considered and their characteristics in the form of probability

    distribution as one of the information that is adopted for the crypto

    analysis. The script complexity in Indic scripts in the form of

    probability distribution is the basis for the proposed model. Many

    attempts are made on Latin text while extracting the probability

    distribution of basic alphabets. The probability distribution of

    characters of Latin text is carried out on a sample of over 10,00,000

    character code points. They demonstrated the dominance of a small

    set of characters in regular usage. Similar concept is extended in the

    present work to evaluate the characteristic nature of variable

    character code points that are embodied in syllables of Telugu,

    Kannada and Hindi text.

    A sample of 32,00,000 character code points are used for the above

    analysis which are mainly compiled from the present usage of text

    taken from passages from numerous newspapers, novels, stories,

    songs, sports and literature etc. For text in Telugu, the probabilities

    expressed as a percentage of the character code points of the alphabet

    from the sample that is considered is represented in Table 4.2. The

    reason for certain frequencies in the Table to be zero is that they are

    the deprecated characters in the usage of the language. The zero

    frequencies are observed for the numbers from 0 to 9 in Telugu

    language which are not used in general usage. An interesting

  • 69

    phenomenon is observed in the probability distribution of character

    code points. The highest probability of 1% among vowels is associated

    with the vowel \U 0C05. All other vowels are observed with the

    Code

    Point Prob %

    Code

    Point Prob %

    Code

    Point Prob %

    8.86 1.34 0.14 7.66 1.05 0.10

    6.63 0.86 0.07 6.51 0.74 0.06 6.12 0.72 0.06 5.28 0.59 0.05 4.75 0.50 0.04 4.71 0.49 0.03 4.35 0.48 0.03 3.36 0.45 0.02 3.20 0.44 0.02 2.92 0.42 0.02 2.69 0.41 0.02 2.59 0.40 0.00 2.58 0.36 0.00 2.30 0.34 0.00 2.24 0.34 0.00 2.19 0.31 0.00 2.06 0.19 0.00

    2.01 0.16 0.00 1.96 0.16 0.00 1.89 0.15 0.00 1.43 0.14

    Table 4.2 : Probability Distribution of character code

    points of Telugu script

  • 70

    probability less than or equal to 0.5%. Among consonants the highest

    probability of 6.2% is associated with the consonant . Only four

    consonants are observed with probability greater than 4%. Among

    vowel signs, only three of them are observed with probability around

    7%. This phenomena is more associated with CV Core which are

    reported with 54% in the syllable structure. The Nasal symbol is

    observed with 4.7% probability and the highest probability of 8.86% is

    associated with Halant . It is quite interesting to know that Halant is

    not treated as a syllable at all. However the significant roll of Halant is

    observed in the conjunct formations of syllables.

    For Kannada, a sample of 16,94,650 character code points are

    used for the above analysis which are mainly compiled from the

    present usage of text taken from passages from numerous

    newspapers, novels, stories, songs, sports and literature etc. For text

    in Kannada, Table 4.3 shows the probabilities expressed as a

    percentage of the character code points of the alphabet from the

    sample that is considered. Certain frequencies in the Table are zero as

    they are the deprecated characters in the present usage of the

    language. A highest probability of 9.25% is associated with the

    character .

  • 71

    Code Point

    Prob %

    Code Point

    Prob %

    Code Point

    Prob %

    Code Point

    Prob %

    Code Point

    Prob %

    Code Point

    Prob %

    9.25 2.66 0.89 0.24 0.05 0.01

    7.76 2.64 0.86 0.23 0.05 0.01

    5.69 2.58 0.75 0.22 0.04 0.01

    5.51 2.45 0.68 0.21 0.04 0.01

    5.38 1.87 0.68 0.17 0.04 0.01

    5.38 1.66 0.64 0.16 0.04 0.00

    4.90 1.56 0.55 0.16 0.03 0.00

    4.39 1.53 0.55 0.11 0.03 0.00

    4.33 1.50 0.52 0.11 0.03 0.00

    3.93 1.41 0.51 0.09 0.02 0.00

    3.60 1.24 0.48 0.08 0.02 0.00

    3.54 1.07 0.35 0.07 0.02 0.00

    3.05 0.94 0.33 0.06 0.02

    Table 4.3 Probability Distribution of character code points of Kannada script

  • 72

    Code

    Point

    Prob

    %

    Code

    Point

    Prob

    %

    Code

    Point

    Prob

    %

    Code

    Point

    Prob

    %

    Code

    Point

    Prob

    %

    8.76 1.80 0.46 0.06 0.00 6.57 1.73 0.41 0.05 0.00 6.43 1.59 0.40 0.05 0.00 6.27 1.52 0.37 0.04 0.00 4.46 1.47 0.37 0.02 0.00 4.20 1.21 0.31 0.02 0.00 4.16 0.92 0.31 0.02 0.00 3.95 0.82 0.28 0.02 0.00 3.92 0.82 0.27 0.02 0.00 3.60 0.81 0.17 0.02 0.00 3.53 0.77 0.16 0.01 0.00 3.42 0.77 0.16 0.01 0.00 3.40 0.73 0.15 0.01 0.00 2.76 0.68 0.14 0.01 0.00 2.69 0.66 0.12 0.01 0.00 2.37 0.64 0.11 0.01 2.28 0.61 0.10 0.01 1.96 0.60 0.07 0.01 1.85 0.51 0.06 0.01

    For Hindi, a sample of 9,36,707 character code points are used for

    the above analysis which are mainly compiled from the present usage

    of text. A highest probability of 9.25% is associated with the vowel sign

    character . Among consonants the character has highest

    probability of 6.57 % as listed in Table 4.4.

    Table 4.4 : Probability Distribution of character code

    points of Hindi script

  • 73

    S.No. Code Point

    Probability %

    S.No. Code Point Probability

    %

    1 E 12.90 14 P 2.62

    2 T 9.65 15 U 2.57

    3 O 7.73 16 F 2.37

    4 S 7.40 17 G 1.86

    5 N 7.20 18 B 1.59

    6 I 7.19 19 W 1.40

    7 A 7.16 20 Y 1.18

    8 R 6.65 21 V 0.89

    9 C 4.52 22 K 0.75

    10 H 3.92 23 X 0.30

    11 L 3.81 24 Q 0.19

    12 D 3.26 25 J 0.12

    13 M 2.67 26 Z 0.09

    For English, a sample of 10,00,000 character code points are used

    for the analysis which are mainly compiled from the present usage of

    text. A highest probability of 12.9% is associated with the character

    E where as a minimum probability of 0.19% is associated with the

    character Z as listed in Table 4.5.

    Like single letters having typical probability distributions, multiple

    letter combinations also occur with varying and predictable

    probabilities. Extending the unconditional probability approach, the

    probabilities are determined at which bigrams and trigrams occur in

    the text. For Telugu, approximately 4096 bigrams are possible. The

    most frequently occurring 26 bigrams are listed in Table 4.6 . The

    Table 4.5 : Probability Distribution of character code

    points of English script

  • 74

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 1.41 14 0.80 2 1.17 15 0.78 3 1.16 16 0.77 4 1.11 17 0.76 5 1.08 18 0.75 6 1.06 19 0.75 7 0.93 20 0.75 8 0.91 21 0.75 9 0.87 22 0.74 10 0.87 23 0.72 11 0.85 24 0.71 12 0.84 25 0.71 13 0.82 26 0.70

    bigram has highest probability of 1.41%. By observing these

    values it is easy to infer that the bigrams are formed as a set of

    clusters around specific values, which increases the complexity of

    reverse mapping . Similar evaluation of probability distribution is

    carried out on Kannada, Hindi and the most probable 26 character

    bigrams are listed in Table 4.7 and Table 4.8 respectively for both the

    languages. A highest probability of 1.34% is observed for in

    Kannada and 1.10% for in Hindi. Like in Telugu, similar

    observation with regard to clustering of bigrams is observed in

    Kannada and Hindi.

    Table 4.6 Probability Distribution of most frequent bi gram

    character code points of Telugu script

  • 75

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 1.34 14 0.78 2 1.28 15 0.77 3 1.24 16 0.74 4 1.21 17 0.73 5 1.09 18 0.71 6 1.06 19 0.71 7 1.06 20 0.70 8 1.04 21 0.70 9 0.96 22 0.65 10 0.91 23 0.64 11 0.82 24 0.64 12 0.81 25 0.64 13 0.80 26 0.62

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 1.10 14 0.66 2 1.06 15 0.66 3 1.00 16 0.64 4 0.98 17 0.63 5 0.95 18 0.60 6 0.92 19 0.59 7 0.88 20 0.59 8 0.87 21 0.58 9 0.83 22 0.58 10 0.78 23 0.57 11 0.75 24 0.57 12 0.74 25 0.56 13 0.67 26 0.52

    Table 4.8 : Probability Distribution of most frequent bi gram

    character code points of Hindi script

    Table 4.7 Probability Distribution of most frequent bi gram

    character code points of Kannada script

  • 76

    S.No. Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 TH 2.47 14 NT 1.23

    2 ER 1.96 15 AN 1.22

    3 HE 1.96 16 ET 1.15

    4 IN 1.83 17 SE 1.02

    5 ES 1.68 18 ED 1.02

    6 ON 1.60 19 TO 1.00

    7 RE 1.44 20 CO 0.99

    8 TE 1.37 21 EC 0.96

    9 ST 1.37 22 IS 0.95

    10 TI 1.35 23 RO 0.91

    11 EN 1.33 24 ND 0.88

    12 AT 1.25 25 IT 0.87

    13 OR 1.23 26 AR 0.86

    For English 676 bigrams are possible logically. The most probably

    occurring 26 bigrams are listed in Table 4.9 . The bigram TH has

    highest probability of 2.47%. By observing these values it is easy to

    infer that the bigrams are formed as a set of clusters around specific

    values. These 26 bigrams of English correspond to 35% of total

    distribution where as for Indic scripts the correspondence is in the

    range 19 to 22%. The complexity of reverse mapping is thus difficult

    in case of Indic scripts than English.

    An enhanced analysis, by extending bigram probability distribution

    to trigram probability distribution of code points provides a better

    knowledge about the language characteristics. For Telugu, a total of

    almost 2,63,000 trigram code points are possible. Table 4.10 shows

    Table 4.9 : Probability Distribution of most frequent bi gram character code points of English script

  • 77

    the top 20 trigrams based on probability distribution. The trigram

    code point has highest probability of 0.86%. Because of large

    number of trigrams that are possible, there are more code points

    which are clustered around specific values. This makes the mapping

    process more complex. Similarly out of 4,20,000 trigrams possible in

    Kannada Table 4.11 displays top 26 character code points as per the

    trigram distribution. The code point has highest probability of

    1.16%. For Hindi a total of around 7,29,000 trigrams are possible, out

    of which has got highest probability of 0.74% as listed in Table

    4.12. Because of the huge trigram space, in Hindi reverse mapping is

    much more complex.

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 0.86 14 0.25 2 0.47 15 0.25 3 0.46 16 0.24 4 0.45 17 0.24 5 0.40 18 0.23 6 0.38 19 0.22 7 0.31 20 0.21 8 0.30 21 0.21 9 0.29 22 0.21 10 0.28 23 0.21 11 0.27 24 0.20 12 0.26 25 0.20 13 0.25 26 0.20

    Table 4.10 Probability Distribution of most frequent tri gram

    character code points of Telugu script

  • 78

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability

    %

    1 1.16 14 0.32 2 0.90 15 0.30 3 0.90 16 0.26 4 0.62 17 0.24 5 0.53 18 0.24 6 0.46 19 0.21 7 0.45 20 0.20 8 0.44 21 0.20 9 0.40 22 0.20 10 0.36 23 0.19 11 0.33 24 0.19 12 0.32 25 0.18 13 0.32 26 0.17

    Sl.No Code Point Probability

    % Sl.No Code Point

    Probability %

    1 0.74 14 0.17 2 0.35 15 0.16 3 0.33 16 0.16 4 0.31 17 0.15 5 0.29 18 0.15 6 0.28 19 0.15 7 0.27 20 0.15 8 0.26 21 0.14 9 0.24 22 0.14 10 0.24 23 0.14 11 0.23 24 0.13 12 0.23 25 0.13 13 0.20 26 0.13

    Table 4.11 Probability Distribution of most frequent tri gram

    character code points of Kannada script

    Table 4.12 Probability Distribution of most frequent tri gram

    character code points of Hindi script

  • 79

    In case of English, a total of 17,576 trigrams are possible which is

    less when compared to possible trigrams in case of Indic scripts. Table

    4.13 lists the most frequent 26 trigrams. The trigram THE has

    highest probability of 1.76% followed by ION with 0.73% and ING

    with probability of 0.65%. Because of the smaller trigram space of

    English than Indic scripts, reverse mapping is less complex for

    English.

    S.No. Code Point Probability

    % S.No. Code Point

    Probability

    %

    1 THE 1.76

    14 EST 0.31

    2 ION 0.73

    15 ERE 0.31

    3 ING 0.65

    16 ATE 0.30

    4 TIO 0.64

    17 USE 0.29

    5 AND 0.55

    18 AGE 0.28

    6 ENT 0.52

    19 STH 0.28

    7 FOR 0.45

    20 HER 0.28

    8 PRO 0.44

    21 THA 0.27

    9 CON 0.42

    22 ONS 0.26

    10 ESS 0.40

    23 ECT 0.26

    11 TER 0.39

    24 NTH 0.25

    12 ATI 0.39

    25 ONT 0.25

    13 INT 0.32

    26 ETH 0.25

    Table 4.13 Probability Distribution of most frequent tri gram

    character code points of English script

  • 80

    4.4 CONDITIONAL PROBABILITY DISTRIBUTION OF

    CHARACTERS

    The statistical influences extending over n symbols of the text

    provides better apriori knowledge to the system to achieve

    consistency. For this purpose, the conditional probability of a

    character, knowing the preceding (n-1) characters needs to be

    calculated. The Conditional probability P(A|B) is the probability of

    some event A, given the occurrence of some other event B.

    P(A|B) = P(A,B)/P(B) (4.1)

    Where P(A,B) is the Joint probability, which is the probability of two

    events in conjunction. It is the probability of both events together and

    P(B) is the unconditional probability of the event B.

    The unconditional probability distributions of different unigram

    character code points are to be calculated from large corpus of the

    language. The joint probabilities of all possible combinations of

    character code points for a character are also calculated. This process

    is repeated for all character code points of that language. Using these

    unconditional and joint probabilities and the expression (4.1), the

    conditional probabilities of all character code points of the language

    can be calculated. This procedure is adopted on four different

    languages English, Telugu, Kannada and Hindi .

    The conditional probabilities for the English character S are listed

    in Table 4.14 . From the Table, it is evident that ST has highest

  • 81

    conditional probability which shows that there is more probability of S

    followed by T. Similarly SZ has minimum conditional probability.

    Similar computation is done for all 26 characters of the language.

    Similarly conditional probabilities for all characters of Telugu,

    Kannad, Hindi are computed. For illustration Table 4.15 shows the

    conditional probability of the Telugu character , Table 4.16 shows

    the conditional probability of Kannada character and Table 4.17

    gives the conditional probabilities of the Hindi character . These

    conditional probabilities are like wise calculated for all characters of

    the language.

    S.No. Char Prob % Cond

    Prob S.No. Char

    Prob

    %

    Cond

    Prob

    1 ST 7.40 2.49 14 SY 7.40 0.25

    2 SE 7.40 1.90 15 SB 7.40 0.24

    3 SS 7.40 1.38 16 SR 7.40 0.20

    4 SI 7.40 1.31 17 SN 7.40 0.20

    5 SA 7.40 1.25 18 SD 7.40 0.15

    6 SO 7.40 1.02 19 SL 7.40 0.14

    7 SC 7.40 0.54 20 SK 7.40 0.04

    8 SU 7.40 0.52 21 SG 7.40 0.04

    9 SP 7.40 0.52 22 SV 7.40 0.03

    10 SH 7.40 0.38 23 SJ 7.40 0.02

    11 SM 7.40 0.32 24 SQ 7.40 0.01

    12 SF 7.40 0.28 25 SX 7.40 0.01

    13 SW 7.40 0.28 26 SZ 7.40 0.00

    Table 4.14 Conditional Probability Distribution of character code point S of English script

  • 82

    Char Prob % Cond Prob Char Prob %

    Cond Prob Char Prob %

    Cond Prob

    5.28 3.78 5.28 0.12 5.28 0.00 5.28 2.80 5.28 0.12 5.28 0.00 5.28 2.57 5.28 0.11 5.28 0.00 5.28 2.28 5.28 0.09 5.28 0.00 5.28 0.97 5.28 0.07 5.28 0.00 5.28 0.52 5.28 0.07 5.28 0.00 5.28 0.47 5.28 0.07 5.28 0.00 5.28 0.45 5.28 0.04 5.28 0.00 5.28 0.45 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.43 5.28 0.02 5.28 0.00 5.28 0.39 5.28 0.02 5.28 0.00 5.28 0.31 5.28 0.01 5.28 0.00 5.28 0.28 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.27 5.28 0.01 5.28 0.00 5.28 0.26 5.28 0.01 5.28 0.00 5.28 0.20 5.28 0.01 5.28 0.00 5.28 0.18 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.17 5.28 0.01 5.28 0.00 5.28 0.14 5.28 0.00 5.28 0.00 5.28 0.13 5.28 0.00 5.28 0.12 5.28 0.00

    Table 4.15 Conditional Probability Distribution of character

    code point of Telugu script

  • 83

    Char Prob %

    Cond Prob

    Char Prob % Cond Prob

    Char Prob % Cond Prob

    5.51 3.60 5.51 0.08 5.51 0.00

    5.51 2.30 5.51 0.05 5.51 0.00

    5.51 1.77 5.51 0.04 5.51 0.00

    5.51 1.38 5.51 0.04 5.51 0.00

    5.51 1.28 5.51 0.04 5.51 0.00

    5.51 0.81 5.51 0.04 5.51 0.00

    5.51 0.55 5.51 0.04 5.51 0.00

    5.51 0.54 5.51 0.03 5.51 0.00

    5.51 0.49 5.51 0.03 5.51 0.00

    5.51 0.49 5.51 0.02 5.51 0.00

    5.51 0.47 5.51 0.02 5.51 0.00

    5.51 0.44 5.51 0.02 5.51 0.00

    5.51 0.37 5.51 0.02 5.51 0.00

    5.51 0.37 5.51 0.02 5.51 0.00

    5.51 0.37 5.51 0.02 5.51 0.00

    5.51 0.33 5.51 0.01 5.51 0.00

    5.51 0.33 5.51 0.01 5.51 0.00

    5.51 0.29 5.51 0.01 5.51 0.00

    5.51 0.27 5.51 0.01 5.51 0.00

    5.51 0.24 5.51 0.01 5.51 0.00

    5.51 0.17 5.51 0.01 5.51 0.00

    5.51 0.16 5.51 0.01 5.51 0.00

    5.51 0.16 5.51 0.01 5.51 0.00

    5.51 0.15 5.51 0.00 5.51 0.00

    5.51 0.13 5.51 0.00 5.51 0.00

    5.51 0.11 5.51 0.00

    Table 4.16 Conditional Probability Distribution of character

    code point of Kannada script

  • 84

    Char Prob % Cond

    Prob Char Prob % Cond

    Prob Char Prob % Cond

    Prob

    2.69 5.56 2.69 0.12 2.69 0.00 2.69 4.90 2.69 0.12 2.69 0.00 2.69 3.82 2.69 0.11 2.69 0.00 2.69 2.73 2.69 0.10 2.69 0.00 2.69 2.30 2.69 0.10 2.69 0.00 2.69 2.29 2.69 0.08 2.69 0.00 2.69 2.12 2.69 0.08 2.69 0.00 2.69 2.05 2.69 0.08 2.69 0.00 2.69 1.20 2.69 0.07 2.69 0.00 2.69 1.10 2.69 0.07 2.69 0.00 2.69 1.06 2.69 0.06 2.69 0.00 2.69 0.71 2.69 0.06 2.69 0.00 2.69 0.70 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.59 2.69 0.04 2.69 0.00 2.69 0.47 2.69 0.03 2.69 0.00 2.69 0.46 2.69 0.02 2.69 0.00 2.69 0.42 2.69 0.01 2.69 0.00 2.69 0.40 2.69 0.01 2.69 0.00 2.69 0.39 2.69 0.01 2.69 0.00 2.69 0.32 2.69 0.01 2.69 0.00 2.69 0.25 2.69 0.01 2.69 0.00 2.69 0.24 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.23 2.69 0.01 2.69 0.00 2.69 0.20 2.69 0.01 2.69 0.00 2.69 0.18 2.69 0.00 2.69 0.00 2.69 0.15 2.69 0.00 2.69 0.00 2.69 0.14 2.69 0.00 2.69 0.00 2.69 0.13 2.69 0.00 2.69 0.00

    Table 4.17 Conditional Probability Distribution of character

    code point of Hindi script

  • 85

    4.5 ENCRYPTION AND DECRYPTION

    The proposed model defines meaningful units that are embedded

    in text documents as essential units and also treated as meaningful

    units in the form of character or byte stream. The byte stream is a

    symbolic representation of text. In case of Indic scripts this byte

    stream is a complex byte stream, where as in case of Latin text the

    byte stream is one-to-one mapping. The present model addressed this

    specificity by taking into consideration of words in the form of

    syllables and extraction of byte stream from syllables.

    Algorithm for Encryption of Indic Scripts

    1 : Divide the given text document into set of words.

    2 : Divide each word into syllables (which is basic unit).

    3 : For each syllable generate the character code point byte stream

    which may consists of single or multiple code points that will

    form that syllable.

    4 : Generate bit stream for the byte stream generated in step3.

    5 : Apply Encryption technique on the bit stream generated in Step 4

    and a key stream generated randomly which results in the cipher

    text.

    6 : Repeat steps 3 to 5 for each syllable generated in step2.

    Figure 4.6 Algorithm for Encryption of Indic Scripts

    script

  • 86

    This byte stream consist of single code point units or multiple code

    point units. They are transformed into a code point byte stream,

    converted to bit stream which undergoes transformation similar to

    that of any system. The code point streams that are derivative of

    syllables are converted to bit stream. A key stream is generated [KAT

    2005] using efficient Random number generator. With this key

    stream, transformation function is applied which results in cipher

    text. The process of encryption and decryption is illustrated in Figure

    4.6 and Figure 4.7. The cryptographic model for Indic scripts is as

    illustrated in Figure 4.8.

    Algorithm for Decryption of Indic Scripts

    1 : Generate bit stream for the cipher text.

    2 : Apply Decryption technique on the bit stream generated in Step 1

    with a key stream generated during encryption resulting in a byte

    stream.

    3 : Combine the bit streams of step2 to form code point byte stream.

    4 : Combine the code point byte stream of step3 to form syllables

    5 : Combine the syllables to form words and the words into text

    document

    6 : Repeat step 1 through 5 for all byte streams in the cipher text.

    Figure 4.7 Algorithm for Decryption of Indic Scripts

    script

  • 87

    MULTIPLE

    CODE

    POINTS SINGLE

    CODE

    POINT

    MULTIPLE

    CODE

    POINTS

    .

    KEY STREM

    TEXT

    DOCUMENT

    WORDS

    SYLLABLES

    SINGLE

    CODE

    POINT

    BIT STREAM

    ENCRYPTION

    FUNCTION

    CIPHER TEXT

    CIPHER

    TEXT

    TEXT

    DOCUMENT

    WORDS

    SYLLABLES

    BIT STREAM

    DECRYPTION

    FUNCTION

    Symbolic representation of byte stream

    Symbolic representation of byte stream

    CODE POINT

    BYTE STREAM

    CODE POINT

    BYTE STREAM

    Figure. 4.8 Flow chart for Encryption and Decryption of

    Indic scripts

  • 88

    The proposed cryptographic model is tested on four languages i.e.

    English, Telugu, Kannada, Hindi. The encryption algorithm is

    implemented on text of different sizes. For this process a key is

    generated randomly using a OS based random generator. A simple 8-

    bit,16-bit, 32-bit key streams are used in the present work. The plain

    text is encrypted using the proposed algorithm and randomly

    generated key resulting in cipher text.

    Figure 4.9 Sample Plain Text, Encrypted Text and

    Decrypted Text in English

  • 89

    Fig. 4.10 Sample Plain Text, Encrypted Text and

    Decrypted Text in Telugu

    4.6 CRYPTANALYSIS USING LANGUAGE MODEL

    In a conventional cryptographic system, a plain text message (m) is

    generated by the sender. An encryption transformation E, which

    depends on a secret key k, transforms the plain text m to cipher text

    (c) using the expression c=Ek(m). Cipher text c is then transmitted to

    the receiver where the decryption transformation D, which also

    depends on secret key k ( in case of symmetric key cryptography), is

    used to recover the plain text m using the expression m= Dk(c). The

    information flow is as illustrated in Figure 4.11. The assumption is

    that an opponent does not possess k and cannot recover m from c

  • 90

    Figure 4.11 Decipherment Model that uses language statistics

    using D. (Note that algorithms D and E may be kept public). For the

    key k to remain secret, a secure communication channel is needed

    between the sender and receiver.

    The information available for a cryptanalyst is a variable that can

    be only cipher text, the complete knowledge of the system (except the

    key), the algorithm used, the characteristics of the language and other

    language statistics. If s represents the information available to an

    opponent and D1 represents the process of cryptanalysis, then the

    deduced information, m1 is expressed as m1 = D1(s). The coincidence

    between m and m1 is a measure of the strength of the system.

    m=Dk(c)

    SENDER

    RECIEVER

    CRYPTANALYST

    Apriori knowledge

    about Language

    y

    Key (k)

    distribution

    m c=Ek(m) c

    k k

    merge m

    1

  • 91

    Figure 4.12 Sample retrieved Plain Text in English

    Figure 4.13 Sample retrieved Plain Text in Telugu

  • 92

    In the present work, the probability distribution of different n-gram

    character code points is considered to be the apriori knowledge for

    cryptanalysis. The probabilities of different characters in the cipher

    text are calculated and the results are tabulated. Mapping is done

    between the characters of plain text and cipher text based on the

    probability distribution. Now the characters in cipher text are replaced

    with the mapped characters of plain text and the percentage of plain

    text retrieved is calculated.

    When English Text is considered the problems are much less

    because of the fact that the correspondence is between the

    transformed text and the original text. Though the key is generated

    randomly, since it is fixed, the mapping function transforms it into a

    distinct point in the orthogonal plane. On many occasions for large

    text size almost all characters are present. Even for a medium sized

    text this is true because of less number of characters that exist. More

    over because of one-to-one mapping predictability is more. The

    percentage of retrieved code points is calculated using probability

    distribution.

    If Indic scripts are considered, the number of character codes that

    exist in the original text need not be the complete set. Even though

    the mapping function takes care of one to one correspondence, in the

    transformation process all character codes may not exist from the

    original set of code points. This may lead to confusion in the crypto

  • 93

    Table 4.18. Retrieved Plain Text using Unigram probability

    analysis. A threshholding function is adopted in the crypto analysis

    process for reverse mapping. The percentage of plain text that is

    retrieved is listed in Table 4.18. This result in case of Telugu,

    Kannada, Hindi is relatively less when compared to English which is

    due to complex nature of Indic scripts.

    S.No.

    Cipher text

    Length (in code

    points)

    Retrieval % using probability of uni grams

    ENGLISH TELUGU KANNADA HINDI

    1 6000 29.97 22.51 27.86 19.88

    2 12000 35.93 26.68 27.96 31.55

    3 25000 40.40 23.77 38.95 23.83

    4 40000 42.34 16.90 26.73 27.10

    5 50000 38.33 08.81 28.03 27.89

    6 70000 63.17 23.60 26.11 26.31

    7 90000 48.04 17.44 25.58 25.93

    8 110000 78.59 18.76 25.70 27.22

    Average 47.10 19.81 28.37 26.21

    The encryption algorithm is implemented on different sizes of

    Telugu text samples. For this process an eight bit key is generated

    using OS based random generator. Plain text is encrypted using the

  • 94

    proposed algorithm and the randomly generated 8-bit key resulting in

    cipher text. The frequencies of different characters in the cipher text

    are extracted. Mapping is carried out between the characters of plain

    text and cipher text based on these frequencies. Now the characters in

    cipher text are replaced with the mapped characters of plain text and

    the percentage of the exact retrieval as compared to plain text is

    calculated which is presented in Table 4.18.

    From the Table, it is easy to infer that cryptanalysis of text of

    complex languages like Telugu, Kannada, Hindi is much more

    difficult. Then the larger key size applicable to applications in Latin

    text can be reduced in case of applications in complex languages like

    Telugu. The percentage of plain text retrieved is not linear with text

    size because a proper threshold function is required to map cipher

    text symbols to corresponding plain text symbols.

    Using unconditional probability distribution approach the

    decipherment model is applied on cipher texts of 4 languages and on

    various sizes ranging from 6000 to 110000 character code points. The

    retrieved plain text percentages are tabulated in Table 4.18. English

    text resulted in maximum retrieval of 78.59% and a minimum of

    29.97%. For Telugu a maximum of 26.68% and a minimum of 8.81%

    plain text is retrieved. The retrieved text for Kannada lies in the range

    of 38.95% to 25.58% and for Hindi in the range 31.55 % to 19.88%.

  • 95

    Table 4.19. Mapping process in decipherment for a sample of most

    frequently occurring characters of English using unconditional

    probability

    From the observed results it can be concluded that the reverse

    mapping is more complex in case of Indic scripts (specific reference to

    Telugu) with smaller key sizes also.

    Mapping of fourteen independent characters of English language on

    various samples varying from 6000 to 110000 is presented in Table

    4.19. An empty cell indicates that the character is not mapped

    correctly. The mapping varies with variation in sample size due to the

    heuristic nature of the language which results in variation of

    percentage of retrieval. Table 4.20 and Table 4.21 illustrate similar

    computations with associated results of code points with regard to

    Hindi, Telugu languages and are observed with the similar nature.

    This nature results in inconsistent variation in the mapping process.

    S.No Plain Text E T O S N I A R C H L D M P

    1 6000 E T R

    2 12000 E T R

    3 25000 E T R H D M

    4 40000 E T R H D M

    5 50000 E T R D M

    6 70000 E T O S I R C

    7 90000 E T R L D M

    8 110000 E T O S I R C H L D M

  • 96

    Table 4.21 Mapping process in decipherment for a sample of most frequently occurring characters of Telugu using

    unconditional probability

    Table 4.20. Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using

    unconditional probability

    S.No. Plain Text

    1 6000

    2 12000

    3 25000

    4 40000

    5 50000

    6 70000

    7 90000

    8 110000

    Sl.No Plain Text

    4 6000

    6 12000

    9 25000

    11 40000

    12 50000

    13 70000

    14 90000

    15 110000

  • 97

    Table 4.22 Retrieved Plain Text using bigram probability

    Now the unconditional probability approach is extended by using

    the probabilities at which bigrams and trigrams occur within the

    cipher text. These probabilities are mapped with that of known

    probability distribution of the original language. Using this reverse

    mapping, the plain text is retrieved and the retrieved percentage of

    plain text for four different languages with varying sample sizes using

    bigrams and trigrams are listed in Table 4.22 and Table4.23

    Using unconditional probability distribution of bigram character

    code points, the decipherment model is applied on cipher texts of four

    languages and on various sizes ranging from 6000 to 110000

    character code points. The retrieved plain text percentages using

    bigram statistics are tabulated in Table 4.22. English text resulted in

    maximum retrieval of 10.77% and a minimum of 2.82%. For Telugu a

    S.No.

    Cipher text Length

    (in code

    points)

    Retrieval % using probability of bi grams

    ENGLISH TELUGU KANNADA HINDI

    1 6000 5.20 0.87 2.20 0.07

    2 12000 2.82 2.47 2.52 3.47

    3 40000 7.85 2.86 3.43 3.40

    4 50000 9.22 2.13 3.84 2.02

    5 70000 6.64 0.91 3.83 3.90

    6 90000 10.77 0.03 4.57 3.49

    7 110000 9.60 1.80 2.70 0.44

    Average 7.44 1.58 3.30 2.40

  • 98

    Table 4.23 Retrieved Plain Text using Trigram probability

    maximum of 2.86% and a minimum of 0.03% plain text is retrieved.

    The retrieved text for Kannada lies in the range of 4.57% to 2.20% and

    for Hindi in the range 3.90 % to 0.07%. The retrieved percentage of

    plain text is more for English than Indic scripts.

    The same process is repeated using trigram probability distribution

    of character code points, on cipher texts of four languages and on

    various sizes ranging from 6000 to 110000 character code points. The

    retrieved plain text percentages using trigram statistics are tabulated

    in Table 4.23. English text resulted in maximum retrieval of 4.18%

    and a minimum of 2.25%. For Telugu a maximum of 2.20% and a

    S.No.

    Cipher text Length

    (in code

    points)

    Retrieval % using probability of Tri grams

    ENGLISH

    TELUGU

    KANNADA

    HINDI

    1 6000 2.95 1.00 1.45 0.60

    2 12000 3.80 1.20 4.72 1.95

    3 25000 2.25 1.67 5.04 0.84

    4 40000 3.10 2.20 3.00 1.27

    5 50000 3.54 2.01 2.52 0.74

    6 70000 3.61 1.22 2.33 0.99

    7 90000 4.18 1.15 3.90 1.00

    8 110000 2.36 1.61 3.03 0.62

    Average 3.22 1.51 3.25 1.00

  • 99

    minimum of 1.00% plain text is retrieved. The retrieved text for

    Kannada lies in the range of 5.04% to 1.45% and for Hindi in the

    range 1.95 % to 0.60%. The retrieved percentage of plain text is more

    for English than Indic scripts.

    Using conditional probability distribution approach the

    decipherment model is applied on cipher texts of 4 languages and on

    various sizes ranging from 6000 to 110000 character code points. The

    retrieved plain text percentages are tabulated in Table 4.24. English

    text resulted in maximum retrieval of 57.85% and a minimum of

    42.42% For Telugu a maximum of 33.93% and a minimum of 24.72%

    plain text is retrieved. The retrieved text for Kannada lies in the range

    of 29.77% to 38.30% and for Hindi in the range 24 % to 34.71%.

    Table 4.25 illustrates the mapping of 14 most probable characters

    using conditional probability. An empty cell indicates that for a given

    sample size that character is not mapped correctly. When this is

    compared with the mapping process using unconditional probability is

    illustrated in Table 4.19, some of the characters are not mapped

    correctly using unconditional probability approach, where as more

    characters are mapped properly using conditional probability

    approach. This results in increased retrieval percentage of message

    text.

  • 100

    Table 4.24. Retrieved Plain Text using Conditional probability

    S.No.

    Cipher text Length

    (in code

    points)

    Retrieval % using conditional probability

    ENGLISH TELUGU KANNADA HINDI

    1 6000 42.42 24.72 29.77 31.15

    2 12000 49.65 27.29 35.94 34.71

    3 25000 52.76 27.99 38.30 25.00

    4 40000 48.76 26.92 36.07 25.74

    5 50000 43.42 26.46 34.88 31.33

    6 70000 56.33 33.93 30.69 31.01

    7 90000 57.85 33.13 30.33 32.67

    8 110000 57.35 29.88 31.31 34.55

    Average 51.07 28.79 33.41 30.77

    Similar conclusions are drawn from Table 4.26 and Table 4.27

    which represent the mapping process for character code points of

    Telugu and Hindi respectively. For these languages conditional

    probability distribution approach resulted in improved mapping

    performance than unconditional approach.

  • 101

    Table 4.26 Mapping process in decipherment for a sample of

    most frequently occurring characters of Telugu using

    conditional probability

    Table 4.25 Mapping process in decipherment for a sample of

    most frequently occurring characters of English using conditional probability

    S.No Plain Text E T O S N I A R C H L D M P

    1 6000 E T O A R H P

    2 12000 E T O S I A R H L M

    3 25000 E T O S I R H L D M

    4 40000 E T O S I R H L D

    5 50000 E T S I R H L D P

    6 70000 E T O S N I R H L D M P

    7 90000 E T O S N I A R H L D M P

    8 110000 E T O S N I R H L D M P

    S.No Plain Text

    1 6000

    2 12000

    3 25000

    4 40000

    5 50000

    6

    6

    70000

    7 90000

    8 110000

  • 102

    Figure. 4.14 Average Retrieved Plain Text using Unconditional and

    Conditional probability of different languages

    Table 4.27 Mapping process in decipherment for a sample of most frequently occurring characters of Hindi using

    conditional probability

    0

    10

    20

    30

    40

    50

    60

    English Telugu Kannada Hindi

    LANGUAGES

    AV

    ER

    AG

    E %

    RE

    TR

    IEV

    AL

    Unconditional

    Probability

    Conditional

    Probability

    S.No.

    Plain Text

    1 6000

    2 12000

    3 25000

    4 40000

    5 50000

    6 70000

    7 90000

    8 110000

  • 103

    Average retrieval efficiency of English is observed to be 47.1% using

    unconditional probability and 51.07% using conditional probability is

    presented in Figure. 4.14. Using unconditional probability the

    retrieved plain text percentage of Telugu, Kannada, Hindi are observed

    to be 19.81, 28.37, 26.21% respectively and the same using

    conditional probability are 28.79, 33.41 and 30.77 respectively. It is

    observed from these results that conditional probability results in

    relatively larger percentage of retrieval than unconditional probability

    distribution. Thus using conditional probability as apriori knowledge

    yields in increased retrieval efficiency and improved consistency in

    retrieval percentage. The retrieval percentage for English is larger than

    the Indic scripts viz. Telugu, Kannada and Hindi while applying

    conditional probabilities. Quite interestingly the retrieval efficiency

    associated with conditional probability is found to be more stable than

    probability distribution alone. In case of conditional probability,

    smaller text sizes (below 2000) of all languages posses improved

    retrieval efficiency when compared with probability distribution alone.

    Adaptation of conditional probability allowed appropriated coincidence

    with more character code points, which is not the case in the

    matching pattern of probability alone which is illustrated in Table

    4.20. Similar improvements are observed with the above languages

    where the number of code points is relatively large.

    In the process of providing security, the context of message is

    termed as script dependent text. In the world of multi lingual data,

  • 104

    every script possesses different complexity levels. A method for

    deciphering substitution ciphers with low-order models of the

    language with a case study on Indic scripts is provided in the present

    work. An attempt is made to analyze the text based crypto model

    using probability distribution of character code points as a parameter

    with specific study on Indic scripts. The complex orthographic nature

    of Indic scripts is explored while studying the impact of probability

    distribution of character code points in the cryptanalysis of text

    retrieval. An extensive analysis is carried out on a large set of

    character code points of Indic scripts compiled from the present usage

    of script. The encryption and decryption process is tested in

    comparison with English and also on Telugu, Kannada, Hindi with

    different key sizes. A comparison between language complexities of

    English and Indic scripts is presented from the stand point of

    probability distribution of character code points while adopting

    cryptanalysis. Evaluation of the model is carried out with the help of

    probability distribution as one of the prominent characteristic of text.

    Crypto analysis is also carried out using conditional probability

    distribution and it is observed that there is an improvement in

    percentage of matches in the reverse transformation. Reduced

    efficiency of mapping is an indicative measure of language complexity

    with specific reference to probability distribution of character code

    points. This reflects the fact that the reverse mapping is much more

    complex in case of Indic scripts which is observed from the results.