chapter 5 cryptanalysis of vigenere cipher and ...shodhganga.inflibnet.ac.in/bitstream/10603/26543/10/10_chapter5.pdf · 84 CHAPTER 5 CRYPTANALYSIS OF VIGENERE CIPHER AND SUBSTITUTION

Embed Size (px)

Citation preview

  • 84

    CHAPTER 5

    CRYPTANALYSIS OF VIGENERE CIPHER AND

    SUBSTITUTION CIPHER

    5.1 INTRODUCTION

    This chapter describes the methods of cryptanalysis of vigenere

    cipher and substitution cipher. For cryptanalysis of vigenere cipher the first

    step is finding the length of key. Two types of guessing the key

    are explained. The first type is Kasiski method (www.trincoll.edu/depts/cpsc/

    cryptography/vigenere.html) and the second type is using Genetic Algorithm.

    After the key length is obtained, the proposed algorithm will be applied on

    ciphertext to get the correct key and to recover the original plaintext. The

    results of cryptanalysis of Vigenere cipher are given and compared with

    existing method. In the cryptanalysis of substitution cipher parallel Genetic

    Algorithm is used and a different fitness function is used in the proposed

    algorithm. The results are presented and compared with existing method.

    5.2 THE KASISKI/KERCKHOFF METHOD

    Vigenere-like substitution ciphers were regarded by many as

    practically unbreakable for 300 years. In 1863, a Prussian Major named

    Kasiski proposed a method for breaking a Vigenere cipher that consisted of

    finding the length of the keyword and then dividing the message into that

    many simple substitution cryptograms. Frequency analysis could then be used

    to solve the resulting simple substitutions. Kasiski's technique for finding the

  • 85

    length of the keyword was based on measuring the distance between repeated

    bigrams in the ciphertext. For example:

    Keyword: RELAT IONSR ELATI ONSRE LATIO NSREL

    Plaintext: TOBEO RNOTT OBETH ATIST HEQUE STION

    Ciphertext: KSMEH ZBBLK SMEMP OGAJX SEJCS FLZSY

    The bigram 'TO' occur twice in the plaintext at position 0 and 9 and

    in both cases it lines up perfectly with the first two letters of the keyword.

    Because of this it produces the same ciphertext bigram, 'KS.' The same can be

    said of plaintext 'BE' which occurs twice starting at positions 2 and 11, and

    also is encrypted with the same ciphertext bigram, 'ME.' In fact, any message

    encrypted with a Vigenere cipher will produce many such repeated bigrams.

    Although not every repeated bigram will be the result of the encryption of the

    same plaintext bigram, many will and this provides the basis for breaking the

    cipher, by measuring and factoring the distances between recurring bigrams. In

    this case the distance is 9. Kasiski was able to guess the length of the keyword.

    For this example, the Kasiskis method would create Table 5.1.

    Table 5.1 Kasiskis Technique

    Repeated

    Bigram Location Distance Factors

    KS 9 9 3,9

    SM 10 9 3,9

    ME 11 9 3,9

  • 86

    Factoring the distances between repeated bigrams is a way of

    identifying possible keyword lengths, with those factors that occur most

    frequently being the best candidates for the length of the keyword. Note that

    in this example since 3 is also a factor of 9 (and any of its multiples) both

    3 and 9 would be reasonable candidates for keyword length. Although in this

    example there is no clear favorite, the possibilities have been narrowed down

    to a very small list. Note also that if a longer ciphertext were encrypted with

    the same keyword ('RELATIONS'), repeated bigrams have been expected to

    be finding at multiples of 9, 18, 27, 81, etc. These would also have 3 as a

    factor. Kasiski's important contribution is to note this phenomenon of

    repeated bigrams and propose a method, factoring of distances, to analyze it.

    Once the keyword length is known, the method of Kasiski in findig the

    correct key work as follows:

    If the keyword is N letters long, then every Nth letter must be

    enciphered using the same letter of the key text.

    Grouping every Nth letter together, the analyst has N

    messages, each encrypted using a one-alphabet substitution,

    and each piece can then be solved using frequency analysis.

    The drawbacks of using Kasiski method is the difficulty of finding

    repeated strings, and involving more time to guess the key length. Also for

    short messages there are often several good candidates for English 'E' in each

    column. This requires the testing of multiple hypotheses, which can get quite

    tedious and involve more time.

    5.3 MODIFIED VIGENERE TABLE

    In this work the space is included; Table 5.2 shows the modified

    Vigenere tableaux.

  • 87

    Table 5.2 The modified Vigenere tableaux

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _

    A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _

    B B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A

    C C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B

    D D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C

    E E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D

    F F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E

    G G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F

    H H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G

    I I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H

    J J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I

    K K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I J

    L L M N O P Q R S T U V W X Y Z _ A B C D E F G H I JK

    M M N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L

    N N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M

    O O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N

    P P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O

    Q Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O P

    R R S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q

    S S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q R

    T T U V W X Y Z _ A B C D E F G H I J K LM N O P Q R S

    U U V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T

    V V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U

    W W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V

    X X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V W

    Y Y Z _A B C D E F G H I J K L M N O P Q R S T U V W X

    Z Z _ A B C D E F G H I J K L M N O P Q R S T U V W X Y

    _ _ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

  • 88

    5.4 PROPOSED METHOD FOR GUESSING THE LENGTH OF

    KEY IN VIGENERE CIPHER

    To find the key length, new method is proposed and this method is

    suitable for the texts which are having the size more than 500 bytes. The main

    idea in this method is to employ frequency of bigrams and trigrams as cost

    function in Genetic Algorithm with few numbers of parameters. Genetic

    Algorithm is applied here to find the key length. The first proposed key length

    will be chosen as two and Genetic Algorithm operations are applied for small

    population size and small number of generation, fitness value is saved for the

    next generation and key length is then increased to three, again fitness value is

    saved and compared with previous key length, if the new fitness is better, it

    will be taken for next generation. The procedure is continued till some

    assumed key length like 35 as example. The best solution is expected to be

    the key length, this number will be used in the proposed algorithm for

    cryptanalysis of Vigenere cipher to get the correct key letters and correct

    plaintext. After termination of the algorithm, if the decrypted text is not

    readable, the method of guessing the key length should be continued from the

    assumed key length (35 as example).

    5.5 PROPOSED ALGORITHM FOR CRYPTANALYSIS OF

    VIGENERE CIPHER

    The following is an outline of proposed algorithm: note that this

    algorithm is used two types of fitness function, frequency analysis using the

    Equation 5.1 and score function using Table 5.4.

  • 89

    1. Inputs to the algorithm are the ciphertext, the key size and

    relative character frequencies, table of common bigrams and

    trigrams.

    2. Initialize the algorithm parameters: maximum number of

    iterations.

    3. Generate 10 keys randomly each one is having the same known

    key length.

    4. Decrypt the ciphertext by using the 10 generated keys.

    5. Calculate the suitability of each key from every decrypted text

    using the formula of frequency analysis or using the score in

    Table 5.4.

    6. Sort the keys based on the increased fitness values for first type

    or based on decreased fitness value for score function.

    7. for 1 to (maximum number of iteration) do:

    Choose 5 pairs from 10 keys.

    For 1 to (5 pairs) do

    i. Apply crossover to get children

    ii. Generate random number from 2 to (key size -1)

    iii. Swap the parts of parents as example:

    Parent 1 sungti | hutior

    Parent 2 subdti | dution

    Child1 sungtidution

    Child 2 subdtihutior

    iv. Generate random position number between (1 to key

    size) for each child and mutate the letter in that

    position.

    Decrypt cipher text by 20 keys.

    Calculate the fitness value for each key.

  • 90

    Sort the 20 keys based on increased (or decreased) fitness

    values as mentioned in step no.5.

    Choose best 10 keys.

    Go to 7.

    8. Output is the best solution.

    The algorithm is illustrated using an example. The 10 random keys

    are listed as:

    AAIAJFNFSYHL

    CSOBEVRTVYFL

    YVYVLPRPOSCK

    DRMMOIWGEKSE

    NQQAUGLKJPPH

    EUHIKGKBMZAK

    BBGZJKFFFGTX

    XTLNADLMPGCS

    QXWBONQUGFGI

    ODPAYFVQSTOO

    After one generation the output is:

    EBHIKHKUMZPG

    ODPAYFVQSTOO

    DRMMOIWGEKSE

    EUHIKGKBMZAK

    OXWBONQSQTQO

    YVYEOGWIVKSE

    BBGZJKFFFGTX

  • 91

    ZAOBEVRTVYFA

    NQQAUGLKJPPH

    AQQAPGLKJUNK

    XTMNADLPLGTX

    AAIAJFNFSYHL

    QXWBONQUGFGI

    CSFAJISFNYHL

    BBGZSKFJFGCF

    UDPAYVFOGFGI

    DRPMLMSPORCK

    XTLNADLMPGCS

    YVYVLPRPOSCK

    CSOBEVRTVYFL

    After 30 generations the output is:

    BUBSTHXFHIOO

    JUBSTHXFCIOO

    HIBSTQFUTIBN

    HUBSFHXHTIOO

    JUBSFHXHTIOO

    .

    .

    After 100 generations output is:

    SUBSTIUUTBON

    SUBSTUBUTION

    SUBSTUBUTION

  • 92

    SUBSTUDUTION

    .

    .

    The final solution is (SUBSTIUUTBON) for ciphertext, which is

    having 1500 bytes size for 100 generations.

    5.6 IMPLEMENTING THE CRYPTANALYSIS OF VIGENERE

    CIPHER

    The attack is implemented by generating 10 independent keys to

    represent the target key. The first generation is generated randomly using a

    simple uniform random number generator. The fitness value is incremented

    and finally normalized to the number of pairs, the criteria here is number of

    generation. The Genetic Algorithm then goes in the normal way to generate

    new generations. The algorithm is terminated based on the criteria described

    earlier. The algorithm has been implemented to get fitness; essentially the

    attack shall continue upward to get the best key. These functions are used in

    the code:

    Void Encrypt ()

    This function performs encryption.

    Void Decrypt ()

    This function performs decryption taking input as key.

    Void Keygen ()

    This function creates the initial population it will generate n keys

    randomly

  • 93

    Void Getfitness ()

    This function measures fitness of a particular chromosome in the

    population set indexed by its position in the population.

    Void Sorting()

    This function is responsible for sorting population of chromosomes

    (The genetic material of an individual - represents the information about a

    possible solution to the given problem) based on fitness value.

    Void Crossover()

    This function performs cross over between chromosomes and stores

    them in the new population set as indexed by pos1, pos2.

    Void Mutation ()

    This function is responsible for mutation of the newly generated

    chromosomes.

    5.7 FITNESS MEASURE

    Two types of cost functions are used to calculate the fitness value.

    5.7.1 Based on frequency analysis

    The method used to evaluate the keys is to compare n-gram

    statistics of the decrypted message with the frequency of n-gram standard of

    English language.

    C k = .

    Ai

    u

    i

    u

    i DK )()( + .

    Aji

    b

    ji

    b

    ji DK,

    ),(),( + .

    Akji

    t

    kji

    t

    kji DK,,

    ),,(),,( (5.1)

  • 94

    Equation (5.1) is a general formula used to calculate the suitability

    of a proposed key (k). Here, A denotes the language alphabet [A. . . Z, _] K

    and D denote known language statistics and decrypted message statistics

    respectively and the indices u, b and t denote the unigram, bigram and trigram

    statistics, respectively. The values of , and are the weights to each of the

    three n-gram types which can be assumed with different values for unigrams,

    brigrams and trigrams (Dimovski and Gligoroski 2003a). The main

    characteristic of the algorithm is the ability of directing the random search

    process of the Genetic Algorithm by selecting the fittest chromosomes among

    the population. Evaluation of the fitness of the each key relied on the

    language statistical characteristic. For example, the letter "E" is the most

    common letter in English language, so the fitness of the key can be measured

    based on how likely it is going to give correct letter frequency in the

    deciphered text. Hence, the fitness function chosen is the main factor of the

    algorithm.

    5.7.2 Based on common bigrams and trigrams

    In the process of determining the cost associated with a Vigenere

    cipher key the proposed key is used to decrypt the ciphertext and then the

    statistics of the decrypted message are compared with statistics of the

    language. Matthews proposed an intuitive alternative. Instead of using all

    possible bigrams and trigrams a subset of the most common ones are chosen

    (Matthews 1993).

    The method of Matthews is to list a number of the most common

    bigrams and trigrams and to assign a weight (or score) to each of them. Also,

    the trigram EEE was included in the list and assigned a negative score. The

    idea behind this is interesting. Since E is very common in English, it could be

  • 95

    expected that a plaintext message might contain a relatively high number of

    Es. Since these never occur normally in the English language, it makes sense

    to assign such a trigram a negative score. Each weight is applied to the

    frequency of the corresponding bigram or trigram in the decrypted message.

    Table 2.1 shows the weight table used by Matthews in his paper.

    The bigrams, trigrams and weights were modified by Andrew John Clark, in

    (Clark 1994) to the values shown in Table 5.3 Notice that the bigram __

    (two consecutive spaces) and the trigram ___ (three consecutive spaces)

    have been included. Some of the other bigrams and trigrams are slightly

    different, due to the fact that the space symbol has been included in the

    encryption in (Clark 1994) (and was not in the work by Matthews) and also

    due to different sets of language statistics used.

    Table 5.3 The fitness weight proposed by Clark

    Bi/trigram Score Bi/trigram Score

    E_ +2 S_ +1

    _T +1 __ -6

    HE +1 _TH +5

    TH +1 THE +5

    _A +1 HE_ +5

    ___ -10

  • 96

    In this work, a modified method has used to evaluate the cost value

    of the fitness function which is used by Clark (1994) and Matthews (1993),

    EEE, AND and ING are included to the Andrew table, Table 5.4 shows the

    weight table used in this work.

    Table 5.4 The fitness weight proposed in this research

    Bi/trigram Score Bi/trigram Score

    EEE -5 ING +5

    E_ +2 S_ +1

    _T +1 __ -6

    HE +1 _TH +5

    TH +1 THE +5

    _A +1 HE_ +5

    ___ -10 AND +5

    5.8 RESULT OF VIGENERE CIPHER

    The result of guessing method of the key length is tabulated as

    comparison of fitness of some key lengths proposed, the population size is

    six, number of generation is 200, and text size is 2 kb, as in Table 5.5.

    The fitness value 151 is the biggest value, it has been calculated

    using frequency of bigrams and trigrams in Table 5.4, the proposed key length

    15 which gave the best fitness value will be submitted to next stage as

    expected key length in the code to find the key word.

  • 97

    Table 5.5 Result of Guessing Key Length

    Number of proposed key

    length Fitness value

    2 32

    3 78

    4 49

    5 81

    6 85

    7 50

    8 67

    9 52

    10 91

    11 73

    12 70

    13 87

    14 65

    15 151

    16 71

    17 72

    18 98

    19 75

    20 93

  • 98

    The results for Vigenere cipher are tabulated in Table 5.6 as a

    comparison of performance for different ciphertext lengths and different key

    lengths. The number of generation is limited to 1000.

    Table 5.6 Result of Vigenere cipher

    Cipher Text

    Size

    (in characters)

    Key

    length

    Number of Recovered

    key letters Time (in seconds)

    Score

    function

    Equation

    5.1 Score

    Equation

    5.1

    2000 20 19 15 119 124

    3000 25 22 21 131 139

    4000 30 24 21 126 150

    The time taken to cryptanalysis of Vigenere cipher is less by using

    score cost function. Using score function gives better results of recovered key

    letters in comparison with frequency analysis.

    The results obtained for various cases such as partial key obtained

    using Equation 5.1 as fitness function and completed key obtained using score

    in Table 5.4 as fitness function. Table 5.7 shows that the key is completely

    obtained using Equation 5.1 with 1kb text if key size

  • 99

    Table 5.7 Partial key obtained using Equation 5.1 with 1 kb text

    Key length Key given Key obtained Fitness value Time

    (sec)

    1 Z Z 6.55 < 1

    2 GF GF 6.51 3

    3 ARN ARN 6.52 3

    4 RAGH RAGH 6.54 13

    5 TOEME TOEME 6.51 12

    6 COLLEG COLLEG 6.51 13

    7 GOVERNM EOVERNM 7.00 61

    8 COMPUTER COMWUTER 6.49 100

    Table 5.8 Complete key obtained using score with 1kb text

    Key length Key given Key obtained Fitness value Time

    (sec)

    1 Z Z 451

  • 100

    Table 5.9 Partial key obtained using Equation 5.1 with 2 kb text

    Key length Key given Key obtained Fitness value Time

    (sec)

    1 Z Z 12.93

  • 101

    Table 5.11 shows that the key is completely obtained using

    Equation 5.1 with 3 kb text if key size

  • 102

    Table 5.12 Complete key obtained using score with 3kb text

    Key

    length Key given Key obtained

    Fitness

    value

    Time

    (sec)

    1 Z Z 1236

  • 103

    5.9 A PARALLEL GENETIC ALGORITHM FOR

    CRYPTANALYSIS OF POLYALPHABETIC

    SUBSTITUTION CIPHER

    Dimovski and Gligoroski (2003a) proposed a number of Genetic

    Algorithms running in parallel for Cryptanalysis of the polyalphabetic

    substitution cipher. Each Genetic Algorithm is using to solve a different part

    of the problem. Figure 5.1 is a pictorial representation of this method with M

    GAs running in parallel and communicating every k iterations. The

    Figure 5.1 is showing a polyalphabetic substitution cipher consisting of M

    simple substitution ciphers.

    Figure 5.1 A parallel Genetic Algorithm

    To solve these M substitution ciphers, M Genetic Algorithms

    (GA1, GA2, . . . , GAM) are used which is attempting to find the key to the

    Initialization

    Output

    GA0

    1

    GA1

    1

    GAk

    1

    GAX

    1 GAX

    2 GAX

    M

    GA0

    M GA0

    2

    GA1

    2

    GAk

    2 GAk

    M

    GA1

    M

  • 104

    cipher of position j, in determining the cost of each of the solutions in its pool,

    GAj uses the current best key from each of its neighbors to find the bigram

    and trigram statistics.

    5.9.1 Suitability Assessment

    Two types of evaluating the fitness function are used in this work.

    The first type is using Equation 5.1. The second type is using the score

    function based on Table 5.4.

    5.9.2 The Reproduction process

    Spillman et al (1993) proposed a method for the mating function to

    order the key. The characters in the key string are ordered such that the most

    frequent character in the ciphertext is mapped to the first element of the key

    (upon decryption), the second most frequent character in the ciphertext is

    mapped to the second element of the key, and so on.. For example, the key:

    NVHIZKLMFBPURS_QGTWXYJAOCED indicates that the most frequent

    character in the ciphertext represents a plaintext N; the second most frequent

    character in the ciphertext represents a plaintext V, etc. Given two parents

    constructed in the manner just described, the first element of the first child is

    chosen to be the one of the first two elements in each of the parents, which is

    most frequent in the known language statistics. This process continues in a

    left to right direction along each of the parents to create the first child only. If,

    at any stage, a selection is made which already appears in the child being

    constructed, the second choice is used. If both of the characters in the parents

    for a given key position already appear in the child then a character is chosen

    at random from the set of characters that do not already appear in the newly

    constructed child. The second child is formed in a similar manner, except that

  • 105

    the direction of creation is from right to left and in this case, the least frequent

    of the two parent elements is chosen.

    Parent 1

    NVHIZKLMFBPURS_QGTWXYJAOCED

    Parent 2

    MNOPCDEFGHIJKLQRSTXY_ZABUVW

    Child 1

    NVHICKEFGBPURS_QWTXYAZDOJML

    Child 2

    XNOPZGEMFHICKLQRSTWY_JABUVD

    5.10 DESCRIPTION OF THE ALGORITHM

    The implementation of the algorithm has been done in two different

    ways to calculate the fitness, first one is similar to existing method (using

    Equation 5.1) and the second way is by using score of bigrams and trigrams

    of the Table 5.4:

    1. The inputs of the algorithm are language statistics for

    unigrams, bigrams and trigrams, the ciphertext, the block size

    (B=3 as example) and this GAs position within the block, j

    (1

  • 106

    3. Calculate the cost for each key using unigram statistics only

    and sort them based of fitness values.

    4. For iteration i (i = 1. . . G) Do:

    a) If i mod k= 0 send the best key from current pool to each

    of the neighbouring GAs (i.e., the GAs solving for

    positions j-1 and j + 1). Also receive the best keys from

    each of these GAs.

    b) Select first 6 pairs for each position of solutions from

    current pool to be the parents of the new generation.

    c) Mate using each pair of parents to produce 12 children

    that become the new generation (new pool) for each

    position.

    d) Mutate each of the children in new pool using the same

    swapping procedure as described in the attack on the

    simple substitution cipher (Spillman et al, 1993).

    e) Calculate the cost of each of the children in new pool

    using the neighbouring keys obtained in Step 4a and

    Equation 5.1 or score of bigrams and trigrams Table 5.4.

    f) Select the 12 best keys from the two pools current and

    new. Replace the current solutions in current pool with

    these solutions.

    5. Output the best key from current pool.

  • 107

    5.11 RESULTS OF POLYALPHABETIC SUBSTITUTION

    CIPHER

    The key size of each position is 27 characters, the block size is

    three then the total key size is 3*27 = 81 characters.

    The algorithm has been implemented using two type of fitness

    value: based on frequency analysis as in Equation 5.1 and based on score in

    the Table 5.4 and the results are tabulated in Table 5.13.

    Table 5.13 Results of proposed algorithm and existing method

    Text size

    Frequency Score

    Key

    recovered

    Text

    recovered %

    Key

    recovered

    Text

    recovered

    500 12 31.47 13 35.85

    1000 16 44.71 20 45.00

    1500 25 51.76 39 75.11

    2000 57 88.96 65 93.1

    The results show that the recovered key letters and recovered

    plaintext letters are more in the way which used based on score. The

    Figures 5.2 and 5.3 show the graph of recovered key letters vs. text size and

    recovered text letters vs. text size. The proposed algorithm by using score

    fitness in calculating the fitness value has improved the way to get recovered

    key letters and recovered text letters.

  • 108

    recovered key letters vs. ciphertext

    size

    0

    10

    20

    30

    40

    50

    60

    70

    0 1000 2000 3000

    ciphertext size

    reco

    vere

    d k

    ey lett

    ers

    Frequency

    method

    Score fitness

    method

    Figure 5.2 Recovered key letters versus plaintext size

    recovered plaintext letters percentage

    vs. ciphertext size

    0

    20

    40

    60

    80

    100

    0 1000 2000 3000

    ciphertext size

    reco

    vere

    d p

    lain

    text %

    Frequency

    method

    Score fitness

    method

    Figure 5.3 Recovered plaintext letters percentage versus ciphertext size