chapter 5 cryptanalysis of vigenere cipher and ...shodhganga.inflibnet.ac.in/bitstream/10603/26543/10/10_chapter5.pdf · 84 CHAPTER 5 CRYPTANALYSIS OF VIGENERE CIPHER AND SUBSTITUTION

84

CHAPTER 5

CRYPTANALYSIS OF VIGENERE CIPHER AND

SUBSTITUTION CIPHER

5.1 INTRODUCTION

This chapter describes the methods of cryptanalysis of vigenere

cipher and substitution cipher. For cryptanalysis of vigenere cipher the first

step is finding the length of key. Two types of guessing the key

are explained. The first type is Kasiski method (www.trincoll.edu/depts/cpsc/

cryptography/vigenere.html) and the second type is using Genetic Algorithm.

After the key length is obtained, the proposed algorithm will be applied on

ciphertext to get the correct key and to recover the original plaintext. The

results of cryptanalysis of Vigenere cipher are given and compared with

existing method. In the cryptanalysis of substitution cipher parallel Genetic

Algorithm is used and a different fitness function is used in the proposed

algorithm. The results are presented and compared with existing method.

5.2 THE KASISKI/KERCKHOFF METHOD

Vigenere-like substitution ciphers were regarded by many as

practically unbreakable for 300 years. In 1863, a Prussian Major named

Kasiski proposed a method for breaking a Vigenere cipher that consisted of

finding the length of the keyword and then dividing the message into that

many simple substitution cryptograms. Frequency analysis could then be used

to solve the resulting simple substitutions. Kasiski's technique for finding the

85

length of the keyword was based on measuring the distance between repeated

bigrams in the ciphertext. For example:

Keyword: RELAT IONSR ELATI ONSRE LATIO NSREL

Plaintext: TOBEO RNOTT OBETH ATIST HEQUE STION

Ciphertext: KSMEH ZBBLK SMEMP OGAJX SEJCS FLZSY

The bigram 'TO' occur twice in the plaintext at position 0 and 9 and

in both cases it lines up perfectly with the first two letters of the keyword.

Because of this it produces the same ciphertext bigram, 'KS.' The same can be

said of plaintext 'BE' which occurs twice starting at positions 2 and 11, and

also is encrypted with the same ciphertext bigram, 'ME.' In fact, any message

encrypted with a Vigenere cipher will produce many such repeated bigrams.

Although not every repeated bigram will be the result of the encryption of the

same plaintext bigram, many will and this provides the basis for breaking the

cipher, by measuring and factoring the distances between recurring bigrams. In

this case the distance is 9. Kasiski was able to guess the length of the keyword.

For this example, the Kasiskis method would create Table 5.1.

Table 5.1 Kasiskis Technique

Repeated

Bigram Location Distance Factors

KS 9 9 3,9

SM 10 9 3,9

ME 11 9 3,9

86

Factoring the distances between repeated bigrams is a way of

identifying possible keyword lengths, with those factors that occur most

frequently being the best candidates for the length of the keyword. Note that

in this example since 3 is also a factor of 9 (and any of its multiples) both

3 and 9 would be reasonable candidates for keyword length. Although in this

example there is no clear favorite, the possibilities have been narrowed down

to a very small list. Note also that if a longer ciphertext were encrypted with

the same keyword ('RELATIONS'), repeated bigrams have been expected to

be finding at multiples of 9, 18, 27, 81, etc. These would also have 3 as a

factor. Kasiski's important contribution is to note this phenomenon of

repeated bigrams and propose a method, factoring of distances, to analyze it.

Once the keyword length is known, the method of Kasiski in findig the

correct key work as follows:

If the keyword is N letters long, then every Nth letter must be

enciphered using the same letter of the key text.

Grouping every Nth letter together, the analyst has N

messages, each encrypted using a one-alphabet substitution,

and each piece can then be solved using frequency analysis.

The drawbacks of using Kasiski method is the difficulty of finding

repeated strings, and involving more time to guess the key length. Also for

short messages there are often several good candidates for English 'E' in each

column. This requires the testing of multiple hypotheses, which can get quite

tedious and involve more time.

5.3 MODIFIED VIGENERE TABLE

In this work the space is included; Table 5.2 shows the modified

Vigenere tableaux.

87

Table 5.2 The modified Vigenere tableaux

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _

A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _

B B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A

C C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B

D D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C

E E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D

F F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E

G G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F

H H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G

I I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H

J J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I

K K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I J

L L M N O P Q R S T U V W X Y Z _ A B C D E F G H I JK

M M N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L

N N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M

O O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N

P P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O

Q Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O P

R R S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q

S S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q R

T T U V W X Y Z _ A B C D E F G H I J K LM N O P Q R S

U U V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T

V V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U

W W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V

X X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V W

Y Y Z _A B C D E F G H I J K L M N O P Q R S T U V W X

Z Z _ A B C D E F G H I J K L M N O P Q R S T U V W X Y

_ _ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

88

5.4 PROPOSED METHOD FOR GUESSING THE LENGTH OF

KEY IN VIGENERE CIPHER

To find the key length, new method is proposed and this method is

suitable for the texts which are having the size more than 500 bytes. The main

idea in this method is to employ frequency of bigrams and trigrams as cost

function in Genetic Algorithm with few numbers of parameters. Genetic

Algorithm is applied here to find the key length. The first proposed key length

will be chosen as two and Genetic Algorithm operations are applied for small

population size and small number of generation, fitness value is saved for the

next generation and key length is then increased to three, again fitness value is

saved and compared with previous key length, if the new fitness is better, it

will be taken for next generation. The procedure is continued till some

assumed key length like 35 as example. The best solution is expected to be

the key length, this number will be used in the proposed algorithm for

cryptanalysis of Vigenere cipher to get the correct key letters and correct

plaintext. After termination of the algorithm, if the decrypted text is not

readable, the method of guessing the key length should be continued from the

assumed key length (35 as example).

5.5 PROPOSED ALGORITHM FOR CRYPTANALYSIS OF

VIGENERE CIPHER

The following is an outline of proposed algorithm: note that this

algorithm is used two types of fitness function, frequency analysis using the

Equation 5.1 and score function using Table 5.4.

89

1. Inputs to the algorithm are the ciphertext, the key size and

relative character frequencies, table of common bigrams and

trigrams.

2. Initialize the algorithm parameters: maximum number of

iterations.

3. Generate 10 keys randomly each one is having the same known

key length.

4. Decrypt the ciphertext by using the 10 generated keys.

5. Calculate the suitability of each key from every decrypted text

using the formula of frequency analysis or using the score in

Table 5.4.

6. Sort the keys based on the increased fitness values for first type

or based on decreased fitness value for score function.

7. for 1 to (maximum number of iteration) do:

Choose 5 pairs from 10 keys.

For 1 to (5 pairs) do

i. Apply crossover to get children

ii. Generate random number from 2 to (key size -1)

iii. Swap the parts of parents as example:

Parent 1 sungti | hutior

Parent 2 subdti | dution

Child1 sungtidution

Child 2 subdtihutior

iv. Generate random position number between (1 to key

size) for each child and mutate the letter in that

position.

Decrypt cipher text by 20 keys.

Calculate the fitness value for each key.

90

Sort the 20 keys based on increased (or decreased) fitness

values as mentioned in step no.5.

Choose best 10 keys.

Go to 7.

8. Output is the best solution.

The algorithm is illustrated using an example. The 10 random keys

are listed as:

AAIAJFNFSYHL

CSOBEVRTVYFL

YVYVLPRPOSCK

DRMMOIWGEKSE

NQQAUGLKJPPH

EUHIKGKBMZAK

BBGZJKFFFGTX

XTLNADLMPGCS

QXWBONQUGFGI

ODPAYFVQSTOO

After one generation the output is:

EBHIKHKUMZPG

ODPAYFVQSTOO

DRMMOIWGEKSE

EUHIKGKBMZAK

OXWBONQSQTQO

YVYEOGWIVKSE

BBGZJKFFFGTX

91

ZAOBEVRTVYFA

NQQAUGLKJPPH

AQQAPGLKJUNK

XTMNADLPLGTX

AAIAJFNFSYHL

QXWBONQUGFGI

CSFAJISFNYHL

BBGZSKFJFGCF

UDPAYVFOGFGI

DRPMLMSPORCK

XTLNADLMPGCS

YVYVLPRPOSCK

CSOBEVRTVYFL

After 30 generations the output is:

BUBSTHXFHIOO

JUBSTHXFCIOO

HIBSTQFUTIBN

HUBSFHXHTIOO

JUBSFHXHTIOO

.

.

After 100 generations output is:

SUBSTIUUTBON

SUBSTUBUTION

SUBSTUBUTION

92

SUBSTUDUTION

.

.

The final solution is (SUBSTIUUTBON) for ciphertext, which is

having 1500 bytes size for 100 generations.

5.6 IMPLEMENTING THE CRYPTANALYSIS OF VIGENERE

CIPHER

The attack is implemented by generating 10 independent keys to

represent the target key. The first generation is generated randomly using a

simple uniform random number generator. The fitness value is incremented

and finally normalized to the number of pairs, the criteria here is number of

generation. The Genetic Algorithm then goes in the normal way to generate

new generations. The algorithm is terminated based on the criteria described

earlier. The algorithm has been implemented to get fitness; essentially the

attack shall continue upward to get the best key. These functions are used in

the code:

Void Encrypt ()

This function performs encryption.

Void Decrypt ()

This function performs decryption taking input as key.

Void Keygen ()

This function creates the initial population it will generate n keys

randomly

93

Void Getfitness ()

This function measures fitness of a particular chromosome in the

population set indexed by its position in the population.

Void Sorting()

This function is responsible for sorting population of chromosomes

(The genetic material of an individual - represents the information about a

possible solution to the given problem) based on fitness value.

Void Crossover()

This function performs cross over between chromosomes and stores

them in the new population set as indexed by pos1, pos2.

Void Mutation ()

This function is responsible for mutation of the newly generated

chromosomes.

5.7 FITNESS MEASURE

Two types of cost functions are used to calculate the fitness value.

5.7.1 Based on frequency analysis

The method used to evaluate the keys is to compare n-gram

statistics of the decrypted message with the frequency of n-gram standard of

English language.

C k = .

Ai

u

i

u

i DK )()( + .

Aji

b

ji

b

ji DK,

),(),( + .

Akji

t

kji

t

kji DK,,

),,(),,( (5.1)

94

Equation (5.1) is a general formula used to calculate the suitability

of a proposed key (k). Here, A denotes the language alphabet [A. . . Z, _] K

and D denote known language statistics and decrypted message statistics

respectively and the indices u, b and t denote the unigram, bigram and trigram

statistics, respectively. The values of , and are the weights to each of the

three n-gram types which can be assumed with different values for unigrams,

brigrams and trigrams (Dimovski and Gligoroski 2003a). The main

characteristic of the algorithm is the ability of directing the random search

process of the Genetic Algorithm by selecting the fittest chromosomes among

the population. Evaluation of the fitness of the each key relied on the

language statistical characteristic. For example, the letter "E" is the most

common letter in English language, so the fitness of the key can be measured

based on how likely it is going to give correct letter frequency in the

deciphered text. Hence, the fitness function chosen is the main factor of the

algorithm.

5.7.2 Based on common bigrams and trigrams

In the process of determining the cost associated with a Vigenere

cipher key the proposed key is used to decrypt the ciphertext and then the

statistics of the decrypted message are compared with statistics of the

language. Matthews proposed an intuitive alternative. Instead of using all

possible bigrams and trigrams a subset of the most common ones are chosen

(Matthews 1993).

The method of Matthews is to list a number of the most common

bigrams and trigrams and to assign a weight (or score) to each of them. Also,

the trigram EEE was included in the list and assigned a negative score. The

idea behind this is interesting. Since E is very common in English, it could be

95

expected that a plaintext message might contain a relatively high number of

Es. Since these never occur normally in the English language, it makes sense

to assign such a trigram a negative score. Each weight is applied to the

frequency of the corresponding bigram or trigram in the decrypted message.

Table 2.1 shows the weight table used by Matthews in his paper.

The bigrams, trigrams and weights were modified by Andrew John Clark, in

(Clark 1994) to the values shown in Table 5.3 Notice that the bigram __

(two consecutive spaces) and the trigram ___ (three consecutive spaces)

have been included. Some of the other bigrams and trigrams are slightly

different, due to the fact that the space symbol has been included in the

encryption in (Clark 1994) (and was not in the work by Matthews) and also

due to different sets of language statistics used.

Table 5.3 The fitness weight proposed by Clark

Bi/trigram Score Bi/trigram Score

E_ +2 S_ +1

_T +1 __ -6

HE +1 _TH +5

TH +1 THE +5

_A +1 HE_ +5

___ -10

96

In this work, a modified method has used to evaluate the cost value

of the fitness function which is used by Clark (1994) and Matthews (1993),

EEE, AND and ING are included to the Andrew table, Table 5.4 shows the

weight table used in this work.

Table 5.4 The fitness weight proposed in this research

Bi/trigram Score Bi/trigram Score

EEE -5 ING +5

E_ +2 S_ +1

_T +1 __ -6

HE +1 _TH +5

TH +1 THE +5

_A +1 HE_ +5

___ -10 AND +5

5.8 RESULT OF VIGENERE CIPHER

The result of guessing method of the key length is tabulated as

comparison of fitness of some key lengths proposed, the population size is

six, number of generation is 200, and text size is 2 kb, as in Table 5.5.

The fitness value 151 is the biggest value, it has been calculated

using frequency of bigrams and trigrams in Table 5.4, the proposed key length

15 which gave the best fitness value will be submitted to next stage as

expected key length in the code to find the key word.

97

Table 5.5 Result of Guessing Key Length

Number of proposed key

length Fitness value

2 32

3 78

4 49

5 81

6 85

7 50

8 67

9 52

10 91

11 73

12 70

13 87

14 65

15 151

16 71

17 72

18 98

19 75

20 93

98

The results for Vigenere cipher are tabulated in Table 5.6 as a

comparison of performance for different ciphertext lengths and different key

lengths. The number of generation is limited to 1000.

Table 5.6 Result of Vigenere cipher

Cipher Text

Size

(in characters)

Key

length

Number of Recovered

key letters Time (in seconds)

Score

function

Equation

5.1 Score

Equation

5.1

2000 20 19 15 119 124

3000 25 22 21 131 139

4000 30 24 21 126 150

The time taken to cryptanalysis of Vigenere cipher is less by using

score cost function. Using score function gives better results of recovered key

letters in comparison with frequency analysis.

The results obtained for various cases such as partial key obtained

using Equation 5.1 as fitness function and completed key obtained using score

in Table 5.4 as fitness function. Table 5.7 shows that the key is completely

obtained using Equation 5.1 with 1kb text if key size

99

Table 5.7 Partial key obtained using Equation 5.1 with 1 kb text

Key length Key given Key obtained Fitness value Time

(sec)

1 Z Z 6.55 < 1

2 GF GF 6.51 3

3 ARN ARN 6.52 3

4 RAGH RAGH 6.54 13

5 TOEME TOEME 6.51 12

6 COLLEG COLLEG 6.51 13

7 GOVERNM EOVERNM 7.00 61

8 COMPUTER COMWUTER 6.49 100

Table 5.8 Complete key obtained using score with 1kb text


(sec)

1 Z Z 451

100

Table 5.9 Partial key obtained using Equation 5.1 with 2 kb text


(sec)

1 Z Z 12.93

101

Table 5.11 shows that the key is completely obtained using

Equation 5.1 with 3 kb text if key size

102

Table 5.12 Complete key obtained using score with 3kb text

Key

length Key given Key obtained

Fitness

value

Time

(sec)

1 Z Z 1236

103

5.9 A PARALLEL GENETIC ALGORITHM FOR

CRYPTANALYSIS OF POLYALPHABETIC

SUBSTITUTION CIPHER

Dimovski and Gligoroski (2003a) proposed a number of Genetic

Algorithms running in parallel for Cryptanalysis of the polyalphabetic

substitution cipher. Each Genetic Algorithm is using to solve a different part

of the problem. Figure 5.1 is a pictorial representation of this method with M

GAs running in parallel and communicating every k iterations. The

Figure 5.1 is showing a polyalphabetic substitution cipher consisting of M

simple substitution ciphers.

Figure 5.1 A parallel Genetic Algorithm

To solve these M substitution ciphers, M Genetic Algorithms

(GA1, GA2, . . . , GAM) are used which is attempting to find the key to the

Initialization

Output

GA0

1

GA1

1

GAk

1

GAX

1 GAX

2 GAX

M

GA0

M GA0

2

GA1

2

GAk

2 GAk

M

GA1

M

104

cipher of position j, in determining the cost of each of the solutions in its pool,

GAj uses the current best key from each of its neighbors to find the bigram

and trigram statistics.

5.9.1 Suitability Assessment

Two types of evaluating the fitness function are used in this work.

The first type is using Equation 5.1. The second type is using the score

function based on Table 5.4.

5.9.2 The Reproduction process

Spillman et al (1993) proposed a method for the mating function to

order the key. The characters in the key string are ordered such that the most

frequent character in the ciphertext is mapped to the first element of the key

(upon decryption), the second most frequent character in the ciphertext is

mapped to the second element of the key, and so on.. For example, the key:

NVHIZKLMFBPURS_QGTWXYJAOCED indicates that the most frequent

character in the ciphertext represents a plaintext N; the second most frequent

character in the ciphertext represents a plaintext V, etc. Given two parents

constructed in the manner just described, the first element of the first child is

chosen to be the one of the first two elements in each of the parents, which is

most frequent in the known language statistics. This process continues in a

left to right direction along each of the parents to create the first child only. If,

at any stage, a selection is made which already appears in the child being

constructed, the second choice is used. If both of the characters in the parents

for a given key position already appear in the child then a character is chosen

at random from the set of characters that do not already appear in the newly

constructed child. The second child is formed in a similar manner, except that

105

the direction of creation is from right to left and in this case, the least frequent

of the two parent elements is chosen.

Parent 1

NVHIZKLMFBPURS_QGTWXYJAOCED

Parent 2

MNOPCDEFGHIJKLQRSTXY_ZABUVW

Child 1

NVHICKEFGBPURS_QWTXYAZDOJML

Child 2

XNOPZGEMFHICKLQRSTWY_JABUVD

5.10 DESCRIPTION OF THE ALGORITHM

The implementation of the algorithm has been done in two different

ways to calculate the fitness, first one is similar to existing method (using

Equation 5.1) and the second way is by using score of bigrams and trigrams

of the Table 5.4:

1. The inputs of the algorithm are language statistics for

unigrams, bigrams and trigrams, the ciphertext, the block size

(B=3 as example) and this GAs position within the block, j

(1

106

3. Calculate the cost for each key using unigram statistics only

and sort them based of fitness values.

4. For iteration i (i = 1. . . G) Do:

a) If i mod k= 0 send the best key from current pool to each

of the neighbouring GAs (i.e., the GAs solving for

positions j-1 and j + 1). Also receive the best keys from

each of these GAs.

b) Select first 6 pairs for each position of solutions from

current pool to be the parents of the new generation.

c) Mate using each pair of parents to produce 12 children

that become the new generation (new pool) for each

position.

d) Mutate each of the children in new pool using the same

swapping procedure as described in the attack on the

simple substitution cipher (Spillman et al, 1993).

e) Calculate the cost of each of the children in new pool

using the neighbouring keys obtained in Step 4a and

Equation 5.1 or score of bigrams and trigrams Table 5.4.

f) Select the 12 best keys from the two pools current and

new. Replace the current solutions in current pool with

these solutions.

5. Output the best key from current pool.

107

5.11 RESULTS OF POLYALPHABETIC SUBSTITUTION

CIPHER

The key size of each position is 27 characters, the block size is

three then the total key size is 3*27 = 81 characters.

The algorithm has been implemented using two type of fitness

value: based on frequency analysis as in Equation 5.1 and based on score in

the Table 5.4 and the results are tabulated in Table 5.13.

Table 5.13 Results of proposed algorithm and existing method

Text size

Frequency Score

Key

recovered

Text

recovered %

Key

recovered

Text

recovered

500 12 31.47 13 35.85

1000 16 44.71 20 45.00

1500 25 51.76 39 75.11

2000 57 88.96 65 93.1

The results show that the recovered key letters and recovered

plaintext letters are more in the way which used based on score. The

Figures 5.2 and 5.3 show the graph of recovered key letters vs. text size and

recovered text letters vs. text size. The proposed algorithm by using score

fitness in calculating the fitness value has improved the way to get recovered

key letters and recovered text letters.

108

recovered key letters vs. ciphertext

size

0

10

20

30

40

50

60

70

0 1000 2000 3000

ciphertext size

reco

vere

d k

ey lett

ers

Frequency

method

Score fitness

method

Figure 5.2 Recovered key letters versus plaintext size

recovered plaintext letters percentage

vs. ciphertext size

0

20

40

60

80

100

0 1000 2000 3000

ciphertext size

reco

vere

d p

lain

text %

Frequency

method

Score fitness

method

Figure 5.3 Recovered plaintext letters percentage versus ciphertext size

Documents

chapter 5 cryptanalysis of vigenere cipher and ...shodhganga.inflibnet.ac.in/bitstream/10603/26543/10/10_chapter5.pdf · 84 CHAPTER 5 CRYPTANALYSIS OF VIGENERE CIPHER AND SUBSTITUTION