If you can't read please download the document
Upload
duongkiet
View
327
Download
4
Embed Size (px)
Citation preview
84
CHAPTER 5
CRYPTANALYSIS OF VIGENERE CIPHER AND
SUBSTITUTION CIPHER
5.1 INTRODUCTION
This chapter describes the methods of cryptanalysis of vigenere
cipher and substitution cipher. For cryptanalysis of vigenere cipher the first
step is finding the length of key. Two types of guessing the key
are explained. The first type is Kasiski method (www.trincoll.edu/depts/cpsc/
cryptography/vigenere.html) and the second type is using Genetic Algorithm.
After the key length is obtained, the proposed algorithm will be applied on
ciphertext to get the correct key and to recover the original plaintext. The
results of cryptanalysis of Vigenere cipher are given and compared with
existing method. In the cryptanalysis of substitution cipher parallel Genetic
Algorithm is used and a different fitness function is used in the proposed
algorithm. The results are presented and compared with existing method.
5.2 THE KASISKI/KERCKHOFF METHOD
Vigenere-like substitution ciphers were regarded by many as
practically unbreakable for 300 years. In 1863, a Prussian Major named
Kasiski proposed a method for breaking a Vigenere cipher that consisted of
finding the length of the keyword and then dividing the message into that
many simple substitution cryptograms. Frequency analysis could then be used
to solve the resulting simple substitutions. Kasiski's technique for finding the
85
length of the keyword was based on measuring the distance between repeated
bigrams in the ciphertext. For example:
Keyword: RELAT IONSR ELATI ONSRE LATIO NSREL
Plaintext: TOBEO RNOTT OBETH ATIST HEQUE STION
Ciphertext: KSMEH ZBBLK SMEMP OGAJX SEJCS FLZSY
The bigram 'TO' occur twice in the plaintext at position 0 and 9 and
in both cases it lines up perfectly with the first two letters of the keyword.
Because of this it produces the same ciphertext bigram, 'KS.' The same can be
said of plaintext 'BE' which occurs twice starting at positions 2 and 11, and
also is encrypted with the same ciphertext bigram, 'ME.' In fact, any message
encrypted with a Vigenere cipher will produce many such repeated bigrams.
Although not every repeated bigram will be the result of the encryption of the
same plaintext bigram, many will and this provides the basis for breaking the
cipher, by measuring and factoring the distances between recurring bigrams. In
this case the distance is 9. Kasiski was able to guess the length of the keyword.
For this example, the Kasiskis method would create Table 5.1.
Table 5.1 Kasiskis Technique
Repeated
Bigram Location Distance Factors
KS 9 9 3,9
SM 10 9 3,9
ME 11 9 3,9
86
Factoring the distances between repeated bigrams is a way of
identifying possible keyword lengths, with those factors that occur most
frequently being the best candidates for the length of the keyword. Note that
in this example since 3 is also a factor of 9 (and any of its multiples) both
3 and 9 would be reasonable candidates for keyword length. Although in this
example there is no clear favorite, the possibilities have been narrowed down
to a very small list. Note also that if a longer ciphertext were encrypted with
the same keyword ('RELATIONS'), repeated bigrams have been expected to
be finding at multiples of 9, 18, 27, 81, etc. These would also have 3 as a
factor. Kasiski's important contribution is to note this phenomenon of
repeated bigrams and propose a method, factoring of distances, to analyze it.
Once the keyword length is known, the method of Kasiski in findig the
correct key work as follows:
If the keyword is N letters long, then every Nth letter must be
enciphered using the same letter of the key text.
Grouping every Nth letter together, the analyst has N
messages, each encrypted using a one-alphabet substitution,
and each piece can then be solved using frequency analysis.
The drawbacks of using Kasiski method is the difficulty of finding
repeated strings, and involving more time to guess the key length. Also for
short messages there are often several good candidates for English 'E' in each
column. This requires the testing of multiple hypotheses, which can get quite
tedious and involve more time.
5.3 MODIFIED VIGENERE TABLE
In this work the space is included; Table 5.2 shows the modified
Vigenere tableaux.
87
Table 5.2 The modified Vigenere tableaux
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _
A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _
B B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A
C C D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B
D D E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C
E E F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D
F F G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E
G G H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F
H H I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G
I I J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H
J J K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I
K K L M N O P Q R S T U V W X Y Z _ A B C D E F G H I J
L L M N O P Q R S T U V W X Y Z _ A B C D E F G H I JK
M M N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L
N N O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M
O O P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N
P P Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O
Q Q R S T U V W X Y Z _ A B C D E F G H I J K L M N O P
R R S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q
S S T U V W X Y Z _ A B C D E F G H I J K L M N O P Q R
T T U V W X Y Z _ A B C D E F G H I J K LM N O P Q R S
U U V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T
V V W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U
W W X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V
X X Y Z _ A B C D E F G H I J K L M N O P Q R S T U V W
Y Y Z _A B C D E F G H I J K L M N O P Q R S T U V W X
Z Z _ A B C D E F G H I J K L M N O P Q R S T U V W X Y
_ _ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
88
5.4 PROPOSED METHOD FOR GUESSING THE LENGTH OF
KEY IN VIGENERE CIPHER
To find the key length, new method is proposed and this method is
suitable for the texts which are having the size more than 500 bytes. The main
idea in this method is to employ frequency of bigrams and trigrams as cost
function in Genetic Algorithm with few numbers of parameters. Genetic
Algorithm is applied here to find the key length. The first proposed key length
will be chosen as two and Genetic Algorithm operations are applied for small
population size and small number of generation, fitness value is saved for the
next generation and key length is then increased to three, again fitness value is
saved and compared with previous key length, if the new fitness is better, it
will be taken for next generation. The procedure is continued till some
assumed key length like 35 as example. The best solution is expected to be
the key length, this number will be used in the proposed algorithm for
cryptanalysis of Vigenere cipher to get the correct key letters and correct
plaintext. After termination of the algorithm, if the decrypted text is not
readable, the method of guessing the key length should be continued from the
assumed key length (35 as example).
5.5 PROPOSED ALGORITHM FOR CRYPTANALYSIS OF
VIGENERE CIPHER
The following is an outline of proposed algorithm: note that this
algorithm is used two types of fitness function, frequency analysis using the
Equation 5.1 and score function using Table 5.4.
89
1. Inputs to the algorithm are the ciphertext, the key size and
relative character frequencies, table of common bigrams and
trigrams.
2. Initialize the algorithm parameters: maximum number of
iterations.
3. Generate 10 keys randomly each one is having the same known
key length.
4. Decrypt the ciphertext by using the 10 generated keys.
5. Calculate the suitability of each key from every decrypted text
using the formula of frequency analysis or using the score in
Table 5.4.
6. Sort the keys based on the increased fitness values for first type
or based on decreased fitness value for score function.
7. for 1 to (maximum number of iteration) do:
Choose 5 pairs from 10 keys.
For 1 to (5 pairs) do
i. Apply crossover to get children
ii. Generate random number from 2 to (key size -1)
iii. Swap the parts of parents as example:
Parent 1 sungti | hutior
Parent 2 subdti | dution
Child1 sungtidution
Child 2 subdtihutior
iv. Generate random position number between (1 to key
size) for each child and mutate the letter in that
position.
Decrypt cipher text by 20 keys.
Calculate the fitness value for each key.
90
Sort the 20 keys based on increased (or decreased) fitness
values as mentioned in step no.5.
Choose best 10 keys.
Go to 7.
8. Output is the best solution.
The algorithm is illustrated using an example. The 10 random keys
are listed as:
AAIAJFNFSYHL
CSOBEVRTVYFL
YVYVLPRPOSCK
DRMMOIWGEKSE
NQQAUGLKJPPH
EUHIKGKBMZAK
BBGZJKFFFGTX
XTLNADLMPGCS
QXWBONQUGFGI
ODPAYFVQSTOO
After one generation the output is:
EBHIKHKUMZPG
ODPAYFVQSTOO
DRMMOIWGEKSE
EUHIKGKBMZAK
OXWBONQSQTQO
YVYEOGWIVKSE
BBGZJKFFFGTX
91
ZAOBEVRTVYFA
NQQAUGLKJPPH
AQQAPGLKJUNK
XTMNADLPLGTX
AAIAJFNFSYHL
QXWBONQUGFGI
CSFAJISFNYHL
BBGZSKFJFGCF
UDPAYVFOGFGI
DRPMLMSPORCK
XTLNADLMPGCS
YVYVLPRPOSCK
CSOBEVRTVYFL
After 30 generations the output is:
BUBSTHXFHIOO
JUBSTHXFCIOO
HIBSTQFUTIBN
HUBSFHXHTIOO
JUBSFHXHTIOO
.
.
After 100 generations output is:
SUBSTIUUTBON
SUBSTUBUTION
SUBSTUBUTION
92
SUBSTUDUTION
.
.
The final solution is (SUBSTIUUTBON) for ciphertext, which is
having 1500 bytes size for 100 generations.
5.6 IMPLEMENTING THE CRYPTANALYSIS OF VIGENERE
CIPHER
The attack is implemented by generating 10 independent keys to
represent the target key. The first generation is generated randomly using a
simple uniform random number generator. The fitness value is incremented
and finally normalized to the number of pairs, the criteria here is number of
generation. The Genetic Algorithm then goes in the normal way to generate
new generations. The algorithm is terminated based on the criteria described
earlier. The algorithm has been implemented to get fitness; essentially the
attack shall continue upward to get the best key. These functions are used in
the code:
Void Encrypt ()
This function performs encryption.
Void Decrypt ()
This function performs decryption taking input as key.
Void Keygen ()
This function creates the initial population it will generate n keys
randomly
93
Void Getfitness ()
This function measures fitness of a particular chromosome in the
population set indexed by its position in the population.
Void Sorting()
This function is responsible for sorting population of chromosomes
(The genetic material of an individual - represents the information about a
possible solution to the given problem) based on fitness value.
Void Crossover()
This function performs cross over between chromosomes and stores
them in the new population set as indexed by pos1, pos2.
Void Mutation ()
This function is responsible for mutation of the newly generated
chromosomes.
5.7 FITNESS MEASURE
Two types of cost functions are used to calculate the fitness value.
5.7.1 Based on frequency analysis
The method used to evaluate the keys is to compare n-gram
statistics of the decrypted message with the frequency of n-gram standard of
English language.
C k = .
Ai
u
i
u
i DK )()( + .
Aji
b
ji
b
ji DK,
),(),( + .
Akji
t
kji
t
kji DK,,
),,(),,( (5.1)
94
Equation (5.1) is a general formula used to calculate the suitability
of a proposed key (k). Here, A denotes the language alphabet [A. . . Z, _] K
and D denote known language statistics and decrypted message statistics
respectively and the indices u, b and t denote the unigram, bigram and trigram
statistics, respectively. The values of , and are the weights to each of the
three n-gram types which can be assumed with different values for unigrams,
brigrams and trigrams (Dimovski and Gligoroski 2003a). The main
characteristic of the algorithm is the ability of directing the random search
process of the Genetic Algorithm by selecting the fittest chromosomes among
the population. Evaluation of the fitness of the each key relied on the
language statistical characteristic. For example, the letter "E" is the most
common letter in English language, so the fitness of the key can be measured
based on how likely it is going to give correct letter frequency in the
deciphered text. Hence, the fitness function chosen is the main factor of the
algorithm.
5.7.2 Based on common bigrams and trigrams
In the process of determining the cost associated with a Vigenere
cipher key the proposed key is used to decrypt the ciphertext and then the
statistics of the decrypted message are compared with statistics of the
language. Matthews proposed an intuitive alternative. Instead of using all
possible bigrams and trigrams a subset of the most common ones are chosen
(Matthews 1993).
The method of Matthews is to list a number of the most common
bigrams and trigrams and to assign a weight (or score) to each of them. Also,
the trigram EEE was included in the list and assigned a negative score. The
idea behind this is interesting. Since E is very common in English, it could be
95
expected that a plaintext message might contain a relatively high number of
Es. Since these never occur normally in the English language, it makes sense
to assign such a trigram a negative score. Each weight is applied to the
frequency of the corresponding bigram or trigram in the decrypted message.
Table 2.1 shows the weight table used by Matthews in his paper.
The bigrams, trigrams and weights were modified by Andrew John Clark, in
(Clark 1994) to the values shown in Table 5.3 Notice that the bigram __
(two consecutive spaces) and the trigram ___ (three consecutive spaces)
have been included. Some of the other bigrams and trigrams are slightly
different, due to the fact that the space symbol has been included in the
encryption in (Clark 1994) (and was not in the work by Matthews) and also
due to different sets of language statistics used.
Table 5.3 The fitness weight proposed by Clark
Bi/trigram Score Bi/trigram Score
E_ +2 S_ +1
_T +1 __ -6
HE +1 _TH +5
TH +1 THE +5
_A +1 HE_ +5
___ -10
96
In this work, a modified method has used to evaluate the cost value
of the fitness function which is used by Clark (1994) and Matthews (1993),
EEE, AND and ING are included to the Andrew table, Table 5.4 shows the
weight table used in this work.
Table 5.4 The fitness weight proposed in this research
Bi/trigram Score Bi/trigram Score
EEE -5 ING +5
E_ +2 S_ +1
_T +1 __ -6
HE +1 _TH +5
TH +1 THE +5
_A +1 HE_ +5
___ -10 AND +5
5.8 RESULT OF VIGENERE CIPHER
The result of guessing method of the key length is tabulated as
comparison of fitness of some key lengths proposed, the population size is
six, number of generation is 200, and text size is 2 kb, as in Table 5.5.
The fitness value 151 is the biggest value, it has been calculated
using frequency of bigrams and trigrams in Table 5.4, the proposed key length
15 which gave the best fitness value will be submitted to next stage as
expected key length in the code to find the key word.
97
Table 5.5 Result of Guessing Key Length
Number of proposed key
length Fitness value
2 32
3 78
4 49
5 81
6 85
7 50
8 67
9 52
10 91
11 73
12 70
13 87
14 65
15 151
16 71
17 72
18 98
19 75
20 93
98
The results for Vigenere cipher are tabulated in Table 5.6 as a
comparison of performance for different ciphertext lengths and different key
lengths. The number of generation is limited to 1000.
Table 5.6 Result of Vigenere cipher
Cipher Text
Size
(in characters)
Key
length
Number of Recovered
key letters Time (in seconds)
Score
function
Equation
5.1 Score
Equation
5.1
2000 20 19 15 119 124
3000 25 22 21 131 139
4000 30 24 21 126 150
The time taken to cryptanalysis of Vigenere cipher is less by using
score cost function. Using score function gives better results of recovered key
letters in comparison with frequency analysis.
The results obtained for various cases such as partial key obtained
using Equation 5.1 as fitness function and completed key obtained using score
in Table 5.4 as fitness function. Table 5.7 shows that the key is completely
obtained using Equation 5.1 with 1kb text if key size
99
Table 5.7 Partial key obtained using Equation 5.1 with 1 kb text
Key length Key given Key obtained Fitness value Time
(sec)
1 Z Z 6.55 < 1
2 GF GF 6.51 3
3 ARN ARN 6.52 3
4 RAGH RAGH 6.54 13
5 TOEME TOEME 6.51 12
6 COLLEG COLLEG 6.51 13
7 GOVERNM EOVERNM 7.00 61
8 COMPUTER COMWUTER 6.49 100
Table 5.8 Complete key obtained using score with 1kb text
Key length Key given Key obtained Fitness value Time
(sec)
1 Z Z 451
100
Table 5.9 Partial key obtained using Equation 5.1 with 2 kb text
Key length Key given Key obtained Fitness value Time
(sec)
1 Z Z 12.93
101
Table 5.11 shows that the key is completely obtained using
Equation 5.1 with 3 kb text if key size
102
Table 5.12 Complete key obtained using score with 3kb text
Key
length Key given Key obtained
Fitness
value
Time
(sec)
1 Z Z 1236
103
5.9 A PARALLEL GENETIC ALGORITHM FOR
CRYPTANALYSIS OF POLYALPHABETIC
SUBSTITUTION CIPHER
Dimovski and Gligoroski (2003a) proposed a number of Genetic
Algorithms running in parallel for Cryptanalysis of the polyalphabetic
substitution cipher. Each Genetic Algorithm is using to solve a different part
of the problem. Figure 5.1 is a pictorial representation of this method with M
GAs running in parallel and communicating every k iterations. The
Figure 5.1 is showing a polyalphabetic substitution cipher consisting of M
simple substitution ciphers.
Figure 5.1 A parallel Genetic Algorithm
To solve these M substitution ciphers, M Genetic Algorithms
(GA1, GA2, . . . , GAM) are used which is attempting to find the key to the
Initialization
Output
GA0
1
GA1
1
GAk
1
GAX
1 GAX
2 GAX
M
GA0
M GA0
2
GA1
2
GAk
2 GAk
M
GA1
M
104
cipher of position j, in determining the cost of each of the solutions in its pool,
GAj uses the current best key from each of its neighbors to find the bigram
and trigram statistics.
5.9.1 Suitability Assessment
Two types of evaluating the fitness function are used in this work.
The first type is using Equation 5.1. The second type is using the score
function based on Table 5.4.
5.9.2 The Reproduction process
Spillman et al (1993) proposed a method for the mating function to
order the key. The characters in the key string are ordered such that the most
frequent character in the ciphertext is mapped to the first element of the key
(upon decryption), the second most frequent character in the ciphertext is
mapped to the second element of the key, and so on.. For example, the key:
NVHIZKLMFBPURS_QGTWXYJAOCED indicates that the most frequent
character in the ciphertext represents a plaintext N; the second most frequent
character in the ciphertext represents a plaintext V, etc. Given two parents
constructed in the manner just described, the first element of the first child is
chosen to be the one of the first two elements in each of the parents, which is
most frequent in the known language statistics. This process continues in a
left to right direction along each of the parents to create the first child only. If,
at any stage, a selection is made which already appears in the child being
constructed, the second choice is used. If both of the characters in the parents
for a given key position already appear in the child then a character is chosen
at random from the set of characters that do not already appear in the newly
constructed child. The second child is formed in a similar manner, except that
105
the direction of creation is from right to left and in this case, the least frequent
of the two parent elements is chosen.
Parent 1
NVHIZKLMFBPURS_QGTWXYJAOCED
Parent 2
MNOPCDEFGHIJKLQRSTXY_ZABUVW
Child 1
NVHICKEFGBPURS_QWTXYAZDOJML
Child 2
XNOPZGEMFHICKLQRSTWY_JABUVD
5.10 DESCRIPTION OF THE ALGORITHM
The implementation of the algorithm has been done in two different
ways to calculate the fitness, first one is similar to existing method (using
Equation 5.1) and the second way is by using score of bigrams and trigrams
of the Table 5.4:
1. The inputs of the algorithm are language statistics for
unigrams, bigrams and trigrams, the ciphertext, the block size
(B=3 as example) and this GAs position within the block, j
(1
106
3. Calculate the cost for each key using unigram statistics only
and sort them based of fitness values.
4. For iteration i (i = 1. . . G) Do:
a) If i mod k= 0 send the best key from current pool to each
of the neighbouring GAs (i.e., the GAs solving for
positions j-1 and j + 1). Also receive the best keys from
each of these GAs.
b) Select first 6 pairs for each position of solutions from
current pool to be the parents of the new generation.
c) Mate using each pair of parents to produce 12 children
that become the new generation (new pool) for each
position.
d) Mutate each of the children in new pool using the same
swapping procedure as described in the attack on the
simple substitution cipher (Spillman et al, 1993).
e) Calculate the cost of each of the children in new pool
using the neighbouring keys obtained in Step 4a and
Equation 5.1 or score of bigrams and trigrams Table 5.4.
f) Select the 12 best keys from the two pools current and
new. Replace the current solutions in current pool with
these solutions.
5. Output the best key from current pool.
107
5.11 RESULTS OF POLYALPHABETIC SUBSTITUTION
CIPHER
The key size of each position is 27 characters, the block size is
three then the total key size is 3*27 = 81 characters.
The algorithm has been implemented using two type of fitness
value: based on frequency analysis as in Equation 5.1 and based on score in
the Table 5.4 and the results are tabulated in Table 5.13.
Table 5.13 Results of proposed algorithm and existing method
Text size
Frequency Score
Key
recovered
Text
recovered %
Key
recovered
Text
recovered
500 12 31.47 13 35.85
1000 16 44.71 20 45.00
1500 25 51.76 39 75.11
2000 57 88.96 65 93.1
The results show that the recovered key letters and recovered
plaintext letters are more in the way which used based on score. The
Figures 5.2 and 5.3 show the graph of recovered key letters vs. text size and
recovered text letters vs. text size. The proposed algorithm by using score
fitness in calculating the fitness value has improved the way to get recovered
key letters and recovered text letters.
108
recovered key letters vs. ciphertext
size
0
10
20
30
40
50
60
70
0 1000 2000 3000
ciphertext size
reco
vere
d k
ey lett
ers
Frequency
method
Score fitness
method
Figure 5.2 Recovered key letters versus plaintext size
recovered plaintext letters percentage
vs. ciphertext size
0
20
40
60
80
100
0 1000 2000 3000
ciphertext size
reco
vere
d p
lain
text %
Frequency
method
Score fitness
method
Figure 5.3 Recovered plaintext letters percentage versus ciphertext size