An effective source recognition algorithm: extraction of significant binary words

An e�ective source recognition algorithm: extraction ofsigni®cant binary words

Shri Kant *, Neelam Verma

Scienti®c Analysis Group, Defence R&D Organisation, New SAG Buildings, Metcalfe House, Delhi 110054, India

Received 28 May 1999; received in revised form 6 June 2000

Abstract

The present paper deals with the computation of binary words, viz., N-grams and delay grams, where N is statis-

tically valid size of pattern words. Three approaches of feature selection methods have been proposed to arrive at the

most signi®cant binary pattern words. The proposed methods reduce the dimensionality of pattern space drastically.

Features obtained through these procedures are having more discriminatory power for identifying the di�erent

pseudorandom sequence generator. Ó 2000 Elsevier Science B.V. All rights reserved.

Keywords: Cryptography; Cipher system; Feature selection; Dimensionality reduction; Signi®cant binary words; System identi®cation

and classi®er

1. Introduction

Numerous papers and research articles ontechniques for recognizing patterns classifying anobject into one of its prespeci®ed categories haveappeared in pattern recognition literature duringthe last three decades. Becker (1978) has discussedabout his Pattern Recognition design dealing withbinary data patterns. Recognition of binary pat-terns, i.e., binary random variable, is of muchimportance for classi®cation of pictorial data, bi-nary stream of numbers, symbolic pattern recog-nition, etc. Here, we are interested in designing aclassi®er, which can identify the broad category ofpseudorandom sequence generator for crypto-

graphic application. Now, we will brie¯y introducethe current state of the art in cryptography.

The protection of vital and strategic informa-tion against unauthorized access on an opencommunication channel is out of question, becausethere are means and technologies available to ad-versaries for eavesdropping. Hence, the messageshave to be intrinsically built and made secure be-fore sending it through open communicationchannel. Simmon (1992) has discussed in detailabout the cryptographic technique for protectingthe vital information.

In cryptography, plaintext bits p1; p2; . . . ; pn aretransformed into ciphertext bits c1; c2; . . . ; cn byusing certain invertible transformation whichconsists of cryptoalgorithm together with keys,popularly known as ciphersystem. These trans-formations generate random key bits, k1; k2; . . . ; kn,which when Xored with plaintext bits p1; p2; . . . ; pn

produce the ciphertext bits c1; c2; . . . ; cn, etc. The

www.elsevier.nl/locate/patrec

Pattern Recognition Letters 21 (2000) 981±988

* Corresponding author. Tel.: +91-11-3977709; fax: +91-11-

2919828.

E-mail address: [email protected] (S. Kant).

0167-8655/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 5 4 - 4

eavesdroppers will intercept ci's and passes it tothe cryptanalyst. Now, the cryptanalyst has toperform uphill task of extracting pi's directly fromci's without any knowledge of ki's. The problembecomes more complicated if the cryptanalyst doesnot know which type of cryptosystem has beenused. To provide this aid to an analyst, we havedesigned and developed a technique which uses thepatterns of signi®cant binary words extracted fromci's to recognize the source ciphersystem providedthe classi®er has already been trained for thosecryptosystems.

There are two broad categories of ciphersys-tems, viz., block ciphersystem and stream cipher-system. Block cipher means a system thattransforms a ®xed length block of pi's into thesame length block of ci's:

C � E�P ; k�;where the encryption function E is applied to Punder the in¯uence of a key k of length m bits. Ingeneral, C � pÿ1

t �F �pt�P ��, where p is a permuta-tion and F is a non-linear function. The security ofblock cipher mainly depends on non-linear logicfunction F and the number of round the processiterates.

Stream ciphers are an algorithm that convertseach pi into ci as follows:

Ci � Pi Xor ki;

where ki's are generated through a key streamgenerator. The security of stream cipher dependsentirely on the non-linear structure of the key-stream generator. If the generatorÕs output is anendless stream of 0's or 1's, then the ci's and pi'sare same or merely transposing will give us pi's. Ifit produces a long repeating bit pattern, then thealgorithm will be a simple Xor with negligible se-curity. If the generator spits out an endless streamof random bits (true random), then it behaves likea Vernam cipher (Zeng et al., 1991). The mainbuilding block of such a generator is the feedbackshift register (FSR), described by Golomb (1967).An FSR consists of n ¯ip-¯ops and a feedbackfunction, that expresses each new element b�t�,where t P n, of the sequence in terms of thepreviously generated elements b�t� � f �b�t ÿ n�;b�t ÿ n� 1�; . . . ; b�t ÿ 1��. Depending on whether

the function f is linear (implementable with simpleXor) or not, the generator is called linear feedbackshift register (LFSR) or a non-linear FSR(NLFSR). Each individual storage element of theFSR is called a stage and the binary signalsb�0�; b�1�; b�2�; . . . ; b�nÿ 1� are loaded into themas initial data to generate the sequence. The periodof the sequence produced depends both on thenumber of stages and on the details of the feed-back connection.

Di�erent classes of ciphersystem have been de-signed by varying size and number of buildingblocks (FSR), and feedback functions. Few ofthem have been analysed directly from cipher bitstream as it is evident from Siegenthaler (1985),Meier and Othmar (1989) and Golic and Mihalj-evic (1991). Kumar (1997, Chapter 5) has dis-cussed in detail about the analysis of block cipherunder certain conditions. Though certain analyti-cal procedures are quoted in the literature, allthese procedures are system oriented. It is there-fore essential for the cryptanalyst to diagnose thepossible ciphersystems for the ciphertext to beanalysed. Keeping this in view, we have developedthe classi®er to classify the crypt ®rst into one ofthe broad category, which we call macrolevelidenti®cation and then to narrow down the iden-ti®cation up to a particular ciphersystem underthat category, i.e., microlevel identi®cation. Afterknowing the most probable ciphersystem for theciphertext to be analysed, the analyst can attackaccordingly.

2. Possible features from binary sequences

We have described earlier that the source ofcryptographic key sequence generation is basicallya combination of FSRS and linear or non-linearcombine function for feedback. The generated se-quence is Xored with plain sequence to get ciphersequence. Hence, there is a possibility of possessingplaintext sequence characteristic by each cipher se-quence. These characteristics may vary from sourceto source. Hence, we have studied the occurrence ofall possible type of binary pattern words. Fromthese large sets of binary pattern words, we haveselected few of them satisfying stringent statistical

982 S. Kant, N. Verma / Pattern Recognition Letters 21 (2000) 981±988

criteria. These selected binary pattern words havebeen ®nally used in classi®er design.

2.1. Frequencies of binary pattern words

Let us assume that we have a cipher sequence oflength 2500 bits of a particular source. Possiblepattern words are up to 9-grams only and the totalnumber of binary pattern words will be

2� 22 � 23 � � � � � 28 � 29 � 1022:

The basis behind computing the frequencies ofbinary pattern words up to 9-gram in the abovecase is the fact that the application of v2 statisticsbecomes meaningless when the expected frequen-cies are less than 5 (Rees, 1989, Chapter 12).

2.2. Delay gram counts

Select the permissible N-grams and then countthe frequencies with certain delay of Di bits, where16

PDi6N ÿ q, in which N is the permissible

gram and q is the number of ®xed bits. It is calleddi-delay gram if two bits (q� 2) are ®xed, tri-delaygram if three bits (q� 3) are ®xed and so on. Thenumber of possible delay gram is also restrictedand depends upon the available length of bitstream. We have already seen that for 2500 bitslong binary stream, permissible length of binarypattern is 9 bits. If we want to compute delay gramfor the same, then the following length delay pat-terns can be computed.

2.2.1. Di-delay gramsIt is a diagram �00;01;10;11� with Di delay,

where 16Di6N ÿ 2, i.e., �0±Di±0; 0±Di±1; 1±Di±0;1±Di±1�. There are seven possible selections of Di,i.e., #p � 7, i � 1; 2; . . . ;N ÿ 2. Hence, the numberof possible di-delay grams is #p� 22 � 7� 4 � 28.

2.2.2. Tri-delay gramIt is a trigram with possible Di and Dj delay

in between the two pairs of trigram bits, e.g.f0±Di±0±Dj±0; 0±Di±1±Dj±0; . . . ; 1±Di±1±Dj±1g,where i and j take the value in such a way thatDi � Dj should not exceed the permissible length ofbinary pattern N, i.e., 16Di � Dj6N ÿ 3. For

N� 9, there are only 15 possible selections of Di

and Dj, i.e., #p� 15, and hence the number of tri-delay grams is #p � 23 � 15� 8 � 120.

2.2.3. Tetra-delay gramIt is a tetragram with Di, Dj ans Dk delay be-

tween the pairs of bits in tetragram, i.e.,f0±Di±0±Dj±0±Dk±0; 0±Di±0±Dj±0±Dk±1; . . . ; 1±Di

±1±Dj±1±Dk±1g, and the combination of i; j; k isselected in such a way that Di � Dj � Dk shouldnot exceed N ) 4. The possible number of selec-tions of Di, Dj and Dk is 10, i.e., #p� 10, which®nally gives only 160 tetra-delay gram (#p � 24).

2.2.4. Penta-delay gramThis is a delay gram with Di, Dj, Dk and Dl

delay between pairs of bits in pentagram, i.e.�0±Di±0±Dj±0±Dk±0±D1±0; . . . ; 1±Di±1±Dj±1±Dk±1±D1±1�. In this case,

PDi is not allowed to exceed

N ) 5, and hence, the number of possible selectionsof Di, Dj, Dk and Dl is unique (i.e., #p� 1). Thenumber of possible penta-delay gram is#p � 25 � 32.

From the above discussion, we can clearly ob-serve that it is not possible for us to compute delaygram beyond the penta-delay gram. Because if weselect six delays, we require ®ve selection of Di, i.e,®ve delays and hence #Di � q exceeds N, which isthe permissible length of the binary pattern for thegiven sequence of length 2500 bits.

Now the total number of binary pattern words(features) that can be computed from a givencryptsequence of 2500 bit long is �1022� 28 �120� 160� 32 � 1362. Another revealing fact isthat computation of delay gram pattern occurrenceresults in substantial data reduction, e.g., the 28 di-delay grams are basically having the pooled e�ectof 1016, 3- to 9-gram patterns, since each delaygram has the possibilities of seven delays and hencerepresenting

P2i, i � 1; 2; . . . ; 7, i.e., 128, 9-gram

patterns. In case of tri-delay gram 516, 4- to 9-grampatterns are covered by 120 tri-delay gram. In caseof tetra-delay gram 248, 5- to 9-gram patterns arecovered by 160 possible tetra-delay grams. There isno data reduction in case of penta-delay gram. It isevident from the above discussion that if we keep N®xed and vary q, as q increases the number ofpossible N bit patterns covered by higher delay

S. Kant, N. Verma / Pattern Recognition Letters 21 (2000) 981±988 983

gram reduce drastically, because we do not havemore choices for varying delays, i.e., Di's.

If we consider the number of occurrences ofthese binary pattern words as a feature vector,then we have 1362 feature components or patternattributes representing a sequence. From thecomputational point of view, we can enumeratethe information of 2N , N-gram patterns with thehelp of 2Nÿ1 information only. In case of tri- andhigher delay grams, if the delay Di is equal betweenthe pair of bits, then only 2Nÿ1 information aresu�cient; otherwise, we require more information(Becker, 1978, Chapter 3).

Designing a classi®er with 1362-dimensionalfeature vectors is impracticable because it increasesthe computational cost enormously. Under thesecircumstances, it is essential to reduce the dimen-sionality of feature vectors. Most of the binarywords may not be even relevant for representingthe given binary cryptsequence. Several featureselection techniques through entropy minimiza-tion, Karhunen±Loeve (KL) expansion, diver-gence maximization, principal component analysis,factor analysis and canonical analysis, etc. areavailable in the literature (Bow, 1984; Tou andGonzalez, 1974; Theodoridies and Kautroumbas,1999). All these algorithms make use of Var.±Cov.matrices, computation of eigenvalues and corre-sponding eigenvectors, etc. Some of them use thelinear combination of most similar features andthe combinations are arranged in the order ofmaximum information contents. The last onecontains the least information. It is evident thatthese algorithms require more computing power interms of computational time and space. The al-gorithms proposed by us in the following sectionare very simple in terms of computation and space.It takes binary words (feature) one by one andselects or rejects it according to its statistical sig-ni®cance. The selected signi®cant binary wordshave desirable characteristics of discrimination.

3. Selection of features/dimentionality reduction

As we have seen in Section 2, the total numberof possible binary words for a given sequence oflength 2500 bits is 1362. Developing a classi®er

with this number of feature components is notpracticable. Hence, the dimensionality of featurevector has to be reduced to a size suitable forsystem identi®cation algorithm development. Thethree algorithms described below deal with thesearch of signi®cant binary words, which ulti-mately reduce the dimensionality of the featurevector.

3.1. Algorithm 1

This algorithm is suitably opted for reducingthe dimensionality of the problem, when there aretwo sources to be identi®ed. After computing allthe measurements, we have applied t-statistics forselecting the most signi®cant feature havingdesired discriminating power. The steps of thealgorithm are as under:

Step 1. The basic requirements for applying thet-test are:1. The measurements should be normally distrib-

uted, which have been tested using v2 statistics(e.g., Neter et al., 1993).

2. There should be equality in the variance of allthe measurements for two classes and have beenveri®ed by F-statistics as below.Set up the hypothesis

HO : r1i � r2

i ;

HA : r1i 6� r2

i :

Compute F � �s2i �2=�s1

i �2, where si is the samplestandard deviation; if Fcal > Ftab, reject HO, whichultimately rejects the ith measurements. This itselfreduces the dimensionality from 1362 to approxi-mately 900.

Step 2. The t-statistics is calculated at 1%, 5%and 10% level of signi®cance (l.o.s) for the above900 measurements as follows:

HO : l1i � l2

i ;

HA : l1i 6� l2

i ;

where li's are the population mean of measure-ments, compute the t-statistics with xi's, which arethe sample means for testing the hypothesis


t� x1i ÿx2

i��n1ÿ1��s1

i �2��n2ÿ1��s2i �2=�n1�n2ÿ2�

� ��1=n1��1=n2��

r ;

where n1, n2 are the number of samples of twosources, respectively. If jtcalj > ttab, reject HO, andthis particular measurement is one of the compo-nents of the feature vector to be used for classi®-cation purposes. It reduces the dimensionalityfrom 900 to approximately 190±200.

Note. The algorithm reduces the number offeatures from 1362 to somewhere around in therange 190±200. Still there is a scope for furtherreduction in the dimensionality. To achieve thesame we can proceed in two ways:

(a) Arrange all the binary words with respect totcal in the descending order. It has been observedthat 2:666 jtcalj6 11:5. If we choose a thresholdth P 6:5, the dimensionality reduces from the rangeof 190±220 to 20±25.

(b) Another way of reducing the dimensionalityis to ®nd the common signi®cant binary words andthe same can be obtained by arranging them in thedescending order and ®xing the threshold valueth P 6:5. We get the feature vector of the samedimension as mentioned in (a).

3.2. Algorithm 2

Step 1. Compute all the measurements forthe two sources of bit stream Cj

q, where q �1; . . . ; 1362 and j � 1 and 2, 1 stands for blockcipher and 2 stands for stream cipher.

Step 2. Apply single population t-test forsearching the signi®cant binary words, for whichthe requirement is to test the normality of eachbinary word. It has been tested with the help of v2

goodness of ®t test. After observing the normality,set up the hypothesis

HO : lq � lq0;

HA : lq 6� lq0;

where lq is the mean frequency of qth measure-ments and lq0 is the expected frequency of the qthbinary word, q � 1; 2; . . . ; 1362 and the hypothesiswas tested at 1%, 5%, and 10% l.o.s.

Step 3. Compute the statistics t

tcal �Mq ÿ lq0��

Vq

�n

q ;

where Mq Z is the mean frequency, Vq the varianceof the qth measurement and n is the number ofsample in that source. The binary word bq is saidto be signi®cant if jtcalj > ttab.

Step 4. Repeat Steps 2 and 3 for all the ÔqÕmeasurements.

Step 5. Repeat Steps 2±4 for j� 2.Note. This algorithm reduces the number of

features from 1362 to somewhere around in therange 190±200. Further reduction from 190±200 to20±25 have been obtained as described in the noteof Algorithm 1.

3.3. Algorithm 3

In this algorithm, ANOVAANOVA has been used forsearching the signi®cant measurements. The algo-rithm steps are as under:

Step 1. Compute q � 1; . . . ; 1362 measurementsfor each Cj

i , i � 1; . . . ;Ni, j � 1; 2; . . . ; p, N is thetotal number of samples in all the classes and Ni'sthe number of samples in all classes are the same.

Step 2. Find

Mjq �j � 1; . . . ; p� and Mq � �1=p�Xp

j�1

Mjq;

SSB � Nj

Xp

j�1

Mjq

�ÿMq

�2

;

SST �Xp

j�1

XNj

i�1

Mjiq

�ÿMq

�2

;

SSE � SSTÿ SSB:

Step 3. Set up the hypothesis

HO : l1q � l2q � l3q � � � � � lpq;

HA : l1q 6� l2q 6� l3q 6� � � � 6� lpq;

where lpq is the population mean of the qth mea-surement for the pth source (Neter et al., 1993,Chapter 21):


Cal F � SSB=P ÿ 1

SSE=N ÿ P:

Step 4. If Cal Fpÿ1;NÿP ;a > Tab Fpÿ1;NÿP ;a, rejectHO at 1%, 5% and 10% l.o.s. Take only those bi-nary words as features for which HO is rejected.The dimensionality is reduced drastically from1362 to 10±15.

4. Designing of classi®er

In this section, we will describe two levels ofidenti®cation, viz., macrolevel and microlevel. Inmacrolevel identi®cation, we want to discriminatethe two major classes of binary stream generationfor cryptographic application. These two majorclasses are popularly known as block and streamcipher. In microlevel classi®cation, we are inter-ested in achieving discrimination among the blockand stream ciphers.

4.1. Macrolevel identi®cation

Let bi and si be two major sources for cipher bitgeneration. From each source, we took about 300judiciously chosen cryptsequences for training theclassi®er. The source bi constitutes DES and itsvariant RSA and IDEA, whereas the source si

constitutes LFSR, Clock ± controlled shift register,non-linear combiner and non-linear feedforwardshift register-based systems. The training sets havebeen chosen in a random fashion. We trained theclassi®er using the 50±250 labelled samples and

tested the classi®er for the labelled (self) as well asthe test sample. The algorithm for discriminationwill work as under:

Step 1. Find the representative feature vector foreach message by any one of the following methods:

(1) frequencies of 2±5 bit binary words provide4±32-dimensional feature vectors;(2) obtain the signi®cant binary pattern wordsas feature vectors by using Algorithms 1 and 2.Step 2. Use the linear statistical classi®er as

described in (Kant, 1993; Kant and Narain, 1998).Step 3. Repeat Steps 1 and 2 for di�erent fea-

ture vectors and tabulate the results of discrimi-nation achieved.

Results. The algorithm is repeatedly applied fordi�erent types of feature vectors. Self-classi®cationis always more than 92% whereas test (cross)classi®cation varies between 80% and 92%. Thiscan be observed in Fig. 1, and it is also evidentfrom the ®gure that block ciphers have a highersuccess rate than stream ciphers. Misclassi®cationin stream ciphers is more in comparison to blockciphers. The X-axis represents varying length fea-ture vectors, viz., 2-, 3-, 4- and 5-gram. Y-axisrepresents the percentage of successful classi®ca-tion. Another important observation can also beseen from Fig. 2 that as we increase the number oftraining samples, the results are more stabilized.

4.2. Microlevel identi®cation

Discrimination between two major classes ofciphers, viz., block and stream reduce the com-

Fig. 1.


plexity of the problem but still the analyst is in-terested in knowing the exact cipher algorithm,which have been used to generate the givencryptsequences. Under each class, there are vari-eties of algorithms available for generating the keysequences. Our aim is to reach at the exact type ofalgorithm so that the analysis could be done ac-cordingly. Keeping this in view, a multi-classclassi®er has been developed and tested for fourtypes of block ciphers and four types of streamciphers as mentioned in Section 4.1 in a simulatedenvironment. The steps of the algorithm are asunder:

Step 1. Compute all measurements from the Msources of cryptsequence generator Ci, wherei � 1; 2; . . . ;N and j � 1; 2; . . . ;M for di�erentcryptosystem algorithms. In the present case,M� 4 and N is varying, depending on the crypt-sequence length.

Step 2. Select the feature vector from the abovemeasurement space with the help of Algorithms 2and 3.

Step 3. Use the Perceptron algorithm of Bow(1984, Chapter 3).

Step 4. If self-classi®cation is not up to the de-sired level, then increase the number of cryptse-quence for learning. It has been observed thatclassi®cation rate starts stabilising after 150learning samples.

Results. Cryptsequences from four di�erentblock ciphers and four di�erent stream ciphershave been studied. Arbitrary cryptsequences oneach type have been taken for training the classi-

®er. The training was started with 50 samples, asuccess rate of classi®cation was found to be sta-bilising after 150 samples. Several sets of 50-cryptsequences have been tested. The success rateof classi®cation in test set is in the range 75±90%,whereas for the labelled sample, it was alwaysabove 90%. The success rates for both the cases isvery encouraging. It has given us a way to reducethe di�culty of the cryptanalyst to some extent.But identi®cation of a given crypt can be doneonly for those systems for which the classi®ershave been thoroughly trained. These algorithmsare found to be very useful in other applicationsrelated to coding scheme recognition and speechanalysis.

Acknowledgements

We are grateful to the Director, Scienti®cAnalysis Group, M/O Defense for providing aconducive environment to carry out this work. Weare also thankful to Dr. Laxmi Narain, Sc. ÔFÕ,Divisional Head, Mathematics Division of SAGfor his valuable discussion and providing data todevelop the algorithm.

References

Becker, P.W., 1978. Recognition of Pattern: Using Frequencies

of Occurence of Binary Words. Springer, New York.

Bow, S.T., 1984. Pattern Recognition: Application to large

Data-Set Problems. Marcell Dekkar, New York.

Fig. 2.


Golic, J.D., Mihaljevic, M.J., 1991. Generalized correlation

attack on a class of stream ciphers based on the Levenshtein

distance. J. Cryptol. 3 (3), 201±212.

Golomb, S.W., 1967. Shift Register sequences. San Francisco

Holden Day (Revised edition, 1980. Agean Park Press, CA,

USA).

Kant, S., 1993. What help classi®cation techniques can provide

to a ryptanalyst? In: Proc. Third Int. Conf. P.R & D.T, pp.

651±660.

Kant, S., Narain, L., 1998. Analysis of some stream and block

ciphers: using pattern recognition tools: NSCR'98. C10±

C21.

Kumar, I.J., 1997. Cryptology System Identi®cation and Key

Clustering. Agean Park Press, CA, USA.

Meier, W.S., Othmar, S., 1989. Fast correlation attacks on

certain stream ciphers. J. Cryptol. 159±176.

Neter, J., Wasserman, W., Whitmal, G.A., 1993. Applied

Statistics. Allyn and Bacon, Boston.

Rees, D.G., 1989. Essential Statistics. Chapman & Hall,

London.

Siegenthaler, T., 1985. Decrypting a class of stream ciphers

using ciphertext only. IEEE Trans Comp. c-34 (1), 81±85.

Simmon, G.J. (Ed.), 1992. Contemporary Cryptology: The

Science of Information Integrity. IEEE Press, New York.

Theodoridis, S., Kautroumbas, K., 1999. Pattern Recognition.

Academic Press, New York.

Tou, J.T., Gonzalez, R.C., 1974. Pattern Recognition Principle.

Addition-Wesley, Reading, MA.

Zeng, K.C., Yang, C.Y., Wei, D.Y., Rao, T.R.M., 1991.

Pseudorandom bit generator in stream-cipher cryptology.

IEEE Comput. February 1991, pp. 8±16.


Documents

An effective source recognition algorithm: extraction of significant binary words