[IEEE 2009 9th International Symposium on Communications and Information Technology (ISCIT) - Icheon, South Korea (2009.09.28-2009.09.30)] 2009 9th International Symposium on Communications

Application of Data Mining in Cryptanalysis Pejman Khadivi and Marjan Momtazpour

Department of Electrical and Computer Engineering Isfahan University of Technology, Isfahan, 84156-83111, Iran

E-mail: {pkhadivi, momtazpour}@ec.iut.ac.ir

Abstract — Cryptography is a popular method for information hiding and reaching confidentiality in digital world. On the other hand, cryptanalysis is an interesting and useful science from different viewpoints. All the digital materials that are processed by computers or transferred in data transmission systems have some information. However, the information, embodied in the data, is entangled with known or unknown patterns and features. As an example, each certain author uses certain writing patterns and techniques which are almost unique. While cryptography algorithms hide information from the eavesdroppers, this hiding is performed through recoding. Hence, information and the related features are still remained in the encrypted output of the cryptosystem. Then, discovering the hidden features in the encrypted texts can be used to analyze the output of a cryptography algorithm. In this paper, the application of data mining in cryptanalysis is explored. It is shown that how this attack may be employed to classify the cipher texts. Also, a number of methods to improve the security of cryptography systems will be introduced. Simulation results support the arguments of the paper.

I. INTRODUCTION

With the fast growth of the Internet and information technology, people from around the world have a wide access to the information, available from different sources. However, as it was the case from ancient times, information hiding and confidentiality are important issues in communications. From ancient times, different cryptography algorithms and methods have been employed to encrypt and hide valuable information from eavesdroppers [1].

The main goal in cryptography is to conceal information from non-authorized parties. However, crypt-analyzers try to find weaknesses in cryptography algorithms to obtain meaningful information from the encrypted data [1-11]. With modern cryptography algorithms, even finding one bit of information is valuable from cryptanalysis perspective.

Different forms of information around the world are entangled with known or unknown patterns and characteristics. Also, it is almost clear that the amount of information in the plain and cipher texts are equal. Cryptosystems conceal information through data recoding. Hence, information and the related features still exist in the ciphertexts.

Extracting features and patterns from data, is an important task of data mining [12], [13]. Data mining is the process of analyzing the data from different viewpoints and generating useful information. It is the process of extracting hidden useful patterns from data. These patterns and information can

then be used to improve our understanding of different economical, social, or engineering systems to increase the incomes, decrease the costs, or improve the behaviors.

In this paper, we will study the application of data mining in cryptanalysis. It will be illustrated that how this attack may be employed to classify the encrypted texts and to extract useful information from the ciphertexts. While simulation results show that this attack may be successful, the outcome of the method greatly depends on the employed features, genre of data and the encryption algorithm. In order to improve the security of cryptography systems, a number of methods will be introduced at the end of the paper.

The remainder of the paper is structured as follows. Short introductions on data mining and cryptology are presented in sections II and III, respectively. Section IV proposes a frame-work for applying data mining methods to cryptanalysis. Section V presents a case study. Some recommendations for security improvement are proposed in Section VI. Section VII is dedicated to some concluding remarks.

II. DATA MINING

In recent years, data mining plays an important role in extracting knowledge and exploring interesting information [12], [13]. This section is dedicated to a brief introduction to data mining.

Data mining refers to exploring and extracting knowledge from large amount of data that are unknown before extracting. Any form of data can be used in data mining, including text, web sites’ data, and multimedia (sound, image, and video) information. Hence, in data mining, different areas of research activities may be found such as text mining, web mining, temporal and spatial mining, and so on. Data mining is used in society, industry and science such as market analysis, fraud detection, science exploration, medical, etc.

Classification, Clustering and Association Rule Mining (ARM) are the most important and famous methods which are employed in data mining. Different mining tasks can be categorized into two groups: supervised and unsupervised methods. Classification is a supervised learning method that classifies data into several classes. Each class has a specific label. Decision tree (such as ID3 and C4.5) is a frequent approach to apply classification to a set of data [14].

Clustering methods, such as k-means or hierarchical clustering, are unsupervised mining tasks. In this method, related data items are grouped into some clusters. In clustering, there are no predefined labels for clusters.

978-1-4244-4522-6/09/$25.00 ©2009 IEEE ISCIT 2009358

Association rule mining is an unsupervised method that generates rules and extracts relational associations based on raw data. In this method, frequent patterns could be found and then by these frequent patterns, rules are generated. Market analysis, catalog design and other businesses use this method.

Times series mining, temporal mining and spatial miningare other methods in data mining which can be used in special type of data in some applications like weather forecasting, earthquake prediction, geosciences, and so on [14].

III. CRYPTOGRAPHY AND CRYPTANALYSIS

Historically, cryptography backs to about 4000 years ago [1]. Due to the importance of information and communications security, cryptography and other related aspects have been at the center of attention. Over the years, a large number of algorithms and methods have been proposed for encryption of data. By definition, "cryptography is the study of mathematical techniques related to aspects of information security such as confidentiality, data integrity, entity authentication, and data origin authentication" [1].

Different cryptography methods are classified as symmetric and asymmetric ones. While in symmetric cryptosystems, both the source and destination parties use a same secret shared key for data encryption and decryption, in asymmetric methods, encryption and decryption keys are different. From a different perspective, cryptosystems are categorized as block ciphers and stream ciphers. Symmetric-key block ciphers are the most famous and important elements in many cryptosystems. Examples of these methods include famous DES, IDEA, TEA, and AES algorithms.

Definition [1]: An n-bit block cipher is a function nn VKVE →×: , such that for each key Kk ∈ , ),( kPE is

an invertible mapping from nV to nV , written )(PEk . The inverse mapping is the decryption function, denoted by

)(CDk . )(PEC k= denotes that ciphertext C results from encrypting plaintext P under k.

These days, the application of cryptography can be seen in different applications and domains. Different cryptosystems can be found in military applications, wireless sensor networks [6], multimedia [8], and web applications. In different systems, cryptography algorithms are implemented in software or hardware [5]. Also, new technologies and sciences help researchers to propose new methods of encryption.

Visual cryptography is a new cryptographic technique which allows visual information to be encrypted in such a way that the decryption can be performed by human, without any decryption algorithm [3]. Quantum cryptography is an emerging technology in which two parties can secure network communications by applying the phenomena of quantum physics. The security of these transmissions is based on the inviolability of the laws of quantum mechanics [4]. DNA cryptography is a new field of cryptography arising with the research of DNA computing in recent years [7].

Cryptanalysis is the art of extracting useful information from encrypted communications without knowing the proper keys. Many different solutions have been proposed in the literature, the most well-known ones are linear cryptanalysis,differential cryptanalysis, and chosen plaintext linear cryptanalysis.

Fundamentally, linear cryptanalysis uses a linear relation between inputs and outputs of an encryption algorithm that holds with a certain probability. On the other hand, differential cryptanalysis is a method that analyses the effect of particular differences in plaintext pairs on the difference of the resultant ciphertexts. These differences can be used to assign probabilities to the possible keys and to locate the most probable key. Historically, linear cryptanalysis was introduced as a theoretical attack on DES [15] and later successfully used in the practical cryptanalysis of this algorithm [16]. Also, differential cryptanalysis was first presented for DES [9]. Beside well-known methods of cryptanalysis, other methods can be found in the literature [2], [8], [10], [11].

IV. FEATURES AND PATTERNS IN CIPHERTEXTS

In this section, a frame-work is proposed for the application of data mining in the analyzing of encrypted information. Let us assume that an information source, S, generates plain data items, P. This information is encrypted by a certain cryptography algorithm. It is also assumed that the output of this cryptosystem is C. Hence, we have:

)(PfC =

where, f is a function that models the behavior of the cryptosystem. However, if the block cipher algorithm has a block size of m bits, then, the raw data, P, is divided to n sub-blocks, 1P , 2P , …, nP , where each block has a length equal to m bits. Each of these sub-blocks produces encrypted information based on the following equation:

)( ii PfC =

It should be mentioned that the plain information, P, has features that depends on the genre of the information, information encoding, and the information creator. As an example, the set of features in images is different from the ones in texts. Also, different images have different features based on their producers. As another example, each author uses certain writing patterns and techniques which are almost unique and different from the techniques used by other authors.

In this paper, we name features in a non-encrypted data, P,as embedded features in P. It should be noted that the these embedded features can be unknown beforehand.

For a block cipher such as f, and plain data P, embedded features can be divided into two categories: interblock and interablock features. Intrablock features are those which are limited to a single block, iP . These features are assumed to be

359

independent from other blocks. On the other hand, interblock features are not limited to one block.

One of the most important problems that should be solved regarding the cryptosystems, is that how much “embedded features” in the non-encrypted text, P, can be passed by the encryption function f ?

Due to the non-linear behavior of modern block ciphers, intrablock features are less likely to be passed by the cryptosystem. However, intrablock features can be easily passed in classic cryptography methods.

One of the most famous features in plain data is the frequency of characters and symbols. It is well-known that symbol frequencies can be used to break classic cryptosystems with the help of known statistics about characters and symbols in data.

With respect to certain properties of modern cryptographic algorithms, such as their non-linearity and the avalanche criterion [1], it seems that most of the features in the plain texts must be faded in the ciphertexts. However, the case study of the next section, illustrates that this conclusion is not true. The following paragraphs show a simple frame-work to use data mining and feature extraction for cryptanalysis of (modern) cryptosystems.

In order to use data mining methods for cryptography, we need to have some useful features about the plain data and the ciphered one. Then, in the first step, different pieces of non-encrypted information must be analyzed by data miners for feature extraction.

As the second step, data encryption algorithm must be studied in order to find features that may be passed through the encryption stage. Only these features can be employed for cryptanalysis purposes.

Appropriate data mining methods, such as clustering, classification, or ARM, can be used to relate extracted features in the ciphered data with the ones in the plain information. Regarding the employed mining method, useful information can be extracted from the encrypted data.

The above strategy may be employed for variety of applications, including recognition of cryptosystem, plain data classification, author recognition, etc. The following section, illustrates a simple case study.

Fig. 1. Distribution of symbols in text and image plain data.

V. IMAGE AND TEXT CLASSIFICATION: A CASE STUDY

In this section, a simple case study is proposed for the application of feature analysis in classification of encrypted texts and images.

The main problem in this case study is that we are facing with a number of encrypted data sets and we are asked to categorize the related plain data sets as text or image. A large number of simulations have been performed to extract useful features for this purpose.

A. Simulation Environment In this section, the simulation environment, used for the

case study, is introduced. The dataset used for this study contains 59 images in bitmap format as well as 59 text files. This dataset is available at [17].

Without loss of generality, for test case generation, only 320 bytes of each input file has been encrypted and analyzed. This is equal to 20 blocks of a block cipher with the block size of 128 bits. For data encryption, two cryptography algorithms are employed: AES and classic replacement encryption model.

B. Features and Classification Generally, texts are constructed by alphanumeric characters.

If each character is encoded by an ASCII code, only a limited set of codes is used to present the text data. Let us assume that each of the byte values (between 0 and 255) is considered as a symbol. Therefore, the frequency of symbols in images is more uniform comparing with texts. This is illustrated in Figure 1. It is almost clear from this figure that most of the symbols in a text file are among a small set of characters.

Classic replacement cryptosystems pass the frequency of symbols to their encrypted output. This symbol frequency is a useful feature for cryptanalysis. However, modern block ciphers, such as AES, do not transfer frequency feature to their outputs. This is illustrated in Figure 2. As it is clear from this figure, the distribution of symbols in the plain text is almost similar to the distribution of symbols in the cipher text with classic cryptosystems. Uniform distribution of symbols in the output of AES algorithm, is an important advantage of this modern cryptosystem. Figure 3 illustrates the same statistics for image data sources.

Fig. 2. Distribution of text symbols in plain and cipher modes.

360

Fig. 3. Distribution of image symbols in plain and cipher modes.

Fig. 4. Comparison between the distribution of symbols in the output of AES cryptosystem for image and text sources of data.

Due to the non-linear functionality of AES and the avalanche effect, it is expected that the distribution of symbols in the output of the AES algorithm be the same for both of the images and texts. However, simulation results show that there is noticeable difference between the mentioned symbol distributions. Figure 4 compares these distributions. Simulations show that the variance of symbol distributions with AES encryption for text data is about

71039.2 −× while it is equal to 61033.7 −× for images. This difference alarms for a method of attack, applicable for cipher text classification.

Figure 5 illustrates how variance of symbol distribution changes for different texts and images in the simulations dataset. Simple rules can be generated based on this feature for encrypted data classification. As an example, the following rule classifies all the encrypted texts as text while, 55.9% of cipher images are recognized as image:

IF Variance(Cipher Data) < 510448.1 −×THEN Cipher is TEXT

ELSE Cipher is IMAGE

where, Variance(Cipher Data) is the variance of symbol distribution in the encrypted data.

Fig. 5. Variance of symbol distribution in the output of AES cryptosystem for all the images and texts in dataset.

Fig. 6. Comparison between the similarity factor in the output of AES cryptosystem for image and text sources of data.

One of the basic features of images is that, long runs of pixels with the same color can be found in images with a high probability. While, the length of these runs depends on the image, it can be shown that the average length of runs in images is greater than the average length of runs in texts. Hence, appearance of long runs of data items in the plain or cipher data may be used as an indicator for classification purposes.

Let us assume that a block cipher algorithm is employed and the block size is equal to m bits. Then, run lengths greater than or equal to 2m bits, can be transferred to the output of the cryptosystem. As an example, in the case of AES with block size of 16 bytes, runs longer than 32 bytes, are encrypted into similar cipher blocks. Hence, it is more likely to find similar blocks in encrypted images, comparing with the text ones.

We define similarity factor as the probability of finding two similar blocks in the encrypted output of a cryptosystem. Other definitions of similarity can be also proposed based on the distance between blocks. Figure 6 compares similarity factors of the encrypted images and texts of the dataset of [17]. The cryptography algorithm which is used in this simulation is AES and the block size is 16 bytes. Similar results are shown for classic replacement cryptosystems in Figures 7 and 8. In Figure 7, similarity factor has been measured for blocks

361

of size 16 bytes, while, block size in Figure 8 is equal to one byte. It is almost clear from these figures that with smaller block sizes, similarity factor of encrypted text is higher. As another basic result, it could be seen that there is almost no difference between the behavior of AES and classic replacement cryptosystems from similarity factor perspective.

Simple rules can be generated based on similarity factor for encrypted data classification. As an example, the following rule classifies all the encrypted texts as text, while, 55.9% of cipher images are recognized as image:

IF S.F.(Cipher Data) > 0 THEN Cipher is IMAGE

ELSE Cipher is TEXT

where, S.F.(Cipher Data) is the similarity factor of the cipher data.

In the following section, some recommendations are proposed to prevent cryptosystems against data mining attacks.

VI. RECOMMENDATIONS FOR SECURITY IMPROVEMENT

In the previous sections, it has been shown that, feature extraction and data mining techniques can be used to analyze cryptosystems. Following recommendations improve the level of security against data mining attacks:

• With the help of source coding methods, one can eliminate some features out of plain data. Hence, data compression is recommended before any encryption.

• Using any kind of channel coding techniques before data encryption is prohibited. Channel coding methods add redundancy and extra hidden features to the plain data.

• Block ciphers with larger block sizes are recommended. With larger block sizes, more features are categorized as intrablock features and hence, do not passed to the output of the cryptosystem.

Fig. 7. Comparison between the similarity factor in the output of classic replacement cryptosystem for image and text sources of data. Similarity

block size is 16 bytes.

• Block ciphers are recommended to be employed in CBC, CFB or OFB modes of operation [1]. This helps to eliminate a large number of interblock features and hence, feature extraction is a more difficult task.

• Cryptography algorithms must be selected with respect to the input data. The best cryptosystem is the one that hide more inter and intra-block features of the plain data.

VII. CONCLUSIONS

In this paper, the application of data mining and feature extraction in cryptanalysis and encrypted data classification has been explored. It has been shown that how data mining techniques may be employed to classify the encrypted data and to extract useful information from the ciphertexts. After proposing a frame-work for this purpose, a simple case study was proposed, illustrating how this frame-work may be employed for encrypted data classification. Recommendations have been presented to improve the security of the cryptosystems against the proposed attack.

REFERENCES

[1] A.J.Meneses, et al., Handbook of Applied Cryptography, CRC Press, 1996.

[2] D.Arroyo, G.Alvarez, V.Fernandez, "A basic framework for the cryptanalysis of digital chaos-based cryptography", Proceedings of the 6th International Multi-Conference on Systems, Signals and Devices, March 2009.

[3] D. Jena, S.K. Jena, "A Novel Visual Cryptography Scheme", Proceedings of the International Conference on Advanced Computer Control, pp 207-211, Jan. 2009.

[4] M.S. Sharbaf, "Quantum Cryptography: A New Generation of Information Technology Security System", Proceedings of ITNG’ 09, pp 1644-1648, April 2009.

[5] W.N.Chelton, M.Benaissa, "Fast Elliptic Curve Cryptography on FPGA", IEEE Trans. on Very Large Scale Integration Systems, Vol. 16, Issue 2, pp 198-205, Feb. 2008.

Fig. 8. Comparison between the similarity factor in the output of classic replacement cryptosystem for image and text sources of data.

Similarity block size is one byte.

362

[6] T.C.Aysal, K.E.Barner, "Sensor Data Cryptography in Wireless Sensor Networks", IEEE Transactions on Information Forensics and Security, Vol. 3, Issue 2, pp 273-289, June 2008.

[7] C. Guangzhao et al., "An encryption scheme using DNA technology", Proceedings of BICTA’08, pp 37-42, 2008.

[8] G. Jakimoski, K.P. Subbalakshmi, "Cryptanalysis of Some Multimedia Encryption Schemes", IEEE Transactions on Multimedia, Vol. 10, Issue 3, pp. 330-338, April 2008.

[9] E. Biham, and A. Shamir, "Differential cryptanalysis of DES-like cryptosystems," Journal of Cryptology, vol. 4, no. 1, pp. 3-72, 1991.

[10] E.Biham, O.Dunkelman, N.Keller, "The Rectangle Attack, Rectangling the Serpent", in the proceedings of EUROCRYPT 2001, Lecture Notes in Computer Science 2045 p.340-ff, Springer-Verlag.

[11] F.X. Standaert, et al. , "Cryptanalysis of Block Ciphers:A Survey", Technical Report CG-2003/2, UCL Crypto Group, Universite Catholique de Louvain, 2003.J. Clerk Maxwell, ATreatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.

[12] M.W. Berry, Survey of Text Mining: Clustering, Classification, and Retrieval Scanned by Velocity, Springer-Verlag, 2004.

[13] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006.

[14] R.S.Michalski, Machine Learning and Data Mining, Wiley, 1998.

[15] National Bureau of Standards, "Data Encryption Standard", Federal Information Processing Standard 46, 1977.

[16] M. Matsui, "The First Experimental Cryptanalysis of the Data Encryption Standard", Advances in Cryptology - CRYPTO ’94 (Lecture Notes in Computer Science no. 839), Springer-Verlag, pp. 1-11, 1994.

[17] http://ece.iut.ac.ir/faculty/khadivi/datasets/dmcrypt.rar.

363

Documents

[IEEE 2009 9th International Symposium on Communications and Information Technology (ISCIT) - Icheon, South Korea (2009.09.28-2009.09.30)] 2009 9th International Symposium on Communications