5
A Controllable and Adaptable Computer Virus Detection Model WANG Xin School of computer science and control Guilin University of Electronic Technology Guilin 541004, China E-mail: [email protected] JIANG Hua School of computer science and control Guilin University of Electronic Technology Guilin 541004, China E-mail: [email protected] DENG ZhenRong School of computer science and control Guilin University of Electronic Technology Guilin 541004, China E-mail: [email protected] Abstract—This paper presents a controllable and adaptable computer virus detection model based on immune system, improves the immune mechanism of the model by introducing the thought of variable detector radius, and achieves the detector radius be set and adjusted automatically. The virus detection model has some effects on solving the conflict between the number of detectors and the scope of detection. Thus it enables the system to have better dynamic and self-adaptability. Keywords-Artificial Immune; detector; controllable adaptive; virus detection I. INTRODUCTION The problem of biological immune system (BIS: Biological Immune System) and detecting virus in computers is almost the same. One of the most natural applications in the immune calculation is virus detection. The principle of immunity has been applied to the virus detection in the Forrest [1] and the virus laboratory [2] at IBM. After that the research in Computer Virus Immune System(CVIS) has become more and more popular. People like S. Homeyr, J. O. Kphart, P. Harmer, and T. Okamoto have put forward a series of computer-virus- immunity model [3] . However, these existing CVIS has not been widely used, mainly because these CVISs still have some shortcomings. One of the most important shortcomings is the contradiction between system’s efficiency, detection rate and error rate (error-affirmative and error-negative). Because library of CVIS itself is too large, the number of detectors is too large. And the cost of training mature detectors and the size of the monitors’ set shows an exponential relationship, which directly resulted in time inefficiencies. But if we decrease the size of the monitors’ set, the system will drop its detection rate and raise its error-affirmative rate. For example, in the virus laboratory at IBM's immune system, because the tolerance- transaction takes too much time, we have to change our original mind that want to do distributed defense to gather the tolerance-transaction to treating and training on a fixed, which make the system effective and feasible. But in fact, this method delays the response-time to detecting unknown viruses at client. What’s more is that if the analytical center collapses or the analytical center loses contact with the client, the client will not be able to defense against unknown viruses. So, in this paper, a controllable and adaptable computer virus detection model based on immune system, AISVDM (Virus Detection Model base-on Artificial Immune System), has been put forward. It’ll discuss with solving the conflict between the number of detector and the scope of detection and the model’s dynamic and self-adaptability. II. IMMUNE-SYSTEM-BASED VIRUS DETECTION In an immune system, the most important question is how to distinguish non-autologous organism from the body. Because the information in computer software system will be eventually reduced to a binary string, so in fact, the computer virus detection is a problem which is according to some certain rules and a priori knowledge analyzing and detecting problems in binary strings. A. The definition of immune components In the natural immune system, the body represents a set of all normal cells in biological body. And non-autologous body is on behalf of a set of external cell. In this model, the definition of the state space of computer virus detection is that using a fixed-length binary string consisting of finite set U to express all of the cells, and (a binary string is the same as a hexadecimal string ), i is natural number. U can be divided into two subsets: N expresses non-autologous body; S represents the body. The relationship between the two is as . D, the definition of a set of immune cells (detector), is as D ={ d|d=< a, age, count, max-age>, a U age, count, max- age Z+ }, a for antibody, age for detector s age, count for the number of matching the antigens, that’s called accumulative affinity, Z+ for the positive integer set, max-age for the largest age of detector. Here the length of a check for 24. It means the definition of antibody is a length-of-24 hexadecimal virus signature. The set of immune cells also can be divided into immune cells of immature and mature immune cells and memory immune cells, using the closest matching rules match the rules of biological consecutive r-bit matching rules (Rcb). Matching function is defined as: _ 1, (, )/ (, ) 0, r dis match f xy l f xy otherwise β = ソ (1) In this case, l is for the length of x, y; f r_dis (x, y) is for the match rate of x, y. B. Implementation of the immune mechanism 1) Negative selection mechanism Immune mechanisms which are mainly used in the system negative selection included negative selection mechanism and clonal selection mechanism. Negative selection process is the processes of immune cells’ self- tolerance to ensure not identify immune cells as viruses in the 2009 Fifth International Joint Conference on INC, IMS and IDC 978-0-7695-3769-6/09 $26.00 © 2009 IEEE DOI 10.1109/NCM.2009.42 1976 2009 Fifth International Joint Conference on INC, IMS and IDC 978-0-7695-3769-6/09 $26.00 © 2009 IEEE DOI 10.1109/NCM.2009.42 1965 2009 Fifth International Joint Conference on INC, IMS and IDC 978-0-7695-3769-6/09 $26.00 © 2009 IEEE DOI 10.1109/NCM.2009.42 1977

[IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

A Controllable and Adaptable Computer Virus Detection Model

WANG Xin School of computer science and control

Guilin University of Electronic Technology

Guilin 541004, China E-mail: [email protected]

JIANG Hua School of computer science and control

Guilin University of Electronic Technology

Guilin 541004, China E-mail: [email protected]

DENG ZhenRong School of computer science and control

Guilin University of Electronic Technology

Guilin 541004, China E-mail: [email protected]

Abstract—This paper presents a controllable and adaptable computer virus detection model based on immune system, improves the immune mechanism of the model by introducing the thought of variable detector radius, and achieves the detector radius be set and adjusted automatically. The virus detection model has some effects on solving the conflict between the number of detectors and the scope of detection. Thus it enables the system to have better dynamic and self-adaptability.

Keywords-Artificial Immune; detector; controllable adaptive; virus detection

I. INTRODUCTION The problem of biological immune system (BIS: Biological

Immune System) and detecting virus in computers is almost the same. One of the most natural applications in the immune calculation is virus detection. The principle of immunity has been applied to the virus detection in the Forrest [1] and the virus laboratory[2] at IBM. After that the research in Computer Virus Immune System(CVIS) has become more and more popular. People like S. Homeyr, J. O. Kphart, P. Harmer, and T. Okamoto have put forward a series of computer-virus-immunity model[3]. However, these existing CVIS has not been widely used, mainly because these CVISs still have some shortcomings. One of the most important shortcomings is the contradiction between system’s efficiency, detection rate and error rate (error-affirmative and error-negative). Because library of CVIS itself is too large, the number of detectors is too large. And the cost of training mature detectors and the size of the monitors’ set shows an exponential relationship, which directly resulted in time inefficiencies. But if we decrease the size of the monitors’ set, the system will drop its detection rate and raise its error-affirmative rate. For example, in the virus laboratory at IBM's immune system, because the tolerance-transaction takes too much time, we have to change our original mind that want to do distributed defense to gather the tolerance-transaction to treating and training on a fixed, which make the system effective and feasible. But in fact, this method delays the response-time to detecting unknown viruses at client. What’s more is that if the analytical center collapses or the analytical center loses contact with the client, the client will not be able to defense against unknown viruses. So, in this paper, a controllable and adaptable computer virus detection model based on immune system, AISVDM (Virus Detection Model base-on Artificial Immune System), has been put forward. It’ll discuss with solving the conflict between the number of detector and the scope of detection and the model’s dynamic and self-adaptability.

II. IMMUNE-SYSTEM-BASED VIRUS DETECTION In an immune system, the most important question is how to distinguish non-autologous organism from the body. Because the information in computer software system will be eventually reduced to a binary string, so in fact, the computer virus detection is a problem which is according to some certain rules and a priori knowledge analyzing and detecting problems in binary strings.

A. The definition of immune components In the natural immune system, the body represents a set of

all normal cells in biological body. And non-autologous body is on behalf of a set of external cell. In this model, the definition of the state space of computer virus detection is that using a fixed-length binary string consisting of finite set U to express all of the cells, and (a binary string is the same as a hexadecimal string ), i is natural number. U can be divided into two subsets: N expresses non-autologous body; S represents the body. The relationship between the two is as . D, the definition of a set of immune cells (detector), is as D ={ d|d=< a, age, count, max-age>, a U age, count, max-age Z+ }, a for antibody, age for detector s age, count for the number of matching the antigens, that’s called accumulative affinity, Z+ for the positive integer set, max-age for the largest age of detector. Here the length of a check for 24. It means the definition of antibody is a length-of-24 hexadecimal virus signature. The set of immune cells also can be divided into immune cells of immature and mature immune cells and memory immune cells, using the closest matching rules match the rules of biological consecutive r-bit matching rules (Rcb). Matching function is defined as:

_1, ( , ) /( , )

0,r dis

match

f x y lf x y

otherwise

β≥= (1)

In this case, l is for the length of x, y; fr_dis(x, y) is for the match rate of x, y.

B. Implementation of the immune mechanism 1) Negative selection mechanism

Immune mechanisms which are mainly used in the system negative selection included negative selection mechanism and clonal selection mechanism. Negative selection process is the processes of immune cells’ self-tolerance to ensure not identify immune cells as viruses in the

2009 Fifth International Joint Conference on INC, IMS and IDC

978-0-7695-3769-6/09 $26.00 © 2009 IEEE

DOI 10.1109/NCM.2009.42

1976

2009 Fifth International Joint Conference on INC, IMS and IDC

978-0-7695-3769-6/09 $26.00 © 2009 IEEE

DOI 10.1109/NCM.2009.42

1965

2009 Fifth International Joint Conference on INC, IMS and IDC

978-0-7695-3769-6/09 $26.00 © 2009 IEEE

DOI 10.1109/NCM.2009.42

1977

Page 2: [IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

detection. In the process of self-tolerance, set of immune cells will eliminate all detecting cells that can identify autologous cells in immune cells' set. This process can be used like following formula:

ftolerance(D)=D-{d | d B y Self(fmatch(d.a ,y)=1)} (2) The merits of the algorithm are simple and easy to realize

and to analyze the problem space and abnormal detection without a priori knowledge. It has a strong robustness, parallel, distributed detection and so on. However, there are some weaknesses like: detectors are produced by random algorithms, it is inevitable that a lot of invalid detector will be produced; and its time complexity grew exponentially with the size of self-set, it is difficult to meet the complex system of timeliness requirements; in addition, nonself space can not be covered completely by the detectors. Therefore, the question in based on negative selection methods is how to cover nonself space by detectors and the number of detectors, that is, how to use less detectors to cover nonself space. In the literature [4] [5], De Castro proposed a simulated annealing algorithm method of induction of group diversity. Drawing on the idea, the simulated annealing method will be used to optimize the detector to ensure hardly cover the self-space in premise; expand the set of detector coverage not my space; make sure that in the process the number of detectors are fixed. Because of wanting to expand the coverage of nonself space, the first objective function is the detector volume, just like formula (2), the constraint condition is that the detector can’t fall into self-space.

C D = Volume(D) s.t.D ⊄ self (3) C(D)is for objective function Volume(D) is for detectors space.

However, calculating the detector’s volume wastes too much time. So it will be changed to an expression that can be calculated easily and reflect the initial purpose. Intuitively, expanding on nonself spaces detector coverage will reduce the overlapping between detectors. So the maximum optimization problem can be converted into minimum optimization problem, added constraints to the objective function at the same time. The simplified formula of the objective function such as (2.4) follows:

C (D) = Overlapping (D) + Self covering (D) (4) Overlapping (D) is for the overlapping spaces of set of

detectors. Self covering (D) is for the detector setting on the self-space coverage. The distance between detectors gets smaller, the overlapping spaces of set of detectors gets larger. So, overlapping spaces between detector di and dj Spaces Overlapping (di, dj) approximate as follows:

i ji j i j

1,fh_dis(d ,d )/(lx,ly)>= ;Overlapping(d ,d ) =fmatch(d ,d )=

0,otherwise;

β (5)

So, the expression of Overlapping (D) is as follows:

i ji j

Overlapping(D) = Overlapping(d ,d ) ≠

(6)

Similarly, coverage Self covering (di, sj) of detector di and self-space-element sj, coverage Spaces Self covering (D) of set of detectors and self-space are as follows:

i ji j match i j

1,fr_dis(d ,s )/l ;Selfcovering(d ,s ) =f (d ,s )=

0,otherwise;

β≥ (7)

i j

i jd s

Selfcovering(D) = Selfcovering(d ,s ) (8)

It will get the eventual minimize objective function by putting formula (5), (6), (7), (8) into the formula (4). The objective function corresponds to the energy function in the simulated annealing algorithm. Solution space is the set of nonself space. The initial solution is the generated detector. Drawing on the idea of simulated annealing algorithm optimizes the detector. When calculate objective function, only on the current optimized detector, it will do the coverage between detector each other and coverage-calculation in self-space. The optimized detector’s objective function which is given above contains a coverage-element of detector to the set of self-space. And this element is to ensure that detector covers elements of the self-set in the mobile process the less the better. It is inevitable that detectors and self-set element will overlap. In order to avoid making mistakes in detection, negative selection algorithm will be used again after the optimization to doing filtration, that is, remove the autologous-matched detector. Specific optimization algorithm is as follows:

a) Initialization: Adjacent radius of detector rnear; Attenuation rate of adjacent radius ; Evolutionary generations nummax; Initial parameter t; Attenuation coefficient of temperature ; Initial solution state S, that is, detector’s set D; Iterative times for each t L.

b) For detector di i = 1,2, , N do step to optimize.

For k =1, 2, … , L, do step • Update the position of detector di, that is, generate a

variation detector di' using rnear for mutation-digit (variation algorithm used see 2.2.3);

• Calculate the objective function increment C'= C (di')-C (d), C (d) for the objective function above;

• If C'< 0, accept di' as a new current solution, otherwise accept di' as a new current solution under the probability of exp ( C'/ t), and so di = di';

• If the termination conditions are matched, output the current solution as the optimal solution. Procedure ends. Otherwise, return to .

Judge whether the given evolutionary generations nummax is matched, if so, transferred to steps , or make t(num +1) = *t(num) match, t be reduced gradually and adjacent radius of detector rnear, that is, rnear [ *rnear], then go to step .

Judge whether all detectors have already been optimized, if so, do autologous-tolerance again, and then output the

197719661978

Page 3: [IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

optimized combination of detector D '; Otherwise, return to Step . Mature detector will be initialized by using algorithm as follows:

• Initialization: set generating N candidate-detectors (that is, property string) in set of detectors D;

• Checkup: While (detector’s set size <N) do:

Affinity calculation: calculate affinity between each self-element and candidate-detector;

Choosing: If the candidate-detector has identified any one element of self-set, this detector should be eliminated.

Else put this detector inside the set of detectors. • Using optimization algorithms of set of detectors to

optimize mature detector, then get D' (the number of detectors is N').

• If N'=N'-N<0, N detectors will be generated randomly, and using step to check, or the initialization is done.

Set D consist of 1000 randomly generated detectors; the initial value of adjacent radius of detector rnear is two third of each detector length, that means, rnear= 16; the initial parameter t =100; iterative times for each t L =10; evolutionary generations nummax = 200; attenuation rate of adjacent radius = 0.8; attenuation coefficient of temperature = 0.8. Then optimize set of detectors by using simulated annealing algorithm. The overlapping between detectors mutate with evolutionary generations as Figure 1. It can be seen from the figure, with evolutionary generations increasing. The overlapping between detectors decrease and the curve decline rapidly at the initial stage. It proves that using simulated annealing algorithm can decrease the overlapping between detectors. Finally, using negative selection algorithm filters detectors and deletes detector which is in the self-set.

Figure 1 The overlapping between detectors mutate with evolutionary

generations

2) Clonal Selection mechanism

Clonal selection algorithm is used to identifying unknown computer viruses. Network system is a constantly changing real-time system. Viruses often change its way to spread via Internet. Kim and Bentley proposed an algorithm which is based on AIS of the dynamic clonal selection (DynamiCS) [6]

in 2002. It has expanded on AIS of the dynamic clonal selection and improved the system's dynamic and self-adaptability. Three different detector groups implement adaptation of Dynamics’ AIS-based coordinated. They are: immature, mature and memory detectors groups. Dynamic Clonal Selection Algorithm wants to streamline the system so that it only has the part which has key effect on the system adaptation. Algorithm also started with randomly generated and immature detector. A detector after autologous-tolerance will become a mature detector. A mature detector will become a memory detector, when the times of identifying viruses (count) reaches a certain threshold value of A. If mature detector doesn’t find any virus in its life cycle Tm, death of the detector means deleting this detector in set of detectors. Memory detector has a longer life cycle Tr to simulate the memory immune cells. If in its life cycle, the virus is not detected, the memory detector degenerates to a mature detector. The definition of detector and the set of detectors are as follows:

D ={ d | d=< a,age,count ,max-age>, a U age,count,max-age Z+} (9)

Dm = {x | x B x.max-age = Tm y Self ( fmatich(x.a, y) = 0)} (10)

Dr = {x | x B x.max-age = Tr ?y Self ( fmatich(x.a, y) = 0)} (11)

Clonal selection algorithm is as follows:

Clone_selection {

Set the initial detection cell set D;

Generations T = 1; / * set it to * /

While (Virus can’t be identified by all detection cell in D and the generations of clonal selection T is less than Nmax) {

T++;

Immature detector age++;

Mature detector age++;

Memory detector age ++; //reset parameters

Calculate affinity between all immune cells and viruses;

Select a subset whose affinity is larger than s;

Clone D1 generated by the immune cell’s subset depending on the affinity;

Clone immune cell in mutated D1 according to the affinity;

Do autologous-tolerance on D1; / * tolerance * /

Add D1 to D as an immature detector;

Removed the eldest mature detector;

If (memory detector free time> Tr) change it to a mature detector and set generations 1;

If (size of immature detector group + size of mature detector group <non-memory detector group’s maximum)

{

197819671979

Page 4: [IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

Do

{Generate a random detector;

Add new detector to the immature group; }

Until (size of immature detector group + size of mature detector group = non-memory detector group’s maximum)

}

}

If (immune cell D identifies virus) immune cell D’s count add 1;

If (the times of identifying virus by immune cell D reach the number of memory detector activation threshold A), add immune cell D to memory detector group;

Not be able to detect viruses, The End;

}

Cloning number of a detector di nd=[ *fr_dis(di,v)/l], v is for the virus being detected, is for limitation times of clones, l is for length of d, fr_dis(di,v) is for match-degree of Rcb between di and v, s is for clonal selection threshold ( s< ). Ways for mutating detector di appears 2.2.3.

This system detects known viruses by using memory detector set and unknown viruses by procedure of clonal selection. In the procedure, using negative selection algorithm does autologous-tolerance to ensure not identifying self cells. And it generates new detector randomly or mutates mature detector (including memory detector).

Virus identification algorithm is as follows:

detect_virus { Initial parameter settings; Initialize a set of immune cells D; Read suspicious sample’s document for identification; If (virus is identified by a immune cell in D) { Known viruses, detection success; //known virus

detection;} Else {/ * unknown virus, called clonal selection process to

identify * / Call algorithm clone_selection identification; } If (immune cell D identifies virus) immune cell D’s count

add 1; If (the times of identifying virus by immune cell D reach

the number of memory detector activation threshold A), add immune cell D to memory detector group;

}

3) Mutation of detector

Detector mutate as the way followed. Select r-bit characters randomly as sub-generation character (testing immediate mutation or controllable mutation determine the value of r). According to the probability of a character in the location select what’re another 24-r-bit. In this experiment, 60 virus signatures for the sample, accounting any one of the bits, the probability of a certain character is Pchi. The calculation

formula is as follows: Pchi=Nchi/36. Nchi is for the times of the character ch at the NO. i bit. Thirty six means the number of the group inside twenty six characters and ten numbers. For example, appearing times of character a is Na3=3, so the probability of character a at the third bit is Pa3=3/36=0.083. Mutating generation i, the detector is 2f251aba002775fa1fb975fa, and the next generation i+1 is 1fh412b40023grf41f49r5f4.

III. CONCLUSIONS Biological immune system becomes a complete and

powerful immune system through natural evolution. At present, the research of biological immune mechanism is still imperfect. Relatively speaking, the research of computer-imitation on biological immune has just started. It’s the same as the computer immune mechanism. Furthermore, the biological immune system is a highly parallel system, but computer virus immune system can’t be parallel as biological immune system. So the most exigent problem which can be applied and have practicality is improving the efficiency of system in the situation of limiting resources and having a good time efficiency, making system have nice dynamics, controllability, self-adaptability and robustness. In this paper, these problems have been discussing and built artificial immune model AISVDM (Virus Detection Model base-on Artificial Immune System) for detecting viruses. In this model, the concept of variable radius of the detector has been introduced and drawing on simulated annealing algorithm for optimization as well as automatically setting and adjusting of radius of detector. It has effect on solving the conflict between the number of detector and the scope of detection. Using optimized dynamic clonal selection to study and detect viruses make the system has better dynamic and self-adaptability. But there are a lot of questions need to be further explored.

1) In theory, it can detect any unknown viruses by CVIS inheriting the thought of distinguish self or nonself in BIS. But how can we ensure the self-information is safe, how to kill the virus and to recover the system faster and more efficient? Whether a more direct and effective way to do that exist? And can it be found through reaching BIS?

2) The model set up to avoid a genetic control, cell signaling and other complex questions on biological immune system. How to put these mechanisms into the model and what will happen to dynamic, adaptability, security and time efficiency of the system?

3) The artificial immune system algorithm is just for specific cases. Profound and significant research achievement about the algorithm or even the calculation and estimation, proving the convergence of whole system is still a fat lot.

REFERENCES

[1] Stephanie Forrest, Alan S.Perelson, Lawrence Allen, Rajesh Cherukuri. Self-Nonself Discrimination in a Computer[C]. In Proceedings of IEEE Symposium on Computer Security and Privacy, 2005.7.

197919681980

Page 5: [IEEE 2009 Fifth International Joint Conference on INC, IMS and IDC - Seoul, South Korea (2009.08.25-2009.08.27)] 2009 Fifth International Joint Conference on INC, IMS and IDC - A

[2] Kephart J O. Biological inspired defenses against computer viruses[A]. Proceeding of 14th Joint Conference on Artificial Intelligence[C]. Los Alamitos: IEEE Computer Society Press,2004.

[3] J.O. Kephart, Gregory B. Sorkin, Morton Swimmer, and Steve R. White. Blueprint for a Computer Immune System[C]. Proceedings of the 1997 International Virus Bulletin Conference, San Francisco, California, October, 2006.

[4] De Castro L N.& Von Zuben F J.Artificial Immune systems Part -Basic Theory and application[J].Technical Report-RT DCA 2007 01

89

[5] De Castro L N.& Von Zuben F J.A Simulated Annealing Approach to Increase Diversity In the Antibody Repertorie[J].submitted to the Evolutionary Computation journal. 2000

[6] Bentley, Kim, Immune Memory in the Dynamic Clonal Selection Algorithm[C], 1st International Conference on Artificial Immune Systems (ECARIS-2002),University of kent at Canterbury, UK,2002.9

[7] [7] A Hofmeyr, S. Forrest, Architecture for an Artificial Immune System[J], Evolutionary Computation, vol.7(1), 2000.

198019691981