37
Introduction Assembler as a native language Anomalies detection Detecting abnormal executable files using binary code mining Rechkov Anton TU Berlin Germany & TTI SFU Russia 21th March 2012 Rechkov Anton Lomonosov Scholarship Report 21th March 2012 1 / 31

Rechkov. Lomonosov Report

Embed Size (px)

Citation preview

Page 1: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Detecting abnormal executable files usingbinary code mining

Rechkov Anton

TU Berlin Germany & TTI SFU Russia

21th March 2012

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 1 / 31

Page 2: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Malware evolution

CipheredEncrypted malware code of viruses

OligomorphicGeneration of a decryptor by randomly selecting each piece of the decryptorfrom several predefined alternatives.

PolymorphicGeneration of a sample by encypting malware body and modifying decryptoreach replication

MetamorphicReprograming all virus body by some obfuscation engine.

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 2 / 31

Page 3: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Modern detection technique

Signature analysisSearching a determine pattern in code.

EmulationUnpacking and analysis through the emulation of malware code and continuesignature analysis.

Behavioral analysisAnalysis of functions graph flow.

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 3 / 31

Page 4: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Code modification

ObfuscationTransformation of executable program code which preserves functionality, butcomplicates the analysis and understanding algorithms.

DeobfuscationResolving irrelevant code by

Algebraic models

Formal grammars

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31

Page 5: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Code modification

ObfuscationTransformation of executable program code which preserves functionality, butcomplicates the analysis and understanding algorithms.

DeobfuscationResolving irrelevant code by

Algebraic models

Formal grammars

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31

Page 6: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Outline

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detection

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 5 / 31

Page 7: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 6 / 31

Page 8: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Structure of compiler

Code generator engine:Machine code generator,Optimizers:

interproceduraloptimization (IPO),profile-guidedoptimization (PGO),high-level optimizations

Mutation code generator /obfuscator.

Common compiler scheme

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 7 / 31

Page 9: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Common Code generator features

high-level optimizations

Unique intermediate language

Preoptimizing in intermediate representation

Code generation

Code templates from Intermediate to Target

Number of used instruction types

Machine dependent optimizer

Instructions cost

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31

Page 10: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Common Code generator features

high-level optimizations

Unique intermediate language

Preoptimizing in intermediate representation

Code generation

Code templates from Intermediate to Target

Number of used instruction types

Machine dependent optimizer

Instructions cost

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31

Page 11: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Common Code generator features

high-level optimizations

Unique intermediate language

Preoptimizing in intermediate representation

Code generation

Code templates from Intermediate to Target

Number of used instruction types

Machine dependent optimizer

Instructions cost

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31

Page 12: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Approving theory

Experiment

Determine instruction sequences

Compile source code with compilers

Compare distributions

Compilers

⇒ MSVC

⇒ LLVM

⇒ GCC

⇒ Intel C++ Compiler

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31

Page 13: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Approving theory

Experiment

Determine instruction sequences

Compile source code with compilers

Compare distributions

Compilers

⇒ MSVC

⇒ LLVM

⇒ GCC

⇒ Intel C++ Compiler

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31

Page 14: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

XTEA distribution test

Frequency of words in binary.

(a) LLVM (b) MSVC

(c) Intel C++ (d) GCC

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 10 / 31

Page 15: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Binary code mining

Optimize binary’s mean distribution

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 11 / 31

Page 16: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Native language processing

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 12 / 31

Page 17: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Native language processing

Text Mining

Language detection

Author detection

Text Classification

Document clustering

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 13 / 31

Page 18: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Stochastic models

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 14 / 31

Page 19: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Stochastic models

Neural networks

Advantages

+ effectively with small number of training vectors

+ assessment of all samples proximity

Disadvantages

- predetermining model

manual words definitionmanual excessive elements analysisreeducation limitations

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 15 / 31

Page 20: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Stochastic models

Probability model

Advantages

+ self-sufficient word definition

+ education only by positive vectors

+ education unification(flexible reeducation)

Disadvantages

- big sample set for education

- errors while distribution determination

- computational complexity

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 16 / 31

Page 21: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Outline

1 Assembler as a native language

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 17 / 31

Page 22: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Preparation

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 18 / 31

Page 23: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Preparation

Collect statistics samples

Python

Detection list of max repeated sequences

Disassembling

Searching strings

MatlabStochastic models

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31

Page 24: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Preparation

Collect statistics samples

Python

Detection list of max repeated sequences

Disassembling

Searching strings

MatlabStochastic models

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31

Page 25: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Preparation

Collect statistics samples

Python

Detection list of max repeated sequences

Disassembling

Searching strings

MatlabStochastic models

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31

Page 26: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Code generator lexemes

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 20 / 31

Page 27: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Code generator lexemes

From disassembling to lexemes

Lexem3 to 6 instruction length sequences

ignore unknown bytes

maximum repeated sequences

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 21 / 31

Page 28: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Code generator lexemes

Lexemes analysis

Suffix tree:Economy memory,String searching faster then O(N2),Fast assessment of maximumrepeats in strings

Suffix Tree example

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 22 / 31

Page 29: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by neural networks

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 23 / 31

Page 30: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by neural networks

Radial basis networks

no need to choose the number ofhidden layerslack of the pathology convergencefast convergence through acombination of learning algorithms.

Neural net architecture

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 24 / 31

Page 31: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by neural networks

Detection compilers

Compiler detection testing

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 25 / 31

Page 32: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Table of Contents

1 Assembler as a native languageBinary code miningNative language processingStochastic models

2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 26 / 31

Page 33: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Multivariate Gamma

Using a set of bi- and 3-variateGamma:

Suggest GammadistributionSample proximityFast education

Empirical and theoretical PDFof element

−0.02 0 0.02 0.04 0.06 0.08 0.1 0.120

5

10

15

20

25

30

35

40

X

PD

F

Gamma PDF

Empirical PDF

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 27 / 31

Page 34: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Probability model testing

Error graphs of compiler probabilities based on coefficient ofminimal value P i

p = P imin ∗ 10coef

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

coeff for min value

err

or

false positive GCC O0

false negative Clang

false negative Intel

false negative GCC O2

false negative MS

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

coeff for min value

err

or

false positive MS

false negative LLVM

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 28 / 31

Page 35: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Probability model testing

Problem of existing zero elements

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

coeff for min value

err

or

false positive GCC O2

false negative Clang

false negative Intel

false negative GCC O0

false negative MS

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

coeff for min value

err

or

false positive GCC O2

false negative Clang

false negative Intel

false negative GCC O0

false negative MS

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 29 / 31

Page 36: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Conclusion

Proposed connection between native language andassemblerDeveloped algorithms of lexical assembler languageanalyzesDeveloped experimental stochastic models:

Based on neural networksBased on probability model

Realized lexical assembler language analysis.Approximate false positive errors of compiler detection:

27%10-15%

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 30 / 31

Page 37: Rechkov. Lomonosov Report

Introduction Assembler as a native language Anomalies detection

Anomalies detection by probability model

Questions?

Rechkov Anton Lomonosov Scholarship Report 21th March 2012 31 / 31