Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Manel Abdellatif
introducing
Parallel programming for accelerating Malware detection
1
Motivations
Introduction : Parallelism & security
• Ever-growing number of threats • The market boost of embedded systems• The increasing variety of operating systems• Malware detection is a highly common and computationally-intensive
problem in intrusion detection systems
1Real time detection
2High throughtput and performance
3Need to execute many detection algorithms
What about small-scale systems?
o How to get benefit from parallel architectures to monitor small-scale systems?
• Great use of small-scale systems : mobile phones, gaming consoles, SoC etc.
• Memory and computation performance constraints• Ever growth of attacks on small-scale systems • Improvement in parallel computation performance
Security threats Parallel performance computation
VS
Introduction : Parallelism & security
Introduction : Parallelism & security
Previous Work• Development of parallel architecture for malware
detection based on pattern matching technique• Achieving better computing performance • Use of Cuda and desktop GPU
Current Work• Migration to mobile platform • Use of OpenCL• Building of behavioral malware dataset based on
syscalls patterns • Development of memory optimization techniques
Benefits of mobile GPGPU
Parallel Processing architecture
To ensure high security level of mobile devices, accelerating malware detection can be provided by GPU parallel processing
Offering a complementary processing unit to the CPU
Adapted to SIMD architecture
Fast memory types access (shared memory , constant memory)
More and more evolving
GPUs driven by high-end applications: prepared to pay a premium for high performance
5
Parallel Processing architecture
Evolution of GPUs for embedded systems
Parallel Processing architecture
6
Parallel Processing architecture
Popular mobile GPUs
Parallel Processing architecture
7
Parallel Processing architecture
Adreno 330
-Inbuilt in Snapdragon™ 800 Series Processors. -speed can push to 3.6 gig pixels/s-Used in HTC one, Xperia z ultra
Power VR SGX544mp3
-Inbuild in Exynos 5 Octa processor-used in galaxy s4 @ 533 MHz clock speed
Mali T604
-First time used in Exynos 5 -The 1st Midgard architecture gpu for arm-Used in famous series of Google tablet nexus 10.
GPU architecture
Parallel Processing architecture
8
Parallel Processing architecture
Architecture
9
Architecture Challenges
10
(1) Which type of dataset can we use in
order to have good detection accuracy?
Limitation of Mobile GPU memory VS important memory
requirement of DFA(3) The need of applying
memory compacting techniques
(2) How can we increase GPU processing performance?
Challenge 1
11
Which type of dataset can we use in
order to have good
detection accuracy?
Thread
1
Thread 2.1
Thread 3.1
Thread 3.2
Thread 2.2
Thread 3.4
Thread 1
Thread 2.1
Thread 3.1
Thread 3.2
Thread 4.1
Thread 2.2
Thread 3.4
Thread 4.2
Extraction of malicious behavior based on syscalls sequences with the thread-grained extraction technique
Malware detection
13
• Key concept: Malwares which have the same malicious code embedded on benign applications, will have common malicious behaviour
• Tracking the tree architecture of applications threads
• Malicious behaviour is likely to appear at the same thread level on applications having the same malware
Application 1 Application 2
Thread tree architecture of tow applications belongingTo the same malware family
Thread 1
Thread 2.1
Thread 3.1
Thread 3.2
Thread 2.2
Thread 3.4
Thread 1
Thread 2.1
Thread 3.1
Thread 3.2
Thread 4.1
Thread 2.2
Thread 3.4
Thread 4.2
Training phase
Malware detection
14
Recording Phase
• Having different applications that belongs to the same malware family
• Execution of every application • Tracking thread tree structure of the
application • Recording syscalls sequences for
every thread created by the application
Extraction phase • Extraction of common syscalls
subsequences belonging to threads from the same family and having the same level
• Storage of common subsequences Application 1
Application 2
Training phase
Malware detection
Filtering Phase
• Get syscall patterns from benign applications B
• For every Csi in our malware dataset M
• Counting the number of common subsequence Csi appearing in B and in M
• Calculate malicious probability of Csi • Storing Csi into Malware common subsequences if Csi haw high
malicious probability • We choose to work with Csi having malicious probability = 1 (Csi
appearing in M and not in B)
The result of traning phase= Malware behavioural dataset build of syscalls patterns
Architecture
16
(1) Which type of dataset can we use ?
(2) How can we increase GPU processing performance?
Challenge 2
17
How can we increase the
parallell processing
performance of pattern
matching on the GPU?
Input data
•Signatures•Syscalls•Bytecode …
Pattern matchi
ng
Parallel Pattern Matching Algorithm
• Match of data streams by malware scanner against a large set of known signatures, using a pattern matching algorithm.
• Pattern matching algorithms analyze the data stream and compare it against a database of signatures to detect known malware.
• Fairly complex signature patterns composed of different-size strings, range constraints, and sometimes recursive forms.
Example • Aho-corasick • Wu-manber• Knuth-Morris-Pratt
Patterns Model
Malicious application
Benign application
18
Parallel Pattern Matching Algorithm
Aho-Corasick
Parallel Pattern Matching Algorithm
• AC algorithm is based on a DFA
structure built from reference
patterns.
• The construction of automaton
is done in pre-processing
phase.
• The matching process is done
in processing phase.
• The automaton structure can
be essentially described by tow
tables: transition table and
failure state table.
19
Parallel Pattern Matching AlgorithmParallel Pattern Matching Algorithm
Direct implementation of parallel pattern matching
Parallel Pattern Matching Algorithm
• Idea Input stream segmentation For every segment we associate a
thread Problem of boundary detection
• Possible solution Every thread check the pattern
presence on the edges. Each thread must scan for a
minimum length which is s almost equal to the segment length plus the longest pattern length of an AC state machine
20
Parallel Pattern Matching AlgorithmParallel Pattern Matching Algorithm
Parallel Failureless Aho-Corasick
Lin, C. H., Liu, C. H., Chien, L. S., & Chang, S. C. (2013).Accelerating pattern matching using a novel parallel algorithm on gpus. Computers, IEEE Transactions on, 62(10), 1906-1916.
Parallel Pattern Matching Algorithm
• Gaols Increase pattern matching
computation throughput via parallelization.
resolve the throughput bottleneck caused by the overlapped computation.
• Idea Byte allocation per thread Failure transitions elimination The thread stops his work if no
valid transition is found.
21
Parallel Pattern Matching AlgorithmParallel Pattern Matching Algorithm
• No boundary detection problems
• Shorter worst-case and average life times of threads than the straightforward implementation
• Most of threads have a high probability to end their work early
• Access and required memory are reduced
Parallel Failureless Aho-Corasick
Parallel Pattern Matching Algorithm
22
Parallel Pattern Matching AlgorithmParallel Pattern Matching Algorithm
• Reducing the global memory transactions of the system
• Making benefit from memory architecture of GPU by
using constant memory and local memory
• Minimize transfers: Intermediate data can be allocated, operated on, and deallocated without ever copying them to host memory
• Group transfers: One large transfer much better than many small ones
Parallel Failureless Aho-Corasick
Parallel Pattern Matching Algorithm
Increase of the algorithm performance on GPU
23
Parallel Pattern Matching AlgorithmParallel Pattern Matching Algorithm
Challenge 3
24
• Malwares grows continuously
• The number of signatures is increasing proportionally
Scaling problems for mobile anti-malwares due to:
• Limitation of Mobile GPU memory VS Important memory requirement for DFA structureThe need of applying
memory compacting techniques
Eliminating failure transition
Eliminating Final states table by performing state reordering
Applying P3FSM technique
Memory optimization technique
25
1
2
3
P3FSM: Portable Predictive Pattern Matching Finite State Machine
Memory optimization technique
26
Parallel Pattern Matching Algorithm
27
Hardware
Mobile Phone• HTC one
GPU• Adreno 320
CPU• Qualcomm Snapdragon 600, quad-core CPU @
1.7GHz
Benchmark• 600 Malicious syscalls patterns
Parallel Pattern Matching AlgorithmExperimentations
Thread per block resizing
28
Experimentations
Best throughput with 64x64 threads = 72Mb/s
16
32
64
12
4
01
02
03
04
05
06
07
08
0
Local work-group resizing
Work-group size
Th
rou
gh
pu
t (M
b/s
)
Effective use of the different GPU memory types
29
Experimentations
On
ly g
lob
al m
em
ory
con
stant m
em
ory
+ sh
are
d m
em
ory
54
56
58
60
62
64
66
68
70
72
74
Different mobile GPU memory use
Th
rou
gh
pu
t Around 15% gain of
performance with the use of constant and shared memories
Applying shared memory to improve the latency of global memory accesses
Acceleration
30
Experimentations
An acceleration of around 2.3x is obtained with the parallel processing on the mobile GPU over serial processing
The framework throughput is dominated by data transfers between the host /device which consist of 60% of the total processing time
Memory requirement
31
Experimentations
Number of patterns
PFAC (KB) P3FSM (KB)
100 446 37
200 703 83
300 1086 197
600 2843 462
Storing DFA structure on the GPU is memory consuming especially that mobile GPU memory is small.
Difference in memory requirement between PFAC DFA and P3FSM. P3FSM that compacts the DFA structure by many times comparing to
standard PFAC DFA.
32
Implementation of a parallel host-based anti-malware on mobile GPU using behavioral detection techniques
Series of optimizations to deal with the low memory problem of mobile devices and the ever-increasing computing and memory requirements of malware detection
Perspectives: Integrating a GPU monitor which tracks down the GPU
memory usage and allows the automaton adjustment in real-time to fit the reduced GPU memory
Use of mobile GPU clusters Working on malware dataset to achieve better detection
accuracy
Conclusion
33
Thank you for your attention