26
Binary‐level program analysis: Static Disassembly Gang Tan CSE 597 Spring 2019 Penn State University 3

Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Binary‐level program analysis:Static Disassembly

Gang TanCSE 597

Spring 2019Penn State University

3

Page 2: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Disassemblers

• Disassembler– Convert machine code in a binary file into assembly code or code in an equivalent IR

• Assume a decode function– decode(code, offset) returns the next instruction and the instruction’s size• Assuming code is a list of bytes, and offset is the beginning of the next instruction 

– E.g., assume code=“6A 03 83 C4 0C B8 CC CC CC CC”• decode(code,0) => (“push 3”, 2)

– “6A 03”• decode(code,2) => (“add esp, 0x0C”, 3)

– “83 C4 0C”• decode(code,5) => (“mov eax, 0xCCCCCCCC”, 5)

– “B8 CC CC CC CC”

Page 3: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Dynamic vs Static Disassemblers

• Dynamic disassemblers– The binary code is executed and the execution traces are recorded and decoded

– Advantages: accurate, can disassemble obfuscated binary code

– Disadvantages: takes time to record traces; only covers one execution path at a time

• Static disassemblers– A binary file is disassembled without executing it– Advantages: fast; covers more than one execution path

– Disadvantages: many challenges; vulnerable to code obfuscation

Page 4: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Static Disassemblers

• Input: a binary file• Goal

– Disassemble the executable sections in the binary file

–May use other information in the binary file• E.g., symbol tables if they are available

• Output: a Control‐Flow Graph (CFG)

6

Page 5: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

CFG

• Nodes are basics blocks of assembly instructions– A basic block is a piece of straight‐line code: no jumps in or out of the middle of a basic block

• Directed edges connect basic blocks– An edge from b1 to b2 means that after the execution of b1 it is possible b2 starts execution

• A basic block may have multiple outgoing edges– E.g., when it ends with a conditional jump instruction

7

Page 6: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

CFG Example

8

Page 7: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Static Disassembly Challenges

• Variable‐sized instruction sets– Do not know instruction boundaries for stripped binaries

• Embedded data in code– E.g., compiler may embed jump tables into the code section

– Note: compilers do less of this nowadays; but an obfuscator might do it still

• Targets of indirect jumps/calls require static analysis– E.g., “jmp 16[ebp]”, “call eax” 

9

Page 8: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Static Disassemblers

• Some major algorithms– Linear sweep– Recursive traversal– Some mixed approach

Page 9: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Linear Sweep

• Linear Sweep – Start at the entry point of a code section – Decode instructions one by one until the end or an illegal instruction is reached

• The Unix utility program objdump adopts linear sweep

Page 10: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Linear Sweep Pseudo Code

– Input• code holds the bytes of the input code section• codeSize is the code section size

currOffset=0;instrSet={};while (currOffset < codeSize) {

(instr, size) = decode(code, currOffset); instrSet = instrSet ∪ {(currOffset, instr, size)};currOffset += size;

}buildCFG(instrSet)

Assume decode throws exception when it  fails (cannot decode, the end of the code buffer, etc.)

Build basic blocks and add CFG edges

Page 11: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Linear Sweep

• Advantage: simple and easy to implement• Disadvantages

–Mistreat data as code if they are mixedCode section

instr instr data instr instr

Linear sweep:

instr instr wrong wrong wrong

(1) Could be a jump over data to the next instr(2) Could be a ret

Page 12: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Recursive Traversal

• Idea– disassembles instructions following the control flow graph constructed during disassembly

14

Page 13: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Recursive Traversal Pseudo Code

worklist = {0}; processed = {};while (worklist <> {}) {

offset = removeOneNode(worklist);processed = processed ∪ {offset};(instr, size) = decode(code, offset);switch (instr)case non‐control‐flow‐instr: add(offset+size);case unconditional‐jmp(dest): add(dest);case cond‐jmp(dest1,dest2): add(dest1); add(dest2);…

}

Procedure add (offset):if (offset ∉ processed) then worklist = worklist ∪ {offset}

15

Page 14: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Recursive Traversal

• Advantage– Recursive traversal can accommodate data embedded in code section

• Disadvantage:– Hard to determine the control‐flow edges out of indirect jumps and calls

• IDA Pro uses recursive traversal– A commercial disassembler– An incomplete control flow graph (CFG) is emitted– The CFG is incomplete because there is no edge for indirect branch or call instructions

Page 15: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Disassembling Obfuscated Code

• “Static Disassembly of Obfuscated Binaries” by Kruegel et al at 2004 Usenix Security– Linear sweep and recursive traversal are combined– Heuristics are used to remove spurious nodes from initial CFGs

• Obfuscated binaries– No symbol info– After obfuscation such as inserting unreachable junk data into the code section • E.g., “ins1; ins2” => “ins1; mov eax, someConst; jmp eax; junk bytes; ins2

Page 16: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Disassembling Obfuscated Code

• The algorithm:– Identify functions

• Match binary code with common prologs• Common prolog: “push %ebp; mov %esp, %ebp”, i.e. 0x55 89 e5

– Construct intra‐procedural CFGs• Decode from every address; throw away illegal instructions

– To accommodate variable‐sized instruction sets– May result in overlapping instructions

• Identify all direct jump instructions in a function• Direct jump instructions whose targets are inside the function and direct conditional branch instructions are selected as jump candidates

• An initial CFG is constructed by treating the entry instruction and jump candidates as the starting points using recursive traversal

– Resolve block conflicts in the initial CFGs• Five steps are taken to remove spurious nodes

Page 17: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Example Program

19

Page 18: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

Initial control flow graphBlue nodes represent the nodes in the real CFG;Red nodes represent spurious nodes;Node A is the entry node;Pink dash lines indicate there is a conflict between the nodes;Solid arrows represent the edges in the initial CFG.

Page 19: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

• The first step removes conflicting nodes which conflict with valid nodes– Entry node must be valid– Nodes reachable from a valid node must be valid– Nodes in conflict with valid nodes must be invalid

• The second step removes ancestors of conflicting nodes– Assumption: valid nodes do not overlap– If two nodes in conflict share an ancestor, the ancestor must be invalid

• The third step removes conflicting nodes with less predecessors– Assumes that valid nodes are more tightly integrated into a CFG– A node with more predecessors implies tighter integration– Clearly a heuristics

• The fourth step removes conflicting nodes with less direct successors– Assumes that valid nodes are more tightly integrated into a CFG– More direct successors implies tighter integration– Heuristics

• The last step removes nodes in conflict randomly– Pick one from two conflicting nodes by random– Being desperate here

Page 20: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

Control flow graph after the first step (Node B is removed)Blue nodes represent the nodes in the real CFG;Red nodes represent spurious nodes;Node A is the entry node;Pink dash lines indicate there is a conflict between the nodes;Solid arrows represent the edges in the initial CFG.

Page 21: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

Control flow graph after the second step (Node J is removed)Blue nodes represent the nodes in the real CFG;Red nodes represent spurious nodes;Node A is the entry node;Pink dash lines indicate there is a conflict between the nodes;Solid arrows represent the edges in the initial CFG.

Page 22: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

Control flow graph after the third step (Node K is removed)Blue nodes represent the nodes in the real CFG;Red nodes represent spurious nodes;Node A is the entry node;Pink dash lines indicate there is a conflict between the nodes;Solid arrows represent the edges in the initial CFG.

Page 23: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Block Conflict Resolution

Control flow graph after the fourth step (Node C is removed)Blue nodes represent the nodes in the real CFG;Red nodes represent spurious nodes;Node A is the entry node;Pink dash lines indicate there is a conflict between the nodes;Solid arrows represent the edges in the initial CFG.

Page 24: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Disassembler Accuracy

Program Objdump Linn/Debray IDA Pro This paper

compress95gccgoIjpeglim88ksimperlvortex

56.0765.5466.0860.8256.6558.4257.6666.02

69.9682.1878.1274.2372.7875.6672.0176.97

24.1945.0943.0131.4629.0729.5631.3642.65

91.0488.4591.8191.6089.8690.3986.9390.71

Mean 60.91 75.24 34.55 90.10

All programs went through an obfuscation toolPercentage of instructions correctly disassembled by each tool using SPECint 95

Page 25: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Student Presentations related to Static Disassembly

• Shingled Graph Disassembly: Finding the Undecideable Path; presenter: Yi Zheng

• Superset Disassembly: Statically Rewriting x86 Binaries Without Heuristics; presenter: Eric Pauley

• Static Binary Rewriting without Supplemental Information; presenter: Tingwei Hua

• Recognizing Functions in Binaries with Neural Networks; presenter: Ryan Sheatsley

27

Page 26: Binary‐level program analysis: Static Disassemblygxt29/teaching/cse597s19/slides/05StaticDisassembly.pdf · •Common prolog: “push %ebp; mov%esp, %ebp”, i.e. 0x55 89 e5 –Construct

Next: Static Analysis Basics

• On a high‐level language– Techniques applicable to assembly code–We will read papers that apply static analysis on assembly code

• Dataflow analysis– First discuss the theory– Then go through implementation in Datalog

• Inter‐procedural analysis; flow‐sensitivity, path‐sensitivity, context sensitivity;

28