Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun...

Preview:

Citation preview

Workload Offloading for Native Codes

from ARM to x86

Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun KimCoreLab POSTECH

ARM is widely used in various smart devices

2Source: http://www.rudebaguette.com/assets/smart_devices.jpg

ARM is much slower than x86

3

2mm3mm

cholesky

corre

lation

cova

riance

doitgen

dynpro

g

fdtd-2d

gemm

reg_detect

symm

syr2k

syrk

GEOMEAN

0

5

10

15

20

25

30

35

Execution Time Normalized to x86 Execution time

ARM x86

Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)

4

Offloading has been proposed• Existing offloading techniques rely on virtual machines

ARM

OS

Application

Migration

Profiler

Runtime

App.VMM

anag

er

x86

VMM

Virtual HW

OS

Application

Migration

Profiler

App.VMM

anag

er

Virtual HW

OS

Source: Byung-Gon Chun et al. CloneCloud: elastic execution between mobile device and cloud. EuroSys '11

VMs are SLOW!!!

5

C C++ using STL containers

Java JIT JavaScript Interpreted JavaScript

0

1

2

3

4

5

6

7

8

9

Execution time of Image edge detection program

Runti

me

Nor

mal

ized

to C

50X

Source: Mojtaba Mehrara et al. Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. HPCA '11

Huge Performance Overhead of Virtual Machine

6

Offloading for Native Code is necessary

ARM

OS A

Application Binary

x86

OS B

Application Binary

• Different ISAs• Different Memory Layouts• Different ABIs (Application Binary Interface)• Sizes, layout, and alignment of data types• Calling convention• System Libraries

Overall System

7

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

8

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

164.gzip in SPEC 2000

main() { init(); compress(); uncompress(); verification();}

9

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

init() { .. file_read_to_memory(); ..}

• Constraint cases• File I/O• System call• Machine specific code

main() { init(); compress(); uncompress(); verification();}

10

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

main() { init(); compress(); uncompress(); verification();}

Function Coverage

compress 37%

uncompress 42%

verification 1.5%

Total 100%

11

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

main() { init(); compress(); uncompress(); verification();}

12

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { while(id = recv()){ switch(id) { } send(ret); }}

Client: ARM Server: x86

main() { init(); compress(); uncompress(); verification();}

13

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { init(); send(compress_id); ret = recv(); uncompress(); verification();}

main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); } send(ret); }}

Client: ARM Server: x86

14

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { init(); send(compress_id); ret = recv(); send(uncompress_id); ret = recv(); verification();}

main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); case: uncompress_id ret = uncom-press(); } send(ret); }}

Client: ARM Server: x86

15

Stack

global variables

text

Heap

textglobal variables

Client’s memory layout Server’s memory layout

Overwritten

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

sp

sp

brk

brk

sp

brkOverwritten

16

Stack

global variables

text

Heap

textglobal variables

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

sp

brk

sp

brk

brk

spsp

brk

Client’s memory layout Server’s memory layout

sp

brk

17

struct Foo { char a; long long b; int c;};

a

b

c

a b

c

Structure layout

x86

a

b

c

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

x86

ARMConversion

18

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

internal Foo fn1(Foo a);

Function offloaded() { … Foo a = *pa; Foo ret = fn1(a); …}

19

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

internal Foo fn1(Foo a);internal Foo_cvrt fn1_cvrt(Foo_cvrt a);

Function offloaded() { … Foo_cvrt a = *pa; Foo_cvrt ret = fn1_cvrt(a); …}

20

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

external Foo fn2(Foo a);

Function offloaded() { … Foo a = *pa;

Foo tret = fn2(a);

…}

21

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

external Foo fn2(Foo a);

Function offloaded() { … Foo_cvrt a = *pa; Foo ta = convert_to_x86(a); Foo tret = fn2(ta); Foo_cvrt ret = convert_to_arm(tret); …}

22

Migration

• Speculative page migration (Before offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1234

……

Used In Offloaded()Page #1, #2, #3 …

Profiling result

Client memory Server memory

1 0x5052 02 0xFF00 03 0x2A48 0

23

• Lazy Loading (During offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)

……

Client memory Server memory

Request

4 0xF35A 0

24

• Lazy Loading (During offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)

……

Client memory Server memory

Migration

4 0xF35A 0

25

• Write-back (After offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0x00AC 13 0x2000 14 0xF35A 0

……

Client memory Server memory

26

• Write-back (After offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0x00AC 03 0x2000 04 0xF35A 0

……

Client memory Server memory

Write-back

2 0x00AC 03 0x2000 0

27

gemm

2mm3mm

cholesky

corre

lation

cova

riance

doitgen

dynpro

g

fdtd-2d

reg_detect

symm

syr2k

syrk

GeoMean

0

1

2

3

4

5

6

7

8

9

10

Spea

dup

Evaluation

Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)

28

Conclusion• We developed a compiler framework provides

workload offloading for native codes from ARM to x86.

• We solve the different ISAs, memory layout, ABI problems which occurs in offloading for native code.

Recommended