1
Lightweight Monitoring of the Progress of Remotely Executing Computations
Shuo Yang, Ali R. ButtY. Charlie Hu, Samuel P. Midkiff
Purdue University
2
Typical workloads are Bursty Periods of little or no processing Periods of insufficient CPU resources Idle cycles not usable for future
Exploit values from the wasted idle resources Achieve more available processing capability for
“free” or at low cost “Smooth out” the workload
Harvesting Unused Resources
3
Centralized cycle sharing SETI@Home, Genome@Home, IBM (with United Device), etc. Condor, Microsoft (with GridIron), etc.
P2P based cycle-sharing (Butt et al. [VM’04]) Individual node can utilize the system –more incentive Nodes can be across administrative domains– more available resource
Remote execution motivates remote monitoring Unreliable resources Untrusted resources
The Need of Remote Monitoring
4
Review of GridCop – [Yang et al. PPoPP’05]
Submitted Job
(H-code)
Reporting
Module
Reporting
Module
JVM (Sandboxed)
Host Machine
Processing
Module
(S-code)
JVM
Submitter
progress
partial computation
5
Our New Contribution: Key Difference From GridCop
Uses probabilistic code instrumentation Prevents replay attacks (like GridCop)
No recomputation needed – reduces network traffic and submitter machine overhead
Ties the progress information closely to program structure Makes spoofing more difficult
PC values reflecting the program binary code internal nature
6
Overview Design of Lightweight Monitoring Mechanism Experimental Results Related Research and Conclusions
Outline
7
System Overview: Code Generation
Original
codeHost-code
Submitter-code
Code
Generation System Original
code
Executed on Host:Emits progress information (“beacons”) during computation
Executed on submitter:Processes “beacons”
8
System Overview
Submitted Job
(H-code)
Reporting
Module
Host Machine
Beacon
Processing
Module
(S-code)
Submitter
Beacon
9
Basic Idea of the FSA Tracking Beacons are placed at significant execution points along
CFG Beacons can be viewed as states in an FSA Can be placed at any site satisfying the compiler instrumentation
criteria, e.g. MPI call sites in this paper
Host emits beacon messages at significant execution points An FSA emitting transition symbols
Submitter processes beacon messages A mirror FSA recognizing legal transitions
10
An FSA Example
main(){ … mpi_irecv(…); //S1 … if(predicate){ mpi_send(…); //S2 } … mpi_wait(); //S3 …}
S1 S2 S3
11
Binary file Location Beacon (BLB)
BLB values are the virtual address of instructions in the virtual memory of a process– states in FSA
Stack
….
….
heap
Code segment
Initialized data
bss
…
804a69b: call mpi_wait
…
804a679: call mpi_send
…
804a641: call mpi_irecv
…
12
PC values – labels driving the transitions in FSAmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}
@804a641
@804a679
@804a69b
“804a69b” “804a641”
“804a69b”“804a679”
•Compiler inserts a getPC() in front of a BLB
•getPC() returns the address of the next instruction
13
Tracking the Progress of an MPI Programmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}
@804a641
@804a679
@804a69b
“804a69b” “804a641”
“804a69b”“804a679”
14
Attacks to the FSA Mechanism
Susceptible to replay attack Remember the stream of beacons of a previous run Replay the stream in the future (cheating to gain
undeserved compensation) Reverse engineer the binary executable
Understand the control flow graph Expensive– NP-hard in worst case ([Wang, PhD thesis
University of Virginia])
15
Probabilistic BLB
Each MPI function call site is a BLB candidate but not necessarily a BLB site It is used as a BLB site with probability of PB in (0,1)
Effect: an individual MPI function call site may be a BLB in the FSA in one code generation; but not a BLB in next time
16
Probabilistic BLBs Guard against Attack
The same job can have a different FSA each time it is submitted to the host
This leads to a different legal beacon value stream Defeats the replay attack by making it detectable
Reverse engineering by binary analysis must be repeated by cheating host on each run Break once, spoof only once
Too expensive!
17
One FSA with Probabilistic BLBmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ pc = getPC(); mpi_send(…); //0x804a679 deposit_beacon(pc); } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}
@804a641
@804a679
@804a69b
“804a69b” “804a641”
“804a69b”“804a679”
18
Another FSA with Probabilistic BLBmain(){ … pc = getPC(); mpi_irecv(…);// 0x804a641 deposit_beacon(pc); if(predicate){ mpi_send(…); //0x804a679 } pc = getPC(); mpi_wait(); //0x804a69b deposit_beacon(pc); …}
@804a641 @804a69b
“804a69b” “804a641”
19
Overview Design of Lightweight Monitoring Mechanism Experimental Results Related Research and Conclusions
Outline
20
Experimental Setup
Submitter machine @UIUC (thanks to Josep Torrellas) Intel 3GHz Xeon/512K cache, 1GB main memory Running Linux 2.4.20 kernel
Host machine @Purdue A cluster of 8 Pentium IV machines (each node has 512K cache,
512MB main memory), interconnected by a FastEthernet. Running FreeBSD 4.7, MPICH 1.2.5
Network access Both machines connected to campus networks via Ethernet UIUC—Purdue: representing a typical scenario of cycle-
sharing across WAN
21
Benchmarks & Evaluation Metrics Used NAS Parallel Benchmark (NPB) 3.2
A set of benchmarks to evaluate the performance of parallel computational resources
1. Run Time Computation Overhead
2. Network Traffic Overhead Network resource is not “free”
3. Beacon Distribution over Time Capability to track progress incrementally
22
Host Side Computation Overhead at Different Number of Nodes
Overhead = (Tmonitoring – Toriginal) / Toriginal * 100% Lower bar is better Does not increase monotonically with the increase of process numbers
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
EP IS MG CG
2 nodes
4 nodes
8 nodes
23
Host Side Computation Overhead under Different Input Sizes
Overhead = (Tmonitoring – Toriginal) / Toriginal * 100% Lower bar is better Lower overhead for larger problem size
Different input sizes on 8 nodes
0.00%
0.50%
1.00%
1.50%
2.00%
EP IS MG CG
Ov
erh
ea
d
size B
size C
24
Submitter Side Computation Cost
Size B (8 procs) Size C (8 procs)
EP 0.06% 0.02%
IS 0.07% 0.02%
MG 0.15% 0.03%
CG 0.17% 0.07%
• Overhead = time(submitter code) / execution time• Imperfect metric–the number depends on submitter’s
hardware, submitter workload, host speed etc.
25
Network Traffic Incurred by Monitoring
Size B (8 procs) Size C (8 procs)
EP 4.2 bytes/s 1.0 bytes/s
IS 101.5 bytes/s 23.2 bytes/s
MG 21.2K bytes/s 1.5K bytes/s
CG 21.9K bytes/s 7.9K bytes/s
Bytes sent over network between host and submitter machine divided by the total execution time Low bandwidth usage
26
Beacon Distribution over Time
Uniformly distribution enables incrementally tracking
27
Overview Design of Lightweight Monitoring Mechanism Experimental Results Related Research and Conclusions
Outline
28
Related Research
L. F. Sarmenta [CCGrid’01], W. Du et al. [ICDCS’04] A host performs same computation on different inputs Needs a central manager
Yang et al. [PPoPP’05] Partially duplicate compuation Incurs more network traffic associated with the recomputation
Hofmeyr et al. [J. of Computer Security’98], Chen and Wagner [CCS’02] Using system call sequence to detect intrusions Approaches to achieve host security
29
Conclusions
Lightweight monitoring over a WAN/Internet possible
No changes to host side system required Instrumentation can be performed automatically
30
Host Side Overhead Details(Slide 22)
Overhead = (Tmonitoring – Toriginal) / Toriginal
Does not increase monotonically with an increase in the number of processes (Nprocess)
When Nprocess increases: The denominator, Toriginal, decreases The numerator – difference of Tmonitoring and Toriginal decreases
(the number of MPI calls decreases, decreasing the overhead of BLB message generation)
Synchronization: always one extra thread per process no matter how many processes are running
31
Host Side Overhead Details(Slide 23)
Overhead = (Tmonitoring – Toriginal) / Toriginal
Results in lower overhead for larger problem size
When the problem size increases Denominator (Toriginal) increases Numerator (Tmonitoring – Toriginal) similar since the number of
MPI calls is similar