Upload
thai
View
41
Download
1
Embed Size (px)
DESCRIPTION
METERG: Measurement-Based End-to-End Performance Estimation Technique in QoS-Capable Multiprocessors. Jae W. Lee and Krste Asanovic { leejw, krste } @mit.edu MIT Computer Science and Artificial Intelligence Lab April 5, 2006. - PowerPoint PPT Presentation
Citation preview
J. LeeApril 5, 2006
MIT CSAIL 1/23
METERG: Measurement-Based End-to-End Performance Estimation
Techniquein QoS-Capable Multiprocessors
Jae W. Lee and Krste Asanovic{leejw, krste}@mit.edu
MIT Computer Science and Artificial Intelligence LabApril 5, 2006
J. LeeApril 5, 2006
MIT CSAIL 2/23
Multiprocessors Are Ubiquitous
Embedded CMP(e.g. ARM MPCore)
P1
MEM I/O
Pn
Shared bus
Servers
Desktops
Handheld/Embedded
• Our focus: Running soft real-time apps (e.g. media codecs, web services) on general-purpose MPs
– Assuming multiprogrammed workloads: real-time + best-effort apps
• Problem: Contention for shared resources causes wider variation of a task’s execution time.
J. LeeApril 5, 2006
MIT CSAIL 3/23
Execution Time Variation in MPs
• For fixed input to fixed task, execution time varies depending on the activities of other processors because of shared memory system.
Processor 1
Shared Bus
SharedMemory
Processor n
Shared Resource
Instance # of TaskX
Executiontime
i i+4i+2
SevereContention
NoContention
i+1 i+3
J. LeeApril 5, 2006
MIT CSAIL 4/23
Impact of Resource Contention
• Execution time increases by 85 % when number of background processes increases from 0 to 7.
• How can we limit the impact of resource contention?
QoS Support
0
0.5
1
1.5
2
0 1 3 7
1.021.00
1.85
1.05
Normalized Execution Time of memread on P1 (a fixed input)
MeasureExecution
Time
Processor 1
Shared Bus
SharedMemory
Processor 8
Shared Resource
Processor 2
Variable Number of Background Processes
Number of Background Processes
J. LeeApril 5, 2006
MIT CSAIL 5/23Memory QoS Reduces MP Execution Time Variation
• Memory QoS guarantees per-processor memory bandwidth and latency.
– It guarantees minimum performance and distributes unclaimed bandwidth to sharers in order to maximize throughput.
• Challenge: How can we find the minimal QoS parameters that meet a RT task’s given performance goal?
– Minimal reservation improves total system throughput.
Instance # of TaskX
i i+2i+1
Range ofExecution Time
i+3 i+4
Without QoS
With QoS
J. LeeApril 5, 2006
MIT CSAIL 6/23Translating Performance Goal in User Metric into QoS Parameters
• User’s performance goal (user metric) should be translated into QoS parameters (system metric).
– User metric: Execution time, Transactions per Second (TPS), Frames per Second (FPS)
– System metric: Memory bandwidth, Latency
• Example: MP server virtualization– Question: What guarantees can you make to users?
– Users prefer user metric (e.g. TPS) to system metric (e.g. Memory bandwidth).
Partition
1
Partition
2
Partition
n
Multiprocessor System Substrate
Minimal QoS parameters are important
to maximize system throughput.
J. LeeApril 5, 2006
MIT CSAIL 7/23Finding Minimal Resource Reservation (1): Analysis
• Static timing analyses: Tools for WCET analyses – Consist of Hardware modeling + program analysis
– [+] Can find minimal resource reservation for all possible inputs
– [-] Analysis cost is prohibitive.
» Becoming harder for increasing hardware/software complexity in MPs
» Possibly overkill for soft RT apps:
Our primary concern is performance degradation from resource contention (not from input data).
We can tolerate a small number of deadline violations to maximize overall system throughput.
J. LeeApril 5, 2006
MIT CSAIL 8/23Finding Minimal Resource Reservation (2): Measurement
• Measurement-based Approach– Measures task’s execution time for a certain input set
– Motto: “The best model for a system is the system itself.” [Colin-RTSS ’03]
– Goal: To find execution time under worst-case contention (=TMAX(x, i)) for given QoS parameter (x) and instruction sequence (i)
Execution Timefor fixed inst seq i
Probabilitydistribution
Range of variation
TMAX(x, i)
Degree of Contention
(x: QoS parameter)
Low ← → High
J. LeeApril 5, 2006
MIT CSAIL 9/23Finding Minimal Resource Reservation (2): Measurement
• Measurement-based Approach (Cont)– [+] Easy
– [-] Not absolutely safe
Can be practically useful for soft real-time apps
Can be useful if we are given “near-worst” input set
– [-] Measured time varies by degree of contention in runtime
We will address this problem in the rest of this talk!Probabilitydistribution
Range of variation
TMAX(x, i)
Degree of Contention
(x: QoS parameter)
Low ← → High
Execution Timefor fixed inst seq i
J. LeeApril 5, 2006
MIT CSAIL 10/23
Our Approach: METERG
• METERG QoS System: Improving measurement-based approach
– Estimates TMAX(x, i) easily by measurement
– Provides hardware support for safe and tight estimation for TMAX(x, i)
– Introduces two QoS modes
» Deployment mode (operation mode)
» Enforcement mode (measurement mode for estimation)
TMAX(x, i)
Probabilitydistribution
Range of variation(Deployment)
Execution Timefor fixed inst seq i
Range of variation(Enforcement) 2. Very narrow and
close to TMAX(x, i) (Tight)
1. Always greater than TMAX(x, i) (Safe)
J. LeeApril 5, 2006
MIT CSAIL 11/23Modifying Shared Blocks to Support Two QoS Modes
• Now each QoS block supports two operation modes:
1) Deployment mode: Conventional QoS mode (Operation)– QoS parameters are treated as a lower bound of received resources
in runtime.
2) Enforcement mode: New QoS Mode (Estimation)– QoS parameters are treated as an upper bound of received
resources in runtime.
QoSParameter
5 0 5 01) Deployment Mode 2) Enforcement Mode
J. LeeApril 5, 2006
MIT CSAIL 12/23
Assumptions
• Single-threaded apps; no shared objects.
• No contentions within a node.
(e.g. multithreaded processors)
• No preemption of a scheduled task.
• Negligible impact of initial states on execution time.
J. LeeApril 5, 2006
MIT CSAIL 13/23
Baseline MP System
Processor 1
Shared Bus
SharedMemory
Processor n
QoS-capableShared block
x1 xn
• Simple fixed frame-based scheduling for memory access– Example (x1=0.2, xn=0.3)
– xi determines
» Minimum bandwidth
» Worst-case latency
1 n n 1 n
TakenBy P1
Not taken:Given to a requesterby round-robin fashion
Frame (size = 10 Timeslots)
xi: Fraction of time slots reserved for Processor i
J. LeeApril 5, 2006
MIT CSAIL 14/23Runtime Usage Example in METERG System
• Example (x1=0.2)
Reservation:
1 1
Frame (size = 10 Time Slots)
Runtime usage (Enforcement):
X X X X X X X X
Reserved &used
Reservedbut unused
Runtime usage (Deployment):
Reserved &used
Not allowed to use
Not reservedbut used
0.2
0.2
Processor 1
Shared Bus
SharedMemory
Processor n
QoS-capableShared block
x1 xn
J. LeeApril 5, 2006
MIT CSAIL 15/23Obtain Minimal Resource Reservation Using METERG
How to find minimal resource reservation with METERG:
In measurement phase:
• Step 1: Measure performance in enforcement mode for a QoS parameter.
• Step 2: Iterate Step 1 to find a minimal QoS parameter (yet meeting the performance goal).
• Step 3: Store the minimal QoS parameter.
In operation phase:
• Step 4: Execute the program in deployment mode with the stored QoS parameter for guaranteed performance.
J. LeeApril 5, 2006
MIT CSAIL 16/23Bandwidth Guarantees Are Not Enough: An Example
• Shared bus example again (x1=0.2)
• Bandwidth inequality is guaranteed but not latency.BandwidthDEP (xi) ≥ BandwidthENF (xi) (O)
MAX [LatencyDEP (xi)] ≤ MIN [LatencyENF (xi)] (X)
May cause end-to-end performance inversion.
Enforcement Mode
X X X X X X X X
Deployment Mode
Not allowedto use
Memory Request
2 Timeslots4 Timeslots
Memory Request
Used by otherprocessors
J. LeeApril 5, 2006
MIT CSAIL 17/23Two Enforcement Modes: Safety-Tightness Tradeoff
• Relaxed enforcement (R-ENF) mode: Bandwidth-only guarantees
BandwidthDEP (xi) ≥ BandwidthR-ENF (xi)
– There is a (small) chance for observed execution time in enforcement mode to be smaller than TMAX(x, i). (Not Safe)
• Strict enforcement (S-ENF) mode: Bandwidth and latency guarantees
BandwidthDEP (xi) ≥ BandwidthS-ENF (xi) &&
MAX[LatencyDEP (xi)] ≤ MIN[LatencyS-ENF (xi)]
– Observed execution time is always greater than TMAX(x, i) in enforcement mode as long as there is no timing anomaly in processor. (Safe)
– Estimation of TMAX(x, i) is looser.
J. LeeApril 5, 2006
MIT CSAIL 18/23Memory System with Strict Enforcement Mode: An Implementation
Processor 1
Processor n
METERGQoS
Memoryblock
Mem_Reply
Mem_Req
Delay Queue(bypassed in DEP mode)
1 data1 TMAX(DEP)(x1)-Tactual1
1 data20
V Data Timer
TMAX(DEP)(x1)-Tactual2
OpMode1
• Delay queue– In Deployment mode: Bypassed
– In Enforcement mode: Deferring delivery to processor for “lucky” messages
TMAX(DEP)(x1): Worst-Case latencyin Deployment Modefor given param x1
Tactual1:Actual time taken toservice request 1
J. LeeApril 5, 2006
MIT CSAIL 19/23
Evaluation: Simulation Setup
Processor 1
Shared Bus
E D
RequestReply
NetworkInterface
(NI)
E D
ReplyRequest
SharedMemory
OpMode1OpMode8
Processor 8
METERG QoS block
• Simulation Setup– 8-way SMP running Linux– In-order core running with 5x
faster clock than the system bus– Detailed memory contention
model with a simple fixed-frame TDM bus
– 32 KB unified blocking L1 cache / No L2 cache
• Synthetic benchmark: Memread– Infinite loop accessing a large
chunk of memory sequentially– Performance bottlenecked by
memory system performance
J. LeeApril 5, 2006
MIT CSAIL 20/23Evaluation: Varying Number of Processes
0.8
1
1.2
1.4
1.6
1.8
2
0 1 3 7
1.85 (BE)
1.45 (ENF)
1.10 (DEP)
• Execution time degradation as number of background processes increases from 0 to 7without QoS (Best-Effort): 85 %
with QoS (Deployment): 10 %
• Estimated execution time upper bound for given QoS parameter (x1=0.25): 45 %
0 1 3 7# of Background Processes
Normalized Execution Time
(x1 =0.25)
Processor 1
Shared Bus
SharedMemory
Processor 8
Shared Resource
Processor 2
Variable # of BE Processes
Measure P1(x1 = 0.25)
J. LeeApril 5, 2006
MIT CSAIL 21/23Evaluation: Varying QoS Parameters
0.9
1.2
1.5
1.8
2.1
2.4
2.7
0.1 0.25 0.5 0.75
2.35 (ENF)
1.51 (DEP)
Normalized Execution Time
• Performance estimation becomes tighter as QoS parameter (x1) increases.– Because the worst-case latency in accessing memory decreases
accordingly
QoS Parameter (x1)
1.011.09
Processor 1
Shared Bus
SharedMemory
Processor 8
Shared Resource
Processor 2
7 BE Background Processes
Measure P1(Variable x1)
J. LeeApril 5, 2006
MIT CSAIL 22/23Evaluation: Varying Number of QoS Processes
0
0.2
0.4
0.6
0.8
1
1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE
P1
P2
P3
P4
P5
P6
P7
P8
(0.25, 0, 0, 0, 0, 0, 0, 0)
(0.25, 0.25, 0.25, 0.20, 0, 0, 0, 0)
(0.25, 0.25, 0, 0, 0, 0, 0, 0)
(0.25, 0.25, 0.25, 0, 0, 0, 0, 0)
0.91
(0.54)
(0.69)
0.45
0.91 0.90
0.37
0.89
0.27
Estimatedlower boundfor xi=0.25
8-BECase
0.91 0.90
0.14
Normalized IPC*
• Performance impact on a QoS process by other QoS processes is negligible (<2%) as long as system is not oversubscribed.
1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE
0.81
QoS Parameters:
(* Higher Number = Better Performance)
J. LeeApril 5, 2006
MIT CSAIL 23/23
Conclusion
• In general-purpose MPs, shared resource contention causes large variation of execution time of a task.
(~2x increase observed in 8-processor case)
• METERG QoS System provides an easy way (by measurement) to estimate execution time under worst-case resource contention for a given QoS parameter.
– Introducing a new QoS mode (Enforcement Mode) for performance estimation
• Using METERG QoS System, minimal QoS parameter can be easily found for given instruction sequence and performance goal.