Upload
roddy
View
34
Download
0
Embed Size (px)
DESCRIPTION
Cross-stack Energy Optimization: Fact or Fiction?. Kevin Skadron University of Virginia Dept. of Computer Science. Flavors of X-Stack. “Up” the stack Circuits Microarchitecture HWSW eg , sensorsthrottling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) … - PowerPoint PPT Presentation
Citation preview
Cross-stack Energy Optimization: Fact or Fiction?
Kevin SkadronUniversity of Virginia
Dept. of Computer Science
Flavors of X-Stack• “Up” the stack– CircuitsMicroarchitecture– HWSW
• eg, sensorsthrottling• Ideally, application itself can adapt (algorithm, precision, QoS, etc.)
– …• “Down” the stack– Often overlooked, but OS, HW can benefit from application
knowledge– SWHW
• eg, access patterns, thread priorities, private/shared, etc.– GPU example: texture (APIdriverHW)
• eg, reconfigurable hardware
2
Up: Dymaxion: Index Transformation• SIMD/SIMT: Because SIMD requires contiguous access for
efficiency, data layout/traversal needs to be transformed• Usermiddleware(device driver)(hardware)
feature[index] feature’[transform(index)]
8
Code Example
HOST
cudaMemcpy(feature_d, feature, …);kmeans_kernel_orig<<<dimGrid,dimBlock>>>(
feature_d,...
);
HOST
map_row2col(feature_remap, feature, …);
kmeans_kernel_map<<<dimGrid,dimBlock>>>(
feature_remap, ... );
DEVICE
__global__ kmeans_kernel_orig(float *feature_d, ...){ int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */
for (int l = 0; l < nclusters; l++) {index = point_id * nfeatures
+ l;...feature_d[index]...
} }
DEVICE
__global__ kmeans_kernel_map(float *feature_remap, ...){
int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x;
/* ... */ for (int l = 0; l < nclusters; l++) { index = point_id * nfeatures
+ l;
...feature_remap[transform_row2col(index, npoints, nfeatures)]... }
}
Dymaxion Version
Original Version
Down: Lack of Sensors and Actuators• Feedback control: sensors and actuators• Chicken and egg problem• Lack of sensors is a big problem now– Can’t control what we can’t measure– Performance monitors not designed for this
• Too coarse-grained, can’t monitor enough– Moving in the right direction
• Need more actuators, too– Currently mainly have just DVFS and
scheduling/placement– Some HDDs offer DRPM– Reconfiguration is a form of actuation, too
5
Wish List• Sensors/constraint communication
– Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc.• Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query
microarchitectural state, expose chip/rack/datacenter/geographic location, etc.– Down: Access patterns, private/shared, priority/performance
expectations, etc.• Requires new programming constructs and new (possibly privileged) instructions
• Actuators– Many system components hard to control
• e.g., HDDs, DRAM, power supply– Control memory behavior, light sleep modes
• Ordering/buffering/prefetching/contention– More reconfigurability, coarse-grained architectures
• Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?
6
Summary• Turn fiction into non-fiction!• Some good ideas already in papers– Revisit: why weren’t they adopted?
• New ideas:– Imagine ideal sensing and actuation– Show a promising control/adaptation/reconfiguration
algorithm– Propose plausible sensors/actuators
7
Backup
8
What is “Cross Stack”?• Layer X adapts based on information in Layer Y
– Example: OS uses hardware info• e.g., temp sensors, structure occupancies, # pending cache misses guide
thread co-location– Or hardware uses OS info
• e.g., thread priorities, task deadlines guide hardware DVFS policy– Important—leverage information across layers to make globally
efficient decisions– Ultimately: break down costly interfaces
• Unnecessary copies, extra state, redundant computation
• Different than energy optimization happening independently in multiple layers– e.g., hardware DVFS (based on instruction flow)
+ OS DVFS (based on task deadlines)– Risky—control loops can fight
9
Fact or Fiction• Should be fact!• But mostly fiction– Can’t measure power/energy effectively in many systems
and components– Control options are typically high-overhead
• DVFS, task migration, etc.– Most solutions are single-layer
• Baby steps– Cluster/datacenter front end monitors per-node activity,
temperature—schedules accordingly– Autotuning– Reducing copies
10