Cross-stack Energy Optimization: Fact or Fiction?

Cross-stack Energy Optimization: Fact or Fiction?

Kevin SkadronUniversity of Virginia

Dept. of Computer Science

Flavors of X-Stack• “Up” the stack– CircuitsMicroarchitecture– HWSW

• eg, sensorsthrottling• Ideally, application itself can adapt (algorithm, precision, QoS, etc.)

– …• “Down” the stack– Often overlooked, but OS, HW can benefit from application

knowledge– SWHW

• eg, access patterns, thread priorities, private/shared, etc.– GPU example: texture (APIdriverHW)

• eg, reconfigurable hardware

2

Up: Dymaxion: Index Transformation• SIMD/SIMT: Because SIMD requires contiguous access for

efficiency, data layout/traversal needs to be transformed• Usermiddleware(device driver)(hardware)

feature[index] feature’[transform(index)]

8

Code Example

HOST

cudaMemcpy(feature_d, feature, …);kmeans_kernel_orig<<<dimGrid,dimBlock>>>(

feature_d,...

);

HOST

map_row2col(feature_remap, feature, …);

kmeans_kernel_map<<<dimGrid,dimBlock>>>(

feature_remap, ... );

DEVICE

__global__ kmeans_kernel_orig(float *feature_d, ...){ int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */

for (int l = 0; l < nclusters; l++) {index = point_id * nfeatures

+ l;...feature_d[index]...

} }

DEVICE

__global__ kmeans_kernel_map(float *feature_remap, ...){

int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x;

/* ... */ for (int l = 0; l < nclusters; l++) { index = point_id * nfeatures

+ l;

...feature_remap[transform_row2col(index, npoints, nfeatures)]... }

}

Dymaxion Version

Original Version

Down: Lack of Sensors and Actuators• Feedback control: sensors and actuators• Chicken and egg problem• Lack of sensors is a big problem now– Can’t control what we can’t measure– Performance monitors not designed for this

• Too coarse-grained, can’t monitor enough– Moving in the right direction

• Need more actuators, too– Currently mainly have just DVFS and

scheduling/placement– Some HDDs offer DRPM– Reconfiguration is a form of actuation, too

5

Wish List• Sensors/constraint communication

– Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc.• Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query

microarchitectural state, expose chip/rack/datacenter/geographic location, etc.– Down: Access patterns, private/shared, priority/performance

expectations, etc.• Requires new programming constructs and new (possibly privileged) instructions

• Actuators– Many system components hard to control

• e.g., HDDs, DRAM, power supply– Control memory behavior, light sleep modes

• Ordering/buffering/prefetching/contention– More reconfigurability, coarse-grained architectures

• Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?

6

Summary• Turn fiction into non-fiction!• Some good ideas already in papers– Revisit: why weren’t they adopted?

• New ideas:– Imagine ideal sensing and actuation– Show a promising control/adaptation/reconfiguration

algorithm– Propose plausible sensors/actuators

7

Backup

8

What is “Cross Stack”?• Layer X adapts based on information in Layer Y

– Example: OS uses hardware info• e.g., temp sensors, structure occupancies, # pending cache misses guide

thread co-location– Or hardware uses OS info

• e.g., thread priorities, task deadlines guide hardware DVFS policy– Important—leverage information across layers to make globally

efficient decisions– Ultimately: break down costly interfaces

• Unnecessary copies, extra state, redundant computation

• Different than energy optimization happening independently in multiple layers– e.g., hardware DVFS (based on instruction flow)

+ OS DVFS (based on task deadlines)– Risky—control loops can fight

9

Fact or Fiction• Should be fact!• But mostly fiction– Can’t measure power/energy effectively in many systems

and components– Control options are typically high-overhead

• DVFS, task migration, etc.– Most solutions are single-layer

• Baby steps– Cluster/datacenter front end monitors per-node activity,

temperature—schedules accordingly– Autotuning– Reducing copies

10

Documents

Cross-stack Energy Optimization: Fact or Fiction?