Thrifty: An Exascale Architecture for Energy Proportional Computing

Incremental In-Memory Checkpointing

P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

• Local coordinated checkpointing (not global)• Only communicating processors

checkpoint/rollback together• Overheads of a few percent expected

P1 rolls backP1’s consumers rollback RULES:

P2 checkpointsP2’s producers checkpoint

Thrifty: An Exascale Architecture for Energy Proportional Computing

Motivation

• Current petascale architectures are not scalable– Consume several MW of power– Can easily waste 20% of capacity to faults and recovery

• Thrifty: – Novel exascale architecture and software stack– Aims at highly-efficient, energy proportional computing– Innovates in power/energy efficiency, resiliency, and performance– Tackles problem at the circuit, architecture, compilation, and runtime

layers

Approach

1. Power/energy efficiency: one order of magnitude efficiency gain• Circuits/architectures for low Vdd and fine-grain power management• Energy-aware compiler drives power management• Application models for energy proportionality and power management

2. Resiliency: reduce loss to faults/recovery to less than 5% of execution• Circuits/architectures for error detection and tolerance• Architecture/software for incremental in-memory scalable

checkpointing3. Performance: one order of magnitude performance increase

• Architectures for fine-grain synch and communication• Auto-tuning compiler• Identifying application idioms and efficiently mapping them

Thrifty Architecture Architecture Description

• 1K-core processor chips with low Vdd and extensive clock/power gating• Cores organized in clusters with fine-grain communication/synch

support• Banked on-chip memory with simple, distributed engines• On-chip network with wide links and simple routers• Novel features for fault detection and tolerance• Incremental, in-memory scalable checkpointing driven by hardware• Novel hardware to minimize data movement and facilitate auto-tuning

Application Auto-Tuning for Energy Efficiency

”Auto-Tuning for Energy Usage in Scientific Applications”, by A. Tiwari, M. A. Laurenzano, L. Carrington, and A. Snavely. In PROPER Workshop at Euro-Par, September 2011.

• Identify HW and SW tunables that have high impact on power draw• Tweak them to optimize for energy usage while maintaining performance

• Approach: Compiler-based using ROSE, CHiLL; Search-based using Active Harmony

Compiler-Driven Transf. for Energy Reduction

”Studying the Impact of Application-Level Optimizations on the Power Consumption of Multi-Core Architectures”, by S. M. F. Rahman, J. Guo, A. Bhat, C. Garcia, M. H. Sujon, Q. Yi, C. Liao, and D. Quinlan, In Computing Frontiers, May 2012.

• 42 sequential and multithreaded benchmarks. Watts Up PRO power meter• 10+ major sequential and parallel opts and execution configurations

• Performance difference often dominates Performance per Watt (P/W)• O2 and O3: similar performance with different power consumption• Multithreaded apps: more room for tuning P/W, e.g. thread affinity, sched.

Power Measurements in a Many-Core (SCC)

”Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs”, by E. Totoni, B. Behzad, S. Ghike and J. Torrellas. In International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2012.

• Power consumption breakdown changes with Vdd:– At high Vdd, cores dominate; at low Vdd, memory controllers dominate

• 48 Pentium-class cores consume less power than 4 Core i7-class cores– But program speed is typically lower

#pragma resiliencyfor (int i = 1; i < arraySize-1; i++)

a[i] = (a[i-1] + a[i+1]) / 2.0;

Compiler-Driven Transf. for Resilience

”ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research”, by J. Lidman, D. J. Quinlan, C. Liao, and S. A. McKee, Submitted for publication, March 2012.

for (int i = 1; i < (arraySize - 1); i++) {int ii, correctCnt = 0;float aI[3] = {a[i], a[i], a[i]};#pragma omp parallel forfor(ii = 0; ii < 3; ii += 1) {

float aII[3] = {aI[ii], aI[ii], aI[ii]};// Original statement: aI[ii] = aII[0] = ((a[i - 1] + a[i + 1]) / 2.0);aII[1] = ((a[i - 1] + a[i + 1]) / 2.0);aII[2] = ((a[i - 1] + a[i + 1]) / 2.0);aI[ii] = aII[0];if (!(aII[2] == aII[1] && aII[1] == aII[0]))

aI[ii] = (aII[0] + (aII[1] + aII[2])) / 3.00000F;}

#pragma omp parallel for reduction (+:correctCnt)for(ii = (0); ii < 2; ii += 1)

correctCnt += array_inter[ii] == array_inter[ii + 1];if (!(correctCnt == 2)) {

printf("Result is not consistent across executions...

assert(false);}

}

• Automated TMR introduction– 20% performance cost– Low power

• Transf. use OpenMP• Future work: tune these

transformations to Thrifty

Scalable Checkpointing

“Rebound: Scalable Checkpointing for Coherent Shared Memory”, by Rishi Agarwal, Pranav Garg, and Josep Torrellas. In International Symposium on Computer Architecture (ISCA), June 2011.

VARIUS-NTV: Model of Faults at Low Voltage

”VARIUS-NT: A Microarchitectural Model to Capture the Increased Sensitivity of Manycores to Process Variations at Near-Threshold Voltage”, by Ulya R. Karpuzcu, Krishna B. Kolluru, Nam Sung Kim and Josep Torrellas. In International Conference on Dependable Systems and Networks (DSN), June 2012.

• Models variation in power and frequency at Near Threshold Voltage (NTV)• Applies to both cores and memory modules• Gives resulting errors

Gate delay: EKV-based formula

Variation in Frequency (288 Cores at 11nm)

• More variation at near threshold computing (NTC) than on conventionalsuperthreshold computing (STC)

Cluster

Application Idioms

Idiom: Common pattern of computation and data access in applications

”Automatic Recognition of Performance Idioms in Scientific Applications”, by J. He, A. Snavely, R. F. Van der Wijngaart, M. Frumkin. In Inter. Parallel and Distributed Processing Symp., May 2011.

Identifying Application Idioms• Constructed a tool to automatically run through large codes at compile time

and classify patterns

• Finding: Performance and energy efficiency of applications can be represented as a tiling of the constituent idioms

• Opportunity: Drive Thrifty dynamic reconfigurations on-the-fly from the application level idiom by idiom to save energy

Team Members

• University of Illinois:– Josep Torrellas, Rishi Agarwal, Adi Agrawal, Ben Ahrens, Amin Ansari,

Dennis Crawford, Hassan Elsami, Maria Garzaran, Austin Gibbons, Prabhat Jain, Ulya Karpuzcu, Wooil Kim, Shanxiang Qi,

• Lawrence Livermore National Laboratory:– Daniel Quinlan, Leo Liao, Thomas Panas, Justin Too

• University of California San Diego:– Laura Carrington, Pietro Cicotti, Sandeep Gupta, Mitesh Meswani, Kayla

Seager, Ananta Tiwari• Intel Corporation

– Wilfred Pinfold

Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel)

http://extremescale.cs.uiuc.edu/thrifty October 2012

Description of the Research

http://extremescale.cs.uiuc.edu/thrifty

Documents

Thrifty: An Exascale Architecture for Energy Proportional Computing