Upload
nero
View
22
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Thrifty: An Exascale Architecture for Energy Proportional Computing. Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel). - PowerPoint PPT Presentation
Citation preview
Incremental In-Memory Checkpointing
P1 P2 P3 P4 P5
LocalChkptLocal
Chkpt
Producerrollback
Consumerrollback
P1 P2
Producerchkpoint
Consumerchkpoint
P1 P2
chkptchkpt
• Local coordinated checkpointing (not global)• Only communicating processors
checkpoint/rollback together• Overheads of a few percent expected
P1 rolls backP1’s consumers rollback RULES:
P2 checkpointsP2’s producers checkpoint
Thrifty: An Exascale Architecture for Energy Proportional Computing
Motivation
• Current petascale architectures are not scalable– Consume several MW of power– Can easily waste 20% of capacity to faults and recovery
• Thrifty: – Novel exascale architecture and software stack– Aims at highly-efficient, energy proportional computing– Innovates in power/energy efficiency, resiliency, and performance– Tackles problem at the circuit, architecture, compilation, and runtime
layers
Approach
1. Power/energy efficiency: one order of magnitude efficiency gain• Circuits/architectures for low Vdd and fine-grain power management• Energy-aware compiler drives power management• Application models for energy proportionality and power management
2. Resiliency: reduce loss to faults/recovery to less than 5% of execution• Circuits/architectures for error detection and tolerance• Architecture/software for incremental in-memory scalable
checkpointing3. Performance: one order of magnitude performance increase
• Architectures for fine-grain synch and communication• Auto-tuning compiler• Identifying application idioms and efficiently mapping them
Thrifty Architecture Architecture Description
• 1K-core processor chips with low Vdd and extensive clock/power gating• Cores organized in clusters with fine-grain communication/synch
support• Banked on-chip memory with simple, distributed engines• On-chip network with wide links and simple routers• Novel features for fault detection and tolerance• Incremental, in-memory scalable checkpointing driven by hardware• Novel hardware to minimize data movement and facilitate auto-tuning
Application Auto-Tuning for Energy Efficiency
”Auto-Tuning for Energy Usage in Scientific Applications”, by A. Tiwari, M. A. Laurenzano, L. Carrington, and A. Snavely. In PROPER Workshop at Euro-Par, September 2011.
• Identify HW and SW tunables that have high impact on power draw• Tweak them to optimize for energy usage while maintaining performance
• Approach: Compiler-based using ROSE, CHiLL; Search-based using Active Harmony
Compiler-Driven Transf. for Energy Reduction
”Studying the Impact of Application-Level Optimizations on the Power Consumption of Multi-Core Architectures”, by S. M. F. Rahman, J. Guo, A. Bhat, C. Garcia, M. H. Sujon, Q. Yi, C. Liao, and D. Quinlan, In Computing Frontiers, May 2012.
• 42 sequential and multithreaded benchmarks. Watts Up PRO power meter• 10+ major sequential and parallel opts and execution configurations
• Performance difference often dominates Performance per Watt (P/W)• O2 and O3: similar performance with different power consumption• Multithreaded apps: more room for tuning P/W, e.g. thread affinity, sched.
Power Measurements in a Many-Core (SCC)
”Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs”, by E. Totoni, B. Behzad, S. Ghike and J. Torrellas. In International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2012.
• Power consumption breakdown changes with Vdd:– At high Vdd, cores dominate; at low Vdd, memory controllers dominate
• 48 Pentium-class cores consume less power than 4 Core i7-class cores– But program speed is typically lower
#pragma resiliencyfor (int i = 1; i < arraySize-1; i++)
a[i] = (a[i-1] + a[i+1]) / 2.0;
Compiler-Driven Transf. for Resilience
”ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research”, by J. Lidman, D. J. Quinlan, C. Liao, and S. A. McKee, Submitted for publication, March 2012.
for (int i = 1; i < (arraySize - 1); i++) {int ii, correctCnt = 0;float aI[3] = {a[i], a[i], a[i]};#pragma omp parallel forfor(ii = 0; ii < 3; ii += 1) {
float aII[3] = {aI[ii], aI[ii], aI[ii]};// Original statement: aI[ii] = aII[0] = ((a[i - 1] + a[i + 1]) / 2.0);aII[1] = ((a[i - 1] + a[i + 1]) / 2.0);aII[2] = ((a[i - 1] + a[i + 1]) / 2.0);aI[ii] = aII[0];if (!(aII[2] == aII[1] && aII[1] == aII[0]))
aI[ii] = (aII[0] + (aII[1] + aII[2])) / 3.00000F;}
#pragma omp parallel for reduction (+:correctCnt)for(ii = (0); ii < 2; ii += 1)
correctCnt += array_inter[ii] == array_inter[ii + 1];if (!(correctCnt == 2)) {
printf("Result is not consistent across executions...
assert(false);}
}
• Automated TMR introduction– 20% performance cost– Low power
• Transf. use OpenMP• Future work: tune these
transformations to Thrifty
Scalable Checkpointing
“Rebound: Scalable Checkpointing for Coherent Shared Memory”, by Rishi Agarwal, Pranav Garg, and Josep Torrellas. In International Symposium on Computer Architecture (ISCA), June 2011.
VARIUS-NTV: Model of Faults at Low Voltage
”VARIUS-NT: A Microarchitectural Model to Capture the Increased Sensitivity of Manycores to Process Variations at Near-Threshold Voltage”, by Ulya R. Karpuzcu, Krishna B. Kolluru, Nam Sung Kim and Josep Torrellas. In International Conference on Dependable Systems and Networks (DSN), June 2012.
• Models variation in power and frequency at Near Threshold Voltage (NTV)• Applies to both cores and memory modules• Gives resulting errors
Gate delay: EKV-based formula
Variation in Frequency (288 Cores at 11nm)
• More variation at near threshold computing (NTC) than on conventionalsuperthreshold computing (STC)
Cluster
Application Idioms
Idiom: Common pattern of computation and data access in applications
”Automatic Recognition of Performance Idioms in Scientific Applications”, by J. He, A. Snavely, R. F. Van der Wijngaart, M. Frumkin. In Inter. Parallel and Distributed Processing Symp., May 2011.
Identifying Application Idioms• Constructed a tool to automatically run through large codes at compile time
and classify patterns
• Finding: Performance and energy efficiency of applications can be represented as a tiling of the constituent idioms
• Opportunity: Drive Thrifty dynamic reconfigurations on-the-fly from the application level idiom by idiom to save energy
Team Members
• University of Illinois:– Josep Torrellas, Rishi Agarwal, Adi Agrawal, Ben Ahrens, Amin Ansari,
Dennis Crawford, Hassan Elsami, Maria Garzaran, Austin Gibbons, Prabhat Jain, Ulya Karpuzcu, Wooil Kim, Shanxiang Qi,
• Lawrence Livermore National Laboratory:– Daniel Quinlan, Leo Liao, Thomas Panas, Justin Too
• University of California San Diego:– Laura Carrington, Pietro Cicotti, Sandeep Gupta, Mitesh Meswani, Kayla
Seager, Ananta Tiwari• Intel Corporation
– Wilfred Pinfold
Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel)
http://extremescale.cs.uiuc.edu/thrifty October 2012
Description of the Research