1
Thrifty: An Exascale Architecture for Energy Proportional Computing Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel) http://extremescale.cs.uiuc.edu/thrifty October 2012 Increm entalIn-M em ory C heckpointing P1 P2 P3 P4 P5 Local C hkpt Local C hkpt P roducer rollback C onsum er rollback P1 P2 P roducer chkpoint C onsum er chkpoint P1 P2 chkpt chkpt Local coordinated checkpointing (notglobal) O nly com m unicating processors checkpoint/rollback together O verheads ofa few percentexpected P1 rolls back P 1’s consum ers rollback RULES: P2 checkpoints P2’s producers checkpoint M otivation C urrentpetascale architectures are notscalable C onsum e several M W ofpow er C an easily w aste 20% of capacity to faults and recovery • Thrifty: Novel exascale architecture and softw are stack Aim s athighly-efficient,energy proportionalcom puting Innovates in pow er/energy efficiency,resiliency,and perform ance Tackles problem atthe circuit,architecture,com pilation,and runtim e layers Approach 1.Pow er/energy efficiency:one orderofm agnitude efficiency gain •C ircuits/architectures forlow V dd and fine-grain pow erm anagem ent •Energy-aw are com pilerdrives pow erm anagem ent •Application m odels forenergy proportionality and pow erm anagem ent 2.R esiliency:reduce loss to faults/recovery to less than 5% ofexecution •C ircuits/architectures forerrordetection and tolerance •Architecture/softw are forincrem ental in-m em ory scalable checkpointing 3.Perform ance:one orderofm agnitude perform ance increase •Architectures forfine-grain synch and com munication •Auto-tuning com piler •Identifying application idiom s and efficiently m apping them Thrifty Architecture Architecture D escription 1K-core processorchips w ith low V dd and extensive clock/pow ergating C ores organized in clusters w ith fine-grain com munication/synch support Banked on-chip m em ory w ith sim ple,distributed engines O n-chip netw ork w ith w ide links and sim ple routers N ovel features forfaultdetection and tolerance Increm ental,in-m em ory scalable checkpointing driven by hardw are N ovel hardw are to m inim ize data m ovem entand facilitate auto-tuning Application Auto-Tuning forEnergy Efficiency ”A uto-Tuning forEnergy U sage in S cientific Applications”,by A.Tiw ari, M .A.Laurenzano,L.C arrington,and A.Snavely.In PR O PER W orkshop atEuro-Par,Septem ber2011. Identify H W and SW tunables thathave high im pacton pow erdraw Tw eak them to optim ize forenergy usage w hile m aintaining perform ance Approach:C om piler-based using R O SE,C H iLL;Search-based using Active H arm ony C om piler-D riven Transf.forEnergy R eduction ”Studying the Im pactofApplication-LevelO ptim izations on the Pow er C onsum ption ofM ulti-C ore A rchitectures”,by S.M . F.R ahm an,J.G uo, A.Bhat,C .G arcia,M .H .Sujon,Q .Yi,C .Liao,and D .Q uinlan, In Com puting Frontiers,M ay 2012. 42 sequential and m ultithreaded benchm arks.W atts U p PR O powerm eter 10+ m ajorsequential and parallel opts and execution configurations Perform ance difference often dom inates Perform ance perW att(P/W ) O 2 and O 3:sim ilarperform ance w ith differentpow erconsum ption M ultithreaded apps:m ore room fortuning P/W , e.g.thread affinity,sched. PowerM easurem ents in a M any-C ore (SC C ) ”C om paring the Pow erand Perform ance ofIntel’s SC C to State-of-the-Art C P U s and G P U s”,by E.Totoni,B.Behzad,S.G hike and J.Torrellas.In International Sym posium on Perform ance Analysis ofSystem s and Softw are (ISP ASS),April2012. Pow erconsum ption breakdow n changes w ith Vdd: Athigh Vdd,cores dom inate;atlow Vdd,m em ory controllers dom inate 48 Pentium -class cores consum e less pow erthan 4 C ore i7-class cores Butprogram speed is typically low er #pragma resi l i ency for (int i = 1; i < arraySize-1; i++) a[i] = (a[i-1] + a[i+1]) / 2.0; C om piler-D riven Transf.forR esilience ”RO SE::FTTransform A Source-to-Source Translation Fram ew ork forExascale Fault- Tolerance R esearch”,by J.Lidm an,D .J. Q uinlan,C .Liao,and S.A.M cKee,Submitted forpublication,M arch 2012. for (int i = 1; i < (arraySize - 1); i++) { int ii, correctCnt = 0; float aI[3] = {a[i], a[i], a[i]}; #pragma omp paral l el f or for(ii = 0; ii < 3; ii += 1) { float aII[3] = {aI[ii], aI[ii], aI[ii]}; // Original statement: aI[ii] = aII[0] = ((a[i - 1] + a[i + 1]) / 2.0); aII[1] = ((a[i - 1] + a[i + 1]) / 2.0); aII[2] = ((a[i - 1] + a[i + 1]) / 2.0); aI[ii] = aII[0]; if (!(aII[2] == aII[1] && aII[1] == aII[0])) aI[ii] = (aII[0] + (aII[1] + aII[2])) / 3.00000F; } #pr agma omp par al l el f or r educt i on ( +: cor r ect Cnt ) for(ii = (0); ii < 2; ii += 1) correctCnt += array_inter[ii] == array_inter[ii + 1]; if (!(correctCnt == 2)) { printf("Resul t i s not consi stent across execut i ons. . . assert(false); } } Autom ated TM Rintroduction 20% perform ance cost Low pow er Transf.use O penM P Future w ork:tune these transform ations to Thrifty Scalable C heckpointing “R ebound:Scalable C heckpointing forC oherent Shared M em ory”,by R ishiAgarw al,Pranav G arg, and Josep Torrellas.In International Sym posium on C om puterArchitecture (ISC A),June 2011. VAR IU S-N TV:M odelofFaults atLow Voltage ”VAR IU S-N T: A M icroarchitecturalM odel to C apture the Increased Sensitivity ofM anycores to Process Variations atN ear- Threshold Voltage”,by U lya R .Karpuzcu,Krishna B.Kolluru,N am Sung Kim and Josep Torrellas.In International C onference on D ependable System s and N etw orks (D SN ),June 2012. M odels variation in pow erand frequency atN earThreshold Voltage (N TV) Applies to both cores and m em ory m odules G ives resulting errors G ate delay:EKV-based form ula Variation in Frequency (288 C ores at11nm ) M ore variation atnearthreshold com puting (N TC )than on conventional superthreshold com puting (STC ) C luster Application Idiom s Idiom :C om m on pattern ofcom putation and data access in applications ”Autom atic R ecognition ofPerform ance Idiom s in Scientific Applications”,by J.H e,A.Snavely,R .F.Van derW ijngaart, M . Frum kin.In Inter.Paralleland D istributed Processing Sym p.,M ay 2011. Identifying Application Idiom s C onstructed a tool to autom atically run through large codes atcom pile tim e and classify patterns Finding:Perform ance and energy efficiency ofapplications can be represented as a tiling ofthe constituentidiom s O pportunity:Drive Thrifty dynam ic reconfigurations on-the-fly from the application level idiom by idiom to save energy Team Mem bers U niversity ofIllinois: Josep Torrellas,R ishiAgarw al,AdiAgraw al,Ben Ahrens,Am in Ansari, D ennis C rawford,H assan Elsam i, M aria G arzaran,Austin G ibbons, PrabhatJain,U lya Karpuzcu,W ooilKim ,Shanxiang Q i, Law rence Liverm ore N ational Laboratory: D aniel Q uinlan, Leo Liao,Thom as Panas,Justin Too U niversity ofC alifornia San D iego: Laura C arrington,Pietro Cicotti,Sandeep Gupta,M itesh M eswani,Kayla Seager,Ananta Tiw ari Intel C orporation W ilfred Pinfold D escription ofthe R esearch

Thrifty: An Exascale Architecture for Energy Proportional Computing

  • Upload
    nero

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Thrifty: An Exascale Architecture for Energy Proportional Computing. Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel). - PowerPoint PPT Presentation

Citation preview

Page 1: Thrifty: An  Exascale  Architecture for Energy Proportional Computing

Incremental In-Memory Checkpointing

P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

• Local coordinated checkpointing (not global)• Only communicating processors

checkpoint/rollback together• Overheads of a few percent expected

P1 rolls backP1’s consumers rollback RULES:

P2 checkpointsP2’s producers checkpoint

Thrifty: An Exascale Architecture for Energy Proportional Computing

Motivation

• Current petascale architectures are not scalable– Consume several MW of power– Can easily waste 20% of capacity to faults and recovery

• Thrifty: – Novel exascale architecture and software stack– Aims at highly-efficient, energy proportional computing– Innovates in power/energy efficiency, resiliency, and performance– Tackles problem at the circuit, architecture, compilation, and runtime

layers

Approach

1. Power/energy efficiency: one order of magnitude efficiency gain• Circuits/architectures for low Vdd and fine-grain power management• Energy-aware compiler drives power management• Application models for energy proportionality and power management

2. Resiliency: reduce loss to faults/recovery to less than 5% of execution• Circuits/architectures for error detection and tolerance• Architecture/software for incremental in-memory scalable

checkpointing3. Performance: one order of magnitude performance increase

• Architectures for fine-grain synch and communication• Auto-tuning compiler• Identifying application idioms and efficiently mapping them

Thrifty Architecture Architecture Description

• 1K-core processor chips with low Vdd and extensive clock/power gating• Cores organized in clusters with fine-grain communication/synch

support• Banked on-chip memory with simple, distributed engines• On-chip network with wide links and simple routers• Novel features for fault detection and tolerance• Incremental, in-memory scalable checkpointing driven by hardware• Novel hardware to minimize data movement and facilitate auto-tuning

Application Auto-Tuning for Energy Efficiency

”Auto-Tuning for Energy Usage in Scientific Applications”, by A. Tiwari, M. A. Laurenzano, L. Carrington, and A. Snavely. In PROPER Workshop at Euro-Par, September 2011.

• Identify HW and SW tunables that have high impact on power draw• Tweak them to optimize for energy usage while maintaining performance

• Approach: Compiler-based using ROSE, CHiLL; Search-based using Active Harmony

Compiler-Driven Transf. for Energy Reduction

”Studying the Impact of Application-Level Optimizations on the Power Consumption of Multi-Core Architectures”, by S. M. F. Rahman, J. Guo, A. Bhat, C. Garcia, M. H. Sujon, Q. Yi, C. Liao, and D. Quinlan, In Computing Frontiers, May 2012.

• 42 sequential and multithreaded benchmarks. Watts Up PRO power meter• 10+ major sequential and parallel opts and execution configurations

• Performance difference often dominates Performance per Watt (P/W)• O2 and O3: similar performance with different power consumption• Multithreaded apps: more room for tuning P/W, e.g. thread affinity, sched.

Power Measurements in a Many-Core (SCC)

”Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs”, by E. Totoni, B. Behzad, S. Ghike and J. Torrellas. In International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2012.

• Power consumption breakdown changes with Vdd:– At high Vdd, cores dominate; at low Vdd, memory controllers dominate

• 48 Pentium-class cores consume less power than 4 Core i7-class cores– But program speed is typically lower

#pragma resiliencyfor (int i = 1; i < arraySize-1; i++)

a[i] = (a[i-1] + a[i+1]) / 2.0;

Compiler-Driven Transf. for Resilience

”ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research”, by J. Lidman, D. J. Quinlan, C. Liao, and S. A. McKee, Submitted for publication, March 2012.

for (int i = 1; i < (arraySize - 1); i++) {int ii, correctCnt = 0;float aI[3] = {a[i], a[i], a[i]};#pragma omp parallel forfor(ii = 0; ii < 3; ii += 1) {

float aII[3] = {aI[ii], aI[ii], aI[ii]};// Original statement: aI[ii] = aII[0] = ((a[i - 1] + a[i + 1]) / 2.0);aII[1] = ((a[i - 1] + a[i + 1]) / 2.0);aII[2] = ((a[i - 1] + a[i + 1]) / 2.0);aI[ii] = aII[0];if (!(aII[2] == aII[1] && aII[1] == aII[0]))

aI[ii] = (aII[0] + (aII[1] + aII[2])) / 3.00000F;}

#pragma omp parallel for reduction (+:correctCnt)for(ii = (0); ii < 2; ii += 1)

correctCnt += array_inter[ii] == array_inter[ii + 1];if (!(correctCnt == 2)) {

printf("Result is not consistent across executions...

assert(false);}

}

• Automated TMR introduction– 20% performance cost– Low power

• Transf. use OpenMP• Future work: tune these

transformations to Thrifty

Scalable Checkpointing

“Rebound: Scalable Checkpointing for Coherent Shared Memory”, by Rishi Agarwal, Pranav Garg, and Josep Torrellas. In International Symposium on Computer Architecture (ISCA), June 2011.

VARIUS-NTV: Model of Faults at Low Voltage

”VARIUS-NT: A Microarchitectural Model to Capture the Increased Sensitivity of Manycores to Process Variations at Near-Threshold Voltage”, by Ulya R. Karpuzcu, Krishna B. Kolluru, Nam Sung Kim and Josep Torrellas. In International Conference on Dependable Systems and Networks (DSN), June 2012.

• Models variation in power and frequency at Near Threshold Voltage (NTV)• Applies to both cores and memory modules• Gives resulting errors

Gate delay: EKV-based formula

Variation in Frequency (288 Cores at 11nm)

• More variation at near threshold computing (NTC) than on conventionalsuperthreshold computing (STC)

Cluster

Application Idioms

Idiom: Common pattern of computation and data access in applications

”Automatic Recognition of Performance Idioms in Scientific Applications”, by J. He, A. Snavely, R. F. Van der Wijngaart, M. Frumkin. In Inter. Parallel and Distributed Processing Symp., May 2011.

Identifying Application Idioms• Constructed a tool to automatically run through large codes at compile time

and classify patterns

• Finding: Performance and energy efficiency of applications can be represented as a tiling of the constituent idioms

• Opportunity: Drive Thrifty dynamic reconfigurations on-the-fly from the application level idiom by idiom to save energy

Team Members

• University of Illinois:– Josep Torrellas, Rishi Agarwal, Adi Agrawal, Ben Ahrens, Amin Ansari,

Dennis Crawford, Hassan Elsami, Maria Garzaran, Austin Gibbons, Prabhat Jain, Ulya Karpuzcu, Wooil Kim, Shanxiang Qi,

• Lawrence Livermore National Laboratory:– Daniel Quinlan, Leo Liao, Thomas Panas, Justin Too

• University of California San Diego:– Laura Carrington, Pietro Cicotti, Sandeep Gupta, Mitesh Meswani, Kayla

Seager, Ananta Tiwari• Intel Corporation

– Wilfred Pinfold

Josep Torrellas (University of Illinois), Daniel Quinlan (Lawrence Livermore National Lab), Laura Carrington (University of California, San Diego), Wilfred Pinfold (Intel)

http://extremescale.cs.uiuc.edu/thrifty October 2012

Description of the Research