ARIADNE Agnostic Reconfiguration In A Disconnected Network Environment Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li-Shiuan Peh (MIT),

Embed Size (px)

DESCRIPTION

Why is reconfiguration challenging? XY routing S D X X Y Y

Citation preview

ARIADNE Agnostic Reconguration In A Disconnected Network Environment Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li-Shiuan Peh (MIT), Valeria Bertacco (Michigan) What is reconfiguration? Silicon technologies move into the nanometer regime transistors become unreliable for architects: permanent faults Our focus in this talk: Network-on-Chip cannot resend need to re-route around the fault reconfiguration: the process of replacing the routing algorithm SD Shekhar Borkar In future chips of 100 billion transistors, 10% of transistors will eventually fail over the lifetime of the chip P P P$ S$ NIC R R Why is reconfiguration challenging? XY routing S D X X Y Y S D Why is reconfiguration challenging? Agnostic Reconfiguration algorithm In A Disconnected Network Environment S D XY routing Why is reconfiguration challenging? Agnostic Reconfiguration algorithm In A Disconnected Network Environment Outline Motivation Ariadne Baseline Deadlocks Synchronization Evaluation Overhead Performance Reliability Conclusions S D How will S find a path to D? ? ? S D RT RT E D: RT RT S D: RT RT W D: RT RT N D: S D How will S find a path to D? RT RT E,S D: RT RT S D: RT RT W,N D: RT RT E,N D: RT RT W,S D: RT RT N D: S D How will S find a path to D? D S RT RT W D: RT RT W D: RT RT S D: Upon a fault that changes the topology a node can let everyone know how it can be reached with a single broadcast N nodes can let everyone know how they can be reached with N broadcasts ARIADNE: baseline Upon a fault that changes the topology Every node broadcasts in-turn to let others know how it can be reached ARIADNE: baseline 1st 19 2nd 20 3rd 17 last fault detector: 18 statically assigned node IDs Upon a fault that changes the topology Every node broadcasts in-turn to let others know how it can be reached ARIADNE: baseline 1st 19 2nd 20 3rd 17 last fault detector: 18 Issues: deadlock avoidance synchronization (when to broadcast, multiple detectors) S D ARIADNE: deadlocks S D D S D S r first bcast ONLY: nodes are assigned ranks 0 bcaster root 1 immediate neighbors 2 2-hop neighbors 3 3-hop neighbors up*/down*disable routes where rank goes higher up down lower unique ordering: among nodes with same rank, arbitrarily select a higher one assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route r ARIADNE: deadlocks first bcast ONLY: nodes are assigned ranks 0 bcaster root 1 immediate neighbors 2 2-hop neighbors 3 3-hop neighbors up*/down*disable routes where rank goes higher up down lower unique ordering: among nodes with same rank, arbitrarily select a higher one assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route r ARIADNE: deadlocks first bcast ONLY: nodes are assigned ranks 0 bcaster root 1 immediate neighbors 2 2-hop neighbors 3 3-hop neighbors up*/down*disable routes where rank goes higher up down lower unique ordering: among nodes with same rank, arbitrarily select a higher one connectivity: can reach any node via the root S D S D assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route Upon a fault that changes the topology Every node broadcasts in-turn to let others know how it can be reached ARIADNE: deadlocks Issues: deadlock avoidance synchronization Upon a fault that changes the topology Every node broadcasts in-turn to let others know how it can be reached RULE: (i) first broadcast ranks nodes (ii) remaining broadcasts spread ONLY via enabled turns ARIADNE: deadlocks Issues: synchronization in turn broadcasts: when does previous broadcast complete? can broadcasts overlap? ARIADNE: synchronization 1-bit how does the recipient of a flag know the broadcasting node? NO arbitration Solution : Atomic Broadcasts Nodes utilize the cycle count as a global reference point Each node is assigned a unique broadcast slot from the global cycle counter ARIADNE: synchronization cycle count (same for all nodes) XXXXXX bcast cycle bcast node 97 5 waits for waits for 50 initiates bcast s bcast completes 6 initiates bcast s bcast completes 4s bcast completes reconfiguration completes in (16) 2 =(number of nodes) 2 cycles log(16) bits log(16) bits longest (in hops) broadcast ARIADNE: synchronization cycle count (same for all nodes) XXXXXX bcast cycle bcast node 97 5 waits for waits for 50 initiates bcast s bcast completes 8 waits for waits for 80 1 st hop 2 nd hop 8 resigns from becoming the root node (!) we need to reconfigure once even for multiple faults ARIADNE: synchronization Outline Motivation Ariadne Baseline Deadlocks Synchronization Evaluation Overhead Performance Reliability Conclusions overhead performance reliability On-chip routing algorithms for irregular topologies Immunet (V. Puente, ISCA04) Vicis routing algo (D. Fick, DATE09) reserves an escape VC for deadlock freedom (routes deterministically in a ring) exceptions to turn model to apply it to an arbitrary topology Evaluation ARIADNE synthesized a baseline 5-stage pipelined router (5 ports, 2 VCs, 5-flit buffer/VC) with Synopsys Design Compiler (IBM 130nm target library): router area (mm 2 ): baseline=2.708, Ariadne=2.761, Vicis=2.748, Immunet= % 2.0% 6.0% Evaluation: Overhead Experimental Setup: Garnet + GEMS network topology8x8 2D mesh memory controllers4 at chip corners channel width64 bits router architecture5-stage pipeline router ports, VCs5, 2 (private) router buffers/port5-flit for each VC processorsIn-order SPARC cores coherenceMOESI protocol L1 cachingprivate unified 32KB/node ways: 2 latency: 3 cycles L2 cachingshared distributed 1MB/node ways: 16 latency: 15 cycles Network Architecture (GARNET) System Configuration (GEMS) Evaluation: Performance Average over 10 PARSEC benchmarks 1000 fault configurations lower is better deadlocks traffic routing in a ring overhead performance reliability On-chip routing algorithms for irregular topologies Immunet (V. Puente, ISCA04) Vicis routing algo (D. Fick, DATE09) reserves an escape VC for deadlock freedom (routes deterministically in a ring) exceptions to turn model to apply it to an arbitrary topology Evaluation: Performance + Reliability ARIADNE 1.5% 2.0% 6.0% Outline Motivation Ariadne Baseline Deadlocks Synchronization Evaluation Overhead Performance Reliability Conclusions We have presented Ariadne. a reconfiguration algorithm that provides deadlock-free routing paths in irregular network topologies that result from faulty links is implemented in a fully distributed mode, resulting in simple hardware and low complexity enables a trade-off between performance and reliable functionality on unreliable silicon Thank You! Questions? The Greek legend of Princess Ariadne Ariadne (), was the daughter of King Minos of Crete. Minos attacked Athens after his son was killed there. The Athenians asked for terms, and were required to sacrifice seven young men and seven maidens every nine years to the Minotaur, a monster with the head of a bull on the body of a man. One year, the sacrificial party included Theseus, a young man who volunteered to come and kill the Minotaur. Ariadne fell in love at first sight, and helped him by giving him a ball of red fleece thread that she was spinning, to find his way out of the Minotaur's labyrinth. similarly to Princess Ariadne, our Ariadne algorithm helps packets nd their way in the labyrinth-like topology of a faulty network. [source: wikipedia] BACKUP SLIDES Evaluation: Results (reconfiguration dynamics) Ariadne Architecture Ariadne vs. off-chip approaches Off-chip networks utilize centralized software algorithms for reconfiguration C (-) topology needs to be communicated to a central node (-) interfacing with OS (-) updated routing tables need to be delivered back to each node OS Ariadne: Motivation Many transistor failures are expected to occur at NoCs fabricated at advanced technology nodes These failures will result in router link failures and disconnected routers Evaluation: Related Work On-chip routing algorithms for irregular topologies implemented for comparison Immunet (V. Puente, ISCA 2004) Vicis routing algorithm (D. Fick, DATE 2009) 1 escape VC reserved for deadlock freedom: routes deterministically in a ring S D latency when traffic 6% overhead (3 routing tables) reliability Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC Evaluation: Related Work On-chip routing algorithms for irregular topologies implemented for comparison Immunet (V. Puente, ISCA 2004) Vicis routing algorithm (D. Fick, DATE 2009) 1 escape VC reserved for deadlock freedom: routes deterministically in a ring latency when traffic 6% overhead (3 routing tables) reliability Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC No faults? use turn model Upon fault occurrence: re-enable disabled turns to increase connectivity (assuming north last) ?