20
Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 Instruction Set Extensions for Photonic Synchronous Coalesced Access This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A. Approved for public release; distribution is unlimited. Distribution Statement A. Approved for public release; distribution is unlimited.

Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

Paul Keltcher, David Whelihan, Jeffrey Hughes

September 12, 2013

Instruction Set Extensions for Photonic Synchronous

Coalesced Access

This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A. Approved for public release;

distribution is unlimited.

Distribution Statement A. Approved for public release; distribution is unlimited.

Page 2: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 2 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Application Needs: –  Large distributed

data sets –  Algorithm complexity –  Parallel processing

Trends

•  Technology Capabilities: –  No more frequency

scaling –  More cores –  Distributed caches

Current Multi-core Computers

Page 3: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 3 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

Current Architectures

Memory

Network, Caches, Prefetchers, etc

CPU CPU CPU CPU

Lack of global coordination hurts both performance and power

Page 4: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 4 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

Our Approach

•  Need better synchronization mechanisms.

•  Two components –  Exploit application

knowledge –  Technology enabler

•  Our Approach –  Extended ISA –  Utilize Photonics

This talk is about employing new technologies that enable us to program parallel processors easily and efficiently.

Application knowledge (data movement)

Instruction Set Extensions

New technology capabilities

New technology (photonics)

Page 5: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 5 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Background •  The Multi-processor ISA •  Matrix transpose example •  Conclusions

Outline

Page 6: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Physical memories are 2-d arrays that are accessed from only one dimension –  To read a memory element, an entire row in the physical array must

be read –  Memory is most efficiently accessed sequentially

Memory Access

Memory is a 1-dimensional resource, best accessed in contiguous chunks

Rows

Data Elements

Memory Read Sequence: Memory Operations(reads):

4

16

Page 7: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 7 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Restructuring of data can be extremely performance limiting •  Micro-architecture innovations can hinder efficient distributed

data movement

Distributed Memory Access

Read Sequence:

Memory Operations

(reads):

4

16

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Network Transfers:

4

16

Compute parallelism can increase performance, but it can also greatly exacerbate non-local data access problems

To Main Memory

Page 8: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 8 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  In our work we use chip-scale photonics to tightly integrate processors and memory

•  Chip-scale photonic links are distance independent •  Photonics enables synchronization at long distance

Chip-scale Photonic Technology

Photonics enables scalable, global, synchronous communication

Page 9: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 9 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

The Synchronous Coalesced Access (SCA)*

µP µP Memory µP µP

Data Movement: Synchronous Coalesced Access (SCA)

Photonic waveguide modulator

data element

•  The distance independent nature of photonics can be used to synchronize transfers from spatially separate chips or regions on chips

•  Multiple independent data transfers synthesized on-the-fly •  This coordination can result in long ordered streams of data

sourced from multiple locations –  Globally synthesized accesses

* IPDPS 2013 “P-sync: A Photonically Enabled Architecture for Efficient Non-local Data Access”

Page 10: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 10 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Worker Processors share two TDM silicon photonic waveguides.

•  The Head Node coordinates memory traffic for Worker Processors by issuing Memory Read/Write requests.

P-sync Architecture*

WPN-1 WPN-2 WP1 WP0 Head

DRAM

Light Source

Clock Inbound (SCA-1) Waveguide

Outbound (SCA) Waveguide Clock

* IPDPS 2013 “P-sync: A Photonically Enabled Architecture for Efficient Non-local Data Access”

Page 11: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 11 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  Background •  The Multi-processor ISA •  Matrix transpose example •  Conclusions

Outline

Page 12: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 12 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  One coordinating instruction to initialize the communication schedule

Globally Synchronous Load-Store Instructions

coalesce_sca base_address, size

•  One instruction to write into the address space set up by the coordinating instruction.

sca.b32 local_data, sca_index

Program the memory, not the processor.

Page 13: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 13 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

Matrix Transpose on P-sync

Memory

WPN-1 WPN-2 WP1 WP0 Head

DRAM

Light Source

Clock Inbound (SCA-1) Waveguide

Outbound (SCA) Waveguide Clock

Head coalesce_sca Base, N

Page 14: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 14 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

Matrix Transpose on P-sync

Memory

WPN-1 WPN-2 WP1 WP0 Head

DRAM

Light Source

Clock Inbound (SCA-1) Waveguide

Outbound (SCA) Waveguide Clock

WP0 sca.b32 local[2], 0

WP1 sca.b32 local[2], 1

WP2 sca.b32 local[2], 2

WP3 sca.b32 local[2], 3

Page 15: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 15 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

Matrix Transpose on P-sync

Memory

WPN-1 WPN-2 WP1 WP0 Head

DRAM

Light Source

Clock Inbound (SCA-1) Waveguide

Outbound (SCA) Waveguide Clock

A3,2 A2,2 A1,2 A0,2

Distributed non-local data is combined on the waveguide to form a single efficient memory transaction.

Page 16: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 16 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

R1 = local_base_address R2 = 0 // row index R3 = row_size // loop count .loop: // R5 = local data ld.local.u32 R5, local[R2] sca.b32 R5, proc_id add.u32 r2, r2, 1 // pos in row sub.u32 r1, r1, 1 // loop count // continue until this row has // been coalesced. bra loop

Full Transpose Code

R1 = global_base_address R2 = 0 // current row R3 = num_rows // loop count .loop: // R4 = global address of row // to coalesce // R4 = R1 + (R2 * row_size) coalesce_sca R4, row_size add.32 R2, R2, 1 // current row sub.32 R3, R3, 1 // loop count // continue until all rows have // been coalesced. bra loop

Head Node Worker Node

Page 17: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 17 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

R1 = local_base_address R2 = 0 // row index R3 = row_size // loop count .loop: // R5 = local data ld.local.u32 R5, local[R2] sca.b32 R5, proc_id add.u32 r2, r2, 1 // pos in row sub.u32 r1, r1, 1 // loop count // continue until this row has // been coalesced. bra loop

R1 = global_base_address R2 = 0 // current row R3 = num_rows // loop count .loop: // R4 = global address of row // to coalesce // R4 = R1 + (R2 * row_size) coalesce_sca R4, row_size add.32 R2, R2, 1 // current row sub.32 R3, R3, 1 // loop count // continue until all rows have // been coalesced. bra loop

Full Transpose Code

Head Node Worker Node

•  coalesce_sca executed once for each row of the matrix.

•  Each time through the loop, it will wait for the prior coalesce to complete

Page 18: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 18 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

R1 = local_base_address R2 = 0 // row index R3 = row_size // loop count .loop: // R5 = local data ld.local.u32 R5, local[R2] sca.b32 R5, proc_id add.u32 r2, r2, 1 // pos in row sub.u32 r1, r1, 1 // loop count // continue until this row has // been coalesced. bra loop

R1 = global_base_address R2 = 0 // current row R3 = num_rows // loop count .loop: // R4 = global address of row // to coalesce // R4 = R1 + (R2 * row_size) coalesce_sca R4, row_size add.32 R2, R2, 1 // current row sub.32 R3, R3, 1 // loop count // continue until all rows have // been coalesced. bra loop

Full Transpose Code

Head Node Worker Node

•  Each processor has the same sca_index each time through the loop

•  It is the head node that sets the global address

•  The sca.b32 instruction will wait for its sca_index each time through the loop

Page 19: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 19 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  General purpose CPUs –  Optimized for locality (caches, prefetchers, etc.) –  When there is no locality, these optimizations penalize performance –  Existing cache-bypassing operations are single-thread

•  GPUs –  Coalescing only works within a Warp –  The Kepler Shuffle Instruction manipulates data within a Warp

Related

In these solutions the programmer cannot express global memory transactions across the entire architecture

Page 20: Instruction Set Extensions for Photonic Synchronous ...ieee-hpec.org/2013/index_htm_files/KetchnerHPEC_2013_Presentation.pdfHPEC 2013- 6 PK 10/12/13 Distribution Statement A. Approved

HPEC 2013- 20 PK 10/12/13 Distribution Statement A. Approved for public release; distribution is unlimited.

•  We introduce two new instructions that permit efficient multi-processor synchronization. –  coalesce_sca –  sca

•  These new instructions, in combination with SCA capability, give us

-  simple code -  high network and memory efficiency due to well-coordinated

communication

Conclusions

Global synchrony enables parallel efficiency