26
Krste Asanović UC Berkeley / SiFive Inc. 5th RISCV Workshop, Berkeley, CA November 30, 2016 Vector Extension Proposal v0.2

Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Krste  Asanović  UC  Berkeley  /  SiFive  Inc.  

5th  RISC-­‐V  Workshop,  Berkeley,  CA  November  30,  2016    

Vector  Extension  Proposal  v0.2  

Page 2: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Goals  for  RISC-­‐V  Standard  V  Extension  

§  Efficient  and  scalable  to  all  reasonable  design  points  - Low-­‐cost  microcontroller  or  high-­‐performance  supercomputer  

- In-­‐order,  decoupled,  or  out-­‐of-­‐order  microarchitectures  - Integer,  fixed-­‐point,  and/or  floaVng-­‐point  data  types  

§ Good  compiler  target  §  Support  both  implicit  auto-­‐vectorizaVon  (OpenMP)  and  explicit  SPMD  (OpenCL)  programming  models  

§ Work  with  virtualizaVon  layers  §  Fit  into  standard  fixed  32-­‐bit  encoding  space  § Be  base  for  future  vector++  extensions    

Page 3: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

RISC-­‐V  Vector  Extension  Update  Summary  

§  Last  presentaVon  v0.1,  2nd  RISC-­‐V  Workshop,  June  2015  

§ Progress  slow  last  year  due  to  other  higher  priority  parts  of  standard,  but  Vme  to  work  on  this  now  

§ Working  group  forming  now  (I’m  chair)  § Goal  is  to  raVfy  12  months  from  now  at  7th  workshop  

Page 4: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

V  Key  Features  

§ Cray-­‐style  vectors  - “The  right  way”  to  exploit  SIMD  parallelism  

§  ImplementaVon-­‐dependent  vector  length  - Same  binary  runs  with  different  hardware  vector  lengths  - Support  wide  range  of  implementaVons,  microcontroller  to  supercomputer  

§ Reconfigurable  vector  register  file  § Mixed-­‐precision  support  

4

Page 5: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

V  Extension  State  

x0  x1  

x31  

f0  f1  

f31  

Standard  RISC-­‐V  scalar  x  and  f  registers  

p0[0]  p1[0]  

p7[0]  

p0[1]  p1[1]  

p7[1]  

p0[MVL-­‐1]  p1[MVL-­‐1]  

p7[MVL-­‐1]  

Up  to  8  vector  predicate  registers,  with  1  bit  per  element  

v0[0]  v1[0]  

v31[0]  

v0[1]  v1[1]  

v31[1]  

v0[MVL-­‐1]  v1[MVL-­‐1]  

v31[MVL-­‐1]  

Up  to  32  vector  data  registers,  v0-­‐v31,  of  at  least  4  elements  each,  with  variable  bits/element  (8,16,32,64,128)  

vl  

vcmaxw  

Vector  configuraVon  CSRs  

Vector  length  CSR  vctype   vcnpred  

vxrm  

Vector  fixed-­‐point  rounding  mode  and  saturaVon  flag  CSRs  

vxsat  

Page 6: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Unit  Configura@on  

§  Each  vector  data  register  is  configured  with  a  width  and  type,  or  disabled  

§  Configurable  number  of  predicate  registers  (0-­‐8)  §  Maximum  vector  length  (MVL)  funcVon  of  configuraVon,  physical  register  storage,  and  microarchitecture  

v0v1v2v3v4v5v6vp0vp1vp2 Element  0   Element  1   MVL-­‐1  

Page 7: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Mandatory  Supported  Types  

54 Volume I: RISC-V User-Level ISA V2.2-draft

CSR name Number Base ISAvl 0x020 RV32, RV64, RV128vxrm 0x020 RV32, RV64, RV128vxsat 0x020 RV32, RV64, RV128vcsr 0x020 RV32, RV64, RV128vcnpred 0x020 RV32, RV64, RV128vcmaxw 0x020 RV32, RV64, RV128vcmaxw1 0x020 RV32vcmaxw2 0x020 RV32, RV64vcmaxw3 0x020 RV32vctype 0x020 RV32, RV64, RV128vctype1 0x020 RV32vctype2 0x020 RV32, RV64vctype3 0x020 RV32vctypev0 0x020 RV32, RV64, RV128vctypev1 0x020 RV32, RV64, RV128...vctypev31 0x020 RV32, RV64, RV128

Table 9.1: Vector extension CSRs.

Supported Fixed-Point WidthsRV32I X8, X16, X32RV64I X8, X16, X32, X64RV128I X8, X16, X32, X64, X128

Supported Floating-Point WidthsF F16, F32FD F16, F32, F64FDQ F16, F32, F64, F128

Table 9.2: Supported data element widths depending on base integer ISA and supported floating-point extensions. Note that supporting a given floating-point width mandates support for allnarrower floating-point widths.

floating-point types (F16, F32, F64, and F128 respectively). When the V extension is added, itmust support the vector data element types implied by the supported scalar types as defined byTable 9.2. The largest element width supported:

ELEN = max(XLEN,FLEN)

Compiler support for vectorization is greatly simplified when any hardware-supported data typesare supported by both scalar and vector instructions.

Adding the vector extension to any machine with floating-point support adds support for the IEEEstandard half-precision 16-bit floating-point data type. This includes a set of scalar half-precisioninstructions described in Section ??. The scalar half-precision instructions follow the template forother floating-point precisions, but using the hitherto unused fmt field encoding of 10.

We only support scalar half-precision floating-point types as part of the vector extension, as

7

Adding  V  extension  to  scalar  floaVng-­‐point  extension  adds  scalar  half-­‐precision  (IEEE  16-­‐bit  FP)  instrucVons  

Page 8: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Maximum  Width  Configura@on  

§  Each  vector  data  register  has  a  4-­‐bit  field  in  the  vcmaxw  CSR  that  describes  the  maximum  width  of  elements  in  that  vector  

§  Total  of  32x4b=128  bits  of  width  state  held  in  one  (RV128),  two  (RV64)  or  four  (RV32)  CSRs  

§ Any  writes  to  vcmaxw  iniValizes  all  vector  unit  state  8

Copyright

c� 2010–2016, The Regents of the University of California. All rights reserved. 55

the main benefits of half-precision are obtained when using vector instructions that amortizeper-operation control overhead. Not supporting a separate scalar half-precision floating-pointextension also reduces the number of standard instruction-set variants.

9.3 Vector Configuration Registers (vcmaxw, vctype, vcp)

The vector unit must be configured before use. Each architectural vector data register (v0–v31) isconfigured with the maximum number of bits allowed in each element of that vector data register,or can be disabled to free physical vector storage for other architectural vector data registers. Thenumber of available vector predicate registers can also be set independently.

The available MVL depends on the configuration setting, but MVL must always have the samevalue for the same configuration parameters on a given implementation. Implementations mustprovide an MVL of at least four elements for all supported configuration settings.

Each vector data register’s current maximum-width is held in a separate four-bit field in the vcmaxwCSRs, encoded as shown in Table 9.3.

Width EncodingDisabled 0000

8 100016 100132 101064 1011

128 1100

Table 9.3: Encoding of vcmaxw fields. All other values are reserved.

Several earlier vector machines had the ability to configure physical vector register storage intoa larger number of short vectors or a shorter number of long vectors, in particular the FujitsuVP series [12].

In addition, each vector data register has an associated dynamic type field that is held in a four-bitfield in the vctype CSRs, encoded as shown in Table 9.4. The dynamic type field of a vector dataregister is constrained to only hold types that have equal or lesser width than the value in thecorresponding vcmaxw field for that vector data register. Changes to vctype do not alter MVL.

Vector data registers have both a maximum element width and a current element data type tosupport vector function calls, where the caller does not know the types needed by the callee, asdescribed below.

To reduce configuration time, writes to a vcmaxw field also write the corresponding vctype field.The vcmaxw field can be written any value taken from the type encoding in Table 9.4, but only thewidth information as shown in Table 9.3 will be recorded in the vcmaxw fields whereas the full typeinformation will be recorded in the corresponding vctype field.

Attempting to write any vcmaxw field with a width larger than that supported by the implemen-tation will raise an illegal instruction exception. Implementations are allowed to record a vcmaxw

Page 9: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Type  Configura@on    

§  Each  data  register  has  current  type  encoded  in  4-­‐bit  field  in  vctype  register  

§ Writes  to  vcmaxw  set  both  vcmaxw  and  vctype,  vcmaxw  retains  only  width  not  type  

§ Writes  to  vctype  only  zeros  associated  vector  register  9

56 Volume I: RISC-V User-Level ISA V2.2-draft

Type vctype encoding vcmaxw equivalentDisabled 0000 0000F16 0001 1001F32 0010 1010F64 0011 1011F128 0100 1100X8 1000 1000X16 1001 1001X32 1010 1010X64 1011 1011X128 1100 1100

Table 9.4: Encoding of vctype fields. The third column shows the value that will be saved whenwriting to vcmaxw fields. All other values are reserved.

value larger than the value requested. In particular, an implementation may choose to hardwirevcmaxw fields to the largest supported width.

Attempting to write an unsupported type or a type that requires more than the current vcmaxwwidth to a vctype field will raise an exception.

Any write to a field in the vcmaxw register configures the vector unit and causes all vector dataregisters to be zeroed and all vector predicate registers to be set, and the vector length register vlto be set to the maximum supported vector length.

Any write to a vctype field zeros only the associated vector data register, leaving the other vectorunit state undisturbed. Attempting to write a type needing more bits than the correspondingvcmaxw value to a vctype field will raise an illegal instruction exception.

Vector registers are zeroed on reconfiguration to prevent security holes and to avoid exposingdi↵erences between how di↵erent implementations manage physical vector register storage.

In-order implementations will probaby use a flag bit per register to mux in 0 instead ofgarbage values on each source until it is overwritten. For in-order machines, partial writesdue to predication or vector lengths less than MVL complicate this zeroing, but these cases canbe handled by adopting a hardware read-modify-write, adding a zero bit per element, or a trapto machine-mode trap handler if first write access after configuration is partial. Out-of-ordermachines can just point initial rename table at physical zero register.

In RV128, vcmaxw is a single CSR holding 32 4-bit width fields. Bits (4N + 3)–(4N) hold themaximum width of vector data register N . In RV64, the vcmaxw2 CSR provides access to theupper 64 bits of vcmaxw. In RV32, the vcmaxw1 CSR provides access to bits 63–32 of vcmaxw, whilevcmax3 CSR provides access to bits 127–96.

The vcnpred CSR contains a single 4-bit WLRL field giving the number of enabled architecturalpredicate registers, between 0 and 8. Any write to vcnpred zeros all vector data registers, sets allbits in visible vector predicate registers, and sets the vector length register vl to the maximumsupported vector length. Attempting to write a value larger than 8 to vcnpred raises an illegalinstruction exception.

Page 10: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Predicate  Configura@on  

§  The  vcnpred  CSR  holds  number  of  predicate  registers  (0-­‐8)  

§ Writes  to  vcnpred  iniValizes  all  vector  unit  state  

10

Page 11: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Faster  configura@on  

§  Sekng  all  configuraVon  bits  directly  via  vcmaxw  requires  creaVng/loading  long  immediates  and  wriVng  possibly  mulVple  CSRs  (RV32/64)  

§ A  vcfgd  CSR  alias  is  defined  for  faster  writes  of  common  vector  data  configuraVons  

§ One  5-­‐bit  field  per  supported  type,  set  to  highest  vector  register  number  with  that  type  or  zero  for  none  

11

58 Volume I: RISC-V User-Level ISA V2.2-draft

# Vector-vector 32-bit add loop.

# Assume vector unit configured with correct types.

# a0 holds N

# a1 holds pointer to result vector

# a2 holds pointer to first source vector

# a3 holds pointer to second source vector.

loop: setvl t0, a0

vld v0, a2 # Load first vector

sll t1, t0, 2 # multiply by bytes

add a2, t1 # Bump pointer

vld v1, a3 # Load second vector

add a3, t1 # Bump pointer

vadd v0, v1 # Add elements

sub a0, t0 # Decrement elements completed

vst v0, a1 # Store result vector

add a1, t1 # Bump pointer

bnez a0, loop # Any more?

Figure 9.1: Example vector-vector add loop.

ing vcfgdi instruction is encoded as a CSRRWI that takes a 5-bit immediate value to set theconfiguration, and returns MVL in the destination register.

One of the primary uses of vcfgdi is to configure the vector unit with single-byte element vectorsfor use in memcpy and memset routines. A single instruction can configure the vector unit forthese operation.

The vcfgd instruction also clears the vcnpred register, so no predicate registers are allocated.

0 F64 F32 F16 X32 X16 X8 RV322 5 5 5 5 5 5

0 F128 X64 F64 F32 F16 X32 X16 X8 RV6424 5 5 5 5 5 5 5 5

0 X128 F128 X64 F64 F32 F16 X32 X16 X8 RV12883 5 5 5 5 5 5 5 5 5

Figure 9.2: Format of the vcfgd value for di↵erent base ISAs, holding 5-bit vector register numbersfor each supported type. Fields must either contain 0 indicating no vector registers are allocatedfor that type, or a vector register number greater than all to the right. All vector register numbersinbetween two non-zero fields are allocated to the type with the higher vector register number.

The vcfgd value specifies how many vector registers of each datatype are allocated, and is dividedinto 5-bit fields, one per supported datatype. A value of 0 in a field indicates that no registers ofthat type are allocated. A non-zero value indicates the highest vector

Each 5-bit field in the vcfgd value must contain either zero, indicating that no vector registers areallocated for that type, or a vector register number greater than all fields in lower bit positions,indicating the highest vector register containing the associated type. This encoding can compactly

Page 12: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Fast  configura@on  example  

12

Copyright

c� 2010–2016, The Regents of the University of California. All rights reserved. 59

0 F64 F32 F16 X32 X16 X8

0 18 12 0 1 0 0

Vector registers vcmaxw vctype Typev31–v19 0000 0000 Disabledv18–v13 1011 0011 F64v12–v2 1010 0010 F32v1–v0 1010 1010 X32

Figure 9.3: Example use of vcfgd value to set configuration.

represent any arbitrary allocation of vector registers to data types, except that there must beat least two vector registers (v0 and v1) allocated to the narrowest required type. An exampleallocation is shown in Figure 9.3.

Separate vcfgp and vcfgpi instructions are provided, using the CSRRW and CSRRWI encodingsrespectively, that write the source value to the vcnpred register and return the new MVL. Thesewrites also clear the vector data registers, set all bits in the allocated predicate registers, and setvl=MVL. A vcfgp or vcfgpi instruction can be used after a vcfgd to complete a reconfigurationof the vector unit.

If a zero argument is given to vcgfd the vector unit will be unconfigured with no enabled registers,and the value 0 will be returned for MVL. Only the configuration registers vcmaxw and vcnpred

can be accessed in this state, either directly or via vcfgd, vcfgdi, vcfgp, or vcfgpi instructions.Other vector instructions will raise an illegal instruction exception.

To quickly change the individual types of a vector register, each vector data register n has a dedi-cated CSR address to access its vctype field, named vctypevn. The vcfgt and vcfgti instructionsare assembler pseudo-instructions for regular CSRRW and CSRRWI instructions that update thetype fields and return the original value. The vcfgti instruction is typically used to change to adesired type while recording the previous type in one instruction, and the vcfgt instruction is usedto revert back to the saved type.

58 Volume I: RISC-V User-Level ISA V2.2-draft

# Vector-vector 32-bit add loop.

# Assume vector unit configured with correct types.

# a0 holds N

# a1 holds pointer to result vector

# a2 holds pointer to first source vector

# a3 holds pointer to second source vector.

loop: setvl t0, a0

vld v0, a2 # Load first vector

sll t1, t0, 2 # multiply by bytes

add a2, t1 # Bump pointer

vld v1, a3 # Load second vector

add a3, t1 # Bump pointer

vadd v0, v1 # Add elements

sub a0, t0 # Decrement elements completed

vst v0, a1 # Store result vector

add a1, t1 # Bump pointer

bnez a0, loop # Any more?

Figure 9.1: Example vector-vector add loop.

ing vcfgdi instruction is encoded as a CSRRWI that takes a 5-bit immediate value to set theconfiguration, and returns MVL in the destination register.

One of the primary uses of vcfgdi is to configure the vector unit with single-byte element vectorsfor use in memcpy and memset routines. A single instruction can configure the vector unit forthese operation.

The vcfgd instruction also clears the vcnpred register, so no predicate registers are allocated.

0 F64 F32 F16 X32 X16 X8 RV322 5 5 5 5 5 5

0 F128 X64 F64 F32 F16 X32 X16 X8 RV6424 5 5 5 5 5 5 5 5

0 X128 F128 X64 F64 F32 F16 X32 X16 X8 RV12883 5 5 5 5 5 5 5 5 5

Figure 9.2: Format of the vcfgd value for di↵erent base ISAs, holding 5-bit vector register numbersfor each supported type. Fields must either contain 0 indicating no vector registers are allocatedfor that type, or a vector register number greater than all to the right. All vector register numbersinbetween two non-zero fields are allocated to the type with the higher vector register number.

The vcfgd value specifies how many vector registers of each datatype are allocated, and is dividedinto 5-bit fields, one per supported datatype. A value of 0 in a field indicates that no registers ofthat type are allocated. A non-zero value indicates the highest vector

Each 5-bit field in the vcfgd value must contain either zero, indicating that no vector registers areallocated for that type, or a vector register number greater than all fields in lower bit positions,indicating the highest vector register containing the associated type. This encoding can compactly

Page 13: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Maximum  Vector  Length  

§  Sekng  vcmaxw  and  vcnpred  determines  current  maximum  vector  length  (MVL)  - vctype  does  not  affect  MVL  

§ Any  change  to  vcmaxw  or  vcnpred  iniValizes  all  vector  unit  state  - Must  not  rely  on  state  inbetween  reconfiguraVons  - Gives  flexibility  to  implementaVons  - Avoid  security  holes  from  leaking  state  

§ CSRRW  /  CSRRWI  instrucVons  to  change  vcmaxw/vcnpred  return  resulVng  MVL  - This  is  different  than  plain  CSRRW  that  returns  old  value  - Most  code  will  not  use  MVL  directly  

13

Page 14: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Set  Vector  Length  

§ AcVve  vector  length  held  in  vl  CSR,  a  WARL  register  holding  values  between  0  and  MVL  inclusive.  

§ Any  configuraVon  changes  iniValize  vl  to  MVL.  § Usually  vl  modified  with  setvl  instrucVon  encoded  as  CSRRW/CSRRWI  instrucVon  

§  Source  argument  to  setvl  is  applicaVon  vector  length  (AVL),  returns  value  placed  in  vl

14

Copyright

c� 2010–2016, The Regents of the University of California. All rights reserved. 57

AVL Value vl settingAVL � 2MVL MVL

2MVL > AVL > MVL bAVL/2cMVL � AVL AVL

Table 9.5: Operation of setvl instruction to set vector length register vl based on requestedapplication vector length (AVL) and current maximum vector length (MVL).

9.4 Vector Length

The active vector length is held in the XLEN-bit WARL vector length CSR vl, which can only holdvalues between 0 and MVL inclusive. Any writes to the maximum configuration registers (vcmaxwor vcnpred) cause vl to be initialized with MVL. Writes to vctype do not a↵ect vl.

The active vector length is usually written with the setvl instruction, which is encoded as a csrrw

instruction to the vl CSR number. The source argument to the csrrw is the requested applicationvector length (AVL) as an unsigned XLEN-bit integer. The setvl instruction calculates the valueto assign to vl according to Table 9.5.

The rules for setting the vl register help keep vector pipelines full over the last two iterations ofa stripmined loop. Similar rules were previously used in Cray-designed machines [4].

The vl register is updated with the minimum of AVL and MVL, and this value is also returned asthe result of the setvl instruction. Note that unlike a regular csrrw instruction, the value returnedis not the original CSR value but the modified value.

The idea of having implementation-defined vector length dates back to at least the IBM 3090Vector Facility [3], which used a special “Load Vector Count and Update” (VLVCU) instructionto control stripmine loops. The setvl instruction included here is based on the simpler setvlrinstruction introduced by Asanovic [2].

The setvl instruction is typically used at the start of every iteration of a stripmined loop to setthe number of vector elements to operate on in the following loop iteration. The current MVL canbe obtained by performing a setvl with a source argument that has all bits set (largest unsignedinteger).

No element operations are performed for any vector instruction when vl=0.

9.5 Rapid Configuration Instructions

It can take several instructions to set vcmaxw, vctype and vcnpred to a given configuration. Toaccelerate configuring the vector unit, specialized vcfg instructions are added that are encoded aswrites to CSRs with encoded immediate values that set multiple fields in the vcmaxw, vctype, andvncpred configuration registers.

The vcfgd instruction is encoded as a CSRRW that takes a register value encoded as shown inFigure 9.2, and which returns the corresponding MVL in the destination register. A correspond-

Page 15: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

32-­‐bit  integer  vector-­‐vector  add  example  

vcfgd 2*X32 # Only need two vector registersstripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first vector vld v1, a2 # Get second vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,2 # Multiply count by 4 to get byte add a1, t1 # Bump pointers add a2, t1 add a3, t1 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config

15

Page 16: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

16-­‐bit  integer  vector-­‐vector  add  example  

vcfgd 2*X16 # Only need two vector registersstripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first vector vld v1, a2 # Get second vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,1 # Multiply count by 2 to get byte add a1, t1 # Bump pointers add a2, t1 add a3, t1 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config

16

Page 17: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

16-­‐bit  +  32-­‐bit  vector  add  example  

vcfgd 1*X32|1*X16stripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first 16-bit vector vld v1, a2 # Get second 32-bit vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,1 # Multiply count by 2 to get byte sll t2,t0,2 # Multiply count by 4 to get byte add a1, t1 # Bump pointers add a2, t2 add a3, t2 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config

17

Page 18: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Length  Portability  

§  Same  binary  code  works  regardless  of:  - Number  of  physical  register  bits  - Number  of  physical  lanes  - Mixed-­‐precision  packing  strategy  

§ Architecture  guarantees  minimum  vector  length  of  four  regardless  of  configuraVon  to  avoid  stripmine  overhead  for  short  vectors  - E.g.,  if  use  32  *  64-­‐bit  vector  registers,  - need  128  *  8-­‐byte  physical  element  registers  - 1KB  SRAM    

Page 19: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Polymorphic  Instruc@on  Encoding  

§  Single  signed  integer  ADD  opcode  works  on  different  size  inputs  and  outputs  - Size  of  inputs  and  outputs  inherent  in  register  number  - Sign-­‐extend  smaller  input  - Modulo  arithmeVc  on  overflow  to  desVnaVon  - Restrict  supported  combinaVons  to  simplify  hardware  

§  Integer,  Fixed-­‐point,  FloaVng-­‐point  arithmeVc  

Page 20: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Loads  and  Stores  

Addressing  modes:  § Unit-­‐stride    (scalar  base)  § Constant  stride  (scalar  base,  scalar  stride)  §  Indexed  (scalar  base,  vector  offset)    Types  inherent  in  desVnaVon  register  number  (for  integers,  signed/unsigned  determined  at  use)    Support  vector  AMOs:  §  E.g,  Vector  fetch-­‐and-­‐add  

Page 21: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Predica@on  

§ Up  to  eight  vector  predicate  registers  p0-­‐p7,  one  bit  per  element  

§  Logical  operaVons  between  predicate  registers  § All  vector  instrucVons  are  predicated  under  p0  - Implicit  predicate  due  to  encoding  constraints  

§  InstrucVon  to  swap  two  predicate  registers  - Reduce  overhead  of  scheduling  complex  control  flow  - Can  implement  just  in  rename  table  if  OoO  core  

§ Popcount  instrucVon  returns  number  of  acVve  bits  in  predicate  register  to  scalar  integer  register  - Used  for  divergent  control  flow  opVmizaVons  

§ Other  cross-­‐element  flag  operaVons  to  support  complex  loop  opVmizaVons  

§  Support  for  sopware  vector  length  speculaVon  

Page 22: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Predica@on  and  Vector  Register  Renaming  

Previous  approaches  in  vector  archs:  1)  DesVnaVon  has  old  value  if  predicate  false  - Simpler  spec,  beqer  for  in-­‐order/no  renaming  - Have  to  copy  old  value  to  new  desVnaVon  with  renaming  

2)  DesVnaVon  has  zero  value  if  predicate  false  - Beqer  for  out-­‐of-­‐order  with  renaming  - Need  addiVonal  merge(s)  to  rebuild  complete  vector  

3)  DesVnaVon  has  undefined  value  if  predicate  false  - More  complex  code,  beqer  for  out-­‐of-­‐order  with  renaming  - Need  addiVonal  merge(s)  to  rebuild  complete  vector  - Messy  definiVon  

§ We’re  choosing  1),  as  simpler  and  safer.  § Use  microarchitectural  tricks  for  OoO  machines  to  reduce  amount  of  data  transfer.  

Page 23: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Vector  Func@on  Calls  

§  In  auto-­‐vectorized  code,  want  to  make  vector  calls  to  funcVon  library  with  separate  vector  calling  convenVon  - Args  in  vector  registers  - AcVve  elements  communicated  by  vector  length  and  vp0  

§ Need  to  abstract  callee  register  usage  from  caller  § Caller  has  to  allocate  registers  for  callee  to  use  §  Set  vcmaxw  to  largest  value,  then  callee  can  change  type  with  vctype

§ Vector  runVme  can  opVmize  calling  convenVon  within  vector  runVme  library   23

for (i=0; i<N; i++) x[i] = exp(y[i]/z[i]);

Page 24: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

OpenCL  /  CUDA/  SPMD  programming    

§ Not  a  great  programming  model,  should  move  community  back  to  autovectorizaVon/autoparallelizaVon,  but  needed  for  compaVbility  

§ PredicaVon  used  to  handle  divergent  control  flow  - See  Yunsup’s  thesis  

§ ConfiguraVon  must  be  set  at  kernel  launch  to  maximum  width  used  anywhere  in  call  tree  

§ Need  general  vector  funcVon  call  capability  with  standard  callee/caller  save  protocol  

24

Page 25: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

OS  Support  

§ Restartable  page  faults  via  microcode  state  dump,  opaque  to  OS  - Similar  to  DEC  Vector  Vax  implementaVon  - If  implementaVon  has  precise  traps,  can  skip  

§ Privileged  specificaVon  describes  XS  sstatus  field  used  to  encode  coprocessor  status  (Off,  IniVal,  Clean,  Dirty)  to  reduce  context  save/restore  overhead.  

Page 26: Vector&Extension&Proposal& v0 - RISC-V · 2016-12-12 · v2 v3 v4 v5 v6 vp0 vp1 vp2 Element0& Element1& MVL=1& Vector&Mandatory&Supported&Types& 54 Volume I: RISC-V User-Level ISA

Ques@ons?  

26