Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Krste Asanović UC Berkeley / SiFive Inc.
5th RISC-‐V Workshop, Berkeley, CA November 30, 2016
Vector Extension Proposal v0.2
Goals for RISC-‐V Standard V Extension
§ Efficient and scalable to all reasonable design points - Low-‐cost microcontroller or high-‐performance supercomputer
- In-‐order, decoupled, or out-‐of-‐order microarchitectures - Integer, fixed-‐point, and/or floaVng-‐point data types
§ Good compiler target § Support both implicit auto-‐vectorizaVon (OpenMP) and explicit SPMD (OpenCL) programming models
§ Work with virtualizaVon layers § Fit into standard fixed 32-‐bit encoding space § Be base for future vector++ extensions
RISC-‐V Vector Extension Update Summary
§ Last presentaVon v0.1, 2nd RISC-‐V Workshop, June 2015
§ Progress slow last year due to other higher priority parts of standard, but Vme to work on this now
§ Working group forming now (I’m chair) § Goal is to raVfy 12 months from now at 7th workshop
V Key Features
§ Cray-‐style vectors - “The right way” to exploit SIMD parallelism
§ ImplementaVon-‐dependent vector length - Same binary runs with different hardware vector lengths - Support wide range of implementaVons, microcontroller to supercomputer
§ Reconfigurable vector register file § Mixed-‐precision support
4
V Extension State
x0 x1
x31
f0 f1
f31
Standard RISC-‐V scalar x and f registers
p0[0] p1[0]
p7[0]
p0[1] p1[1]
p7[1]
p0[MVL-‐1] p1[MVL-‐1]
p7[MVL-‐1]
Up to 8 vector predicate registers, with 1 bit per element
v0[0] v1[0]
v31[0]
v0[1] v1[1]
v31[1]
v0[MVL-‐1] v1[MVL-‐1]
v31[MVL-‐1]
Up to 32 vector data registers, v0-‐v31, of at least 4 elements each, with variable bits/element (8,16,32,64,128)
vl
vcmaxw
Vector configuraVon CSRs
Vector length CSR vctype vcnpred
vxrm
Vector fixed-‐point rounding mode and saturaVon flag CSRs
vxsat
Vector Unit Configura@on
§ Each vector data register is configured with a width and type, or disabled
§ Configurable number of predicate registers (0-‐8) § Maximum vector length (MVL) funcVon of configuraVon, physical register storage, and microarchitecture
v0v1v2v3v4v5v6vp0vp1vp2 Element 0 Element 1 MVL-‐1
Vector Mandatory Supported Types
54 Volume I: RISC-V User-Level ISA V2.2-draft
CSR name Number Base ISAvl 0x020 RV32, RV64, RV128vxrm 0x020 RV32, RV64, RV128vxsat 0x020 RV32, RV64, RV128vcsr 0x020 RV32, RV64, RV128vcnpred 0x020 RV32, RV64, RV128vcmaxw 0x020 RV32, RV64, RV128vcmaxw1 0x020 RV32vcmaxw2 0x020 RV32, RV64vcmaxw3 0x020 RV32vctype 0x020 RV32, RV64, RV128vctype1 0x020 RV32vctype2 0x020 RV32, RV64vctype3 0x020 RV32vctypev0 0x020 RV32, RV64, RV128vctypev1 0x020 RV32, RV64, RV128...vctypev31 0x020 RV32, RV64, RV128
Table 9.1: Vector extension CSRs.
Supported Fixed-Point WidthsRV32I X8, X16, X32RV64I X8, X16, X32, X64RV128I X8, X16, X32, X64, X128
Supported Floating-Point WidthsF F16, F32FD F16, F32, F64FDQ F16, F32, F64, F128
Table 9.2: Supported data element widths depending on base integer ISA and supported floating-point extensions. Note that supporting a given floating-point width mandates support for allnarrower floating-point widths.
floating-point types (F16, F32, F64, and F128 respectively). When the V extension is added, itmust support the vector data element types implied by the supported scalar types as defined byTable 9.2. The largest element width supported:
ELEN = max(XLEN,FLEN)
Compiler support for vectorization is greatly simplified when any hardware-supported data typesare supported by both scalar and vector instructions.
Adding the vector extension to any machine with floating-point support adds support for the IEEEstandard half-precision 16-bit floating-point data type. This includes a set of scalar half-precisioninstructions described in Section ??. The scalar half-precision instructions follow the template forother floating-point precisions, but using the hitherto unused fmt field encoding of 10.
We only support scalar half-precision floating-point types as part of the vector extension, as
7
Adding V extension to scalar floaVng-‐point extension adds scalar half-‐precision (IEEE 16-‐bit FP) instrucVons
Vector Maximum Width Configura@on
§ Each vector data register has a 4-‐bit field in the vcmaxw CSR that describes the maximum width of elements in that vector
§ Total of 32x4b=128 bits of width state held in one (RV128), two (RV64) or four (RV32) CSRs
§ Any writes to vcmaxw iniValizes all vector unit state 8
Copyright
c� 2010–2016, The Regents of the University of California. All rights reserved. 55
the main benefits of half-precision are obtained when using vector instructions that amortizeper-operation control overhead. Not supporting a separate scalar half-precision floating-pointextension also reduces the number of standard instruction-set variants.
9.3 Vector Configuration Registers (vcmaxw, vctype, vcp)
The vector unit must be configured before use. Each architectural vector data register (v0–v31) isconfigured with the maximum number of bits allowed in each element of that vector data register,or can be disabled to free physical vector storage for other architectural vector data registers. Thenumber of available vector predicate registers can also be set independently.
The available MVL depends on the configuration setting, but MVL must always have the samevalue for the same configuration parameters on a given implementation. Implementations mustprovide an MVL of at least four elements for all supported configuration settings.
Each vector data register’s current maximum-width is held in a separate four-bit field in the vcmaxwCSRs, encoded as shown in Table 9.3.
Width EncodingDisabled 0000
8 100016 100132 101064 1011
128 1100
Table 9.3: Encoding of vcmaxw fields. All other values are reserved.
Several earlier vector machines had the ability to configure physical vector register storage intoa larger number of short vectors or a shorter number of long vectors, in particular the FujitsuVP series [12].
In addition, each vector data register has an associated dynamic type field that is held in a four-bitfield in the vctype CSRs, encoded as shown in Table 9.4. The dynamic type field of a vector dataregister is constrained to only hold types that have equal or lesser width than the value in thecorresponding vcmaxw field for that vector data register. Changes to vctype do not alter MVL.
Vector data registers have both a maximum element width and a current element data type tosupport vector function calls, where the caller does not know the types needed by the callee, asdescribed below.
To reduce configuration time, writes to a vcmaxw field also write the corresponding vctype field.The vcmaxw field can be written any value taken from the type encoding in Table 9.4, but only thewidth information as shown in Table 9.3 will be recorded in the vcmaxw fields whereas the full typeinformation will be recorded in the corresponding vctype field.
Attempting to write any vcmaxw field with a width larger than that supported by the implemen-tation will raise an illegal instruction exception. Implementations are allowed to record a vcmaxw
Vector Type Configura@on
§ Each data register has current type encoded in 4-‐bit field in vctype register
§ Writes to vcmaxw set both vcmaxw and vctype, vcmaxw retains only width not type
§ Writes to vctype only zeros associated vector register 9
56 Volume I: RISC-V User-Level ISA V2.2-draft
Type vctype encoding vcmaxw equivalentDisabled 0000 0000F16 0001 1001F32 0010 1010F64 0011 1011F128 0100 1100X8 1000 1000X16 1001 1001X32 1010 1010X64 1011 1011X128 1100 1100
Table 9.4: Encoding of vctype fields. The third column shows the value that will be saved whenwriting to vcmaxw fields. All other values are reserved.
value larger than the value requested. In particular, an implementation may choose to hardwirevcmaxw fields to the largest supported width.
Attempting to write an unsupported type or a type that requires more than the current vcmaxwwidth to a vctype field will raise an exception.
Any write to a field in the vcmaxw register configures the vector unit and causes all vector dataregisters to be zeroed and all vector predicate registers to be set, and the vector length register vlto be set to the maximum supported vector length.
Any write to a vctype field zeros only the associated vector data register, leaving the other vectorunit state undisturbed. Attempting to write a type needing more bits than the correspondingvcmaxw value to a vctype field will raise an illegal instruction exception.
Vector registers are zeroed on reconfiguration to prevent security holes and to avoid exposingdi↵erences between how di↵erent implementations manage physical vector register storage.
In-order implementations will probaby use a flag bit per register to mux in 0 instead ofgarbage values on each source until it is overwritten. For in-order machines, partial writesdue to predication or vector lengths less than MVL complicate this zeroing, but these cases canbe handled by adopting a hardware read-modify-write, adding a zero bit per element, or a trapto machine-mode trap handler if first write access after configuration is partial. Out-of-ordermachines can just point initial rename table at physical zero register.
In RV128, vcmaxw is a single CSR holding 32 4-bit width fields. Bits (4N + 3)–(4N) hold themaximum width of vector data register N . In RV64, the vcmaxw2 CSR provides access to theupper 64 bits of vcmaxw. In RV32, the vcmaxw1 CSR provides access to bits 63–32 of vcmaxw, whilevcmax3 CSR provides access to bits 127–96.
The vcnpred CSR contains a single 4-bit WLRL field giving the number of enabled architecturalpredicate registers, between 0 and 8. Any write to vcnpred zeros all vector data registers, sets allbits in visible vector predicate registers, and sets the vector length register vl to the maximumsupported vector length. Attempting to write a value larger than 8 to vcnpred raises an illegalinstruction exception.
Vector Predicate Configura@on
§ The vcnpred CSR holds number of predicate registers (0-‐8)
§ Writes to vcnpred iniValizes all vector unit state
10
Faster configura@on
§ Sekng all configuraVon bits directly via vcmaxw requires creaVng/loading long immediates and wriVng possibly mulVple CSRs (RV32/64)
§ A vcfgd CSR alias is defined for faster writes of common vector data configuraVons
§ One 5-‐bit field per supported type, set to highest vector register number with that type or zero for none
11
58 Volume I: RISC-V User-Level ISA V2.2-draft
# Vector-vector 32-bit add loop.
# Assume vector unit configured with correct types.
# a0 holds N
# a1 holds pointer to result vector
# a2 holds pointer to first source vector
# a3 holds pointer to second source vector.
loop: setvl t0, a0
vld v0, a2 # Load first vector
sll t1, t0, 2 # multiply by bytes
add a2, t1 # Bump pointer
vld v1, a3 # Load second vector
add a3, t1 # Bump pointer
vadd v0, v1 # Add elements
sub a0, t0 # Decrement elements completed
vst v0, a1 # Store result vector
add a1, t1 # Bump pointer
bnez a0, loop # Any more?
Figure 9.1: Example vector-vector add loop.
ing vcfgdi instruction is encoded as a CSRRWI that takes a 5-bit immediate value to set theconfiguration, and returns MVL in the destination register.
One of the primary uses of vcfgdi is to configure the vector unit with single-byte element vectorsfor use in memcpy and memset routines. A single instruction can configure the vector unit forthese operation.
The vcfgd instruction also clears the vcnpred register, so no predicate registers are allocated.
0 F64 F32 F16 X32 X16 X8 RV322 5 5 5 5 5 5
0 F128 X64 F64 F32 F16 X32 X16 X8 RV6424 5 5 5 5 5 5 5 5
0 X128 F128 X64 F64 F32 F16 X32 X16 X8 RV12883 5 5 5 5 5 5 5 5 5
Figure 9.2: Format of the vcfgd value for di↵erent base ISAs, holding 5-bit vector register numbersfor each supported type. Fields must either contain 0 indicating no vector registers are allocatedfor that type, or a vector register number greater than all to the right. All vector register numbersinbetween two non-zero fields are allocated to the type with the higher vector register number.
The vcfgd value specifies how many vector registers of each datatype are allocated, and is dividedinto 5-bit fields, one per supported datatype. A value of 0 in a field indicates that no registers ofthat type are allocated. A non-zero value indicates the highest vector
Each 5-bit field in the vcfgd value must contain either zero, indicating that no vector registers areallocated for that type, or a vector register number greater than all fields in lower bit positions,indicating the highest vector register containing the associated type. This encoding can compactly
Fast configura@on example
12
Copyright
c� 2010–2016, The Regents of the University of California. All rights reserved. 59
0 F64 F32 F16 X32 X16 X8
0 18 12 0 1 0 0
Vector registers vcmaxw vctype Typev31–v19 0000 0000 Disabledv18–v13 1011 0011 F64v12–v2 1010 0010 F32v1–v0 1010 1010 X32
Figure 9.3: Example use of vcfgd value to set configuration.
represent any arbitrary allocation of vector registers to data types, except that there must beat least two vector registers (v0 and v1) allocated to the narrowest required type. An exampleallocation is shown in Figure 9.3.
Separate vcfgp and vcfgpi instructions are provided, using the CSRRW and CSRRWI encodingsrespectively, that write the source value to the vcnpred register and return the new MVL. Thesewrites also clear the vector data registers, set all bits in the allocated predicate registers, and setvl=MVL. A vcfgp or vcfgpi instruction can be used after a vcfgd to complete a reconfigurationof the vector unit.
If a zero argument is given to vcgfd the vector unit will be unconfigured with no enabled registers,and the value 0 will be returned for MVL. Only the configuration registers vcmaxw and vcnpred
can be accessed in this state, either directly or via vcfgd, vcfgdi, vcfgp, or vcfgpi instructions.Other vector instructions will raise an illegal instruction exception.
To quickly change the individual types of a vector register, each vector data register n has a dedi-cated CSR address to access its vctype field, named vctypevn. The vcfgt and vcfgti instructionsare assembler pseudo-instructions for regular CSRRW and CSRRWI instructions that update thetype fields and return the original value. The vcfgti instruction is typically used to change to adesired type while recording the previous type in one instruction, and the vcfgt instruction is usedto revert back to the saved type.
58 Volume I: RISC-V User-Level ISA V2.2-draft
# Vector-vector 32-bit add loop.
# Assume vector unit configured with correct types.
# a0 holds N
# a1 holds pointer to result vector
# a2 holds pointer to first source vector
# a3 holds pointer to second source vector.
loop: setvl t0, a0
vld v0, a2 # Load first vector
sll t1, t0, 2 # multiply by bytes
add a2, t1 # Bump pointer
vld v1, a3 # Load second vector
add a3, t1 # Bump pointer
vadd v0, v1 # Add elements
sub a0, t0 # Decrement elements completed
vst v0, a1 # Store result vector
add a1, t1 # Bump pointer
bnez a0, loop # Any more?
Figure 9.1: Example vector-vector add loop.
ing vcfgdi instruction is encoded as a CSRRWI that takes a 5-bit immediate value to set theconfiguration, and returns MVL in the destination register.
One of the primary uses of vcfgdi is to configure the vector unit with single-byte element vectorsfor use in memcpy and memset routines. A single instruction can configure the vector unit forthese operation.
The vcfgd instruction also clears the vcnpred register, so no predicate registers are allocated.
0 F64 F32 F16 X32 X16 X8 RV322 5 5 5 5 5 5
0 F128 X64 F64 F32 F16 X32 X16 X8 RV6424 5 5 5 5 5 5 5 5
0 X128 F128 X64 F64 F32 F16 X32 X16 X8 RV12883 5 5 5 5 5 5 5 5 5
Figure 9.2: Format of the vcfgd value for di↵erent base ISAs, holding 5-bit vector register numbersfor each supported type. Fields must either contain 0 indicating no vector registers are allocatedfor that type, or a vector register number greater than all to the right. All vector register numbersinbetween two non-zero fields are allocated to the type with the higher vector register number.
The vcfgd value specifies how many vector registers of each datatype are allocated, and is dividedinto 5-bit fields, one per supported datatype. A value of 0 in a field indicates that no registers ofthat type are allocated. A non-zero value indicates the highest vector
Each 5-bit field in the vcfgd value must contain either zero, indicating that no vector registers areallocated for that type, or a vector register number greater than all fields in lower bit positions,indicating the highest vector register containing the associated type. This encoding can compactly
Maximum Vector Length
§ Sekng vcmaxw and vcnpred determines current maximum vector length (MVL) - vctype does not affect MVL
§ Any change to vcmaxw or vcnpred iniValizes all vector unit state - Must not rely on state inbetween reconfiguraVons - Gives flexibility to implementaVons - Avoid security holes from leaking state
§ CSRRW / CSRRWI instrucVons to change vcmaxw/vcnpred return resulVng MVL - This is different than plain CSRRW that returns old value - Most code will not use MVL directly
13
Set Vector Length
§ AcVve vector length held in vl CSR, a WARL register holding values between 0 and MVL inclusive.
§ Any configuraVon changes iniValize vl to MVL. § Usually vl modified with setvl instrucVon encoded as CSRRW/CSRRWI instrucVon
§ Source argument to setvl is applicaVon vector length (AVL), returns value placed in vl
14
Copyright
c� 2010–2016, The Regents of the University of California. All rights reserved. 57
AVL Value vl settingAVL � 2MVL MVL
2MVL > AVL > MVL bAVL/2cMVL � AVL AVL
Table 9.5: Operation of setvl instruction to set vector length register vl based on requestedapplication vector length (AVL) and current maximum vector length (MVL).
9.4 Vector Length
The active vector length is held in the XLEN-bit WARL vector length CSR vl, which can only holdvalues between 0 and MVL inclusive. Any writes to the maximum configuration registers (vcmaxwor vcnpred) cause vl to be initialized with MVL. Writes to vctype do not a↵ect vl.
The active vector length is usually written with the setvl instruction, which is encoded as a csrrw
instruction to the vl CSR number. The source argument to the csrrw is the requested applicationvector length (AVL) as an unsigned XLEN-bit integer. The setvl instruction calculates the valueto assign to vl according to Table 9.5.
The rules for setting the vl register help keep vector pipelines full over the last two iterations ofa stripmined loop. Similar rules were previously used in Cray-designed machines [4].
The vl register is updated with the minimum of AVL and MVL, and this value is also returned asthe result of the setvl instruction. Note that unlike a regular csrrw instruction, the value returnedis not the original CSR value but the modified value.
The idea of having implementation-defined vector length dates back to at least the IBM 3090Vector Facility [3], which used a special “Load Vector Count and Update” (VLVCU) instructionto control stripmine loops. The setvl instruction included here is based on the simpler setvlrinstruction introduced by Asanovic [2].
The setvl instruction is typically used at the start of every iteration of a stripmined loop to setthe number of vector elements to operate on in the following loop iteration. The current MVL canbe obtained by performing a setvl with a source argument that has all bits set (largest unsignedinteger).
No element operations are performed for any vector instruction when vl=0.
9.5 Rapid Configuration Instructions
It can take several instructions to set vcmaxw, vctype and vcnpred to a given configuration. Toaccelerate configuring the vector unit, specialized vcfg instructions are added that are encoded aswrites to CSRs with encoded immediate values that set multiple fields in the vcmaxw, vctype, andvncpred configuration registers.
The vcfgd instruction is encoded as a CSRRW that takes a register value encoded as shown inFigure 9.2, and which returns the corresponding MVL in the destination register. A correspond-
32-‐bit integer vector-‐vector add example
vcfgd 2*X32 # Only need two vector registersstripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first vector vld v1, a2 # Get second vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,2 # Multiply count by 4 to get byte add a1, t1 # Bump pointers add a2, t1 add a3, t1 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config
15
16-‐bit integer vector-‐vector add example
vcfgd 2*X16 # Only need two vector registersstripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first vector vld v1, a2 # Get second vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,1 # Multiply count by 2 to get byte add a1, t1 # Bump pointers add a2, t1 add a3, t1 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config
16
16-‐bit + 32-‐bit vector add example
vcfgd 1*X32|1*X16stripmine: vsetvl t0, a0 # a0 holds vector length vld v0, a1 # Get first 16-bit vector vld v1, a2 # Get second 32-bit vector vadd v1, v0 # Add vectors vst v1, a3 # Store result vector sll t1,t0,1 # Multiply count by 2 to get byte sll t2,t0,2 # Multiply count by 4 to get byte add a1, t1 # Bump pointers add a2, t2 add a3, t2 sub a0, t0 # Subtract number done bnez a0, stripmine # Any more? vuncfg # Turn off vector unit by zeroing config
17
Vector Length Portability
§ Same binary code works regardless of: - Number of physical register bits - Number of physical lanes - Mixed-‐precision packing strategy
§ Architecture guarantees minimum vector length of four regardless of configuraVon to avoid stripmine overhead for short vectors - E.g., if use 32 * 64-‐bit vector registers, - need 128 * 8-‐byte physical element registers - 1KB SRAM
Polymorphic Instruc@on Encoding
§ Single signed integer ADD opcode works on different size inputs and outputs - Size of inputs and outputs inherent in register number - Sign-‐extend smaller input - Modulo arithmeVc on overflow to desVnaVon - Restrict supported combinaVons to simplify hardware
§ Integer, Fixed-‐point, FloaVng-‐point arithmeVc
Vector Loads and Stores
Addressing modes: § Unit-‐stride (scalar base) § Constant stride (scalar base, scalar stride) § Indexed (scalar base, vector offset) Types inherent in desVnaVon register number (for integers, signed/unsigned determined at use) Support vector AMOs: § E.g, Vector fetch-‐and-‐add
Vector Predica@on
§ Up to eight vector predicate registers p0-‐p7, one bit per element
§ Logical operaVons between predicate registers § All vector instrucVons are predicated under p0 - Implicit predicate due to encoding constraints
§ InstrucVon to swap two predicate registers - Reduce overhead of scheduling complex control flow - Can implement just in rename table if OoO core
§ Popcount instrucVon returns number of acVve bits in predicate register to scalar integer register - Used for divergent control flow opVmizaVons
§ Other cross-‐element flag operaVons to support complex loop opVmizaVons
§ Support for sopware vector length speculaVon
Vector Predica@on and Vector Register Renaming
Previous approaches in vector archs: 1) DesVnaVon has old value if predicate false - Simpler spec, beqer for in-‐order/no renaming - Have to copy old value to new desVnaVon with renaming
2) DesVnaVon has zero value if predicate false - Beqer for out-‐of-‐order with renaming - Need addiVonal merge(s) to rebuild complete vector
3) DesVnaVon has undefined value if predicate false - More complex code, beqer for out-‐of-‐order with renaming - Need addiVonal merge(s) to rebuild complete vector - Messy definiVon
§ We’re choosing 1), as simpler and safer. § Use microarchitectural tricks for OoO machines to reduce amount of data transfer.
Vector Func@on Calls
§ In auto-‐vectorized code, want to make vector calls to funcVon library with separate vector calling convenVon - Args in vector registers - AcVve elements communicated by vector length and vp0
§ Need to abstract callee register usage from caller § Caller has to allocate registers for callee to use § Set vcmaxw to largest value, then callee can change type with vctype
§ Vector runVme can opVmize calling convenVon within vector runVme library 23
for (i=0; i<N; i++) x[i] = exp(y[i]/z[i]);
OpenCL / CUDA/ SPMD programming
§ Not a great programming model, should move community back to autovectorizaVon/autoparallelizaVon, but needed for compaVbility
§ PredicaVon used to handle divergent control flow - See Yunsup’s thesis
§ ConfiguraVon must be set at kernel launch to maximum width used anywhere in call tree
§ Need general vector funcVon call capability with standard callee/caller save protocol
24
OS Support
§ Restartable page faults via microcode state dump, opaque to OS - Similar to DEC Vector Vax implementaVon - If implementaVon has precise traps, can skip
§ Privileged specificaVon describes XS sstatus field used to encode coprocessor status (Off, IniVal, Clean, Dirty) to reduce context save/restore overhead.
Ques@ons?
26