The present technique relates to the field of data processing.
An instruction set architecture (ISA) defines the set of instructions available to a software developer or compiler when developing a particular software program, and in a corresponding way defines the set of instructions which need to be supported by a processor implementation in hardware to allow the hardware to be compatible with software written according to the ISA. For example, the ISA may define, for each instruction, the encoding of the instruction, a representation of its input operands and result value, and the functions for mapping the input operands to the result of the instruction.
A vector ISA supports at least one vector instruction which operates on a vector operand comprising two or more independent vector elements represented within a single register, and/or generates a vector result comprising two or more independent vector elements. The vector instruction can be processed in a SIMD (single instruction, multiple data) fashion to allow multiple independent calculations to be performed on different data values in response to a single instruction. Vector instructions can be useful, for example, to allow a scalar loop of instructions written in high level code to be vectorised so that processing corresponding to multiple loop iterations can be performed in response to a single iteration of vectorised loop. This helps to improve performance by reducing the number of instructions which need to be fetched, decoded and executed to carry out a certain amount of data processing.
At least some examples provide an apparatus comprising: processing circuitry to perform data processing; and instruction decoding circuitry to control the processing circuitry to perform the data processing in response to decoding of program instructions defined according to a scalable vector instruction set architecture supporting vector instructions operating on vectors of scalable vector length to enable the same instruction sequence to be executed on apparatuses with hardware supporting different maximum vector lengths; in which: the instruction decoding circuitry and the processing circuitry are configured to support a sub-vector-supporting instruction which treats a given vector as comprising a plurality of sub-vectors with each sub-vector comprising a plurality of vector elements, each sub-vector having an equal sub-vector length; and in response to the sub-vector-supporting instruction, the instruction decoding circuitry is configured to control the processing circuitry to perform an operation for the given vector at sub-vector granularity.
At least some examples provide a method comprising: decoding, using instruction decoding circuitry, program instructions defined according to a scalable vector instruction set architecture supporting vector instructions operating on vectors of scalable vector length to enable the same instruction sequence to be executed on apparatuses with hardware supporting different maximum vector lengths; and controlling processing circuitry to perform data processing in response to decoding of the program instructions; in which: the instruction decoding circuitry and the processing circuitry support a sub-vector-supporting instruction which treats a given vector as comprising a plurality of sub-vectors with each sub-vector comprising a plurality of vector elements, each sub-vector having an equal sub-vector length; and in response to the sub-vector-supporting instruction, the instruction decoding circuitry controls the processing circuitry to perform an operation for the given vector at sub-vector granularity.
At least some examples provide a computer program to control a host data processing apparatus to provide an instruction execution environment for execution of target code; the computer program comprising: instruction decoding program logic to decode instructions of the target code to control the host data processing apparatus to perform data processing in response to the instructions of the target code; in which: the instruction decoding program logic supports decoding of program instructions defined according to a scalable vector instruction set architecture supporting vector instructions operating on vectors of scalable vector length to enable the same instruction sequence to be executed on apparatuses with hardware supporting different maximum vector lengths; the instruction decoding program logic comprises sub-vector-supporting instruction decoding program logic to decode a sub-vector-supporting instruction which treats a given vector as comprising a plurality of sub-vectors with each sub-vector comprising a plurality of vector elements, each sub-vector having an equal sub-vector length; and in response to the sub-vector-supporting instruction, the instruction decoding program logic is configured to control the host data processing apparatus to perform an operation for the given vector at sub-vector granularity.
The computer program may be stored on a storage medium. The storage medium may be a non-transitory storage medium.
Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which:
An apparatus has processing circuitry to perform data processing, and instruction decoding circuitry to control the processing circuitry to perform the data processing in response to decoding of program instructions defined according to a scalable vector instruction set architecture (ISA). The scalable vector ISA (also known as a “vector length agnostic” vector ISA) supports vector instructions operating on vectors of scalable vector length to enable the same instruction sequence to be executed on apparatuses with hardware supporting different maximum vector lengths. This is useful because it allows different hardware designers of processor implementations to choose different maximum vector lengths depending on whether their design priority is high-performance or reduced circuit area and power consumption, while software developers need not tailor their software to a particular hardware platform as the software written according to the scalable vector ISA can be executed across any hardware platform supporting the scalable vector ISA, regardless of the particular maximum vector length supported by a particular hardware platform. Hence, the vector length to be used for a particular vector instruction of the scalable vector ISA is unknown at compile time (neither defined to be fixed in the ISA itself, nor specified by a parameter of the software itself). The operations performed in response to a given vector instruction of the scalable vector ISA may differ depending on the vector length chosen for a particular hardware implementation (e.g. hardware supporting a greater maximum vector length may process a greater number of vector elements for a given vector instruction than hardware supporting a smaller maximum vector length). An implementation with a shorter vector length may therefore require a greater number of loop iterations to carry out a particular function than an implementation with a longer vector length.
While the scalable vector ISA can be very useful to enable development of platform-independent program code which can easily be ported between processor implementations with differing maximum vector lengths, there may be a significant amount of legacy code which was compiled assuming a known vector length. It may take a considerable amount of software development effort to redevelop the legacy software for use with the scalable vector ISA. This is particularly the case because some software techniques typically used to improve performance for vectorised code, such as software pipelining or loop unrolling, may rely on the vector length used for the instructions being known at compile time. Therefore, it may not be straightforward to remap instructions of the non-scalable vector ISA to instructions of the scalable vector ISA as some techniques used in legacy code may not be available in the scalable vector ISA. This may be a significant barrier to adoption of the scalable vector ISA and may result in some software developers choosing not to use the scalable vector ISA, so that a significant amount of software executing on newer processors supporting the scalable vector ISA may still use a less performance-efficient non-scalable vector ISA which uses relatively short maximum vector length, even on a processor implementation supporting a large maximum vector length using the scalable vector ISA (that processor implementation may also support the non-scalable vector ISA for backwards compatibility reasons). This means that the full performance capability of the hardware may not be used for many programs.
In the examples discussed below, the instruction decoding circuitry and the processing circuitry support, within the scalable vector ISA, a sub-vector-supporting instruction which treats a given vector as comprising a plurality of sub-vectors with each sub-vector comprising a plurality of vector elements and each sub-vector having equal sub-vector length. In response to the sub-vector-supporting instruction, the instruction decoding circuitry controls the processing circuitry to perform an operation for the given vector at sub-vector granularity. This helps to reduce the software development effort required to enable adoption of the scalable vector ISA because each vector of a non-scalable vector ISA can be mapped to one of the sub-vectors of a vector in the scalable ISA. This makes mapping non-scalable software to scalable software simpler and therefore reduces the barrier to use of the scalable vector ISA, making it more likely that a greater fraction of code executing on an apparatus supporting the scalable vector ISA is actually using the scalable vector ISA, which will tend to improve average performance across a range of processor implementations because those higher-end processors which support longer vector lengths may be more likely to be able to make use of those longer vector lengths to improve performance.
Each sub-vector may have a sub-vector length which is known at compile time for a given instruction sequence to be executed using the sub-vector-supporting instruction. This is helpful because it allows a vector operand or result of the scalable vector ISA to be defined as a “vector of vectors” comprising a number of smaller sub-vectors each of known vector length, so that multiple vectors defined according to software written using a non-scalable vector ISA can be combined into a larger vector according to the scalable vector ISA. As the length of each sub-vector is known at compile time, any performance-improving software techniques relying on knowledge of vector length at compile time can be applied at the granularity of sub-vectors, making it much easier for a software developer to map code written according to the non-scalable vector ISA into code defined according to the scalable vector ISA while retaining those software techniques.
How many sub-vectors are comprised by the given vector is unknown at compile time for a given instruction sequence to be executed using the sub-vector-supporting instruction. In other words, the overall vector length for the given vector may be a scalable vector length as permitted by the scalable vector ISA (which may define a variety of maximum vector lengths which may be permitted for different processor implementations). This means that the sub-vector-supporting instruction can still benefit from the platform-independent properties of the scalable vector ISA while supporting a range of different performance/power points. Nevertheless, software optimisations which rely on knowledge of a vector length to be used as a granularity for vectorisation can still be used because they can be applied with reference to the sub-vectors of known sub-vector length before a variable number of sub-vectors can be accommodated in a larger given vector of scalable vector length using the sub-vector-supporting instruction. For example, a variable number of iterations of a vectorised loop which previously would have been implemented in vectorised form using a non-scalable vector ISA can be mapped to a vectorised loop in the scalable vector ISA with each iteration of the original vectorised loop corresponding to one of the sub-vectors of the vectors processed by sub-vector supporting instructions of the scalable vector ISA. This makes compilation and software development of code much simpler as it may avoid the need to revert from vectorised non-scalable code to a scalar loop before converting the scalar loop back into vectorised scalable code according to the ISA-instead it may be simpler to map vector instructions of the non-scalable ISA direct to scalable vector instructions of the scalable ISA without an intervening devectorisation step (other compilers may simply compile directly for the scalable ISA including sub-vector-processing instructions without basing the compilation on previous code compiled for a non-scalable ISA). The sub-vector length known at compile time may be independent of the vector length used for the given vector, so that the sub-vector length is the same for a given instruction of a given piece of software regardless of the actual vector length used for the overall given vector by a given hardware implementation.
In response to the sub-vector-supporting instruction, the instruction decoding circuitry may control the processing circuitry to process each of the sub-vectors in response to the same instance of executing the sub-vector-supporting instruction. For example, each of the operations performed at sub-vector granularity could be processed in parallel. Alternatively, the operations performed at sub-vector granularity could be processed sequentially, or in part sequentially and in part in parallel, or in a pipelined manner. Regardless of the exact timing at which the various operations performed at sub-vector granularity, each of the operations at sub-vector granularity is performed in response to a single instance of execution of the sub-vector supporting instruction, so that the SIMD benefits of vector ISAs can be realised.
There are a number of ways in which the sub-vector length could be known at compile time. For some implementations of the scalable vector ISA, the ISA may define the sub-vector length as a variable software-defined parameter which can be specified by the software code itself, so as to allow the software to select between two or more different sub-vector sizes. For example, this could help provide support for remapping code from two or more different non-scalable vector ISAs with different non-scalable vector lengths.
However, in one example each sub-vector may have a sub-vector length of an architecturally-defined fixed size which is independent of a vector length used for the given vector. This can simplify the architecture and avoid any need for software to specify the sub-vector length. Instead, the sub-vector length may be fixed in the architecture definition of the sub-vector-supporting instruction defined in the scalable vector ISA.
The architecturally-defined fixed size may correspond to an architecturally-defined maximum vector length prescribed for vector instructions processed according to a predetermined non-scalable vector ISA. For example, the predetermined non-scalable vector ISA could be the “Advanced SIMD” architecture (also known as the Neon™ architecture) provided by Arm® Limited of Cambridge, UK.
For example, the architecturally-defined fixed size of the sub-vector length may be 128 bits. This is useful for compatibility with the Advanced SIMD architecture which defines a maximum vector length of 128 bits.
It is also possible to implement sub-vector-supporting instructions targeting other non-scalable vector ISAs, in which case the sub-vector length may vary depending on the particular ISA targeted.
Also, it is possible to implement the sub-vector-supporting instructions without intending to target any particular non-scalable vector ISA, but simply to choose a given fixed sub-vector length independent of any aim to emulate a length used in a particular non-scalable vector ISA. Even if there is no particular non-scalable vector ISA being targeted, it can still be useful to define the sub-vectors have a sub-vector length of an architecturally-defined fixed size, to enable software developers and compilers to make use of software performance optimisations that depend on compile-time knowledge of that fixed size.
The sub-vector-supporting instruction may support variable element size, so that each vector elements of each sub-vector has a variable element size selected from two or more different sizes supported by the scalable vector ISA. The sub-vector length may be independent of which element size is used for each vector element within each sub-vector. Hence, the same sub-vector length may be used regardless of whether the selected element size is a larger element size or smaller element size. The number of vector elements define per sub-vector may correspond to the ratio between the sub-vector length and the selected element size. The selected element size may be defined by a software-specified parameter of the instruction sequence comprising the sub-vector-supporting instruction, and so is known at compile time. Hence, as both the sub-vector length and the element size may be known at compile time, the number of vector elements provided per sub-vector may also be known at compile time, although the total number of vector elements in the given vector as a whole may be unknown at compile time because the overall vector length of the given vector is unknown at compile time according to the scalable vector ISA.
The operation performed at sub-vector granularity may vary. A number of different sub-vector-supporting instructions can be defined as part of the scalable vector ISA to enable a number of different operations to be performed at sub-vector granularity depending on the choice of the programmer or compiler.
In some examples, for at least one sub-vector-supporting instruction, the operation performed at sub-vector granularity is an operation performed, for each sub-vector, on vector elements within that sub-vector, independent of elements in other sub-vectors. Hence, this can allow operations, which in a non-scalable vector ISA would have been performed using a number of separate vector instructions (each operating on a respective vector of non-scalable vector length known at compile time), to be performed in a single sub-vector-supporting instruction (e.g. those separate vector instructions may correspond to different iterations of a loop written in the code of the non-scalable vector ISA).
In some examples, for at least one sub-vector-supporting instruction, the operation performed at sub-vector granularity is an operation performed, for each element position within a sub-vector, on respective vector elements at that element position within each of the plurality of sub-vectors. Such an instruction could be useful, for example, to replicate processing which would have been performed using a sequence of vector instructions within a single loop iteration where those vector instructions would have combined data values in the corresponding element positions within a number of vector operands specified for that sequence of vector instructions.
In some examples, for at least one sub-vector-supporting instruction, the operation performed at sub-vector granularity is an operation to set, or perform an operation depending on, selected predicate bits of a predicate value, where the selected predicate bits are predicate bits corresponding to sub-vector-sized portions of a vector. This may differ from many instructions of the scalable vector ISA which may set or interpret the predicate at granularity of individual vector elements smaller than the sub-vector length. Such sub-vector-supporting predicate-setting or predicate-dependent instructions can be useful for allowing processing of entire sub-vectors to be selectively masked in certain instances, e.g. in conditions where corresponding code of a non-scalable vector ISA would have masked out an entire iteration of a vectorised loop performed on vectors of a particular vector length known at compile time.
In some examples, the scalable vector ISA may support at least one variant of a sub-vector-supporting permute instruction. In response to a sub-vector-supporting permute instruction, the instruction decoder controls the processing circuitry to set, for each sub-vector of a vector result, the sub-vector to a permutation of one or more vector elements selected from among vector elements within a correspondingly-positioned sub-vector of at least one vector operand. The sub-vector-supporting permute instruction could be incapable of setting a vector element of a given sub-vector of the vector result based on bits of vector elements at a different, non-corresponding, sub-vector position within one of the vector operands. By performing the permutation at sub-vector granularity, rather than across the entire vector length of the given vector as a whole, this may allow the behaviour of the sub-vector-supporting permute instruction to mirror behaviour of a number of separate non-scalable permute instructions defined in a non-scalable vector ISA which assume a known vector length for the permutation, while still enabling the number of such sub-vector-granularity permutations performed for a given instruction to be scaled based on the implemented vector length supported in hardware according to the scalable vector ISA.
In some examples, the scalable vector ISA may support at least one variant of a sub-vector-supporting reduction instruction. In response to a sub-vector-supporting reduction instruction, the instruction decoder may control the processing circuitry to perform at least one reduction operation at sub-vector granularity, each reduction operation to reduce a plurality of vector elements of an operand vector to a single data value within a result. Such a reduction operation when performed at granularity of an individual sub-vector may give a different result to a corresponding reduction operation performed across the entire vector length, as vector elements in different sub-vectors may not be combined with each other. Such reductions can be useful to emulate processing which in a sequence of non-scalable vectorised code might have been implemented by a sequence of multiple instructions within a given loop iteration or a series of loop iterations each comprising a single instance of an instruction to combine each element of a vector operand with corresponding elements of an accumulator value tracking the result of similar combinations in any preceding loop iterations.
Different variants of a sub-vector-supporting reduction instruction are possible, which vary in the way in which the reduction is performed at sub-vector granularity.
For example, for an intra-sub-vector sub-vector-supporting reduction instruction, for each reduction operation the plurality of vector elements comprise the respective vector elements within a corresponding sub-vector of the operand vector. Including such an instruction in the ISA can be useful to allow a software developer to use the instruction to emulate, in a scalable architecture with unknown vector length at compile time, behaviour of non-scalable code which performed the reductions across all the vector elements of a single vector operand. The result of each sub-vector granularity reduction could be placed in a different sub-vector of the result value. Alternatively, a variant of the intra-sub-vector sub-vector-supporting reduction instruction could place the result of each sub-vector granularity reduction in respective vector elements of one or more sub-vectors of the result value.
In another variant, for an inter-sub-vector sub-vector-supporting reduction instruction, for each reduction operation the plurality of vector elements comprise the vector elements at corresponding element positions within a plurality of sub-vectors of the operand vector. For this type of instruction, the vector elements that are reduced to a single result are vector elements selected at intervals of the sub-vector length. Including such instruction in the ISA can be useful to allow a software developer to use the instruction to emulate, in a scalable architecture with unknown vector length at compile time, behaviour of non-scalable code which perform the reductions across the vector elements at the same element position within vector is processed in a number of successive loop iterations.
The scalable vector ISA may support any one or more of these types of sub-vector-supporting reduction instructions.
In some examples, the scalable vector ISA may support at least one variant of a sub-vector-supporting load/store instruction. In response to a sub-vector-supporting load/store instruction, the instruction decoder may control the processing circuitry to perform a load/store operation to transfer, at sub-vector granularity, one or more sub-vectors between a memory system and at least one vector register. This can be useful to emulate behaviour of load/store instructions of a non-scalable vector ISA which would have performed corresponding load/store operations on vectors of a known vector length.
The sub-vector-supporting load/store instruction may be a predicated instruction associated with a predicate value. In response to the sub-vector-supporting load/store instruction, the instruction decoder may control the processing circuitry to control, based on predicate bits selected from the predicate value at sub-vector granularity, whether each transfer of the one or more sub-vectors is performed or masked. This may differ from the behaviour of other load/store instructions of the scalable vector ISA which may perform the load/store operation across the entire (scalable) vector length with the predicates selected at a granularity of the element size used for the vector elements of the vector (the element size being smaller than the sub-vector length).
In some examples, the scalable vector ISA may support at least one variant of a sub-vector-supporting increment/decrement instruction. In response to the sub-vector-supporting increment/decrement instruction, the instruction decoder may control the processing circuitry to increment or decrement an operand value based on how many sub-vector-sized portions of a vector are indicated as active by bits of a predicate value selected from the predicate value at sub-vector granularity. This can be helpful for loop control so that a loop control variable used by software to decide whether at least one further loop iteration still needs to be performed can be incremented or decremented according to the number of sub-vectors processed in the latest iteration of the loop (as the number of sub-vectors processed in that loop iteration may not be known at compile time, it can be useful to provide an instruction which enables the number of sub-vectors processed to be deduced and used to update a loop control variable accordingly).
The predicate value used by the sub-vector-supporting increment/decrement instruction to determine how to update the operand value may be one of: a predicate value specified as a predicate operand by the sub-vector-supporting increment/decrement instruction; and a predicate value implied by a predicate pattern identifier specified by the sub-vector-supporting increment/decrement instruction, the predicate pattern identifier specifying a predetermined pattern of predicate bits at sub-vector granularity.
In some examples, the scalable vector ISA may support at least one variant of a sub-vector-supporting predicate-setting instruction. In response to the sub-vector-supporting predicate setting instruction, the instruction decoder may control the processing circuitry to perform a predicate setting operation to set bits of a predicate value at sub-vector-granularity, to indicate which sub-vectors of a vector are active. Such an instruction can be useful to control which sub-vectors are processed by other sub-vector-supporting predicate instructions. Bits of the predicate value set to a particular value (e.g. 0) may cause processing of the corresponding sub-vector to be masked. Such a predicate-setting instruction may be a different instruction to other predicate-setting instructions of the scalable vector ISA which may set the predicate value at granularity of a vector element size which may be smaller than the sub-vector length.
The predicate setting operation for the sub-vector-supporting predicate-setting instruction may comprise setting the predicate value based on one of: a predicate pattern identifier specifying a predetermined pattern of predicate bits to be applied at sub-vector granularity; and sub-vector-granularity comparison operations based on a comparison of a first operand and a second operand. With this approach, the new value of the predicate is not explicitly specified in the sub-vector-supporting predicate setting instruction, but can be defined according to some general pattern which may be scalable for different vector lengths. This is useful because the number of predicate bits to be set may be unknown at compile time due to the scalable vector length.
Various examples of sub-vector-supporting instructions are described above. It will be appreciated that any given implementation of the scalable vector ISA need not support all of these types instructions. Any one or more of the described instructions in this application could be implemented.
The techniques discussed above may be implemented within a data processing apparatus which has hardware circuitry provided for implementing the instruction decoder and processing circuitry discussed above.
However, the same technique can also be implemented within a computer program which executes on a host data processing apparatus to provide an instruction execution environment for execution of target code. Such a computer program may control the host data processing apparatus to simulate the architectural environment which would be provided on a hardware apparatus which actually supports target code according to the scalable vector instruction set architecture, even if the host data processing apparatus itself does not support that architecture. The computer program may have instruction decoding program logic which emulates functions of the instruction decoding circuitry discussed above. For example, the instruction decoding program logic generates, in response to a given instruction of the target code, a corresponding sequence of code in the native instruction set of the host data processing apparatus, to control the host data processing apparatus to perform corresponding function to the decoded instruction. The instruction decoding program logic includes sub-vector-supporting instruction decoding program logic to decode a sub-vector-supporting instruction as discussed above, to control the host data processing apparatus to perform an operation for a given vector at sub-vector granularity. Such a simulation program can be useful, for example, when legacy code written for one instruction set architecture is being executed on a host processor which supports a different instruction set architecture. Also, the simulation can allow software development for a newer version of the instruction set architecture to start before processing hardware supporting that new architecture version is ready, as the execution of the software on the simulated execution environment can enable testing of the software in parallel with ongoing development of the hardware devices supporting the new architecture. The simulation program may be stored on a storage medium, which may be an non-transitory storage medium.
Specific examples are now described with reference to the drawings. It will be appreciated that the claims are not limited to these particular examples.
The execute stage 16 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include a scalar processing unit 20 (e.g. comprising a scalar arithmetic/logic unit (ALU) 20 for performing arithmetic or logical operations on scalar operands read from the registers 14); a vector processing unit 22 for performing vector operations on vectors comprising multiple vector elements; and a load/store unit 28 for performing load/store operations to access data in a memory system 8, 30, 32, 34. Other examples of processing units which could be provided at the execute stage could include a floating-point unit for performing operations involving values represented in floating-point format, or a branch unit for processing branch instructions.
The registers 14 include scalar registers 25 for storing scalar values, vector registers 26 for storing vector values, and predicate registers 27 for storing predicate values. The predicate values 27 may be used by the vector processing unit 22 when processing vector instructions, with a predicate value in a given predicate register indicating which vector elements of a corresponding vector operand stored in the vector registers 26 are active vector elements or inactive vector elements (where operations corresponding to inactive data elements may be suppressed or may not affect a result value generated by the vector processing unit 22 in response to a vector instruction).
A memory management unit (MMU) 36 controls address translations between virtual addresses (specified by instruction fetches from the fetch circuitry 6 or load/store requests from the load/store unit 28) and physical addresses identifying locations in the memory system, based on address mappings defined in a page table structure stored in the memory system. The page table structure may also define memory attributes which may specify access permissions for accessing the corresponding pages of the address space, e.g. specifying whether regions of the address space are read only or readable/writable, specifying which privilege levels are allowed to access the region, and/or specifying other properties which govern how the corresponding region of the address space can be accessed. Entries from the page table structure may be cached in a translation lookaside buffer (TLB) 38 which is a cache maintained by the MMU 36 for caching page table entries or other information for speeding up access to page table entries from the page table structure shown in memory.
In this example, the memory system includes a level one data cache 30, the level one instruction cache 8, a shared level two cache 32 and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 20 to 28 shown in the execute stage 16 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that
The processing pipeline 4 supports a scalable vector ISA, which means that the vector instructions of the ISA may be processed without it being known at compile time (the time at which the executed instructions were compiled by a compiler) what vector length will be used for execution of those vector instructions. This enables a variety of different processing apparatuses supporting different maximum vector lengths to execute the same software code, to avoid the burden on the software developer in producing software code suitable for execution across a range of processing platforms. This means that the hardware designer of a given processing apparatus has freedom to select the implemented maximum vector length depending on the designer's preferred performance/power requirements (systems aimed at higher performance may select a longer maximum vector length than systems aimed at better energy efficiency).
Of course, other examples could be based on different scalable and non-scalable architectures and so the range of sizes available for selection for the scalable vector length and the fixed size specified for the non-scalable vector length could differ from that shown in
As well as the vector registers 26, a number of predicate registers 27 (labelled P0 to P15) are provided, for storing predicate values used to selectively mask operations performed on vector elements of vector operands provided using the vector registers 26. The predicate registers 27 may have a bit per vector element in the vector registers 26 when defined according to the minimum vector element size supported by the ISA when the maximum vector length supported in hardware is used (the maximum vector length here refers to the selected size LEN*128 used by a particular hardware implementation, not the maximum vector length (e.g. 2048 bits) permitted by the architecture for any hardware implementation). For example, if the minimum element size is 8 bits, then for the example of LEN*128 bit vector registers, each predicate register may have size LEN*16 bits.
As shown in
The vector length agnostic property of the scalable vector ISA is useful because within a fixed encoding space available for encoding instructions of the ISA, it is not feasible to create different instructions for every different vector length that may be demanded by processor designers, when considering the wide range of requirements scaling from relatively small energy-efficient microcontrollers to servers and other high-performance-computing systems. By not having a fixed vector length known at compile time, multiple markets can be addressed using the same ISA, without effort from software developers in tailoring code to each performance/power/area point.
To achieve the scalable property of the scalable vector ISA, the functionality of the vector instructions of the scalable vector ISA is defined in the architecture with reference to a parameter (e.g. LEN as shown in
In both 128-bit and 256-bit examples, the vector instructions scale to the number of vector elements that can fit within the corresponding vector length. In this example, the element size used is 64 bits so that there are 2 vector elements per vector in the 128-bit example and 4 vector elements per vector in the 256-bit example. Accordingly, only two predicate bits of p0 are used in the 128-bit example and only four predicate bits of p0 are used in the 256-bit example (in practice, the register p0 may include a greater number of predicate bits to enable smaller vector element sizes to be supported).
The whilelt, incd and b.first instructions are loop control instructions which set predicates and update a loop count value according to the supported vector length and conditionally branch back to the start of the loop depending on a comparison of the loop count value. The other instructions are load/store instructions or vector processing instructions which process a number of vector elements according to the supported vector length.
The incrementing instruction incd increments the loop counter based on the number of vector elements processed in the corresponding loop iteration (e.g. VL/ES where VL is the vector length and ES is the vector element size). The predicate setting instructions whilelt uses a comparison between the loop counter (i, represented by register x4 in the instruction sequence) and the termination limit (n, represented by register x3 which is loaded from memory by the load instruction at line 3 of the code example above) to set the predicate depending on whether various incremented versions of the loop counter are less than (It) the limit value. For those elements where the incremented loop counter is still less than the limit value, the predicate is set to true, while the predicate is set to false once the incremented loop counter reaches the termination limit (in this example, this occurs for the fourth element processed for instruction 2 in the 256-bit example, but for the 2-element vector in the 128-bit example this limit is not yet reached on the first pass and will only be reached by a subsequent predicate setting instruction (the whilelt instruction shown at line 14 in the code example above) that is executed to determine there is still another vector element to process (see the 9th instruction in decoded program order shown in the 128-bit example of
Note how ultimately the result produced by the instruction sequences the same for both the 128-bit and 256-bit example (the data stored to memory for array y [ ] has the same values 3, 50, 41, 32 (from most significant address to least significant address), but the 128-bit example has obtained its result in two loop iterations while the 256-bit example only needed one loop iteration. Hence, the instructions at lines 8-15 in the code example above were required to be decoded and executed twice in the 128-bit example (see instructions 4 to 17 numbered in decoded program order for the 128-bit example in
While the scalable vector ISA can be very useful to enable platform-independent code to be developed supporting a range of different hardware implementations, nevertheless many software programs in use have been optimised specifically for a non-scalable vector ISA supporting a fixed length vector unit, such as the 128-bit vector unit provided for systems compliant with the Neon™ non-scalable vector ISA of Arm® Limited. For backwards compatibility, systems supporting the scalable vector ISA may also support the instructions of the non-scalable vector ISA to ensure such legacy software can still be executed. However, this means that even if the hardware supports greater vector lengths, the non-scalable software cannot benefit from the extra performance that would be available in hardware.
It may be desirable for the program code written in the non-scalable vector ISA to be redeveloped using instructions of the scalable vector ISA, as this can open up greater performance opportunities, exploiting the longer vector lengths available on many hardware implementations of the scalable vector ISA. However, if it is attempted to rewrite a program written for a non-scalable vector ISA (where the vector length is known at compile time) into program code written for a scalable vector ISA (where the vector length is unknown at compile time), this may require considerable effort for the programmer or the compiler writer. The instructions of the non-scalable vector ISA may not map to scalable vector instructions in a simple manner, because often the program code in the non-scalable ISA may include specific optimisations which were chosen dependent on the knowledge of a fixed vector length known at compile time (e.g. 128 bits), and if it cannot be known what vector next will be used at compile time, this may prevent use of some such optimisations, which may in some cases prevent use of a vectorised loop at all in the scalable vectorised code altogether.
For example, the following C code may implement a partial sum reduction (e.g. this may be an operation from a digital signal processing application):
In the code above, the result of the multiply-add operations in each iteration is accumulated in acc and stored in state [biquad].y1. After being vectorized, each element of the vector used to store the value of state [biquad].y1 should store the partial sum of the accumulation (Element 0 should store the value of acc in the 0th iteration, Element 1 should store the resultant value of acc after the first two iterations, and so on).
On a non-scalable vector ISA, this partial sum reduction can be achieved by broadcasting each one of the four elements into a new vector register and partially accumulating these four vector registers using the mla (multiply-accumulate) instructions, as shown below:
However, in a scalable vector ISA this partial sum reduction cannot be performed in the same way, because at compile time the vector length is unknown and therefore the compiler does not know how many dup instructions should be included. This may prevent the successful vectorisation of the scalar loop defined in the C code, and may force the scalable vector code to resort to a scalar loop so that the benefits of vectorisation cannot be realised.
This is just one example of a software optimisation which may rely on compile-time knowledge of the vector length. Other examples may include loop unrolling, which reduces the number of loop control instructions needed to be executed by mapping a certain number of original loop iterations to a smaller number of loop iterations each comprising a greater number of instructions, with one iteration of the “unrolled” loop corresponding to multiple iterations of the original loop; and software pipelining, where a compiler re-orders the execution of instructions of the loop so that some instructions of a later loop iteration may be executed ahead of instructions from an earlier loop iteration.
With this approach, since the sub-vector length is known, it becomes much simpler to convert code compiled for use with a non-scalable vector ISA (assuming a fixed vector length) to code compiled for use with the scalable vector ISA (using sub-vector-supporting instructions which assume sub-vectors of known sub-vector length but allow for scalable overall vector length). Also, even if compiling directly for the scalable vector ISA supporting the sub-vector-supporting instructions (without starting from non-scalable vectorised code), the sub-vector-supporting instructions can be useful to allow software performance-improving techniques such as those discussed above to be applied which would not otherwise be possible for scalable vector instructions which operate at element-by-element or whole-vector granularity in a vector of scalable length, rather than at granularity sub-vectors of fixed size.
Hence, as discussed in the examples below, and number of sub-vector-supporting instructions may be defined which control the processing circuitry 16 (e.g. the vector processing unit 22 and/or load/store unit 28) to perform operations at granularity of sub-vectors rather at granularity of individual elements or granularity of the overall vector). The operations performed at granularity of sub-vectors can be performed in parallel, sequentially, part in parallel and part sequentially, or in a pipelined manner, in response to a single instance of execution of a sub-vector-supporting instruction (hence, it is not necessary to use predicate values set between respective instances of executing the sub-vector-supporting instruction to partition the vector into sub-vectors with each sub-vector processed in a separate pass through the sub-vector-supporting instruction).
It is not essential to provide sub-vector-supporting instructions corresponding to all vector operations which might be desired to be performed on vector operands. Many operations (e.g. add or multiply) may be applied at an element-by-element granularity and so may give the correct results even when applied to operands designed to support the vector-of-vectors approach shown in
The bottom part of
While
As shown in
Regardless of whether the reduction is implemented across corresponding elements of each sub-vector as shown in
Similarly,
It will be appreciated that the addressing mode shown in
All the instructions described above can help make it easier to adapt code written for a fixed-length vector architecture to a scalable vector architecture. It will be appreciated that not all of these instructions need be implemented in a given implementation. Also, similar sub-vector-granularity instructions could be defined for other operations.
In summary, to enable a more straightforward transition for software developers transitioning from a non-scalable, vector-length-prescribing, architecture such as Neon™ to a scalable, vector-length-agnostic, architecture such as SVE, the above examples add sub-vector (e.g. quad-word (128-bit) sized) elements and treat each sub-vector as an element in the scalable architecture. In doing so, a vector-in-vector style is formed to vectorize each fixed length vector of the non-scalable architecture using the vector length agnostic style of the scalable architecture. This allows mapping of non-scalable to scalable code with a rough 1-to-1 mapping of instructions so that vectorization can leverage longer and flexible vector length permitted in the scalable architecture and yet retain code optimizations introduced for the non-scalable architecture which rely on an assumption of a vector length known at compile time.
To the extent that embodiments have previously been described with reference to particular hardware constructs or features, in a simulated embodiment, equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be implemented in a simulated embodiment as computer program logic. Similarly, memory hardware, such as a register or cache, may be implemented in a simulated embodiment as a software data structure stored in the host storage (e.g. memory or registers) of the host processor 330. In arrangements where one or more of the hardware elements referenced in the previously described embodiments are present on the host hardware (for example, host processor 330), some simulated embodiments may make use of the host hardware, where suitable.
The simulator program 310 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 300 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 310. Thus, the program instructions of the target code 300 may be executed from within the instruction execution environment using the simulator program 310, so that a host computer 330 which does not actually have the hardware features of the apparatus 2 discussed above (e.g. an instruction decoder 10 and processing circuitry 16 supporting the sub-vector-supporting instructions as discussed above) can emulate these features.
Hence, the simulator program 310 may have instruction decoding program logic 312 for decoding instructions of the target code 300 and mapping these to corresponding sets of instructions in the native instruction set of the host apparatus 330. The instruction decoding program logic 312 includes sub-vector-supporting instruction decoding program logic 313 for decoding the sub-vector-supporting instructions described above. Register emulating program logic 314 maps register accesses requested by the target code to accesses to corresponding data structures maintained on the host hardware of the host apparatus 330, such as by accessing data in registers or memory of the host apparatus 330. Memory management program logic 316 implements address translation, page table walks and access permission checking to simulate access to a simulated address space by the target code 300, in a corresponding way to the MMU 36 as described in the hardware-implemented embodiment above. Memory address space simulating program logic 318 is provided to map the simulated physical addresses, obtained by the memory management program logic 316 based on address translation using the page table information maintained by software of the target program code 300, to host virtual addresses used to access host memory of the host processor 330. These host virtual addresses may themselves be translated into host physical addresses using the standard address translation mechanisms supported by the host (the translation of host virtual addresses to host physical addresses being outside the scope of what is controlled by the simulator program 310).
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2203431.8 | Mar 2022 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/053244 | 12/15/2022 | WO |