The present technique relates to the field of data processing, and more particularly to the handling of memory access operations.
Vector processing systems have been developed that seek to improve code density, and often performance, by enabling a given vector instruction to be executed in order to cause an operation defined by that given vector instruction to be performed independently in respect of multiple data elements within a vector of data elements. In the context of memory access operations, it is hence possible to load a plurality of contiguous data elements from memory into a specified vector register in response to a vector load instruction or to store a plurality of contiguous data elements from a specified vector register to memory in response to a vector store instruction. It is also possible to provide vector gather or vector scatter variants of those vector load or store instructions, so as to allow the data elements processed to reside at arbitrary locations in memory. When using such vector gather or vector scatter instructions, in addition to a vector being identified for the plurality of data elements to be processed, a vector can also be identified to provide a plurality of address indications used to determine the memory address of each data element.
There is increasing interest in capability-based architectures in which certain capabilities are defined for a given process, and an error can be triggered if there is an attempt to carry out operations outside the defined capabilities. The capabilities can take a variety of forms, but one type of capability is a bounded pointer (which may also be referred to as a “fat pointer”).
Each capability can include constraining information that is used to restrict the operations that can be performed when using that capability. For instance, considering a bounded pointer, this may provide information used to identify a non-extendable range of memory addresses accessible by processing circuitry when using that capability, along with one or more permission flags identifying associated permissions.
It would be desirable to support the execution of vector gather or vector scatter instructions, but whilst enabling the various address indications to be specified by capabilities, in order to benefit from the security benefits offered through the use of capabilities. However, capabilities that provide an address indication are inherently larger than an equivalent standard address indication, due to the constraining information that is provided in association with the address indication to form the capability.
In a first example arrangement there is provided an apparatus comprising: processing circuitry to perform vector processing operations; a set of vector registers; and an instruction decoder to decode vector instructions to control the processing circuitry to perform the vector processing operations specified by the vector instructions; wherein: the instruction decoder is responsive to a given vector memory access instruction specifying a plurality of memory access operations, where each memory access operation is to be performed to access an associated data element, to determine, from a data vector indication field of the given vector memory access instruction, at least one vector register in the set of vector registers associated with a plurality of data elements, and to determine, from at least one capability vector indication field of the given vector memory access instruction, a plurality of vector registers in the set of vector registers containing a plurality of capabilities, each capability being associated with one of the data elements in the plurality of data elements and providing an address indication and constraining information constraining use of that address indication when accessing memory, wherein the number of vector registers determined from the at least one capability vector indication field is greater than the number of vector registers determined from the data vector indication field; the instruction decoder is further arranged to control the processing circuitry: to determine, for each given data element in the plurality of data elements, a memory address based on the address indication provided by the associated capability, and to determine whether the memory access operation to be used to access the given data element is allowed in respect of that determined memory address having regard to the constraining information of the associated capability; and to enable performance of the memory access operation for each data element for which the memory access operation is allowed, where performance of the memory access operation for any given data element causes that given data element to be moved between the determined memory address in the memory and the at least one vector register.
In a further example arrangement there is provided a method of performing memory access operations within an apparatus providing processing circuitry to perform vector processing operations and a set of vector registers, the method comprising: employing an instruction decoder, in response to a given vector memory access instruction specifying a plurality of memory access operations, where each memory access operation is to be performed to access an associated data element, to determine, from a data vector indication field of the given vector memory access instruction, at least one vector register in the set of vector registers associated with a plurality of data elements, and to determine, from at least one capability vector indication field of the given vector memory access instruction, a plurality of vector registers in the set of vector registers containing a plurality of capabilities, each capability being associated with one of the data elements in the plurality of data elements and providing an address indication and constraining information constraining use of that address indication when accessing memory, wherein the number of vector registers determined from the at least one capability vector indication field is greater than the number of vector registers determined from the data vector indication field; controlling the processing circuitry: to determine, for each given data element in the plurality of data elements, a memory address based on the address indication provided by the associated capability, and to determine whether the memory access operation to be used to access the given data element is allowed in respect of that determined memory address having regard to the constraining information of the associated capability; and to enable performance of the memory access operation for each data element for which the memory access operation is allowed, where performance of the memory access operation for any given data element causes that given data element to be moved between the determined memory address in the memory and the at least one vector register.
In a still further example arrangement there is provided a computer program for controlling a host data processing apparatus to provide an instruction execution environment, comprising: processing program logic to perform vector processing operations; vector register emulating program logic to emulate a set of vector registers; and instruction decode program logic to decode vector instructions to control the processing program logic to perform the vector processing operations specified by the vector instructions; wherein: the instruction decode program logic is responsive to a given vector memory access instruction specifying a plurality of memory access operations, where each memory access operation is to be performed to access an associated data element, to determine, from a data vector indication field of the given vector memory access instruction, at least one vector register in the set of vector registers associated with a plurality of data elements, and to determine, from at least one capability vector indication field of the given vector memory access instruction, a plurality of vector registers in the set of vector registers containing a plurality of capabilities, each capability being associated with one of the data elements in the plurality of data elements and providing an address indication and constraining information constraining use of that address indication when accessing memory, wherein the number of vector registers determined from the at least one capability vector indication field is greater than the number of vector registers determined from the data vector indication field; the instruction decode program logic is further arranged to control the processing program logic: to determine, for each given data element in the plurality of data elements, a memory address based on the address indication provided by the associated capability, and to determine whether the memory access operation to be used to access the given data element is allowed in respect of that determined memory address having regard to the constraining information of the associated capability; and to enable performance of the memory access operation for each data element for which the memory access operation is allowed, where performance of the memory access operation for any given data element causes that given data element to be moved between the determined memory address in the memory and the at least one vector register.
In a yet further example arrangement there is provided an apparatus comprising: processing means for performing vector processing operations; a set of vector register means; and instruction decode means for decoding vector instructions to control the processing means to perform the vector processing operations specified by the vector instructions; wherein: the instruction decode means, responsive to a given vector memory access instruction specifying a plurality of memory access operations, where each memory access operation is to be performed to access an associated data element, for determining, from a data vector indication field of the given vector memory access instruction, at least one vector register means in the set of vector register means associated with a plurality of data elements, and for determining, from at least one capability vector indication field of the given vector memory access instruction, a plurality of vector register means in the set of vector register means containing a plurality of capabilities, each capability being associated with one of the data elements in the plurality of data elements and providing an address indication and constraining information constraining use of that address indication when accessing memory, wherein the number of vector register means determined from the at least one capability vector indication field is greater than the number of vector register means determined from the data vector indication field; the instruction decode means is further arranged for controlling the processing means: to determine, for each given data element in the plurality of data elements, a memory address based on the address indication provided by the associated capability, and to determine whether the memory access operation to be used to access the given data element is allowed in respect of that determined memory address having regard to the constraining information of the associated capability; and to enable performance of the memory access operation for each data element for which the memory access operation is allowed, where performance of the memory access operation for any given data element causes that given data element to be moved between the determined memory address in the memory and the at least one vector register means.
The present technique will be described further, by way of illustration only, with reference to examples thereof as illustrated in the accompanying drawings, in which:
In accordance with the techniques described herein, an apparatus is provided that has processing circuitry to perform vector processing operations, a set of vector registers, and an instruction decoder to decode vector instructions to control the processing circuitry to perform the vector processing operations specified by the vector instructions. The vector processing operation specified by a vector instruction may be implemented by performing the required operation independently on each of a plurality of data elements in a vector, and those required operations may be performed in parallel, sequentially one after the other, or in groups (where for example the operations in a group may be performed in parallel, and each group may be performed sequentially).
The instruction decoder may be arranged to process a given vector memory access instruction specifying a plurality of memory access operations, where each memory access operation is to be performed to access an associated data element, and hence the plurality of memory access operations can collectively be viewed as implementing a vector memory access operation specified by the vector memory access instruction. In particular, in response to such a given vector memory access instruction, the instruction decoder may be arranged to determine, from a data vector indication field of the given vector memory access instruction, at least one vector register in the set of vector registers associated with a plurality of data elements. Each vector register determined from the data vector indication field may hence for example form a source register for a vector scatter operation seeking to store data elements from that source register to various locations in memory, or may act as a destination register for a vector gather operation seeking to load data elements from various locations in memory for storage in that vector register.
The instruction decoder is also arranged to determine, from at least one capability vector indication field of the given vector memory access instruction, a plurality of vector registers in the set of vector registers containing a plurality of capabilities. In one example implementation, a single capability vector indication field is used, and the plurality of vector registers are determined from the information in that single capability vector indication field. However, in an alternative implementation, multiple capability vector indication fields may be provided, for example to allow each capability vector indication field to identify a corresponding vector register. In one example implementation each vector register of the plurality of vector registers contains a plurality of capabilities, whilst in another example each vector register of the plurality of vector registers contains a single capability.
Each capability in the determined plurality of vector registers is associated with one of the data elements in the plurality of data elements and provides an address indication and constraining information constraining use of that address indication when accessing memory. The constraining information can take a variety of forms, but may for example identify range information that is used to determine an allowable range of memory addresses that may be accessed when using the address indication provided by the capability, and/or one or more permission attributes specifying types of accesses that may be performed using the address indication (for example whether read accesses are allowed, whether write accesses are allowed, whether the capability can be used to generate memory addresses of instructions to be fetched and executed, whether accesses are allowed from a particular level of security or privilege, etc.). In a further example the constraining information may be a constraint identifying value indicative of an entry in a set of constraint information. Each entry in the set of constraint information can take a variety of forms, but may for example identify range information that is used to determine an allowable range of memory addresses that may be accessed when using the address indication provided by the capability, and/or one or more permission attributes specifying types of accesses that may be performed using the address indication (for example whether read accesses are allowed, whether write accesses are allowed, whether the capability can be used to generate memory addresses of instructions to be fetched and executed, whether accesses are allowed from a particular level of security or privilege, etc.). In some implementations the generated memory address may be a physical memory address that directly correspond to a location in the memory system, whereas in other implementations the generated memory address may be a virtual address upon which address translation may need to be performed in order to determine the physical memory address to access.
In accordance with the techniques described herein, the number of vector registers determined from the at least one capability vector indication field is greater than the number of vector registers determined from the data vector indication field.
The instruction decoder is further arranged to control the processing circuitry to determine, for each given data element in the plurality of data elements, a memory address (which may be either a virtual address or a physical address) based on the address indication provided by the associated capability, and to determine whether the memory access operation to be used to access the given data element is allowed in respect of that determined memory address having regard to the constraining information of the associated capability. As mentioned earlier, the constraining information can take a variety of forms, and hence the checks performed here to determine whether the memory access operation to be used to access the given data element is allowed may take various forms. Those checks may hence for example identify whether the determined memory address can be accessed given any range constraining information in the capability, but also may determine whether the type of access is allowed (e.g. if the access operation is to perform a write to memory, does the constraining information in the capability allow such a write to be performed).
The processing circuitry can then be arranged to enable performance of the memory access operation for each data element for which the memory access operation is allowed, where performance of the memory access operation for any given data element causes that given data element to be moved between the determined memory address in the memory and the at least one vector register (it being appreciated that the direction of movement depends upon whether the data is being loaded from memory into the registers or stored from the registers into memory). In one example implementation, the given data element in the original location may be left untouched during this process, and hence in that case the move operation may be performed by copying the given data element. This, for example, may typically be the case at least when loading a data element from memory for storage within a vector register, where the data element then stored within the vector register is a copy of the data element stored in memory.
Whilst in one example implementation the memory access operations may be performed for each data element for which those memory access operations are allowed, in other implementations it may be decided to suppress performance of one or more allowed memory access operations in instances where another of the memory access operations is not allowed. Exactly which allowable accesses get suppressed in such a situation may depend on the implementation, and where in the vector of data elements the data element whose associated access is not allowed is. Purely by way of illustrative example, it may be that the various accesses are performed sequentially, and hence when one access is detected that is not allowed, it may be decided to suppress the subsequent accesses irrespective of whether they are allowed or not, but with the earlier accesses having already been performed.
In one example implementation, a mechanism is provided to keep track of valid capabilities stored within the vector registers. In particular, in one example implementation, the apparatus further comprises capability indication storage providing a valid capability indication field in association with each capability sized block within given vector registers of the set of vector registers, wherein each valid capability indication field is arranged to be set to indicate when the associated capability sized block stores a valid capability and is otherwise cleared. Whilst in one example implementation any of the vector registers in the set of vector registers may be able to store capabilities, in another example implementation the ability to store capabilities may be limited to a subset of the vector registers in the set, and in that latter case the capability indication storage will only need to provide a valid capability indication field for each capability sized block within that subset of the vector registers.
Whilst in one example implementation the capability indication storage may be provided separately to the set of vector registers, in an alternative example implementation the capability indication storage may be incorporated within the set of vector registers.
In order to constrain how the valid capability indication fields are set, the processing circuitry may be arranged to only allow any valid capability indication field to be set to indicate that a valid capability is stored in the associated capability sized block in response to execution of one or more specific instructions amongst a set of instructions that are executable by the apparatus. By restricting the setting of the valid capability indication field in this way, this can improve security, for example by inhibiting any attempt to indicate that a capability sized block of general purpose data within a vector register should be treated as a capability. Hence, operations performed on a vector that do not create a valid capability, either through a non-capability operation or through mutating a capability in a way that it ceases to be valid, can be arranged to cause the associated valid capability indication field to be cleared, hence indicating that a valid capability is not stored therein. Thus, by way of example, a partial write to a capability sized block of data, or a write of a non-capability, will clear the associated valid capability indication field. A capability indication field may also be cleared by various non-instruction operations, for example the stacking and clearing of vector register state associated with exception handling, or in some implementations a reset operation.
As mentioned earlier, the number of vector registers used to provide the required capabilities when executing the above-mentioned given vector memory access instruction is larger than the number of vector registers containing the data elements being subjected to the memory access operations. In one example implementation, the number of vector registers forming the plurality of vector registers determined from the at least one capability vector indication field is a power of two. In particular, the number of vector registers required to store the capabilities is dependent on the difference in size between the data elements and the capabilities, and in one example implementation that difference can vary by powers of two. It should be noted herein that when considering the size of a capability, any associated flag used to indicate that the capability is a valid capability (such as the earlier-mentioned valid capability indication field) is not considered to be part of the capability itself.
As mentioned earlier, if desired, multiple capability vector indication fields can be used to specify the various vector registers storing the capabilities required when executing the given vector memory access instruction. Such an approach allows the various vector registers to be arbitrarily located with respect to each other, and specified in the instruction encoding. However, in one example implementation, the at least one capability vector indication field is a single capability vector indication field arranged to identify one vector register and the instruction decoder is arranged to determine the remaining vector registers of the plurality of vector registers based on a determined relationship. Such an approach can be advantageous from an instruction encoding point of view, since typically instruction encoding space is quite limited, and it may not be practical to provide multiple capability vector indication fields to identify each of the vector registers that are to store the required capabilities.
The way in which the remaining vector registers are determined based on the identified one vector register and the determined relationship can take a variety of forms, dependent on implementation. For example, the determined relationship may specify that the vector registers are sequential to each other, that the vector registers are an even/odd pair, or that a known offset exists between the various vector registers. Alternatively, any other suitable indicated relationship may be used.
In one particular example implementation, the number of vector registers in the plurality of vector registers storing the required capabilities is 2N, and the single capability vector indication field is indicative of a first vector register number identifying the one vector register, where the first vector register number is constrained to have its N least significant bits at a logic zero value. The instruction decoder is then arranged to generate vector register numbers for each of the remaining vector registers by reusing the first vector register number and selectively setting at least one of the N least significant bits to a logic one value. This can provide a particularly simple and efficient mechanism for computing the various vector registers that will provide the capabilities required when executing the given vector memory access instruction.
In some implementations, the number of vector registers required to hold the capabilities will be fixed, for example due to the given vector memory access instruction only being supported for use with data elements of a particular fixed size, and where the capabilities are also of a fixed size. However, in a more general case, the number of vector registers can be inferred at runtime by the instruction decoder, based on knowledge of the size of the data elements upon which the given vector memory access instruction will be executed, and the size of the capabilities.
There are a number of ways in which the single capability vector indication field may be arranged to indicate the first vector register number. Whilst the single capability vector indication field may directly identify the first vector register number in one example implementation, in other implementations it may specify information sufficient to enable that first vector register number to be determined. For example, in the above case, where the first vector register number is constrained to have its N least significant bits at a logic zero value, those least significant N bits do not need identified within the single capability vector indication field, and instead can be hardwired to logic zero values.
The manner in which the capabilities associated with the various data elements are laid out within the vector registers used to provide the capabilities may vary dependent on implementation. However, in one example implementation, for any given pair of data elements associated with adjacent locations in the at least one vector register, the associated capabilities are stored in different vector registers of said plurality of vector registers. It has been found that such an arrangement can allow an efficient implementation when executing the given vector memory access instruction.
The way in which the location within the multiple vector registers of the associated capability for any particular data element is determined may vary dependent on implementation. However, in one example implementation the at least one vector register determined from the data vector indication field comprises a single vector register, and each data element is associated with a corresponding data lane of the single vector register. Further, each capability is located within a capability lane within one of the vector registers in said plurality of vector registers. It should be noted here that the width of the data lane will typically be different from the width of the capability lane, due to the fact that the data elements and capabilities are of a different size. With such an arrangement, then for a given data element the vector register within the plurality of vector registers containing the associated capability may be determined in dependence on a given number of least significant bits of a lane number of the corresponding data lane, and the capability lane containing the associated capability may be determined in dependence on the remaining bits of the lane number of the corresponding data lane. This hence provides a particularly efficient mechanism for determining the location of the associated capability for each data element.
In one particular example arrangement, the number of vector registers containing the plurality of capabilities is P, considered logically as a sequence with values 0 to P−1, and the number of capability lanes in any given vector register is M, with values from 0 to M−1. Further, the data lane associated with the given data element is data lane X, with values from 0 to X−1. Using such terminology, then in one example implementation the location of the associated capability within the plurality of vector registers may be determined by dividing X by P to produce a quotient and a remainder, where the quotient identifies the capability lane containing the associated capability, and the remainder identifies the vector register within the plurality of vector registers containing the associated capability. Hence, in such an implementation both the vector register and the capability lane needed to locate the associated capability for a given data element can be readily and efficiently determined.
It should be noted that whilst in the above example the plurality of vector registers containing the plurality of capabilities is considered logically as a sequence with values 0 to P−1, does not mean that the logical vector numbers associated with those vector registers need to be contiguous logical vector numbers, nor indeed does it mean that the vector registers have to be physically sequentially located with respect to each other within the set of vector registers.
In one example implementation, the set of vector registers may be logically partitioned into a plurality of sections, where each section contains a corresponding portion from each of the vector registers in the set of vector registers, and the plurality of capabilities may be located within the plurality of vector registers such that, for each data element, the associated capability is stored within the same section as that data element. By such an approach, this can allow execution of the given vector memory access instruction to be divided into multiple “beats”, and during each beat only one section of the set of vector registers is accessed in order to execute the given vector memory access instruction. By allowing the vector memory access instruction to be divided into multiple beats, this can allow execution of the vector memory access instruction to be overlapped with execution of one or more other instructions, which can lead to a highly efficient implementation. In particular, since during any particular beat the data elements and capabilities required to perform the memory access operations during that beat can all be obtained from a single section of the set of vector registers, this leaves any other sections available for access during execution of an overlapped instruction.
In one example implementation, the processing circuitry may be arranged to perform, over one or more beats, the memory access operations for the data elements within a given section, before performing, over one or more beats, the memory access operations for the data elements within a next section. Whilst in one example implementation each beat amongst the multiple beats used to execute the given vector memory access instruction may access a different section, this is not a requirement and it may be the case in some implementations that more than one of those beats accesses the same section.
There are a number of ways in which the capabilities required when executing the above-mentioned given vector memory access instruction may be loaded from memory and then configured within the multiple vector registers in the arrangements discussed earlier, and indeed a number of ways in which those capabilities within the vector registers can be stored back to memory in due course. However, in one example implementation, the instruction decoder is arranged to decode a plurality of vector capability memory transfer instructions that together cause the instruction decoder to control the processing circuitry to transfer a plurality of capabilities between the memory and the plurality of vector registers, and to rearrange the plurality of capabilities during the transfer such that in memory the plurality of capabilities are sequentially stored and in the plurality of vector registers the plurality of capabilities are de-interleaved such that any given pair of capabilities within said plurality that are sequentially stored in the memory are stored in different vector registers of said plurality of vector registers.
It should be noted that the plurality of vector capability memory transfer instructions used to take the above steps do not need to directly follow each other, and hence do not need to be executed sequentially one after the other. Instead, there could be multiple, distinct, instructions that each perform part of the required work, and once all of the instructions have been executed then the required rearrangement of the capabilities as they are moved (in one example copied) between the memory and the vector registers will have been performed. The plurality of vector capability memory transfer instructions may be either load instructions used to load the capabilities from memory into the multiple vector registers, or store instructions used to store the capabilities from the multiple vector registers back to memory.
In one example implementation, each vector capability memory transfer instruction is arranged to identify different capabilities to each other vector capability memory transfer instruction, and each vector capability memory transfer instruction is arranged to identify an access pattern that causes the processing circuitry to transfer the identified capabilities whilst performing the rearrangement specified by the access pattern. Hence, in such an arrangement execution of each individual vector capability memory transfer instruction will cause the required rearrangement to be performed in respect of the capabilities being transferred by that instruction, with other vector capability memory transfer instructions then being used to transfer other capabilities and perform the required rearrangement for those capabilities.
With such an implementation, it is possible to arrange for the various different instructions to all transfer the same maximum amount of data, that maximum amount of data being selected having regard to the finite memory bandwidth available in any particular system. Such an approach can avoid any individual instruction from stalling and hence no sequencing state machine is required in order to implement such an approach. Such an approach also allows other instructions to be scheduled whilst this capability transfer process is ongoing. Further, by arranging each of the instructions to operate on different capabilities in the manner discussed above, any individual instruction can be arranged for each beat, to only operate within the same section of the vector registers. As discussed earlier, operating only within a given section allows overlapping of instructions that operate on different sections.
In one example implementation, the memory is formed of multiple memory banks and, for each vector capability memory transfer instruction, the access pattern is defined so as to cause more than one of the memory banks to be accessed when that vector capability memory transfer instruction is executed by the processing circuitry. Banked memory makes it easier for hardware to implement parallel transfers to/from memory, and hence specifying an access pattern that enables this is beneficial.
In addition to the vector capability memory transfer instructions mentioned above, vector load and store instructions can be used to load data elements from memory into the vector registers or store those data elements from the vector registers back to memory as and when required.
Whilst the number of vector registers used to hold the data elements and the number of vector registers used to hold the associated capabilities may vary dependent on implementation, in one particular example implementation the at least one vector register determined from the data vector indication field of the given vector memory access instruction comprises a single vector register, the capabilities are twice the size of the data elements (as mentioned earlier any flag used to indicate that the capability is a valid capability is not considered to be part of the capability when considering the size of the capability), and the plurality of vector registers determined from the at least one capability vector indication field comprise two vector registers. It has been found that such an arrangement provides a particularly useful implementation for performing vector gather and scatter operations using memory addresses derived from capabilities.
In one example implementation, the given vector memory access instruction may further comprise an immediate value indicative of an address offset, and the processing circuitry may be arranged to determine, for each given data element in the plurality of data elements, the memory address of the given data element by combining the address offset with the address indication provided by the associated capability. This can provide an efficient implementation for computing the memory addresses from the address indications provided in the various capabilities.
In one example implementation, the given vector memory access instruction may further comprise an immediate value indicative of an address offset, and, for each given data element, the processing circuitry may be arranged to update the address indication of the associated capability in the plurality of vector registers by adjusting the address indication in dependence on the address offset. Hence, by way of example, once the address indication in a particular capability has been used during execution of a first vector memory access instruction, that address indication as indicated within the capability stored in the vector register can be updated in the above manner so that it is ready to use in association with a subsequent vector memory access instruction.
In some instances, both of the above adjustment processes can be performed, such that the address offset is combined with (e.g. added to) the address indication provided by the capability in order to identify the memory address to access, and that same updated address is written back to the capability register as an updated address indication. Typically, the same immediate value will be used for both adjustment processes, but if desired different immediate values could be used for each adjustment process.
Particular example implementations will now be discussed with reference to the figures.
The set of scalar registers 10 comprises a number of scalar registers for storing scalar values which comprise a single data element. Some instructions supported by the instruction decoder 6 and processing circuitry 4 may be scalar instructions which process scalar operands read from the scalar registers 10 to generate a scalar result written back to a scalar register.
The set of vector registers 12 includes a number of vector registers, each arranged to store a vector value comprising multiple elements. In response to a vector instruction, the instruction decoder 6 may control the processing circuitry 4 to perform a number of lanes of vector processing on respective elements of a vector operand read from one of the vector registers 12, to generate either a scalar result to be written to a scalar register 10 or a further vector result to be written to a vector register 12. Some vector instructions may generate a vector result from one or more scalar operands, or may perform an additional scalar operation on a scalar operand in the scalar register file as well as lanes of vector processing on vector operands read from the vector register file 12. Hence, some instructions may be mixed scalar-vector instructions for which at least one of the one or more source registers and a destination register of the instruction is a vector register 12 and another of the one or more source registers and the destination register is a scalar register 10.
Vector instructions may also include vector load/store instructions which cause data values to be transferred between the vector registers 12 and locations in the memory system 8. The load/store instructions may include contiguous load/store instructions for which the locations in memory correspond to a contiguous range of addresses, or gather/scatter type vector load/store instructions which specify a number of discrete addresses and control the processing circuitry 4 to load data from each of those addresses into respective elements of a vector register or to store data from respective elements of a vector register to the discrete addresses.
The processing circuitry 4 may support processing of vectors with a range of different data element sizes. For example, a 128-bit vector register 12 could be partitioned into sixteen 8-bit data elements, eight 16-bit data elements, four 32-bit data elements or two 64-bit data elements. A control register may be used to specify the current data element size being used, or alternatively this may be a parameter of a given vector instruction to be executed.
The processing circuitry 4 may include a number of distinct hardware blocks for processing different classes of instructions. For example, load/store instructions which interact with the memory system 8 may be processed by a dedicated load/store unit 18, whilst arithmetic or logical instructions could be processed by an arithmetic logic unit (ALU). The ALU itself may be further partitioned into a multiply-accumulate unit (MAC) for performing operations involving multiplication, and a further unit for processing other kinds of ALU operations. A floating-point unit can also be provided for handling floating-point instructions. Pure scalar instructions which do not involve any vector processing could also be handled by a separate hardware block compared to vector instructions, or re-use the same hardware blocks.
As discussed earlier, one type of vector load/store instruction that may be supported is a vector gather/scatter instruction. Such a vector instruction may indicate a number of discrete addresses in memory and control the processing circuitry 4 to load data from those discrete addresses into respective elements of a vector register (in the case of a vector gather instruction) or to store data from respective elements of a vector register to the discrete addresses (in the case of a vector scatter instruction). In accordance with the techniques described herein, rather than using a vector of standard address indications to identify the various memory addresses, a new form of vector gather/scatter instruction is provided that is able to specify vectors of capabilities to be used to determine the various memory addresses. This can provide a finer grain of control over the performance of the individual memory access operations used to implement a vector gather/scatter operation, since a separate capability can be defined for use in association with each of those individual memory access operations. In addition to providing an address indication, each capability will typically include constraining information that is used to restrict the operations that can be performed when using that capability. For example, the constraining information may identify a non-extendable range of memory addresses that are accessible by the processing circuitry when using the address indication provided by the capability, and may also provide one or more permission flags identifying associated permissions (for example whether read accesses are allowed, whether write accesses are allowed, whether accesses are allowed from a specified privilege or security level, whether the capability can be used to generate memory addresses of instructions to be fetched and executed, etc.).
When executing this new form of vector gather/scatter instruction, each data element to be moved between memory and a vector register (the direction of movement being dependent on whether a vector gather operation or a vector scatter operation is being performed) will have an associated capability, and capability access checking circuitry 16 within the processing circuitry 4 may be used to perform a capability check for each data element to determine whether the memory access operation to be used to access that given data element is allowed having regard to the constraining information specified by the associated capability. This may hence involve checking both whether the memory address is accessible given any range constraining information in the capability, and whether the type of access is allowed given the constraining information in the capability. More details as to how the plurality of capabilities required when executing such a vector gather/scatter instruction are arranged within a series of vector registers will be discussed in more detail with reference to a number of the remaining figures.
As shown in
When a capability is loaded into a register 100 accessible to the processing circuitry, then the tag bit moves with the capability information. Accordingly, when a capability is loaded into the register 100, an address indication 102 (which may also be referred to herein as a pointer) and metadata 104 providing the constraining information (such as the earlier-mentioned range information and permissions information) will be loaded into the register. In addition, in association with that register, or as a specific bit field within it, the tag bit 106 will be set to identify that the contents represent a valid capability. Similarly, when a valid capability is stored back out to memory, the relevant tag bit 120 will be set in association with the data block in which the capability is stored. By such an approach, it is possible to distinguish between a capability and normal data, and hence ensure that normal data cannot be used as a capability.
The apparatus may be provided with dedicated capability registers for storing capabilities (not shown in
In the specific example of
Whilst in the example of
It should be noted that whilst in the examples of
At step 172, it is determined whether the data being written in respect of a given capability sized portion of a vector register is of a full capability block size. If not, then the tag bit is cleared if it was previously set, and accordingly the process proceeds to step 174 where the tag bit is cleared. Such an approach prevents illegal modification of a capability. For example, if an attempt is made to modify a certain number of bits of a valid capability stored within a vector register, then the above process will cause the tag bit to be cleared, preventing the modified version now stored in the vector register from being used as a capability.
However, assuming a full capability sized block of information is being written into the given capability sized portion of the vector register, then it is determined at step 176 whether a valid capability is being written. If not, then again the process proceeds to step 174 where the tag bit is cleared. However, if a valid capability is being written, then the process proceeds to step 178 where the tag bit is set.
It should be noted that it is not just during the execution of instructions that write to the vector registers that a tag bit associated with a capability sized block within a vector register may be cleared. In particular, as indicated by
A data vector indication field 210 is used to identify at least one vector register that is to be associated with the data elements that will be moved between the vector register set and memory through execution of the instruction. In one example implementation, a single vector register is identified by the data vector indication field 210. It will be appreciated that such an identified vector register will act as a source vector register when performing a vector scatter operation, or will act as a destination vector register when performing a vector gather operation.
At least one capability vector indication field 215 may also be provided whose contents are used to identify the plurality of vector registers storing the capabilities required to determine the memory addresses of each of the data elements to be subjected to the vector scatter or vector gather operation. Whilst in one implementation multiple capability vector indication fields may be provided, for example one field for each of the vector registers containing the required capabilities, in another example implementation a single capability vector indication field is used to provide sufficient information to determine one of the vector registers storing the capabilities, with the other vector registers then being determined based on some predetermined relationship. This latter approach can be advantageous from an instruction encoding point of view. The predetermined relationship can take a variety of forms. For example, the vector registers may be sequential to each other, may form an even/odd pair, or a known offset may exist between the various vector registers.
As shown in
As another example of optional information that may be provided within one or more fields 220, information may be provided to specify the data element size of the data elements to be accessed during execution of the instruction, and/or the capability size. In some implementations this information may be unnecessary, since the capability size may be fixed, and also it may be the case that the vector memory access instructions of the type described herein are only allowed to be performed on data elements of a specific size, and hence in that example instance both the data element size and the capability size are known without needing to be specified separately by the instruction.
It should be noted that whilst in
At step 240, the multiple vector registers containing the required capabilities are also determined, using the information in the at least one capability vector indication field. As discussed earlier, multiple capability vector indication fields can be provided, each for example identifying one of the vector registers, or alternatively a single capability vector indication field may be provided to enable determination of one of the vector registers, with the other vector registers then being determined having regard to a known relationship.
At step 245, for each given data element that the vector memory access instruction relates to, a memory address is determined for that given data element based on the address indication provided by the associated capability. In addition, it is determined whether the memory access operation to be used to access that given data element is allowed based on the constraining information of the associated capability. This may involve not only determining whether the memory address is within the allowed range specified by range constraining information in the associated capability, but also whether any other constraints specified by the metadata of the associated capability are met (for example whether a write access is allowed using the associated capability in the event that a vector scatter operation is being performed, and hence the individual memory access operation being performed for the given data element is a write operation).
At step 250, performance of the memory access operation can be enabled for each data element for which the memory access operation has been determined to be allowed. Whilst in one example implementation the memory access operations may be performed for each data element for which those memory access operations are allowed, in other implementations it may be decided to suppress performance of one or more allowed memory access operations in instances where another of the memory access operations is not allowed. As mentioned earlier, exactly which allowable accesses get suppressed in such a situation may depend on the implementation, and where in the vector of data elements the data element whose associated access is not allowed is.
At step 330, a first vector register number is determined from the information provided in the capability vector indication field, but in this implementation the least significant N bits of that vector register number are constrained to be logic zero values. In such an implementation, it will be appreciated that the capability vector indication field does not need to specify those bits, since they can be hardwired to 0.
At step 340, each other vector register number for the multiple vector registers containing the required capabilities is determined by manipulation of the N least significant bits of the first determined vector register number. This provides a particularly simple and efficient mechanism for specifying the multiple vector registers containing the required capabilities.
In the examples shown in
Such an arrangement has been found to be highly advantageous, as it means that the capabilities required in association with a particular sequence of data elements can all be found within the same portion 357, 359 of the vector registers. In particular, in the example shown in
Whilst in
It is also not a requirement that the vector registers be considered to be 128-bit registers, and in the example of
When performing the earlier described beat wise execution of a vector memory access instruction, then in one example implementation each section of the vector register may be arranged to store one or more capabilities. Hence, considering the examples of
At step 470, the quotient and the remainder resulting from the above computation are used to identify the capability lane and vector register, respectively, containing the associated capability. At step 475, it is determined whether data lane X is the last data lane, and if not the value of X is incremented at step 480 before returning to step 465. Once at step 475 it is determined that data lane X is the last data lane, then the process ends at step 485.
In some applications such as digital signal processing (DSP), there may be a roughly equal number of ALU and load/store instructions and therefore some large blocks such as the MACs can be left idle for a significant amount of the time. This inefficiency can be exacerbated on vector architectures as the execution resources are scaled with the number of vector lanes to gain higher performance. On smaller processors (e.g. single issue, in-order cores) the area overhead of a fully scaled out vector pipeline can be prohibitive. One approach to minimise the area impact whilst making better usage of the available execution resource is to overlap the execution of instructions, as shown in
Hence, it can be desirable to enable micro-architectural implementations to overlap execution of vector instructions. However, if the architecture assumes that there is a fixed amount of instruction overlap, then while this may provide high efficiency if the micro-architectural implementation actually matches the amount of instruction overlap assumed by architecture, it can cause problems if scaled to different micro-architectures which use a different overlap or do not overlap at all.
Instead, an architecture may support a range of different overlaps as shown in examples of
As shown in
As shown in the lower example of
On the other hand, a more area efficient implementation may provide narrower processing units which can only process two beats per tick, and as shown in the middle example of
A yet more energy/area-efficient implementation may provide hardware units which are narrower and can only process a single beat at a time, and in this case one beat may be processed per tick, with the instruction execution overlapped and staggered by two beats as shown in the top example of
It will be appreciated that the overlaps shown in
As well as varying the amount of overlap from implementation to implementation to scale to different performance points, the amount of overlap between vector instructions can also change at run time between different instances of execution of vector instructions within a program. Hence, the processing circuitry 4 may be provided with beat control circuitry 20 as shown in
At step 490, a sequence of vector capability memory transfer instructions are decoded, where each such instruction defines an associated access pattern and identifies a subset of the capabilities that are required by any particular instance of the of earlier described vector gather/scatter instruction. In one example implementation, each individual vector capability memory transfer instruction identifies a different subset of capabilities to each other vector capability memory transfer instruction in the sequence.
At step 492, the capabilities are then moved between memory and identified vector registers whilst performing de-interleaving (in the event that a load operation is being performed) or interleaving (in the event that a store operation is being performed) as defined by the access patterns of each vector capability memory transfer instruction. As a result, the plurality of capabilities can be arranged to be sequentially stored in memory, whilst in the multiple vector registers the plurality of capabilities are de-interleaved such that any given pair of capabilities that are sequentially stored in memory are stored in different vector registers.
The plurality of vector capability memory transfer instructions used to perform the steps illustrated in
In one example implementation, the memory is formed of multiple memory banks and, for each vector capability memory transfer instruction the access pattern is defined so as to cause more than one of the memory banks to be accessed when that vector capability memory transfer instruction is executed. Banked memory makes it easier for hardware to implement parallel transfers to/from memory, and hence specifying access patterns that enable this is beneficial. This is illustrated schematically in
Purely by way of example, considering the arrangement of capabilities shown in
VLDRC2_1: C0→Qn[63:0], C3→Q(n+1)[127:64]
VLDRC2_2: C1→Q(n+1)[63:0], C2→Qn[127:64]
With reference to
Historically, powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons. For example, the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is given in “Some Efficient Architecture Simulation Techniques”, Robert Bedichek, Winter 1990, USENIX Conference, Pages 53 to 63.
To the extent that examples have previously been described with reference to particular hardware constructs or features, in a simulated implementation equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be provided in a simulated implementation as computer program logic. Similarly, memory hardware, such as register or cache, may be provided in a simulated implementation as a software data structure. Also, the physical address space used to access memory 8 in the hardware apparatus 2 could be emulated as a simulated address space which is mapped on to the virtual address space used by the host operating system 510 by the simulator 505. In arrangements where one or more of the hardware elements referenced in the previously described examples are present on the host hardware (for example host processor 515), some simulated implementations may make use of the host hardware, where suitable.
The simulator program 505 may be stored on a computer readable storage medium (which may be a non-transitory medium), and provides a virtual hardware interface (instruction execution environment) to the target code 500 (which may include applications, operating systems and a hypervisor) which is the same as the hardware interface of the hardware architecture being modelled by the simulator program 505. Thus, the program instructions of the target code 500 may be executed from within the instruction execution environment using the simulator program 505, so that a host computer 515 which does not actually have the hardware features of the apparatus 2 discussed above can emulate those features. The simulator program may include processing program logic 520 to emulate the behaviour of the processing circuitry 4, instruction decode program logic 525 to emulate the behaviour of the instruction decoder 6, and vector register emulating program logic 522 to maintain data structures to emulate the vector registers 12. Hence, the techniques described herein for performing vector gather or scatter operations using capabilities can in the example of
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Although illustrative examples of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise examples, and that various changes, additions and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2201544.0 | Feb 2022 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/053313 | 12/20/2022 | WO |