The present technique relates to the field of data processing. More particularly it relates to matrix processing.
Matrix processing operations which generate a two-dimensional matrix as a result matrix can be an important operation in some fields of data processing, for example in machine learning or image processing.
At least some examples provide an apparatus comprising: matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and position shifting circuitry to apply a variable position shift to vary which row or column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry during a given matrix processing operation, the variable position shift based on one of a plurality of alternative shift amounts selectable for the given matrix processing operation, each alternative shift amount corresponding to a position shift of said one of the first and second input operands relative to the result matrix by a different number of rows or columns.
At least some examples provide an apparatus comprising: means for performing a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix; means for storing information for forming the first and second input operands for the means for performing; and means for applying a variable position shift to vary which row or column of the result matrix is updated based on a given element of one of the first and second input operands stored in the means for storing during a given matrix processing operation, the variable position shift based on one of a plurality of alternative shift amounts selectable for the given matrix processing operation, each alternative shift amount corresponding to a position shift of said one of the first and second input operands relative to the result matrix by a different number of rows or columns.
At least some examples provide a data processing method comprising: performing a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix and the first and second input operands are dependent on information stored in operand storage circuitry; and during a given matrix processing operation, applying a variable position shift to vary which row or column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry, the variable position shift based on one of a plurality of alternative shift amounts selectable for the given matrix processing operation, each alternative shift amount corresponding to a position shift of said one of the first and second input operands relative to the result matrix by a different number of rows or columns.
Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which:
Two-dimensional (2D) convolution operations are a popular operation in the field of machine learning, particularly for neural networks. 2D convolutions can also be used for other purposes such as applying filters to images. In a 2D convolution operation, a kernel is provided to define the filter or other operation to be applied. The kernel is applied to one or more input channels which each comprise a matrix typically of greater size than the kernel. In the 2D convolution operation, for a given output element position within an output matrix, the value for the given output element position depends on a sum of products of respective pairs of kernel values and input channel values. For each output matrix position the selection of the input channel values to multiply with the corresponding kernel values is different. For a given output element position, the kernel values that are multiplied with the corresponding input matrix elements are those which are aligned in position when the kernel is logically positioned so that the central kernel element is over the element of the input matrix that corresponds in position to the given output element position. Examples of 2D convolution are described further below.
One reason why 2D convolution operations are relatively complex to implement in data processing is that they may require calculation of sums of products of a number of pairs of kernel and input elements for many different combinations of the kernel values and input elements, including adding products involving input matrix elements which may not be stored at adjacent addresses within a memory address space. Hence, a typical approach for performing 2D convolutions is to perform (prior to the sum-of-product calculations themselves), some remapping (rearrangement) operations to remap the data stored for the input matrix in memory, so as to generate a number of bespoke data structures which correspond to the values to be operated on for each respective kernel position of the kernel. However, this remapping involves many instances of copying data from one memory location to another, which incurs extra latency and wastes memory space. Hence, it may be desirable to find a way of implementing 2D convolution so that the operations required can be applied directly based on the layout of the input channel data within the memory space without needing such remapping.
In the examples below an apparatus has matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix. The first and second input operands do not themselves need to be two-dimensional and in some examples may be one-dimensional vectors, although other examples could apply the matrix processing operation to two-dimensional input operands. Operand storage circuitry is provided to store information forming the first and second input operands for the matrix processing circuitry. Masking circuitry performs a masking operation to mask at least part of the matrix processing operation or the information stored to the operand storage circuitry based on masking state data indicative of one or more masked row or column positions to be treated as representing a masking value. The masking state data could be defined as an operand of the matrix processing instruction which instructs the matrix processing circuitry to perform the matrix processing operation, or may be some stored state data which is configured separately and is not explicitly referenced by the matrix processing instruction.
By providing masking based on masking state data indicative of masked row/column positions, this enables the matrix processing to skip certain rows or columns of input data, which can be particularly useful for 2D convolution operations. The masking circuitry could perform the masking operation either at the time of loading operands into the operand storage circuitry, or at the time of performing the matrix processing operation itself, or both on loading the operand storage circuitry and on performing the matrix processing operation.
This approach helps to support more efficient 2D convolution operations. For example, the 2D convolution operation may be split (by software) into a number of separate 1×1 convolution operations which apply kernel value(s) from a single kernel position within a larger kernel matrix to a number of input matrix elements of a given input channel, and update respective elements within an output matrix based on the result (in some cases multiple channels of such 1×1 convolution processing could be done in parallel). Such 1×1 convolutions would allow the operation for a given kernel position to be applied without needing remapping of the structure in memory, with successive results of 1×1 convolutions for different kernel positions being accumulated together (with an appropriate shift of the output matrix elements being updated relative to the input matrix elements used to calculate those outputs, to account for which kernel position is being applied), so that after performing the 1×1 convolutions for each kernel position the result is equivalent to the result of the 2D convolution.
To support this, it can be useful to provide the masking circuitry which can be controlled, based on the masking state data, to mask out a given row or column position so that the data from some rows/columns of the corresponding input channels can be treated as if it represents a masking value instead of the actual data stored in memory. This is because when the 2D convolution is split into successive 1×1 convolutions, while for most output element positions the correct result for a given 1×1 convolution can be achieved by reading a corresponding input matrix element, multiplying that element by a corresponding kernel value, and writing the result to a corresponding output matrix element (with a shift in position between the relative position of the input matrix element within the input matrix and the relative position of the corresponding output matrix element within the output matrix, and that shift being by the same number of element positions for each of the multiplications being performed for a given kernel position). However, at the edges of the matrix there are some elements for which this approach would give the wrong result, e.g. due to an element on one edge of the output matrix being updated based on an element at the opposite edge of the input matrix, causing an error preferred to below as a ‘wraparound’ error. By providing the masking operation, this allows rows or columns of the input data which should not affect the output to be masked out. Hence, by providing support for masking of rows/columns, this can enable improved performance for 2D convolution operations which can be an important for neural network performance.
It will be appreciated that the control of which particular rows/columns of a matrix are masked out is controlled by software, so is not a feature of a particular processor implementation. The apparatus provides features which enable software to select the rows/columns to be masked.
When a given row or column of the given operand matrix is indicated as masked by the masking state data, there may be different options for selecting the masking value to be used for that row/column position. For many practical applications it can be useful for the masking value to be zero. This can help to support the skipping of rows to deal with the ‘wrap-around’ problem described above where the rows/columns on one edge of the input matrix should be prevented from affecting the calculation of output matrix elements on the opposite edge. Also, the masking value of zero can be useful for enabling padding values to be supplied to be multiplied with kernel elements which are positioned outside the bounds of the input matrix when a padded 2D convolution operation is applied and the kernel is at a position centred near the edge of the input matrix. Hence, in some hardware implementations it may be sufficient that the masking circuitry supports only a fixed masking value to be used for any masked row/column positions, e.g. a masking value of zero.
However, for some applications using 2D convolutions, it may be desired to use padding values other than zero (e.g. if the matrices are represented using a quantization scheme where each value is offset from its true value by a certain number, so that the “zero point” is represented by a numeric value other than zero). To support such operations, it can be useful to provide the ability to select a non-zero value as a masking value. Therefore, in some implementations, in the masking operation, the masking value can be selected from among a plurality of masking values (e.g. zero or another pre-configured value), based on at least one of: a masking value selection parameter specified by the instruction which causes the masking operation to be performed (e.g. a load instruction for loading information to the operand storage circuitry, or a matrix processing instruction for controlling the matrix processing circuitry to perform the matrix processing operation); a control value stored in a control register; and a masking vector specifying separate masking values for a plurality of elements of a masked row/column. With the last option, the masking vector could be read from a vector register.
The masking state data may have an encoding identifying, within a two-dimensional array of elements, elements to be treated as representing the masking value. Hence, the masking state data may (fully or partially) identify positions of masked elements across two dimensions. Providing state data which can apply masking in two dimensions can be useful for dealing with a number of issues involved in 2D convolution processing, including the “wraparound” error problem discussed above, the fact that at the tail of a loop there may be a number of “out of bounds” elements unused which extend beyond the end of the data structure to be processed, and with providing support for the “position shifting” feature described in more detail below.
For example, the masking state data could specify first masking state data indicative of one or more masked rows or column positions for which all elements in the masked row or column position are to be treated as representing the masking value, and second masking state data indicative of whether individual element positions within a given row or column are to be masked or not. The masking out of entire rows or columns using the first masking state data can be useful for dealing with the “wraparound” error and/or “out of bounds” rows/columns in a first dimension, and the individual masking of particular elements within a not-fully-masked row or column can be useful for supporting “out of bounds” columns/rows in a second dimension and/or the position shifting feature described below (or for more general per-element predication). The first masking state data may comprise a set of elements identifying the masked/non-masked row/column positions in one dimension (row or column), while the second masking state data may comprise a set of elements identifying masked/non-masked positions in the orthogonal dimension (column or row). In some cases, the second masking state data may specify the individual indications of masked/non-masked elements only for a single row/column, as the same set of second masking state data could be shared across rows/columns (or if different patterns of masked/non-masked elements are needed for different rows/columns, then the second masking state data could be adjusted between processing one row/column and the next).
The masking state data may have an encoding capable of indicating, as masked row or column positions, at least two non-adjacent row or column positions separated by at least one non-masked row or column position. This recognises that when a 2D convolution is split into a number of 1×1 convolutions then there may be a number of non-adjacent row or column positions that need to be masked to prevent the input values on one edge of the input matrix affecting the output values at the opposite edge of the output matrix. Also, the locations to be padded for padded 2D convolutions may not correspond to contiguous addresses in memory.
The masking state data can be represented in a number of different ways. In general the masking state data may be any set of information which can indicate which row/column positions within a matrix structure are to be masked. One approach can be that the masking state data (e.g. the first masking state information described above) comprises a number of masking state indicators each corresponding to a respective row or column position of a given operand matrix and indicating whether the corresponding row or column position is a masked row or column position. For example the masking state data could include a bitmap where each bit corresponds to a given row or column position and is set to one value if that row or column position is to be masked and to another value if that row or column position is to remain unmasked. Similarly, the second masking information may comprise a second bitmap indicating the masked row/element positions within a particular row/column.
It is not necessary for the masking state data to distinguish whether it refers to respective rows of the given operand matrix or to respective columns of the given operand matrix. Different software applications may choose different layouts for a matrix within memory (e.g. row-major or column major), but the format of the masking state data may be the same regardless.
The operand storage circuitry can be implemented in different ways. In some examples the operand storage circuitry may comprise a set of input registers from which the first and second operands can be read when performing a given matrix processing operation.
However, it can be useful to provide, as part of the operand storage circuitry, matrix transposition circuitry which comprises a number of storage units to store respective matrix elements of a given operand matrix. The storage units of the matrix transposition circuitry may be readable in row groups corresponding to rows of the given operand matrix, and may also be readable in column groups corresponding to columns of the given operand matrix. Providing such matrix transposition circuitry can be very helpful in dealing with the fact that different machine learning algorithms may use different layouts to store the input channel data within memory. For example, some algorithms may use a row-major layout in memory where the offset between the memory addresses of adjacent elements of the same row of the matrix is smaller than the offset between the memory addresses of elements in adjacent elements in the same column of the given operand matrix. Other algorithms may use a column-major layout where the offset between the addresses of adjacent elements in the same column is smaller than the offset between adjacent elements within the same row. The matrix transposition circuitry enables on the fly remapping of whether a row-major or column-major format is used, since it is possible that if the given operand matrix is written to the matrix transposition circuitry in row groups, it can be read out from the matrix transposition circuitry in column groups, or vice versa, so that the subsequent matrix processing operations can assume a consistent format regardless of whether the data for the input matrix stored in memory is row-major or column-major. This can simplify code development and avoids the need for remapping or rearrangement of data within the memory storage itself.
Note that the storage units of the matrix transposition circuitry do not need to be physically arranged in rows and columns. It is sufficient that the storage units of the matrix transposition circuitry are logically readable in groups of storage elements corresponding to rows or in groups corresponding to columns. For example, the matrix transposition circuitry can be implemented as set of registers which have multiple read/write ports so that portions of the registers can be addressed in different combinations. For example, if each register stores a row group, a column group may be considered to be formed by a set of portions of data (the set comprising one portion per register, at corresponding positions within each register). Alternatively, the opposite mapping may be used where each column group maps to one register and a row group is a stripe of portions of data within corresponding positions in each register. Also, note that it is not essential that “rows” of a matrix stored in memory are written into “row groups” of the matrix transposition circuitry — while this is possible, such rows of the matrix could equally well be written into “column groups” of the matrix transposition circuitry. Hence, the “row groups” and “column groups” of the storage units in the matrix transposition circuitry refer to orthogonal groupings by which the storage units of the matrix transposition circuitry can be read, but do not need to conform to the same row/column direction as the matrices in memory. In fact, to improve pipelining of reads/writes for the matrix transposition circuitry it can sometimes be useful to alternate the choice of whether successive groups of lines (either rows or columns) of an input matrix are written into the matrix transposition circuitry in row groups or column groups.
Hence, when loading data to the matrix transposition circuitry, load circuitry may select whether to load at least one row group or at least one column group of storage units of the matrix transposition circuitry based on a portion of the matrix data structure in memory. The selection of whether to load at least one row group or at least one column group may be based on one or both of: row/column direction selection information specified by the load instruction; and row/column direction selection information stored in a control register which is updatable in response to a row/column direction switching instruction. Some implementations could use only one of these options to determine whether to load a row group or a column group (either information specified by the load instruction, or information specified in the control register).
Alternatively, an implementation could combine both of these pieces of information. For example, the control register bit could indicate either row mode or column mode, but a bit in the load instruction could indicate whether or not the meaning of the stored bit should be inverted (so that for load instructions with the “inverted” bit set, the instruction will load a row when the stored bit indicates a column and will load column when the stored bit indicates row).
Similarly, on reading data out from the matrix transposition circuitry to supply an operand for a matrix processing operation (or to transfer information to operand registers from which operands may subsequently be obtained for a matrix processing operation), row/column direction selection information could specify whether to read a row group or a column group of the matrix transposition circuitry (again that selection information could be specified by an instruction and/or in a control register, with the option to use both combining the row/column direction bit in a register and the “inverted” bit in the instruction for store instructions similar to load instructions as described above).
The masking operation based on the masking state data could be performed at different times relative to the loading of operands for matrix processing and the processing of matrix processing operations themselves.
In some implementations, the matrix processing circuitry may comprise the masking circuitry. The masking circuitry of the matrix processing circuitry may be responsive to the masking information to perform the matrix processing operation with a portion of one of the first and second operands corresponding to the one or more masked row or column positions treated as representing the masking value instead of an actual value of the portion of said one of said first and second operands stored in the operand storage circuitry. Hence, although the actual data from the input channels can be loaded from memory to the operand storage circuitry as normal, replacement of such input data with a masking value to provide padding or to avoid the wraparound errors described above can be controlled by masking the data read from the operand storage circuitry on input to the matrix processing circuitry. This approach can be particularly useful for implementations which also support the option to apply variable position shifting as discussed further below.
In some implementations, the masking circuitry may be comprised by load circuitry which is responsive to a load instruction to load information corresponding to a target row or column of a given operand matrix to the operand storage circuitry based on a portion of a matrix data structure stored in memory. In this case, when the target row or column corresponds to a masked row or column position indicated by the masking state data, the load circuitry may load a portion of said operand storage circuitry corresponding to the target row or column with data having the masking value instead of data based on the portion of the matrix data structure stored in memory. With this approach, the masking can be applied at the point of loading the operands from memory, which avoids unnecessary loading of matrix elements which will be masked anyway. Out of bounds data (corresponding to addresses beyond the end of a data structure to be processed which are referenced by a load instruction in a final iteration of a loop due to the amount of data to be processed not corresponding to an exact multiple of the amount of data that can be processed in one iteration) can also be masked using the masking circuitry, to prevent them from being loaded and hence prevent address faults being raised by accesses to addresses which might be invalid.
Some hardware implementations could support both types of masking, which could be useful as, for example, padding and masking of out of bounds data may be more efficiently handled by masking at the point of loading, but if variable position shifting is supported then dealing with the “wraparound” errors of the type discussed above may require masking at different input rows/columns for different instances of reading the same set of input data, in which case applying the masking at the point of reading the operand storage circuitry to perform a particular matrix processing operation can be more effective. Hence, to provide greatest flexibility, some implementations may support both types of masking.
For those implementations which provide load circuitry comprising the masking circuitry to apply masking at the point of loading operand data from memory, when the masking state data corresponding to the target row or column indicates that the target row or column corresponds to a masked row or column position, the load circuitry may determine whether each of the matrix elements of the target row or column should be masked, based on a shared item of masking state data shared between the two or more matrix elements of the target row or column. Hence, it is not necessary to provide individual masking state for each individual element within the target row or column (although this would be possible if desired, as described above with the example of the second masking state data providing 2D masking). For the purpose of supporting the “split into 1×1 convolutions” approach to handling 2D convolutions, a common memory layout for input channel data is to group the input elements at the same x-y position for multiple input channels together in a contiguous block of memory, in which case it may be that the masking can be applied to an entire row or column of the input matrix structure defining the input data for each of those input channels. This means it can be sufficient to share an item of masking state data among a whole row or column of an operand matrix being processed.
For the load masking example, the masking state data could be represented using a set of masking state indicators (e.g. a bitmap) as discussed above.
However, another approach may be that the masking state data comprises a number of offset values each corresponding to a respective row or column position of the given operand matrix and indicating an offset of an address of a corresponding portion of a matrix data structure in memory relative to a base address. In this case, a masked row or column position may be indicated by the offset value for the masked row or column position having a predetermined reserved offset value. This approach can be useful because it means that the masking state data can be represented using part of the addressing information used to identify the memory addresses from which portions of the matrix data structure in memory should be loaded. Hence, for each respective row or column position, the base address and the corresponding offset value for that row or column position can be used to identify the address in memory from which a portion of the matrix data structure should be loaded when the offset value does not have the predetermined reserved offset value. However, if the offset value for a given row or column position has the predetermined reserved offset value then instead of loading in the corresponding portion of the matrix data structure in memory, the masking value may be written to the portion of the operand storage circuitry which would otherwise store the portion of the matrix for that row or column. Hence, this approach avoids the need to provide separate masking state data beyond state data used for addressing of the matrix data structure in memory. The predetermined reserved offset value could be any reserved value that is designated as not being allowed to be used for real offset values, such as −1 (e.g. in signed binary representation, a value where all offset bits are 1).
In one example the masking state data may be stored within at least one masking state register provided within the processing apparatus. For example, there may be certain instructions for writing masking state data to the masking state register(s), prior to executing load instructions for loading portions of the operand matrix under control of the masking state data.
The masking state register could be a dedicated register provided specifically for controlling masking when performing matrix processing and/or loading operands for the matrix processing.
In other examples, the at least one masking state register could comprise at least one predicate register. In response to a vector instruction (or single instruction multiple data instruction) for controlling processing circuitry to perform vector processing using one or more vector operands comprising a one-dimensional array of elements, the vector predicate register can be read to provide a predicate value which controls whether respective lanes of vector processing are masked. Hence, the same register(s) could be shared between indicating vector predicates for vector operations and indicating the masking state data for matrix operations.
At least one masking state addressing register may be provided to store masking state addressing information which identifies locations in memory from which the masking state data can be obtained. For example, when the masking state data is represented using a set of offset values as discussed above, the set of offset values could be stored in memory, and the masking state addressing information in the masking state addressing register could identify where that array is stored in memory. This approach could reduce the number of registers which are architecturally required to be provided for supporting the matrix processing, which may be preferred for some lower power micro-architectural implementations.
Nevertheless, even if it is not architecturally required to provide registers for storing the masking state information itself (as those micro-architectures which do not wish to provide dedicated hardware for storing this information can instead load it when required from memory), some micro-architecture designers may nevertheless choose to provide a masking state cache to cache the masking state data obtained from memory so that it can be accessed more quickly for future accesses, to help improve performance. This can be useful because it may be that the pattern of masked/unmasked rows/columns may be the same for a number of matrix operations, so caching can save a significant number of memory accesses.
Regardless of the form of the masking state data, the load circuitry may determine a target address of the portion of the matrix data structure in memory based on addressing information, which could be defined in various ways. The addressing information could be obtained from a register explicitly referenced by the instruction which causes the load to be performed, or could be obtained from a default register implicitly referenced for the load instruction.
In one example, the addressing information could comprise a set of address pointers, where each address pointer indicates an address of a portion of the matrix data structure corresponding to a respective row or column position of the given operand matrix.
In another example, the addressing information may comprise a base address of the matrix data structure stored in memory and offset information for determining an address of the portion of the matrix data structure corresponding to a given row or column of the given operand matrix relative to the base address. While in some examples this offset information may be represented using the same set of offset values as used for the masking state data, this is not essential and in other examples the offset information may be separate from the masking state data. The offset information could be represented in different ways, e.g. using a stride value which indicates a difference between an address of the portion of the matrix data structure corresponding to one row or column of the given operand matrix and an address of the portion of the matrix data structure corresponding to the next row or column of a given operand matrix, or by explicitly recording the offset for multiple rows/columns in an offset data structure as described earlier. The use of a stride value avoids the need to explicitly encode each separate offset value for the respective rows, but the use of a more explicit offset data structure allows the masking state to be represented in the same structure as the offsets and would permit processing of a matrix with an irregular pattern of memory accesses for the respective rows/columns. Either way, representing the addresses using offset information relative to a base address can allow the addressing information to be represented using fewer bits than if the addressing information indicated the absolute addresses corresponding to each row/column position of the given operand matrix.
In some examples the addressing information could also include further information which provides sub-portion selection information to select which sub-portion of the portion of the matrix data structure in memory identified based on the addressing information is to be loaded to the operand storage circuitry when loading a given target row or column. This recognises that, given limitations on the maximum size of matrices which can be processed in hardware, when processing input matrices of a larger size then the operation may need to be split into a number of sub-operations each acting on a smaller portion of the input matrix. As the layout of matrix data in memory may include rows or columns of a greater size than the block of matrix data to be operated on by a given set of matrix processing instructions, the sub-portion selection information can be used to narrow down which sub-portion of a row or column should be processed for a given operation.
Hence, there are a number of options for representing the addressing information which identifies the location in memory for which a given target row or column is to be loaded. At least one addressing register may be provided to store the addressing information. Prior to executing load instructions or matrix processing instructions, the program being executed may load the at least one addressing register with the appropriate addressing information for selecting the portion of the matrix data structure to be processed.
In some implementations, pref etch circuitry can be provided to generate prefetch requests for pref etching portions of the given operand matrix from memory depending on the addressing information stored in the at least one addressing register. For example, if the addressing information includes an array of offset values then while loading rows or columns of the given operand matrix for earlier rows or columns, the prefetch circuitry could look ahead and start pref etching data based on the offsets for later rows/columns, so that performance is improved. Alternatively, other micro-architectures may prefer not to provide the prefetch circuitry to save power and circuit area.
For some implementations, the first and second input operands for the matrix processing operation may be two-dimensional matrix operands. For example, the matrix processing circuitry may support a full matrix multiply operation being performed in a single instruction, which can be beneficial for performance. However, this approach may be more expensive in terms of power consumption and in circuit area.
Hence, other implementations may prefer to provide matrix processing circuitry which supports performing the matrix processing operation on one-dimensional vector operands to generate a two-dimensional result matrix. For example the matrix processing operation may comprise an outer product operation applied to the 1D vector operands to generate the 2D result matrix. This recognises that in practice a matrix multiplication operation applied to two 2D matrix operands to generate a 2D result matrix can be decomposed into a number of separate outer product operations which are applied to respective combinations of individual rows/columns of the input matrix operands, with the results of the outer product operations being accumulated together to generate the end result equivalent to the 2D matrix multiply result. Hence, it can be particularly useful for the outer product operation to comprise an outer-product-and-accumulate operation, for which the result matrix comprises updated values for respective elements of an accumulator matrix, where the updated value for a given element of the accumulator matrix corresponds to a result of adding a previous value of that given element of the accumulator matrix to a corresponding element of an outer-product result matrix corresponding to a result of performing the outer product operation on the first and second input operands represented as one-dimensional vectors. This operation can be useful for supporting the 2D convolution operations discussed above.
The matrix processing circuitry may generate the result matrix as a two-dimensional matrix based on the first and second input operands, in response to a single instruction. Hence, even if a matrix multiply operation is split into multiple instructions performing separate outer product operations with each outer product operation acting on one-dimensional vector operands, each individual outer product operation may nevertheless generate a two-dimensional result matrix. This may provide improved performance compared to approaches which use vector processing circuitry to perform a series of vector operations equivalent to a matrix operation, where each vector operation processes 1D vector operands to generate a 1D vector result.
An example apparatus has matrix processing circuitry to perform a matrix processing operation on first and second operands to generate a result matrix, where the result matrix is a 2D matrix. Operand storage circuitry stores information for forming the first and second input operands for the matrix processing circuitry. Position shifting circuitry is provided to apply a variable position shift to vary which row or column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operands storage circuitry during a given matrix processing operation. The variable position shift is based on one of a number of alternative shift amounts selectable for the given matrix processing operation. Each alternative shift amount corresponds to a position shift of the one of the first and second input operands relative to the result matrix by a different number of rows or columns.
The position shifting circuitry is useful for supporting the approach where 2D convolution operations are decomposed into a number of separate 1×1 convolutions accumulating into a result matrix. The inventor recognised that in such a series of 1×1 convolutions, the 1×1 convolution operations corresponding to a number of adjacent kernel positions require very similar input data, but with a relative shift of one or more row/column positions between the inputs for the respective kernel positions. Hence, by providing circuitry to apply a variable row/column position shift of the input to a given matrix processing operation relative to the output, this means that the same operand data loaded from memory can act as inputs for the matrix processing operations for a number of different kernel positions during the series of 1×1 convolutions implementing the 2D convolution operation, which can reduce the number of load operations needed to load data from memory for performing a given 2D convolution operation.
As discussed above, while some implementations could implement full matrix multiplication operations, for limiting the hardware costs other implementations may implement the matrix processing operation as an outer product operation applied to one-dimensional vector operands as the first and second input operands, to generate a two-dimensional result matrix. Hence, in this case the variable position shift may vary which row or column of the result matrix is updated based on a given element within one of the first and second input vector operands. Again, for similar reasons to those discussed above it can be particularly useful for the matrix processing operation to be an outer-product-and-accumulate operation where the result matrix comprises updated values for respective elements of an accumulator matrix, formed based on a previous value for the accumulator matrix and the corresponding elements generated for the outer-product result. This operation can be useful for supporting the 1×1 convolution approach to handling 2D convolutions.
The position shifting circuitry may select between the respective alternative shift amounts based on a parameter specified by a matrix processing instruction for controlling the matrix processing circuitry to perform the matrix processing operation. In some implementations, the parameter identifying the shift amount could be part of the opcode of the matrix processing instruction, so that a number of different opcodes may be allocated for the respective shift amounts, each corresponding to the same type of matrix processing operation (other than having a different shift amount). Alternatively a separate parameter in the instruction encoding could be defined, e.g. a shift amount selection field separate from the opcode identifying the particular matrix processing operation to be performed. The parameter for selecting the shift amount could be represented as an immediate value within the instruction encoding, or could be identified within a register specified by the matrix processing instruction.
Alternatively, in some implementations a certain dedicated register for storing the shift amount selection parameter could be provided, so that the register read in response to the matrix processing instruction to obtain the shift amount selection parameter is implicit, and so does not need explicit encoding in the instruction encoding.
The matrix processing circuitry may also support predication where certain rows or columns within the result matrix can be identified as active or inactive row or column positions as identified by predicate information accessible to the matrix processing circuitry. Hence, when a given row or column of the result matrix corresponds to an active row or column position indicated by the predicate information, then the matrix processing circuitry may generate elements of the given row or column of the result matrix having values depending on a corresponding row or column of one of the first and second input operands (which row or column is the corresponding row or column depends on the one of the alternative shift amounts selected for that particular matrix processing operation). When the given row or column of the result matrix corresponds to an inactive row or column position indicated by the predicate information, then elements of the given row or column of the result matrix are generated having values independent of the corresponding row or column of one of the first and second input operands. For example when a given row or column of the result matrix is inactive then the corresponding elements may retain their previous values without being updated based on the corresponding row or column of the input operand. By providing the ability to prevent certain rows or columns of the input operands affecting the output, this helps deal with the ‘wraparound’ problem discussed above. This predication may be one example of the masking operation described earlier.
Again, as for the masking examples discussed above, the operand storage circuitry may comprise matrix transposition circuitry which enables reading and writing of storage units of the matrix transposition circuitry either in row groups or in column groups. This helps to support more efficient handling of matrix data structures stored in memory represented either in row-major or column-major form. All of the features discussed above for the matrix transposition circuitry may also be provided when the position shifting example is used.
When the matrix transposition circuitry is provided, then the operand storage circuitry may also comprise operand registers for storing the first and second input operands for the matrix processing operation, separate from the matrix transposition circuitry itself. The operand registers may be the storage circuitry from which the operands for a given processing operation are read in response to a matrix processing instructions for controlling the processing circuitry to perform the matrix processing separation.
A dedicated move instruction could be provided to control operand moving circuitry to read out at least one row or column of the given operand matrix from the matrix transposition circuitry and write the at least one row or column to the operand registers. This may simplify the encoding of a matrix processing instruction because any additional parameters for selecting whether a column or a row is to be read from the matrix transposition circuitry (or for selecting which particular row or column should be read) can be encoded in the move instruction so that less encoding space within the matrix processing instruction needs to be expended on such parameters.
However another approach would be that operands could be read out from the matrix transposition circuitry in response to matrix processing instruction and provided directly to the circuit logic for performing the matrix processing operation, without needing to go via a set of operand registers.
While such operand moving circuitry responsive to a move instruction, or the ability to directly read operands from the matrix transposition circuitry were not explicitly described above for the example using masking, these features can also be provided in that example.
Also, the masking functionality described in the earlier section can be combined with the position shifting functionality described above. Hence, even in the position shifting example it is also possible to provide masking circuitry which performs a masking operation based on masking state data as described above.
In fact, it can be particularly useful to combine both the masking functionality on the loads and the position shifting (including the predication applied at the input to matrix processing operation). One may expect that the predication merely would be redundant in the case where the masking on loads is supported, but in fact it can be useful to provide both functionalities. This is because the masking on loads can be used to insert padding values which support padded 2D convolution, even if the predication applied at the input to a matrix processing operation is then further masking to prevent certain rows from affecting the output (to deal with the wraparound problem discussed above). This is because the position of the rows affected by the wraparound problem may differ from kernel position to kernel position so when the position shifting functionality is used to allow multiple kernel positions to be calculated based on a set of data loaded for a single kernel position, then the predication based on the predicate value may be used to select the individual rows to be supressed for each individual kernel position, which would be difficult to handle if such wraparounds were dealt with solely at the point of loading data from memory. Nevertheless the masking approach can be useful for supplying the padding values.
Nevertheless, in the earlier described examples, if the position shifting is not supported then the masking at the point of carrying out a load operation can be sufficient to deal with a wraparound problem if performing a separate load for each kernel position, or alternatively masking on loads may not be supported at all and instead masking/predication may be applied at the time of performing a matrix processing operation.
Again, as for the masking example, the result matrix generated for the matrix processing operation may be a two-dimensional result matrix generated from the first and second input operands in response to a single instruction, so does not require separate processing of individual vector instructions each generating a one-dimensional vector result.
In the 2D convolution operation, for each output element within the output matrix, the kernel is centred on the element of the input matrix at the corresponding position to the output element being generated, and the output element is generated with a value corresponding to the sum of the products of the respective kernel elements and input matrix elements which are at corresponding positions relative to the centred kernel. For example, for output matrix element F′ which corresponds in position to input element F, the value for F′ is generated by multiplying respective pairs of input and kernel elements which are at the corresponding positions assuming that the central kernel element K5 is positioned over the input element F corresponding to the output position F′. Hence, F′=A*K1+B*K2+C*K3+E*K4+F*K5+G*K6+I*K7+J*K8+K*K9.
Similarly, for each other matrix element within the output matrix, the element is generated based on a sum of products but with the kernel over a different element of the input matrix. For example for output element G′, the kernel matrix has its central element K5 over input matrix element G, which means that the sum of products is G′=B*K1+C*K2+D*K3+F*K4+G*K5+H*K6+J*K7+K*K8+L*K9. Similar operations are performed for generating the output elements J′ and K′.
As shown in
For the calculations when the kernel is centred on one of these outer element positions, then the kernel elements which would sit outside the input matrix are multiplied with padding values (PV). For example, for the calculation for generating output element A′, this would require the central kernel position K5 to sit over element A of the input matrix, and so while there are valid input values for positions A, B, E, F in the input matrix corresponding to kernel elements K5, K6, K8, K9, the other kernel elements K1, K2, K3, K4, K7 are multiplied with padding values when generating the sum of products to generate the new value for output matrix A′.
Similarly, for other elements around the boundary of the output matrix, the padding values will be in different positions relative to the kernel, depending on the edge of the input matrix at which that kernel is overlapping. For example, for output position L′ the padding values will be needed for the right hand column of the kernel K3, K6, K9 as these are the positions which would extend outside the input matrix when the kernel is centred over position L. Similarly, for output element N′ then kernel position K5 will be centred on position N and so this means that the bottom row of kernel positions K7, K8, K9 extends outside the input matrix and so requires padding.
In one example, the padding value could simply be zero. However, some 2D convolution operations may require other types of padding values. For example in some cases a quantization scheme could be used where an offset is applied to the true values of the matrix when generating the stored numeric values for each matrix element, so that ‘zero’ may actually be represented using a non-zero numeric value. In this case, the padding value may be a non-zero value representing the zero point. The padding values may be set based on averaging of other elements within the input matrix. The precise rules for setting the padding values may depend on the particular application being performed. Hence, it can be useful to support the ability to select between the number of alternative types of padding value (e.g. based on a control register and/or a parameter specified by a matrix processing instruction).
While not shown in the example of
Unpadded and padded 2D convolution operations can be useful for a range of processing applications. For example, 2D convolutions can be useful for applying filters to images, for example for blurring, sharpening, edge detection, etc. The kernel applied may be selected based on the type of filter desired, and may have particular values for the kernel elements which will bring out some features such as edges. Effectively the kernel may slide over each successive image pixel and apply an operation to generate a new value for an output pixel based on that pixel and a number of surrounding pixels using the relationship defined by the kernel.
Another type of processing which may include 2D convolutions is in the field of machine learning, for example in implementing neural networks. For example, a neural network trained to detect features within image data could be implemented using a set of kernels which are applied to the image data in 2D convolution operations. More generally, feature maps representing some data to be processed can be processed with kernels in order to make inferences about the data.
As shown in
In this example, the number of output channels OC is equal to the number of input channels IC, but this is not essential. Other examples could have different numbers for IC and OC. Also, the 2D convolution shown in
When 2D convolutions are to be applied to a number of input channels then there may be a number of choices for the layout used to store the data of the input channels within memory.
Hence, when referring to
It will be appreciated that while for ease of understanding
The NHWC memory layout shown in
Regardless of the particular memory layout selected for a given application, one problem with the 2D convolution approach is that the elements which are required for combining with the kernel elements for generating a given output element within the output matrix may not be within contiguous memory addresses within the memory address space. For example, for calculating the top left output position A′ in the padded 2D convolution of
Similarly, for each other output position within the output matrix, a different row 2 is generated by gathering together the respective input elements needed to generate that output position. Hence, this requires OH*OW rows 2 of additional data to be generated where each row comprises KH*KW*IC elements. While this may generate a lot of overhead in extracting the respective subsets of elements from the data stored in memory and copying them elsewhere in memory to generate the rows, this can greatly simplify the subsequent 2D convolution operation which can then simply apply the kernel values directly to a contiguous block of memory in a matrix processing operation to generate the corresponding output matrix.
However, this approach has several problems. One problem is that increasingly performance is improving for matrix processing operations implemented in a given data processing system. As matrix processing performance improves, Amdahl's Law means that other operations performed alongside the matrix processing operations themselves have an increasingly important impact on overall performance. Even if the matrix processing operations themselves can continue to improve in performance, if other operations such as the im2row operation shown in
Another type of convolution operation is a 1×1 convolution operation, which is similar to the 2D convolution described above but with a kernel which is a 1×1 matrix instead of having a 2-dimensional extent. With a 1×1 kernel, the result of a 1×1 convolution operation is simply an output matrix in which each element corresponds to the result of multiplying a corresponding element of the input matrix by the same kernel element. As shown in
In the examples of the 2D convolutions shown above, the calculation of the sum of products has been shown separately for each position of the output matrix, with each group of products being for different pairs of input/kernel positions but the same output position.
However, it is also possible to partition the multiplications in a different grouping, considering the set of multiplications associated with a single kernel position as a group, with that group of multiplications generating one of the products to be summed for each output position. Considering the example of
Similarly, for each other kernel position K2-K9, it can be determined which input element (or a padding value) should be multiplied with that kernel element to generate another of the products summed for each of the output positions. Note that a given input matrix element contributes to a different element of the output matrix for each kernel position. For example, when considering input element F, this will contribute to output element K′ when multiplied with kernel element K1, contribute to output element J′ when multiplied with kernel element K2, contribute to output element I′ when multiplied with kernel element K3, and so on, until F contributes to output element A′ when multiplied with kernel element K9.
Therefore, between respective kernel element positions, there is a relative shift between the position of a given output element in the output matrix and the position of the corresponding input element that contributes to that given output element for that particular kernel element position. For instance, the shift of the effective input matrix between the K1 multiplication and the K2 multiplication is a shift left by one column position.
This means that, by performing a series of 1×1 convolutions and accumulating the results of each 1×1 convolution into an accumulator matrix representing the running totals for the output matrix, the result can be equivalent to the result of the 2D convolution operation performed over a kernel size larger than 1×1. For example, the result of each of the K2 multiplications shown may be added to the corresponding elements of the accumulator matrix resulting from the K1 multiplications (with, say, the result of K2*B being added to the accumulator matrix element at position F′ set based on K1*A in the K1 1×1 convolution), and the result of each of the K3 multiplications may then be added to the corresponding elements of the accumulator matrix resulting from the K1 and K2 multiplications (with the result of K3*C being added to the accumulated value for output element F′ so that F′ now equals K1*A+K2*B +K3*C). This continues for each successive kernel position, and so by the end of the ninth 1×1 convolution operation, the output matrix has the same result as if the 2D convolution operation had been performed with a 3×3 kernel matrix. It will be appreciated that it is not essential to calculate the 1×1 convolutions in the order K1, K2, K3, . . . , K9 shown in
As shown in
Hence, as shown in
The input matri 10 can be loaded from memory direct from the data structure laid out as shown in
Similarly, the output matri 12 has a corresponding layout to the input matrix 10, and so once all the 1×1 convolutions for the 2D convolution have been accumulated together, the result can be written directly back to a matrix data structure in memory laid out as in
As shown in the top part of
Therefore, to allow the 1×1 convolutions to be applied over a larger number of rows even if there are selected rows which encounter the wraparound problem, it can be useful to support a masking operation which allows certain rows of the input can be skipped when generating the output. This is shown by the “X” marked on the path between inputs rows D, H and output rows I′, M′. The masking operation may be controlled by masking state data which defines the positions of the masked rows (or if the matrices are instead arranged with the input elements for a given input channel position extending within the same column, the masked columns). Examples of encoding the masking state data are described below. The masking operation could be implemented at the time of loading the data from memory into registers (so that instead of loading the actual data elements from memory, a masking value is instead loaded into corresponding portions of the operand storage for storing information for forming the input channel matrix 10). Alternatively, the masking operation could be performed at the time of performing the matrix processing operation itself, so that when the matrix processing circuitry reads operands for processing, predication is applied to mask out a read row of elements and ensure that the matrix processing circuitry treats those elements as if they represented the masking value instead of the actual value stored in operand storage. The masking value could be zero, or could be non-zero if a zero point is represented using a non-zero value. Either way, this means that the wraparound problem is prevented from causing errors and this enables a 1×1 convolution to be performed in fewer instructions as the 1×1 convolution can be applied to a matrix size larger than the block of contiguous rows that does not encounter the wraparound problem.
For the other kernel weight positions K2-K9, similar matrix multiplication operations to that shown in
For the centre-left kernel position K4, K4 needs to be multiplied with element A of the input matrix when generating output element B (because K4 will be multiplied by A when the central position of the kernel K5 is over element B). Similarly, there is a 1 position shift between input elements and output elements for each of the other positions within the input/output matrices 10, 12.
Similarly, for the centre-right kernel position K6 needs to be multiplied with input element B to generate output element A, with input element C to generate output element B, and so on.
As shown in
However, it can be seen that in general the input data for rows A-P of the input matrix 10 is essentially the same for each of the three kernel weight positions K4, K5, K6, except that relative to the centre position K5, for the centre-left position K4 the input matri 10 is shifted down one row position relative to the output, so that input row A is used to generate output row B instead of generating row A as in the central position K5. Similarly for the centre-right position the input matri 10 is shifted up one row relative to the output matri 12 so that input row B feed into output row A.
Therefore, it is observed that by providing circuitry which performs a variable position shift of the inputs relative to the outputs, so that it can be adjusted which row of the output matrix is updated based on a particular row of the input matrix, and which supports multiple different alternative shift amounts that can be selected, this enables a block of matrix data loaded from memory to be reused for the 1×1 convolutions for multiple different kernel positions. This means the memory bandwidth associated with the loads for loading input rows A-P can be amortized across multiple different matrix processing operations, which greatly improves performance. If this position shifting is used, then as the positions of the masked rows for dealing with the wraparound problem vary from kernel position to kernel position, then the masking would be needed at the point of reading the previously loaded operands from registers or a matrix transposition box.
The execute stage 36 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include a scalar arithmetic/logic unit (ALU) 40 for performing arithmetic or logical operations on scalar operands read from the registers 34; a floating point unit 42 for performing operations on floating-point values; a branch unit 44 for evaluating the outcome of branch operations and adjusting the program counter which represents the current point of execution accordingly; a matrix processing unit 46 for matrix processing (which will be discussed in more detail below);
and a load/store unit 48 for performing load/store operations to access data in a memory system 28, 50, 52, 54.
In this example, the memory system includes a level one data cache 50, the level one instruction cache 28, a shared level two cache 52 and main system memory 54. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 40 to 48 shown in the execute stage 36 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that
In some implementations the data processing apparatus 20 may be a multi-processor apparatus which comprises a number of CPUs (central processing units, or processor cores) 60 each having a processing pipeline 24 similar to the one shown for one of the CPUs 60 of
One approach for supporting matrix processing operations can be to decompose the individual multiplications of a given matrix processing operation into separate integer or vector instructions which can be processed on the processing pipeline 24 of a given CPU 60. However, this may be relatively slow.
Another approach to accelerating matrix processing can be to provide, as one of the devices 64 connected to the interconnect 66, a hardware accelerator with dedicated hardware designed for handling matrix operations. To interact with such a hardware accelerator, the CPU 24 would execute load/store instructions using the load/store unit 48, to write configuration data to the hardware accelerator defining the matrix operands to be read from memory by the hardware accelerator and defining the processing operations to be applied to the operands. The CPU can then read the results of the matrix processing back from the hardware accelerator using a load instruction specifying an address mapped to registers within the hardware accelerator. While this approach can be faster than using integer operations within the pipeline, there may nevertheless be an overhead associated with using the load/store mechanism to transfer information between the general purpose processor 60 and the hardware accelerator 64, and also the hardware accelerator approach can create challenges when different virtual machines running on the same processing system need to share access to the hardware accelerator. Therefore, this approach may not scale well in a virtualised implementation having a number of virtual machines.
Therefore, as shown in
While
The matrix transpose box 74 includes a number of storage elements 88 each for storing a different matrix element of a given operand (input) matrix. The storage elements 88 are logically arranged in rows and columns so that they are accessible either as a row group 90, where all of the storage elements 88 which correspond to the same row of the input matrix are readable/writable, or as a column group 92 where all of the storage elements 88 which correspond to the same column of the input matrix are readable/writable. The physical arrangements of the storage elements 88 on the integrated circuit does not need to follow the logical arrangement in rows and columns and can take any physical arrangement. The ability to read or write the elements 88 in row groups 90 and column groups 92 is provided instead by providing read/write ports and multiplexing circuitry so that the relevant elements which correspond to a given row or a given column can be read, regardless of their physical location in the chip.
This means that when loading data from a matrix data structure in memory, the matrix load circuitry 80 may select (in response to a row/column direction selection parameter 89) whether to load an individual row group 90 of the matrix transpose box 74 or an individual column group 92 with data from a portion of the matrix structure in memory selected based on addressing information 94. A load instruction 98 decoded by the instruction decoder 30 to control the matrix load circuitry 80 may specify a row/column ID 99 which identifies which particular row or column is to be loaded. The instruction could specify the row/column ID 99 directly as an immediate parameter, or indirectly by specifying a register which contains the row/column ID 99.
The row/column selection parameter 89 could be explicitly encoded in the load instruction 98, using a field within the instruction encoding which selects whether a row group 90 or a column group 92 of the matrix transpose box 74 is loaded with data from memory. Alternatively, the row/column direction selection parameter could be implicitly encoded. For example, there may be a control parameter stored in a control register which specifies whether the matrix load instructions 98 should currently select that rows of the matrix transpose box 74 should be loaded or that columns should be loaded. The control parameter in the control register could switch states when a row/column direction switching instruction is executed. This avoids the need for every matrix load instruction to specify an explicit row/column direction selection parameter. Also, it is possible to use both a parameter specified in the instruction encoding and a parameter stored in a control register, with the combination of the control register bit and the row/column selection bit in the instruction encoding selecting which of the row/column directions is used. For example, the control register bit could indicate whether rows/columns are selected, but the bit in the instruction encoding could select whether the bit in the control register is inverted or not, e.g.:
Of course, other encodings could be used instead—this is just one example.
Also, the load circuitry 80 is responsive to masking state information 96, 97 to select whether or not to replace the values loaded into the matrix transpose box 74 with masking values instead of the values loaded from memory. In this example, the masking state information includes first masking state information 96 and second masking state information 97.
The first masking state information 96 is used to control masking of certain row/column positions to prevent the corresponding row/column group of the matrix transpose box 74 being updated based on the corresponding values of memory. For each row/column position in the matrix transpose box 74, the first masking state information 96 identifies whether that row/column position is a masked row/column position or an unmasked row/column position. That is, if the row/column selection parameter(s) 89 indicate that elements are to be written in rows, the masking indications of the first masking state information correspond to different row positions. If the row/column selection parameter(s) 89 indicate that the elements are to be written to the matrix transpose box 74 in columns, then the masking indications of the first masking state information correspond to different column positions.
If the first masking state information 96 specifies that the target row/column to be loaded is an unmasked row/column, then the second masking state information 98 can be used to identify which individual element positions within the target row/column are masked, and the matrix load circuitry 80 obtains the corresponding data from the matrix structure stored in memory and writes the non-masked elements of the target row/column to the corresponding elements 88 of the selected row/column group of the matrix transpose box 74 (with any masked out elements in the selected row/column group being set to the masking value instead). Hence, the second masking state information 98 may provide a set of masking indications where each masking indication corresponds to a different position extending in the opposite dimension to the positions associated with the masking indications of the first masking state information. That is, if the row/column selection parameter(s) 89 indicate that elements are to be written in rows, the masking indications of the second masking state information correspond to different column positions. If the row/column selection parameter(s) 89 indicate that the elements are to be written to the matrix transpose box 74 in columns, then the masking indications of the first masking state information correspond to different row positions.
The first and second masking state information 96, 97 together represent two-dimensional masking state information as they indicate positions of masked elements across two dimensions of the matrix to be loaded into the matrix transpose box 74. However, each individual instruction only uses the part of the first masking state information corresponding to a single target row/column (parts of the first masking state information relating to other rows/columns are ignored). Nevertheless, the first and second masking state information 96, 97 may together define the masked positions across the 2D matrix transpose box as a whole so that it is not necessary to change the masking state data between loading one row/column and the next.
On the other hand, if the selected row/column position is indicated by the first masking state information 96 a masked row/column position, then instead of supplying the data loaded from memory a masking value is written to each of the matrix elements 88 within the selected row/column. Here, each of the elements within the selected row/column may share the same item of first masking state data 96 either identifying all elements in the selected row/column as masked or identifying all matrix elements 88 within the selected row/column as unmasked. When the load instruction specifies a masked row/column; then in response to the masking state information 96 the matrix load circuitry 80 instead writes a masking value to each of the elements within the masked row/column.
Regardless of whether the masking value is supplied to a particular element 88 of the matrix transpose box 74 due to masking of a whole row based on the first masking state data 96 or masking of an individual element based on the second masking state data 97, the masking value can be a predetermined value such as zero, or could be one of a number of alternative masking values that are selectable based on masking selection information which could be stored in a register or within a parameter specified explicitly by the load instruction.
The addressing information 94 could be stored within the general purpose registers 34 of the CPU which are also used for general integer operands, or in some examples could be stored within some dedicated matrix addressing information registers which store information specific to identifying a portion of a matrix structure to be loaded from memory.
Also in the example of
Similarly, the second masking state information (mask2) 97 is represented as a bitmap which includes a number of bit flag indicators 101 each corresponding to a column/row position (the opposite dimension to the positions indicated by each bit flag indicator 100 in the mask1 bitmap 96), so that mask2 indicates the positions of individual masked elements within the target row/column having the row/column number 99 specified by the load instruction 98 as described above.
The registers storing the first/second masking state information 96, 97 could be dedicated registers for storing the masking state information for masking of matrix operands/processing (and which serve no other purpose), or could serve a dual function so that the same registers could also be used for other information when processing instructions other than matrix processing related instructions. For example, the masking state information 96, 97 could be read from predicate registers, which can also be used to store vector predicates which control masking of lanes of vector processing when a vector instruction is executed.
With this approach, when processing an individual load instruction 98 the matrix load circuitry 80 could calculate the address of the portion of data to be loaded into the selected row or column of the matrix transpose box 74, by adding the base address to the product of the stride value 106 and the row/column number 99 specified by the instruction, optionally offset by the intra-row/column offset value 108 if necessary.
The offset data structure 110 defines an array of offset values where each offset 114 corresponds to a particular row/column number that can be selected by an individual matrix load instruction 98. When a load instruction specifies a given row/column number (e.g. column 2 as in the example shown in
However, certain offset values are reserved so that they cannot be used for valid offsets but instead indicate the position of a masked row/column. For example the reserved offset value may be −1 (that is a binary value having a most significant bit of 1 and all other bits set to 0 to compliment representation). Hence, when calculating the address for an individual load instruction, if it is determined that the selected offset value 114-2 for the selected row/column number has the reserved value, then this is interpreted as a masked row or column position and therefore instead of performing the actual load from the portion of the matrix data structure stored in memory, instead the relevant row or column group 90, 92 of the matrix transpose box 74 is filled with each of the elements 88 in that row having the masking value, for example zero.
Hence, with this approach the offsets which define the positions in memory from which respective rows or columns of the input matrix are to be loaded into the matrix transpose box also serves as masking state information, which avoids the need for a separate register for the masking state values.
An advantage of using an array 110 of offset values 114 as part of the addressing information is that, compared to an alternative approach of storing a table of absolute addresses indicating the addresses of respective rows/columns of matrix data in memory, this requires much less storage capacity as the offsets can be indicated relative to a common base address and so can be represented using fewer bits. Nevertheless, other implementations could omit the base register 104 in the example of
Also, the use of a special reserved value of the offset field 110 to represent the masked row/column positions can be more efficient than if padding was supported instead by storing the padding value in memory itself and representing the masked rows/columns by specifying in the field of offset array 110 corresponding to a masked row/columns an offset value which points to the actual location in memory where the padding value is stored. With the special reserved value approach, there is no need to perform an actual load to memory in order to obtain the padding value, as the padding value can instead by generated on the fly by the load circuitry 80 based on detecting the reserved offset value.
While
Regardless of how the particular masking state information 96, 97 and addressing information 94 is represented, this functionality enables the required portions of a matrix stored in memory to be loaded into the matrix transpose box 74 to permit the 1×1 convolution of operations described earlier to be applied to that portion of the matrix. The masking enables certain lines of the input to be skipped as shown in
Having written rows or columns of a given operand matrix into the matrix transpose box 74, the data can be read out in row or column groups by the operand moving circuitry 82 and transferred to the input operand register 70 ready for matrix processing. The operand moving circuitry 82 is not limited to reading out the data from the matrix transpose box 74 in the same row/column direction as the direction which the data was loaded by the matrix load circuitry 80. In practice, it can be useful for the operand moving circuitry 82 to read out the data in the opposite row/column direction to the one used on loading, if the data structure stored in memory for the input operands is stored in a different row/column-major format compared to the output data structure. This on the fly transposition of matrices as they are loaded into the matrix transpose box 74 and read out for processing may be performed in hardware much more efficiently than would be possible from remapping data layouts within memory. Hence, this can greatly improve performance with dealing with input matrices of potentially different memory layouts.
Note that for any given memory layout for a matrix structure stored in memory, it is possible to load that same layout either column-wise or row-wise into the matrix transpose box 74, so whether the row/column selection parameter 89 specifies the row direction or the column direction may be selected totally independently of the actual layout used in the underlying matrix structure in memory. This is because to transpose the direction of the matrix using the matrix transpose box, it is irrelevant whether the data is loaded in column-wise and read out row-wise or whether it is loaded in row-wise and read out column-wise, as these both achieve the same results. In fact, when performing such on the fly transpositions, it can be useful to alternate between loading in matrix data row-wise and loading them in column-wise, to achieve better pipelining of the read out of earlier rows or columns of a matrix for processing and the loading in of later rows or columns of the matrix.
For example, imagine a series of operations where a series of rows of the matrix structure in memory are loaded into rows 0 to 7 of the matrix transpose box 74, but then are read out column wise because the output data structure with which they are being combined has the opposite memory layout. In this case, having loaded the final row 7 into the matrix transpose box, the operand moving circuitry 82 can then start reading out columns one by one starting with column 0 and finishing with column 7. However, as soon as the data for column 0 has been read out, then while the operand moving circuitry 82 continues to read out successive columns 1-7 for processing by the matrix processing object 84, the matrix load circuitry 80 could start loading in further rows of the matrix structure from memory for a next chunk of the matrix to be processed. As columns 1-7 may still be needed for the matrix processing logic 84, it is therefore more efficient to start loading those further rows of the matrix column into the respective columns 0, 1, 2, etc. as those columns successively become free due to the operand moving circuitry reading them out for processing. Hence, the loads for later parts of matrices can be loaded into respective columns of the matrix transpose box 74 at early column positions 0, 1 while the read out for the later columns associated with the previous chunk of matrix is still ongoing. For example once the matrix moves by the operand moving circuitry 82 have read out the data in a certain column, say column 2, then the load into that column for the next pass could start and so this enables some performance improvements by pipelining. Then, once all of the columns have been loaded for the next chunk of the matrix in memory to be processed, the next set of operand moving operations performed by the operand moving circuitry 82 could be performed row wise while loads proceed just behind to fill the row groups 90 of the matrix transpose box just read by the operand moving circuitry 82. Hence, it can be seen that (when on-the-fly transposition is used), by alternating which direction is used for a set of loads, this can provide better performance than if the same row/column direction was used throughout the matrix.
Alternatively, if a particular set of operations is being performed where there is no need for on-the-fly transposition of the matrix layout (e.g. as the output data structure has the same layout in memory as the input data structure), then a fixed one of the row/column directions could be selected for both the matrix load operations and the operand moving operations. Nevertheless, there may still be pipelining so that operands can be read out from certain rows/columns for processing while loads are being performed into other rows/columns.
In the example of
However, such a matrix multiply operation would require, for each output element position of the output matrix 12, 4 separate products to be calculated, and then an addition of 5 terms (the 4 products and the previous value of the output element). This may be slow to implement and difficult to fit with pipeline timings of other operations.
In contrast, an outer product operation takes a first vector operand u=(u1, u2, . . . , um) and a second vector operand v=(v1, v2, . . . , vn) each comprising a one-dimensional array of elements and combines these to form a two-dimensional result matrix W where
Hence, each element of the result matrix is derived from a single product of a single element of the input vector operand with a single element of the second vector operand.
For an outer-product-and-accumulate operation, each element of an updated result matrix W′ also depends on the corresponding element in the previous value of the result matrix W:
Hence, even for the outer-product-and-accumulate operation, each element requires only the calculation of a single product added to one additional term. This can be performed much faster with lower hardware cost.
The full matrix multiply operation can be decomposed into individual outer product operations. For example, when taking a vector operand 206 as shown in
Hence, to support the outer-product-and-accumulate operation performed by the matrix processing logic 84, the input operand registers 70 store one-dimensional vector operands and the operand moving circuitry 82 reads out parts of the input matrix in the matrix transpose box 74 a row or a column at a time. Hence, even though the underlying given operand matrix on which the operations are being performed is a two-dimensional matrix structure, at the point of applying a matrix processing operation it is treated as a series of one-dimensional vector operands, but nevertheless the matrix processing logic 84 is able to generate a result matrix as a two-dimensional matrix structure in one instruction, corresponding to the result of applying the outer product/accumulate operation on a pair of vector operands. This means that the operation is still faster than if individual vector processing instructions were processed which can each only generate a single row/column of a result matrix at a time.
In the example of
Alternatively, other approaches may only provide sufficient input operand register storage 70 for a single vector operand pair, in which case that single pair of vector registers would need to be loaded with the new value for each different combination of row/column of the respective input matrices being multiplied.
Also, it is not essential to provide separate register banks for the two operands A, B. In another example, both operands A and B may be selected from respective registers in a single combined register file.
As shown in
The matrix processing logic 84 includes position shifting circuitry 260 for applying a variable position shift between the elements of one of the input operands 250 and the corresponding element positions in the output matrix 270 generated in response to the matrix processing instruction 240. The shift information 244 can be represented either as an explicit parameter within the matrix processing instruction 240, or could be represented by a control parameter stored in a control register. The shift parameter 244 specifies one of a number of variable shift amounts. Based on the selected shift amount the positions shifting circuitry activates a number of multiplexers to select which of the input elements from the first vector operand 250 are supplied to each element position within a shifted input operand 272. For example, if a variable shift amount of 0 is selected then each element of the input vector 250 is passed through to the correspondingly positioned element in the shifted input vector 272, while if a variable shift amount of 1 is selected then the element at a given element position within the shifted input vector 272 is set to the value of the element at the next highest element position within the original input vector 250. For the elements at the highest element position within the shifted input vector 272, a padding value 274 can be supplied as there is no higher element position within the original input vector to inject if a variable shift amount greater than 0 is selected. Similarly, for higher values of the shift amount then a larger shift of position can be applied so as to adjust which position of the input vector 250 is supplied through to the shifted positions in the shifted input vector 272. No shift is applied to the second vector operand 252 which is simply used in its original position.
The matrix processing logic 84 then performs the outer product operation so that each element C′[i,j] is generated according to the expression C′[i,j]=C[i,j]+P[i].Ashift[i]×B[j], where i is iterated across all rows of the result matrix C′[i, j] and j is iterated across all columns of result matrix C′[i, j]. Here, the predicate bit P[i] corresponding to a given row position i in the result matrix specifies whether that row is masked (inactive) or unmasked (active). In this example the inactive rows of the output matrix 270 are indicated by predicate bits equal to 0 while the active rows are indicated by predicate bits of 1, but it will be appreciated that other examples could take the opposite mapping of the predicate value so that the inactive rows may be identified using predicate bits of 1 and the active rows by predicate bits of 0. For inactive rows, in this example the corresponding elements of the shifted input vector 272 are assumed to be replaced with a masking value of zero, but other examples could use a non-zero masking value.
Hence, with this approach the variable position shift provided by the position shifting circuitry 260 helps to support the approach shown in
While
Hence, providing the features discussed above with respect to
While
Also, in some implementations it may not be essential to provide the input operand registers 70 at all, as if the matrix transpose box 74 is provided, then another approach could be that the matrix processing logic 84 reads its operands directly from the storage elements 88 of the matrix transpose box 74. Hence, while in general some operand storage circuitry may be provided to be loaded with rows or columns of a matrix by the matrix load circuitry 80 and from which operands can be obtained by the matrix processing logic 84, it is not necessary to provide both the matrix transpose box 74 and the input operand register 70, and either can be provided on their own, or both can be provided in combination as in the example of
While
Performance can be improved to greatest extent if both the row/column masking functionality and the position shifting functionalities described above are provided, but this is not essential and some implementations may provide only one or other of these functionalities.
On the other hand if the target row or column position is not a masked row or column position, then at step 308 the matrix load circuitry 80 obtains the second masking state data 97, which is per-element masking state data indicating positions of any individual masked column/row positions within the target row/column. At step 310 the matrix load circuitry determines whether there are any active elements within the target row/column (it is possible that even though the first masking state data 96 indicated the target row/column was not masked, the second masking state data 97 may have set all elements in the target row/column to be inactive). If there is at least one active element in the target row/column, then at step 312 the matrix load circuitry 80 triggers a load operation to read from the memory a portion of the matrix data structure which corresponds to the target row or column. The address from which the data is loaded may be derived from the addressing information 94, for example by adding a base address 104 to a multiple of the row/column number and the specified stride 106 in the example of
If at step 310 the matrix load circuitry 80 determines that all of the elements in the target row/column are indicated as inactive by the second masking state data 97, then at step 314 the load operation is prevented from taking place, and each element of the target row/column in the operand storage circuitry (i.e. storage elements 88 of the matrix transpose box 74 or an input operand register 70) is filled with the masking value, without needing to perform any load from memory at all.
While
Hence, at step 328 a variable position shift is applied by the position shifting circuitry 260 based on the shift amount selected at step 326, so that it is varied which row or column of the 2D result matrix 270 is updated based on a given element of one of the input operands 250. At step 324 of
Hence, in summary these ideas help support more efficient hardware to support processing of 2D convolution operations which are a common operation in the field of machine learning and image processing.
Further examples are set out in the following clauses:
first masking state data indicative of one or more masked rows or column positions for which all elements in the masked row or column position are to be treated as representing the masking value; and second masking state data indicative of whether individual element positions within a given row or column are to be masked or not.
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2007068.6 | May 2020 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2021/051153 | 5/13/2021 | WO |