This invention relates to integrated circuit devices and methods for vector data processing. In the field of vector data processing an integrated circuit device may have at least one instruction processing module arranged for executing vector data processing upon receipt of a respective one of a set of data processing instructions. Such a single data processing instruction may operate on multiple data elements, also called SIMD.
The United States patent application document US 2010/0106944 describes a data processing apparatus and method for performing rearrangement operations. The data processing apparatus has a register data store with a plurality of registers, each register storing a plurality of data elements. Processing circuitry is responsive to control signals to perform processing operations on the data elements. An instruction decoder is responsive to at least one but no more than N rearrangement instructions, where N is an odd plural number, to generate control signals to control the processing circuitry to perform a rearrangement process. The process involves obtaining as source data elements the data elements stored in N registers of said register data store as identified by the at least one re-arrangement instruction; performing a rearrangement operation to rearrange the source data elements between a regular N-way interleaved order and a de-interleaved order in order to produce a sequence of result data elements; and outputting the sequence of result data elements for storing in the register data store. This provides a technique for performing N-way interleave and de-interleave operations.
However, the known system requires many instructions for some matrix processing operations.
The present invention provides an integrated circuit device, and a method, as described in the accompanying claims.
Specific embodiments of the invention are set forth in the dependent claims. Aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.
Further details, aspects and embodiments of the invention will be described, by way of example only, with reference to the drawings.
a and
a and
Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. In the Figures, elements which correspond to elements already described may have the same reference numerals.
Examples of the present invention will now be described with reference to an example of an instruction processing architecture, such as a central processing unit (CPU) architecture. However, it will be appreciated that the present invention is not limited to the specific instruction processing architecture herein described with reference to the accompanying drawings, and may equally be applied to alternative architectures. For the illustrated example, an instruction processing architecture is provided comprising separate data and address registers. However, it is contemplated in some examples that separate address registers need not be provided, with data registers being used to provide address storage. Furthermore, for the illustrated examples, the instruction processing architecture is shown as comprising four data execution units. Some examples of the present invention may equally be implemented within an instruction processing architecture comprising any number of data execution units. Additionally, because the illustrated example embodiments of the present invention may, for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated below, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
The data processing instructions include at least one matrix processing instruction for processing elements of a matrix. The elements of rows of the matrix are sequentially stored in the set of registers 110. The matrix processing instruction triggers accessing matrix elements via an accessing unit. Thereto, the instruction processing module has an accessing unit 120,120′ for accessing selected elements of the matrix, which selected elements are non-sequentially stored according to a predetermined pattern across multiple registers of the set of registers.
In the example, the instruction processing module has a first accessing unit 120 that is shown to enable access to the data elements D2 in WideReg A, D7 in WideReg B, D8 in WideReg C and D13 in WideReg D for executing the data processing instruction 130. Furthermore, in the right halve of the Figure, further data processing is depicted in which the instruction processing module has a further accessing unit 120′ that is shown to enable access to the data elements D3 in WideReg A, D4 in WideReg B, D9 in WideReg C and D14 in WideReg D, which elements are added to generate data elements 0,1,2,3 to be outputted to memory or the execution unit.
By providing the access unit the system is provided with the ability to access a different wide register for each processing lane of the register, the accessing enabling respective processing lanes to write or read different registers. It is noted that the ability to access a different wide register for each processing lane may be implemented in the register file 110 in combination with a permutation unit as depicted in the Figures. The combined ability to access different wide registers and apply a suitable permutation is called access unit in this document. It is noted that multiple processing lanes, having respective execution units, may be provided to facilitate a single data processing instruction to operate on multiple data elements. The data processing device is further arranged to handle and execute a set of additional matrix instructions that support the new register addressing modes.
The instructions 230 in the first example are “Id (r0),d8:d9:d10:d11” (i.e. load processor register r0 in the data elements D8,D9,D10,D11) or “add d0:d1:d2:d3, d8:d9:d10:d11” (i.e. add external data elements 0,1,2,3 to the data elements D8,D9,D10,D11).
The instructions 230′ in the second example are “st d4:d5:d6:d7, (r0)” (i.e. store to processor register r0 the data elements D4,D5,D6,D7) or “add d4:d5:d6:d7, d0:d1:d2:d3” (i.e. add the data elements D4,D5,D6,D7 to the data elements 0,1,2,3).
It is noted that in the prior art elements are accessed which are sequentially stored in the registers, e.g. in the second example of
The prior art access may be provided with a permutation unit between the operating register and the wide registers storing the vector data. However, although such permutation would enable rotation or swapping of data elements to or from a single wide register, such permutation unit would not enable access to data elements of different registers, e.g. for accessing a column of a stored matrix. Such access is only provided by said access units as described with reference to
It is noted that load, store and add are commonly used matrix instructions, but the set of instructions may comprises any further instruction, such as MUL, MAC, SUBTR, LOGIC, etc. Such instructions are used for multiplication, accumulation, subtraction, and logical functions. For example, a specific instruction may transfer data, multiply those data and execute accumulation. Such an instruction may specify multiple data transfers and multiplication operations, and/or subtraction and addition circuit operations.
It is noted that, in the processing module, the access to the data registers has been enhanced by providing said access units (not shown in
a and
The
The wide vector may constitute a matrix. The elements of rows of the matrix are sequentially stored in the set of registers 410. The matrix processing instruction triggers accessing matrix elements via an accessing unit. Thereto, the instruction processing module has an accessing unit 420 for accessing selected elements of the matrix, which selected elements are non-sequentially stored according to a predetermined pattern across multiple registers of the set of registers.
For enabling the non sequential access the access unit 420 is coupled to said multiple registers 410 and includes a permutation function, as indicated by arrows in the unit as depicted in
Combining the permute function with the new mechanism that writes (or reads) each part of the data bus or respective lane into (or from) a different register, enables the new register ordering that in turn enables a fast matrix element access, in particular column access. It is noted that a matrix to be so processed may be two-dimensional and have a row size of n and a column size of m elements. Note that n and m are integers of any value and n may differ from m. In practice n and m will usually be equal. The size of the matrix row may be 2n, where 2n is two to the power n. Also, the size of the matrix column may be 2n. In practice, n may be 2, 3 or 4 and the matrix is a two-dimensional matrix of a matrix size 4×4, 8×8 or 16×16 respectively. Furthermore, the matrix processing instruction may comprise an indication of the matrix row and/or column size.
In the example, the instruction processing module has an accessing unit 420 that is shown to enable access to the data elements D2 in WideReg A, D7 in WideReg B, D8 in WideReg C and D13 in WideReg D for executing the data processing instruction 130.
The
By providing the access unit the system is provided with the ability to access a different wide register for each processing lane of the register. For example, D9 from wide register B is accessed to provide output data element 2, whereas element D14 from wide register D is accessed to provide output data element 0. In the new load operation, a single load loads a single row but each column element of the row ends up in a different wide register. Two such operations locate two elements of the same column side by side in the same wide register, as marked by an ellipse 495, and thus enable wide access to them by a later operation, for example either a store operation or an ALU operation such as ADD.
a and
The modules are shown to operate on data 540,590 from/to a memory or an execution unit, which data has 4 data elements 0,1,2,3. The execution unit as such, which is part of the instruction processing module, is not shown in
The
It is noted that the access unit 520 loads the respective column values in the respective locations of the wide registers as indicated by subsequent permutations while accessing the respective locations according to a predetermined pattern. The arrows as shown in the Figure in unit 520 and below are an example of such permutation. Writing to different wide registers in different processing lanes is used in
The
It is noted that the access unit 570 retrieves the respective column values from the respective locations of the wide registers as indicated by subsequent permutations while accessing the respective locations according to a predetermined pattern. The arrows as shown in the Figure in unit 570 and below are an example of such permutation.
The subsequent execution of the processing as shown in
In the following an example is provided of a software program using the matrix processing instructions for execution on a processor comprising the instruction processing module as described above, based on a matrix size of 8×8. Practical values for the matrix size may be 2n, e.g. n being 2,3 or 4, and the matrix size correspondingly being 4×4, 8×8, or 16×16. Other matrix sizes may be implemented also where required and efficient for certain applications.
The instructions comprise load (LD2), store (ST2) and add (ADDA) instructions. The instructions are shown to have an indication of the matrix row and/or column size by the parameters as indicated after the respective instruction code. Also, in the example, the matrix processing instructions include an indication of the predetermined pattern for accessing the elements by the enumeration of the respective elements. The program is an example of a reversal of values in a matrix, also called matrix transpose:
For such transpose function any “traditional” implementation of transpose of a big 2D array that cannot fit into a single row of the target processor, requires read or write the array with no vectorization, since a single element of the array has to be accessed. No other elements that can be accessed at the same time. Since load/store accesses are a limiting factor in every processor architecture, which normally can perform only one or two transactions every cycle, regardless of the transaction width, accessing an array without vectorization will cost N time cycles, when N is the length of the vector. In the example N is 8.
The function executed by the above program using the enhanced matrix processing instructions requires only 16 cycles. When performing the same function using traditional code (not using vectorization) 96 cycles for permutation of 128 complex values would be required:
The traditional code without the new instructions needs to break all the loads or vectorized loads (LD2.16F) into 8 separate loads (LD2.2F), and using linear register order for the stores (ST2.16F or ST2.SRS.16F). It is easy to see that number of cycles increases significantly by using traditional code.
In a further practical application the new matrix instructions may be used for FFT. A special register order may be provided for use in such routine. In addition to the above transpose routine, it uses special FFT reverse-carry addressing. The new FFT implementation using the enhanced matrix instructions enables FFT reverse carry reordering, by wide load and store across the matrix (8 words in parallel), accelerating this phase by factor of 8.
In summary, the enhancement resides in that the instruction module can access different wide registers at every respective different processing lane. The additional circuitry does not require a wider data path, while it provides wide access to columns of 2D complex non-serial data structures
In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims. For example, the connections may be a type of connection suitable to transfer signals from or to the respective nodes, units or devices, for example via intermediate devices. Accordingly, unless implied or stated otherwise the connections may for example be direct connections or indirect connections.
Because the apparatus implementing the present invention is, for the most part, composed of electronic components and circuits known to those skilled in the art, circuit details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Although the invention has been described with respect to specific conductivity types or polarity of potentials, skilled artisans appreciated that conductivity types and polarities of potentials may be reversed.
Also, the invention is not limited to physical devices or units implemented in non-programmable hardware but can also be applied in programmable devices or units able to perform the desired device functions by operating in accordance with suitable program code. Furthermore, the devices may be physically distributed over a number of apparatuses, while functionally operating as a single device.
Furthermore, the units and circuits may be suitably combined in one or more semiconductor devices.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2013/050220 | 1/10/2013 | WO | 00 |