I. Field of the Disclosure
The technology of the disclosure relates generally to parallel data processing using vector processors.
II. Background
One class of computational tasks encountered by modem computer processors involves performing scalar operations on one of a number of accumulators based on input data, with a value of the input data determining which accumulator is a target of each scalar operation. A non-limiting example of this class of computational tasks is histogram generation. To generate a histogram, a processor calculates cumulative frequencies of occurrence for individual data values or ranges of data values within input data (e.g., by counting a number of times each data value appears within the input data). In this manner, overall distribution of data values within the input data may be determined, and may be used to generate a visual representation of the distribution.
Histograms are frequently used in image processing to illustrate tonal distribution in a digital image by plotting a number of pixels for each tonal value within the digital image. For instance, the digital image may comprise pixels each having an 8-bit intensity value. Accordingly, generating a histogram from the digital image may require a processor to use 256 (i.e., 28) accumulators, with each accumulator corresponding to one of the possible intensity values for the pixels of the digital image. The processor carries out operations to examine each pixel of the digital image and determine an intensity value for the pixel. The intensity value for the pixel is then used to determine which accumulator should be incremented.
Computational tasks such as histogram generation may be computationally intensive, as processing of each data value involves receiving the data value as input, retrieving a value of an accumulator corresponding to the data value, and writing a new value to the accumulator based on a scalar operation performed on the retrieved value. Thus, each data value may require multiple processor clock cycles to process. Moreover, processing of the input data may be further limited by an availability of bandwidth to update accumulators. For example, a data cache in which the accumulators are stored may provide only a limited number of read paths and/or write paths during each processor clock cycle.
One approach for optimizing this class of computational tasks involves the use of multicore processing to parallelize scalar operations using multiple instruction, multiple data (MIMD) techniques. Under this approach, each processing thread of a multicore processor provides a private set of accumulators, and processes one section of input data. The accumulators for each of the processing threads are then “reduced,” or added together, after all processing threads complete processing on their respective portions of the input data. However, this approach may result in dependency issues and/or memory conflicts among processing threads, and may provide only minimal performance increases as additional processing clusters are used.
Aspects disclosed in the detailed description include parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files. Related circuits, methods, and computer-readable media are also disclosed. In this regard, in one aspect, a vector processor is configured to provide single instruction, multiple data (SIMD) functionality for parallelizing scalar operations. The vector processor includes a vector register file providing a plurality of vector registers. Each vector register is logically subdivided into a plurality of accumulators. The total number of accumulators in the plurality of vector registers corresponds to a number of possible data values in anticipated input data. The vector register file also provides a plurality of write ports that enable multiple writes to be made to the vector registers during each processor clock cycle. To enable parallelization of scalar operations, the vector processor is configured to receive an input data vector. The vector processor then executes vector operations to access an input data value (e.g., a subset of the input data vector) for each write port of the vector register file. For each input data value, a register index and an accumulator index are determined. Together, the register index and the accumulator index may serve as a mapping to the appropriate accumulator, with the register index indicative of the vector register containing the accumulator and the accumulator index indicative of the specific accumulator within the vector register. Accordingly, the accumulators may be considered “data-indexed,” in that the input data value determines the accumulator to be acted upon. A scalar operation is performed on the vector register indicated by the register index, with the specific scalar operation based on the register index and the accumulator index. In this manner, the vector processor can take advantage of the data parallelization capabilities of the vector register file to increase processing bandwidth, thus improving overall processing performance.
In another aspect, a vector processor comprising a vector register file is provided. The vector register file includes a plurality of vector registers, each configured to provide a plurality of accumulators. The vector register file is also configured to provide a plurality of write ports. The vector processor is configured to receive an input data vector. For each write port of the plurality of write ports, the vector processor is configured to execute one or more vector operations to access an input data value of the input data vector. The vector processor is further configured to, for each write port of the plurality of write ports, determine, based on the input data value, a register index indicative of a vector register among the plurality of vector registers. The vector processor is also configured to, for each write port of the plurality of write ports, determine, based on the input data value, an accumulator index indicative of an accumulator among the plurality of accumulators of the vector register. The vector processor is additionally configured to, based on the register index and the accumulator index, perform a scalar operation on the vector register indicated by the register index.
In another aspect, a vector processor is provided, comprising a means for receiving an input data vector. The vector processor further comprises, for each write port of a plurality of write ports of a vector register file of the vector processor, a means for accessing an input data value of the input data vector. The vector processor also comprises, for each write port of the plurality of write ports, a means for determining, based on the input data value, a register index indicative of a vector register among a plurality of vector registers in the vector register file. The vector processor additionally comprises, for each write port of the plurality of write ports, a means for determining, based on the input data value, an accumulator index indicative of an accumulator among a plurality of accumulators of the vector register. The vector processor further comprises, for each write port of the plurality of write ports, a means for performing a scalar operation on the vector register indicated by the register index, based on the register index and the accumulator index.
In another aspect, a method for parallelizing scalar operations in vector processors is provided. The method comprises receiving, by a vector processor, an input data vector. The method further comprises, for each write port of a plurality of write ports of a vector register file of the vector processor, executing one or more vector operations to access an input data value of the input data vector. The method also comprises, for each write port of the plurality of write ports, executing the one or more vector operations to determine, based on the input data value, a register index indicative of a vector register among a plurality of vector registers in the vector register file. The method additionally comprises, for each write port of the plurality of write ports, executing the one or more vector operations to determine, based on the input data value, an accumulator index indicative of an accumulator among a plurality of accumulators of the vector register. The method further comprises, for each write port of the plurality of write ports, executing the one or more vector operations to perform a scalar operation on the vector register indicated by the register index, based on the register index and the accumulator index
In another aspect, a non-transitory computer-readable medium having stored thereon computer-executable instructions is provided. The computer-executable instructions cause a vector processor to receive an input data vector. The computer-executable instructions further cause the vector processor to, for each write port of a plurality of write ports of a vector register file of the vector processor, execute one or more vector operations to access an input data value of the input data vector. The computer-executable instructions also cause the vector processor to, for each write port of the plurality of write ports, execute the one or more vector operations to determine, based on the input data value, a register index indicative of a vector register among a plurality of vector registers in the vector register file. The computer-executable instructions additionally cause the vector processor to, for each write port of the plurality of write ports, execute the one or more vector operations to determine, based on the input data value, an accumulator index indicative of an accumulator among a plurality of accumulators of the vector register. The computer-executable instructions further cause the vector processor to, for each write port of the plurality of write ports, execute the one or more vector operations to perform a scalar operation on the vector register indicated by the register index, based on the register index and the accumulator index.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files. Related circuits, methods, and computer-readable media are also disclosed. In this regard, in one aspect, a vector processor is configured to provide single instruction, multiple data (SIMD) functionality for parallelizing scalar operations. The vector processor includes a vector register file providing a plurality of vector registers. Each vector register is logically subdivided into a plurality of accumulators. The total number of accumulators in the plurality of vector registers corresponds to a number of possible data values in anticipated input data. The vector register file also provides a plurality of write ports that enable multiple writes to be made to the vector registers during each processor clock cycle. To enable parallelization of scalar operations, the vector processor is configured to receive an input data vector. The vector processor then executes vector operations to access an input data value (e.g., a subset of the input data vector) is accessed for each write port of the vector register file. For each input data value, a register index and an accumulator index are determined Together, the register index and the accumulator index may serve as a mapping to the appropriate accumulator, with the register index indicative of the vector register containing the accumulator and the accumulator index indicative of the accumulator within the vector register. Accordingly, the accumulators may be considered “data-indexed,” in that the input data value determines the accumulator to be acted upon. A scalar operation is performed on the vector register indicated by the register index, with the specific scalar operation based on the register index and the accumulator index. In this manner, the vector processor can take advantage of the data parallelization capabilities of the vector register file to increase processing bandwidth, thus improving overall processing performance.
In this regard,
Before discussing the particular circuitry and vector processing operations configured to be provided by the vector processor 12 in this disclosure for parallelization of scalar operations using the vector register file 16 starting with
With continuing reference to
The computer processor 10 further includes an instruction dispatch circuit 36 configured to fetch instructions from program memory 38, as indicated by arrow 40. The instruction dispatch circuit 36 may decode the fetched instructions. Based on a type of the instructions, the instruction dispatch circuit 36 may direct the fetched instructions to either the scalar processor 24 via a scalar data path 42, or through a vector data path 44 to the vector processor 12.
As discussed above, one class of computational tasks that may be encountered by the computer processor 10 involves performing scalar operations on data-indexed accumulators (i.e., accumulators for which a value of input data determines which accumulator is a target of the scalar operation). Carrying out these computational tasks using the scalar processor 24 may result in suboptimal processor performance. As an example, each input data value may require multiple processor clock cycles for the scalar processor 24 to process. Processing of the input data may be further limited by the bandwidth available to the scalar processor 24 to update the accumulators. For instance, the accumulators may be stored in the data cache 32, which may provide only a limited number of read ports (not shown) and/or write ports (not shown) during each processor clock cycle.
In this regard, the vector processor 12 of
In the example of
The vector register file 16 in this example provides a total of four (4) write ports 22(0)-22(3), enabling up to four (4) of the vector registers 18 to be updated during each processor clock cycle. The write ports 22(0)-22(3) correspond to the write ports 22(0)-22(Z) of
The vector processor 12 receives an input data vector 19 representing a set of pixel intensity values within the digital image for which a histogram will be generated. In this example, the input data vector 19 is a vector comprising sixteen (16) values numbered 0-15. According to some aspects, the input data vector 19 may be a 128-bit vector of sixteen (16) 8-bit values. The input data vector 19 may be received as an input stream (not shown), or may be stored in a register or other memory (not shown) accessible to the vector processor 12, as non-limiting examples. In some aspects, the input data vector 19 may comprise more or fewer bits and/or more or fewer values than shown in
For each of the write ports 22(0)-22(3), the vector processor 12 provides corresponding multiplexer logic blocks 48(0)-48(3). In some aspects, the multiplexer logic blocks 48(0)-48(3) may be implemented as microcode defining vector instructions for performing operations described herein. Each multiplexer logic block 48 receives, as input, a subset of the input data vector 19, and selects one input data value 50 from the subset for processing during a processor clock cycle. In the example of
As discussed above, when generating a histogram, each of the input data values 50 indicates which accumulator 46 is to be incremented. For example, if the input data value 50(0) is 255 (indicating a pixel intensity of 255), the vector processor 12 should cause the corresponding accumulator 46(255) to be incremented. Because the accumulators 46 are stored within the vector registers 18(0)-18(31), it is necessary for the vector processor 12 to decode the input data values 50 to determine which specific vector register 18 contains the accumulator 46 to be incremented. Accordingly, the vector processor 12 as shown in
Each of the register decoders 52(0)-52(3) receives respective input data values 50(0)-50(3) from corresponding multiplexer logic blocks 48(0)-48(3), and operates on the input data values 50(0)-50(3) to generate register indices 54(0)-54(3), respectively. Each of the register indices 54(0)-54(3) is indicative of one of the vector registers 18 containing the accumulator 46 to be incremented. In the example of
As seen in
The accumulator decoders 56(0)-56(3) receive respective input data values 50(0)-50(3) from corresponding multiplexer logic blocks 48(0)-48(3), and operate on the input data values 50(0)-50(3) to generate accumulator indices 58(0)-58(3), respectively. Each of the accumulator indices 58(0)-58(3) is indicative of one of the accumulators 46 within one of the vector registers 18 identified by a corresponding one of the register indices 54. In this example, each of the accumulator indices 58 must be a value in the range from 0 to 7 (i.e., a 3-bit value), indicating which of the eight (8) accumulators 46 within one of the vector registers 18 identified by a corresponding one of the register indices 54 is to be incremented. Thus, generating the accumulator indices 58 may be accomplished by masking the five (5) high-order bits of each of the input data values 50 (e.g., by performing logical AND operations). It is to be understood that additional and/or different operations may be carried out by the accumulator decoders 56 to generate the accumulator indices 58.
Once the register indices 54 and the accumulator indices 58 have been generated, the vector processor 12 may synthesize vector instructions (not shown) to perform scalar operations on the vector registers 18 appropriately. In the example of
As discussed above, in the example of
Some aspects may provide that the collision merge logic block 60 may determine that two or more of the register indices 54 corresponding to two or more of the write ports 22 are identical. In such a case, the vector processor 12 may synthesize vector instructions to merge the scalar operations to be performed on multiple accumulators 46 within the vector register 18 indicated by the matching register indices 54 into a single merged scalar operation. In some aspects, the collision merge logic block 60 may further determine that two or more of the accumulator indices 58 corresponding to the matching register indices 54 are also identical. Accordingly, the vector processor 12 may synthesize vector instructions to merge the scalar operations to be performed on the accumulator 46 indicated by the matching accumulator indices 58 into a merged scalar operation (e.g., to increment the accumulator 46 by more than one).
As noted above, in some aspects, the vector register file 16 of
In
In
In this example, the register decoder 70 generates a register index 74 for the input data value 62 by performing a logical right shift of a plurality of bits of the input data value 62 by three (3) bits. The result, as seen in
The vector processor 12 first accesses an input data value 62 of the input data vector 64 (block 82). As discussed above, the input data value 62 may represent a subset of the input data vector 64. Based on the input data value 62, the vector processor 12 determines a register index 74 indicative of a vector register 18(31) among a plurality of vector registers 18 in the vector register file 16 (block 84). In some aspects, the vector processor 12 may determine the register index 74 by performing one or more logical right shifts of a plurality of bits of the input data value 62 (block 86). The vector processor 12 also determines, based on the input data value 62, an accumulator index 76 indicative of an accumulator 46(250) among a plurality of accumulators 46 of the vector register 18 (block 88). Some aspects may provide that the vector processor 12 determines the accumulator index 76 by masking one or more high-order bits of the input data value 62 to zero (0) (block 90). It is to be understood that additional and/or other operations may be performed by the vector processor 12 to determine the register index 74 and/or the accumulator index 76 based on the input data value 62.
The vector processor 12 then performs a scalar operation on the vector register 18(31)) indicated by the register index 74, based on the register index 74 and the accumulator index 76 (block 92). As a non-limiting example, the scalar operation may comprise operations to increment the accumulator 46 indicated by the accumulator index 76. In some aspects, the scalar operation may include additional and/or other arithmetic and/or logical operations on the accumulator 46. For instance, the scalar operation may include calculating a weighted histogram in the accumulator 46 is incremented according to a weighting value (not shown) included as part of the input data values 62.
As discussed above with respect to
In
The vector processor 12 next combines the two or more scalar operations into a merged scalar operation (block 100). As a non-limiting example, two scalar operations to increment the same accumulator 46 by one (1) may be merged into a merged scalar operation to increment the accumulator 46 by two (2). The vector processor 12 then performs the scalar operation by performing the merged scalar operation (block 102).
Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, and a portable digital video player.
In this regard,
Other master and slave devices can be connected to the system bus 112. As illustrated in
The CPU(s) 106 may also be configured to access the display controller(s) 124 over the system bus 112 to control information sent to one or more displays 130. The display controller(s) 124 sends information to the display(s) 130 to be displayed via one or more video processors 132, which process the information to be displayed into a format suitable for the display(s) 130. The display(s) 130 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The master and slave devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The aspects disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flow chart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 62/029,039 filed on Jul. 25, 2014 and entitled “PARALLELIZATION OF SCALAR OPERATIONS BY VECTOR PROCESSORS USING DATA-INDEXED ACCUMULATORS IN VECTOR REGISTER FILES, AND RELATED CIRCUITS, METHODS, AND COMPUTER-READABLE MEDIA,” which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62029039 | Jul 2014 | US |