Hereinafter, embodiments of the present invention will be described in detail using the accompanying drawings.
A first embodiment of the present invention will be described in detail with reference to the accompanying drawings.
Various IO devices are connected to the DMA controller 7 via a DMA bus 10. To the IO device are connected a video input part 11 for carrying out a video input such as a camera and NTSC signal, a video output part 12 for outputting videos such as NTSC, a voice input part 13 for inputting voices of a microphone or the like, a voice output part 14 for outputting voices of a loudspeaker, optical output, or the like, a serial input part 15 and a serial output part 16 for carrying out serial transfer of a remote control or the like, a stream input part 17 for inputting streams such as a TCI bus, a stream I/O part 18 for outputting streams of a hard disk or the like, and various IO devices 19. To the PCI bus 22 are connected various PCI devices 23, such as a hard disk and a flash memory.
To the display control part 8 is connected a display 21 which is a display device. The picture processing part 6 is a processing part for carrying out processing to a two-dimensional image, such as video codec, scaling of images, and filtering of images. In this way, this embedded system is a system which has both input and output of video and voice, and carries out picture and voice processings. This system includes, for example, a cellular phone, a HDD recorder, a monitoring device, an on-vehicle image processing device, and the like.
The signal line group 50d is an input signal and is stored in a register 510. A clockwise input signal group 511, i.e., an output of the register 510, which is delayed by one cycle, is inputted to a BID decoder 512, a selector 513, and the signal line group 50a. To the BID decoder 512, at least WE and BID among the input signal group 511 are inputted. The BID decoder 512 has a block ID [4:0] for recognizing its own block number.
Moreover, if the data write condition is satisfied, the selector 513 is selected to the input signal line group 50b side, and the input signal line group 50b is outputted to the signal line group 50e. At this time, SBR_OUT_REQ of the input signal line group 50b is outputted to SBR_WE_OUT of the input signal line group 50e. If SBR_OUT_REQ is 0, it is inputted to a shift register slot at the next stage as an invalid transaction. This protocol is the same as the data write in
These behaviors of the BID decoder 512 enables: a behavior that an input from the signal line group 50d is received as a data write transaction; a behavior that the signal line group 50b is outputted to a shift register slot at the next stage as a data write transaction; and that a transaction is succeeded to the next stage even if the transaction is not the data write transaction to its own block. In this way, the clockwise data transfer from the left side block to the right side block is realized.
Similarly, with respect to the above description, the signal line group 50d is replaced with the signal line group 50g, the signal line group 50e is replaced with the signal line group 50f, the signal line group 50a is replaced with the signal line group 50c, the register 510 is replaced with a register 514, the BID decoder 512 is replaced with a BID decoder 516, the selector 513 is replaced with a selector 517, and the SBR_OUT_REQ signal is replaced with an SBL_OUT_REQ signal, thereby allowing a counterclockwise data transfer from the right side block to the left side block to be realized.
In addition, when a data write transaction occurred simultaneously from the signal line group 50a and the signal line group 50c to a memory with a single port memory, such as a memory, a conflict at the memory write port will occur. In order to prevent this, there are several methods. One of them is that one side of the shift type bus is stalled to prioritize a data write from one side. In this case, the conflict signal is broadcasted to all the blocks before stopping the shift type bus. Moreover, by inputting the signal line group 50a and signal line group 50c to FIFO, the frequency of the conflict can be prevented. Moreover, in the case where such a memory is used, an interleave type memory configuration is employed so that the writing from the clockwise shift type bus and the writing from the counterclockwise shift type bus may be carried out to separate bank memories, and thus the conflict can be prevented. However, the data flow is simple, and for the data delivery between blocks, the clockwise shift type bus is used, and for reading an external memory, i.e., a data write transaction via the internal bus bridge 60, the counterclockwise shift type bus is used, and thus the conflict can be prevented. Moreover, the probability that the data write transactions occur to one memory in the same cycle from the clockwise shift type bus and from the counterclockwise shift type bus and thus a conflict occur is extremely small. For this reason, the extent to which the performance decreases may be low.
With this method, the bus transfer can be achieved without having a global bus arbitration circuit which is usually timing-critical. Moreover, by being through registers in the unit of block by means of the registers 510 and 514 in the shift register slot 500, the long wirings and timing critical paths can be reduced in an actual LSI floor plan. Generally, in a tri-state bus architecture and a crossbar switch type bus, as the number of blocks increased, the critical timing and the amount of wirings will increase, however according to this method, even when the number of blocks to be connected to the bus is increased, an increase in the critical timing and the amount of wirings can be suppressed.
Moreover, the data transfer can be carried out in parallel in the same cycle between a plurality of blocks, so that a high data transfer performance can be obtained. Especially when carrying out the data transfer only to adjacent blocks, a data bandwidth in proportional to the number of blocks can be obtained. As described above, the bus protocol of the shift type bus 50 is only data writing. In the bus protocol of data write, an address (ADDR_OUT) and a data (DATA_OUT) can be outputted in the same cycle as a request signal (WE_OUT), and thus a simpler bus can be configured as compared with a bus structure in which the data write is carried out using a FIFO or a queue while holding the state.
The clock stop signal 500p is inputted to the terminal 50p. When this clock stop signal 50p is active, the signal line group 50d and signal line group 50g are selected at both selector 513 and selector 517, respectively. This allows for the through-propagation without being through the register from the input to the output. This method allows for a data transfer, for example, even when a clock for one block is stopped. Because this shift type bus 50 does not have a global bus arbitration circuit, a clock is supplied to only a block which should at least operate, thus allowing for a data transfer between blocks and reducing the number of registers to operate, so that the power consumption can be reduced. In addition, by supplying a clock to the whole shift type bus 50 and not supplying the clock to each block, each block can be also stopped with an increase in power worth of the registers 510, 514, and 518.
In this way, the shift type bus 50 allows for connection between adjacent blocks with a simple interface. Accordingly, a plurality of blocks can be connected by extending the block ID field. Although in the description of this embodiment the shift type bus 60 is described as a common bus in the picture processing part 6, the invention is not limited thereto. For example, use of the shift type bus interface at LSI pins allows for serial connection of a plurality of LSIs, so that communication not only with adjacent LSIs but also with LSIs which are distant arrangement-wise. In addition, in the inter-LSI connection, a reduction in pin counts can be also achieved using a high-speed serial interface or the like.
Moreover, the shift type bus 50 has a Last signal. If this signal line is “1” upon data transfer, a data memory ready counter DMRC in a synchronization control part 473 described later is counted up. This provides a synchronization between blocks at instruction level. The detail thereof will be described later. In addition, the shift type bus also has a read transaction. This read transaction also will be described later.
Again, the picture processing part 6 is described using
Next, the picture processing engine 66 in the first embodiment is described in more detail using
Moreover, the picture processing engine 66 includes an instruction memory 31 and data memory 35 capable of carrying out a data write from the shift type bus 50. To the data path part 36, an instruction memory control part 32 for controlling the instruction memory 31 is connected via a path 42 and a data memory control part 33 is connected via a path 43. The instruction memory control part 32 is a block which controls a data write from the shift type bus 50 to the instruction memory 31 and controls an instruction supply to a CPU part 30, and the instruction memory control part 32 is connected to the instruction memory 31 via a path 40, to the CPU part 30 via a path 37, and to the data path part 36a via the path 42, respectively. The data memory control part 33 is a block which controls a data write from the shift type bus 50 to the data memory 35 and controls a data output from the data memory 35 to the shift type bus 50, which data output the local DMAC 34 controls. The data memory control part 33 further controls an access from the CPU 30 to the data memory 35. The control of the data memory 35 is carried out using a path 41.
The data write from the shift type bus 50 to the data memory 35 and the data output from the data memory 35 to the shift type bus 50 are controlled via the path 43 in concert with the data path part 36. The connection to the CPU part 30 is controlled by two paths. The data read processing from the data memory 35 to the CPU part 30 is controlled by a path 38, and the data write from the CPU part 30 to the data memory 35 is controlled by a path 39. In both cases, the access address of the data memory 35 is supplied via a path 45.
In addition, although in the description of this embodiment, for ease of description, the number of the data memory 35 is one, an interleave configuration using a plurality of data memories is also possible. With the interleave configuration, the access to a plurality of data memories 35 can be carried out in parallel. In prior to describing the present invention, the calculation contents by the CPU 30 are defined. However, these calculation contents are for describing the essence of the present invention, and the types of calculation contents are not limited thereto.
The CPU part 30 is a CPU for carrying out calculations, and the like, to the two-dimensional image. In this embodiment, for ease of description, assume that the CPU part 30 has four instructions shown below. However, the types of the instruction are for ease of description, and the instruction types are not limited thereto. However, a means to specify a register pointer and a height direction described later is the indispensable element. Let the four instructions be a branch instruction, a read instruction, a write instruction, and an add instruction. Table 8 to Table 11 show the required bit fields in the instruction format of each instruction.
Here, the CPU part 30 is described as having a read instruction, a write instruction, and a divide-add instruction, except for a branch instruction. Accordingly, during a read instruction, at the time when a read data 38 is returned, the control line 308 uses a register number pointer value, in which register a read data is stored, as a storage location register number pointer. During a write instruction, a write data register number is used because reading the register file 304 is required. During a divide-add instruction, both reading and writing to the register file 304 are required and thus these are controlled. Although in this description the instruction decode signal 309 becomes active only during the divide-add instruction, in case of having other instructions a signal for controlling the arithmetic logical unit is outputted in accordance with the type of the instruction. The control line 310 selects the read data 38 at the time of a read instruction, and selects a calculation result 314 of the arithmetic logical unit 313 at the time of a divide-add instruction. A selected calculation data 315 is stored in the register file 304. Moreover, at the time of a read instruction and at the time of a write instruction, the instruction decode part 303 controls the arithmetic logical unit 313 to generate an access address 45 of the data memory 35.
In addition, the arithmetic logical unit 303 consists of 8-parallel SIMD type arithmetic logical units like in JP-A-2000-57111, where eight 8-bit width additions can be executed in parallel. That is, eight divide-add operations can be executed in parallel. Moreover, the data width of the CPU 30 is set to 8 bytes. Accordingly, a read instruction, a write instruction, and a divide-add instruction can be executed in the unit of 8 bytes. Moreover, assume that 8, 16, and 32 can be defined in the width field of a read instruction, a write instruction, and a divide-add instruction, and in the count field, 1 to 16 can be specified at an interval of one.
The operation of generating the access address 45 of the instruction decode part 303 and arithmetic logical unit 313 is described using
The instruction decode part 303 includes a Wc counter, which is cleared to 0 upon activation of an instruction (Step 90). Next, in Step 91, a read instruction, a write instruction, and a divide-add instruction are executed using Src and Dest, and (Addr+Wc). Next, in Step 92, one is added to Src and Dest, and 8 is added to Wc. In Step 93, the Width field specified in the instruction field is compared with Wc. If Width is greater than Wc, the flow returns to Step 91 again to repeat the instruction execution. If Width is equal to or smaller than Wc, the flow changes to Step 94 to determine whether the Count value shown in the instruction field is 0 or not. If the Count value is not 0, the flow changes to Step 95, where one is subtracted from the Count value and Pitch is added to Addr, and again the flow changes to Step 90 to repeat the instruction execution. If the Count value is 0, the instruction execution is terminated. At this time, the instruction decode part 303 outputs the instruction fetch request 37r.
The behavior of the flowchart of
In the calculation contents shown in
In addition, in
The instruction memory control part 32 in the first embodiment is described using
In case of an instruction fetch access, the arbitration part 320 causes the selector 323 to select an output of an instruction program counter 322 for reading the instruction memory 31, and outputs a control line 321 to increment the program counter 322. An instruction 40d returned from the instruction memory 31 is stored in an instruction register 324 and is returned to the CPU part 30 as the instruction 37i. At the same time, the operation code field of the instruction is inputted to a branch control part 325, where whether it is a branch instruction or not is determined and a signal 326 which is set to 1 at the time of a branch instruction is inputted to the arbitration part 320. Moreover, a read index field of the instruction register is inputted to a branch condition register 327. The branch condition register 327 is a group of registers consisting of a plurality of one bit width words, and the word is specifies by a read index field of the branch condition register, and a signal 328 with one bit width is inputted to the arbitration part 320.
The actual branching occurs if the signal 326 is 1 and if the signal 328 is 1. The combinations other than this are recognized as instructions other than the branch instruction. The arbitration part 320 returns the instruction ready signal 37d only at the time of instructions other than the branch instruction. At the time of the branch instruction, the instruction ready signal 37d is not returned, and the selector 323 selects an immediate value stored in the instruction register 324. At this time, the program counter 322 is updated with a value incremented by this immediate value.
According to this method, when an interval of issuing the instruction fetch request 37r of the CPU takes several cycles, the cycles which it takes to re-read the instruction due to a branch instruction can be masked completely, so that the performance decrease due to the branching can be suppressed. In the CPU part 30 in the present invention, a two-dimensional operand is specified, so that the pitch of issuing the instruction fetch request 37r is large and thus the above-described advantage is significant.
The data memory control part 33 in the first embodiment is described using
First, connection to the CPU part 30 is described. The data memory address 45 at the time of a read instruction and write instruction is through the address selector 331 and is inputted to the data memory 35 as the data memory address 41a. At the time of a write instruction, the write data 39 is inputted to the data memory 35 via a data selector 332 as the write data 41w. At the time of a read instruction, in accordance with the data memory address 41a the read data 41d is read and stored in a data register 333. The stored read data is returned to the CPU part 30 as the read data 38. In addition, if a value of the master S/D register is specified in DestReg of a read instruction, the read data is outputted to the read data 43r. Next, in a write processing from the shift type bus 50, the address line 43a is through the address selector 331 and is inputted to the data memory 35 as the data memory address 41a. At the same time, the data line 43d is inputted to the data memory 35 via the data selector 332 as the write data 41w.
Finally, at the time of access from the local DMAC 34, the address 43p is through the address selector 331 and is inputted to the data memory 35 as the data memory address 41a. The read data 41d read correspondingly is stored in the data register 333 and is returned as the read data 43r.
The local DMAC 34 in the first embodiment is described using
The local DMAC 34 includes four sets of register groups, i.e., a master D register 340 and master S register 341 which can be rewritten by a read instruction, and a slave D register 342 and slave S register 343 which can be written from the shift type bus 50. Table 12 to Table 15 show the format of each register.
The data transfer using the local DMAC 34 has three types of operation modes.
The first one is a data write mode. The data write mode is a mode in which its own data memory 35 is read using a parameter of the master D register 340, and the data is transferred to a block of other picture processing engine or the like using a parameter of the master S register 341 and the data is written to an address-mapped region of the data memory 35 or the like.
The second one is a read command mode. The read command mode is a processing in which the values themselves of the master D register and the master S register are transferred to a block of other picture processing engine or the like, as the data, and the values are stored in the slave D register and the slave S register of the other block. This operates as a read request to other block. In addition, at the time of a read command mode, as an interface of the shift type bus 50, a CMD signal is set to 1 for transferring. A block which receives a read command recognizes based on the CMD signal whether or not this shift type bus transfer is a read command or not.
The third one is a read mode. This is a mode in which in response to the read request received in the above-described read command mode, the data memory 35 is read using a parameter of the slave D register 342, and the data is transferred to a block, such as other picture processing engine, using a parameter of the slave S register 343, and the data is stored in a address-mapped region of the data memory 35, or the like. With a combination of these three modes, a data transfer is achieved between blocks, such as the picture processing engines, or the like
The master D register 340 and master S register 341 can be updated by a read instruction issued by the CPU part 30, and at this time, a data is inputted from the signal line 44pw to thereby update two registers. That is, a descriptor, in which the contents of data transfer is described, is stored in the data memory 35 in advance, and the data transfer is started by copying the contents to the master D register 340 and the master S register 341.
Upon update of the two registers, the state changes to two states depending on the Mode field of the master D register 340. If the Mode field indicates a data write mode, MADDR, MWidth, MCount, and MPitch of the master D register 340 are transferred to a data memory address generator 346 via an address selector 344. The data memory address generator 346 generates an address for reading the data memory 35, and outputs the address 44da. The address is generated by the same method as the access address 45 which the instruction decode part 303 in the CPU part 30 generates. Accordingly, the data memory address generator 346 has a Wc counter, where a two-dimensional rectangular address is generated by an address generation replacing MWidth, MCount, and MPitch with Width, Count, and Pitch, respectively.
In the same way, SADDR, SWidth, SCount, and SPitch of the master S register 341 are inputted to a shift type bus address generator 347 via an address selector 345, where an address to be outputted to the shift type bus 50 is generated, thereby outputting the address 44sa. The address generation by this shift type bus address generator 347 also expresses a two-dimensional rectangular like in the address generation of the data memory address generator 346. With these two addresses, the read data 43r is read from the data memory 35 sequentially, so that a data write processing is achieved from the picture processing engine 66 to the shift type bus 50, as the signal line group 50b. At this time, the destination block is a block which the field SBID of the master S register 341 indicates. At this time, whether to use the counterclockwise shift type bus or to use the clockwise shift type bus is determined in accordance with a MDIR flag.
In addition, in this method, the address 44da of the data memory 35 and the address 44sa for outputting to the shift type bus are generated using MWidth, MCount, MPitch, and SWidth, SCount, SPitch, respectively. In this way, the address generation by two sets of registers each allows the shape of a two-dimensional rectangular to be converted, thus allowing for data transfer. However, when transferring as the same rectangular, the address can be generated by the parameter of only one of the registers.
On the other hand, when the Mode field indicates a read command mode, the values of the master D register 340 and master S register 341 are outputted as the direct output signal 44swb to thereby transfer the read command to other block. At this time, the destination block is a block which the MBID field of the master D register 340 indicates. When the destination block received this read command, the slave D register 342 and slave S register 343 are updated to start the processing as a read mode. The read command is through the path 44sw and is updated in the slave D register 342 and slave S register 343. After the destination block receives the read command, the read data is read and outputted to the shift type bus 50 by almost the same operation as that of the above-described data write processing. MADDR, MWidth, MCount, and MPitch of the slave D register 342 are inputted to the data memory address generator 346 via the address selector 344 to access the data memory 35 as the address 44da. Subsequent behavior is the same as the one at the time of data write. In the same way, SADDR, SWidth, SCount, and SPitch of the slave S register 343 are inputted to the shift type bus address generator 347 via the selector 345, where the address 44sa is generated. Subsequent operation is the same as the one at the time of data write. With these three behaviors of the local DMAC 34, in the shift type bus 50 the data transfer is achieved with only a write transaction in which an address and a data can be outputted in the same cycle. Generally, in order to improve the performance of a bus, a split type bus is used in which an address and a data are separated to each other. In the split type bus, an address and a data are managed by ID, such as the same transaction ID, and a slave side of each request queues the address into FIFO or the like and waits until receiving a data. Accordingly, the bus performance is limited by the number of stages of the queue or FIFO. On the other hand, in this method, in every bus transfer, an address and a data can be transferred in the same cycle and thus the saturation of the performance due to the number of stages of FIFO or the like will not occur.
In addition, the operation of the local DMAC 34 is activated by a read instruction, and upon this activation, the CPU part 30 can execution the next instruction. However, only during transfer execution using the local DMAC 34, the use of next local DMAC 34 is prohibited and is stalled However, the performance decrease due to conflict will not occur by increasing the pitch of issuing an activation of the local DMAC 34. Meanwhile, the CPU part 30 executes other processing sequence and thus the processing of the CPU part 30 and an interblock transfer can be executed in parallel, allowing the required number of processing cycles to be reduced. Moreover, concerning a read transfer, the receipt of the next read command is prohibited and the termination is not executed on the shift type bus 50 during execution of a read processing because the local DMAC includes only one set of slave D register 342 and slave S register 343. The shift type bus 50 is loop-shaped, and thus a restart of the read command is enabled by receiving a read command at the time when the read command circled the shift type bus 50. By carrying out most of the data transfer between blocks in a write mode and thus suppressing the generation frequency of a read, this performance decrease can be reduced. Because the picture processing involves a lot of data flow-like behaviors and the interblock transfer mostly uses a write mode, this method can suppress the performance decrease.
In transferring by means of the local DMAC 34, a “Last” signal can be outputted to the shift type bus 50. Namely, at the time of transferring while the Last field in the master D register 340 or the slave D register 342 is “1”, only one cycle is asserted at the time of the last transfer in transferring a two-dimensional rectangular. Accordingly, whether the direct memory transfer of interest is completed or not can be recognized. This is used at the time of interblock synchronization described later.
The data path part 36 in the first embodiment is described using
In this way, according to the first embodiment, the power consumption can be reduced by reducing the frequency of access to the instruction memory 31 and stopping the clock supply to each block, and the like. Moreover, by means of masking in the branch instruction and the operation in parallel with the local DMAC 34, and the like, the number of processing cycles is substantially reduced to achieve a reduction in power consumption.
A second embodiment of the present invention is described using
In the descriptions of
Moreover, when a data is direct memory transferred from other picture processing engine 6 to the data memory 35 by using the local DMAC 34 and then the CPU part 30 reads this transfer data, it should be recognized that this direct memory transfer is completed. If the data transfer is not completed, the CPU part 30 is stalled. This is called the interblock synchronization. In addition, although the interblock synchronization can be used also in the first embodiment, the description is made only with this second embodiment. The synchronization control part 473 carries out these three synchronization processings. Next, the synchronization control method is described. In the synchronization control, the synchronization is carried out by means of four counters to be arranged for each CPU, two counters to be arranged as one pair in a block, and five flags defined on an instruction. Table 16 shows the definition of the counters. Moreover, Table 17 shows the definition of a synchronization field to be arranged in an instruction.
First, the input synchronization is described using
Moreover, even when the processing speed of the vector calculation part 46 is slow and there is a difference between the count-up of SRC and the count-up of ERC, the preparation of the calculation data 30wb by the CPU part 30, i.e., the count-up of the execution ready counter ERC, is possible and thus can operate as a data pre-fetch.
In the same way, when the CPU part 30 uses the calculation data 30i which the vector calculation part 46 generated, as opposed to the above description the DRE field is used by an instruction of the vector calculation part 46, and the ISYNC field is used by an instruction of the CPU part 30, and by means of the execution ready counter ERC [CPU part 30] and slave request counter SRC [CPU part 30] arranged in the CPU part 30, the input synchronization is enabled. In addition, although the input synchronization using the execution ready counter ERC and slave request counter SRC has been described here, the input synchronization is possible even with one bit width flag. For example, the flag is set based on the update condition of the execution ready counter ERC. Until this flag and the ISYNC flag of a CPU instruction at the receiving side of a calculation data both are set to 1, two CPUs are stalled. By clearing the flag at the time when the stall is released, a synchronization between two CPUs is enabled with few logic circuits.
Next, the output synchronization is described using
In the operation of this example, at the time when an instruction whose RFR field is set to 1 is terminated by an instruction of the vector calculation part 46, the CPU part 30 can write to the register file 462 of the vector calculation part 46. At the time when an instruction whose RFR field is set to 1 is terminated, the register file ready counter RFRC [CPU part] of the CPU part 30 is counted up. By this time, an instruction whose OSYNC is set by the CPU 30 part is stalled upon activation request. This stall condition is when the value of the register file ready counter RFRC [CPU part] is smaller than or equal to the master request counter MRC [CPU part]. When an instruction whose OSYNC is set by the CPU part 30 is activated and received, the master request counter MRC [CPU part] is counted up. Also in this method, like in the input synchronization, when the processing of a CPU at the preceding stage is extremely slow and the processing of a CPU at the subsequent stage is fast, more free space in the register file can be freed up. In this case, a stall will not occur at the time of the output synchronization of the CPU at the preceding stage. In the same way, until the register file 304 of the CPU part 30 becomes writable, in the output synchronization in which the vector calculation part 46 is stalled, the vector calculation part 46 uses OSYNC and the CPU part 30 sets the RFR field, thereby achieving the output a synchronization between two CPUs. With a combination of these input synchronization and output synchronization, a fine-grain synchronization between two CPUs at register file level is achieved. These synchronization methods are characterized in that an instruction itself includes a synchronization field.
Finally, the interblock synchronization is described using
The second counter is a data memory access counter DARC and is a counter which is counted up when an instruction, whose MSYNC arranged in an operation code of a read instruction is “1”, becomes executable. Accordingly, the timing that the CPU part 30 can execute reading is when the data memory ready counter DMRC is greater than the data memory access counter DARC. In other words, if the data memory ready counter DMRC is equal to or smaller than the data memory access counter DARC, the CPU part 30 is stalled. In this way, a synchronization between blocks is enabled at instruction level of the read instruction.
In this way, according to the second embodiment, because the interval of issuing an instruction is large even when a plurality of CPUs capable of using a two-dimensional operand share an instruction memory, the performance decrease can be suppressed and the memory area can be reduced by sharing the instruction memory. Moreover, the read and write processings to the data memory 35 are carried out in the CPU part 30, the data processing is carried out in the vector calculation part 46, and the synchronization between two CPUs at register file level is carried out by a synchronization means, thereby allowing the calculation throughput to be improved. Moreover, at instruction level, the a synchronization between blocks is achieved.
A third embodiment is described using
Moreover, also concerning the synchronization processing, the control thereof differs. In the description of the second embodiment, the input synchronization method and output synchronization method between the adjacent CPUs were described. Also in the third embodiment, the same synchronization processings are carried out. That is, the input synchronization and output synchronization are carried out between the adjacent CPUs. Moreover, synchronization is also carried out between the CPU part 30s at the final stage and the CPU 30 at the first stage. Moreover, the CPU part 30 and CPU part 30s both access to the data memory 35. Accordingly, the data memory control part 33 shown in
Moreover, according to this method, even in a configuration in which two or more CPUs are connected in series and in a ring shape, a multi-CPU configuration with a synchronization between CPUs is achieved. Moreover, even when the number of CPUs increased, the number of read-write ports of a register file will not increase, thus not allowing the area of a network and register file to be increased. For example, in an increase in the number of CPUs by the VLIW configuration shown in JP-A-2001-100977, the number of ports of a register increases in proportion to the number of arithmetic logical units and the area cost increases. In contrast, in the series connection according to this method these will not increase.
Moreover, in the VLIW system, the timings that a plurality of arithmetic logical units are activated differ to each other. For example, consider an example in which in the same calculation loop, a first arithmetic logical unit carries out a memory read, and a second arithmetic logical unit carries out a general calculation, and a third arithmetic logical unit carries out a memory write. At this time, although the numbers of calculation cycles in which the respective CPUs actually operate differ, the processings are carried out in the same calculation loop and therefore the operation rate of the arithmetic logical units decreases, and as a result, the number of required processing cycles increases and the power consumption increases. On the other hand, according to this method, CPUs each are capable of including a program counter, respectively, and is capable of processing its own calculation without depending on the operation of other CPUs as well as the operation of program counters of other CPUs. For example, when changing one parameter between the fifth and sixth time loops out of 10 times of loops, although in the VLIW system the instruction sequence needs to be described with two loops of 5 times each, in this method the CPUs each have a program counter and thus only a CPU which changes the parameter can specify the instruction sequence with two loops, so that the calculation operation rate can be improved and the capacity of the instruction memory 31 to use can be reduced.
Next, there is shown an embodiment concerning a method of specifying a two-dimensional operand consisting of a Width field and a Count field in the operand of an instruction. Up till now, a reduction in the number of instructions by specifying a two-dimensional operand, and a reduction in power consumption by reducing the number of times of reading the instruction memory 31, and a reduction in power consumption and reduction in the area cost by reducing the capacity of the instruction memory 31, have been described. In addition to these, a reduction in power consumption by reducing the number of processing cycles can be also achieved. Here, the embodiment is described using inner product calculation and convolution calculation.
The inner product calculation is one of the generic image processings used for a video codec, an image filter, and the like. Here, an inner product calculation of 4×4 matrix is described as an example.
With these four instructions, the first row of the inner product calculation is calculated and then by changing Src1 register, four rows of calculations are carried out. Accordingly, a total of 16 instructions are calculated consuming 16 cycles. In addition, as a pre-processing, the transposition of Matrix A is required. Accordingly, the number of required cycles is actually greater than 16 cycles.
On the other hand, in this embodiment capable of specifying a two-dimensional operand, a configuration of an arithmetic logical unit shown in
Pay attention to the first row of the calculation result of the example of inner product calculation of
Next, for the inner product calculation with respect to the transposed matrix, the operation thereof is described using an example of inner product calculation of
In addition, in a matrix calculation in which only the second matrix is transposed, the same data supply as that of the inner product without transposition is carried out for the inputs of Src1 and Src2, and the arithmetic logical unit is realized with a configuration in which four elements are added in one cycle like in the ordinary SIMD type arithmetic logical unit. In this method, the outputs of four Registers 601 are added without using Register 608 at the input of the sigma adder 607. Next, an operation example of a convolution calculation is described. The convolution calculation is used in filtering processing, edge enhancement, and the like, by a low pass filter, high pass filter, and the like of images. Moreover, this calculation is also used in a motion compensation processing in a video codec. In the convolution calculation, unlike the inner product calculation, the second matrix (serve as a convolution coefficient) is fixed, and with this convolution coefficient the calculation is carried out to the whole data elements of the first matrix.
The second cycle:
The third cycle:
The fourth cycle:
By sigma adding these 4 cycles of data in the sigma adder 607, a convolution calculation result of the first row is obtained. In the fifth cycle, again by inputting Register 2 to Src1 and inputting Register 8 to Src2, the convolution calculation of the second row is carried out. As a result, the convolution calculation results of 4×4 matrix is obtained in 16 cycles.
In addition, in these descriptions, although a shift register is used for supplying Src1 and Src2, the same effect is obtained by selecting the data using a selector and carrying out the same data supply. Accordingly, the invention is characterized by a means for supplying data.
In the general SIMD type arithmetic logical unit shown in
On the other hand, in the horizontal convolution calculation by the general SIMD type arithmetic logical unit shown in
Accordingly, specifying a two-dimensional operand as in this method means expressing a plurality of source instructions with one instruction, so that it is possible to reduce the processing cycles, including a pre-processing and a post-processing other than truly required product sum operation. As a result, the processing can be realized with a low operation frequency and the power consumption can be reduced further.
It should be further understood by those skilled in the art that although the foregoing description has been made on embodiments of the invention, the invention is not limited thereto and various changes and modifications may be made without departing from the spirit of the invention and the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2006-170382 | Jun 2006 | JP | national |