1. Field of the Invention
The present invention relates to a data processing apparatus capable of realizing various kinds of processings, and more particularly, to a stream processor thereof.
2. Description of the Related Art
A data processing apparatus has been required to have a higher capability of processing a large amount of data such as moving video data at a higher speed, so that such a data processing apparatus has not only a host processor such as a central processing unit (CPU) but also digital signal processors (DSPs) or application specific integrated circuit (ASIC) units for decreasing the processing load of the host processor.
On the other hand, in a data processing apparatus, various kinds of data encoding/decoding processings are required for processing multi-media data such as stationary video data, moving video data, audio data and music data. Also, various kinds of communication protocols are used for transmitting/receiving data via networks such as the Internet. Further, encipherment/decipherment processings are required for maintaining data security protection.
Thus, in order to completely decrease the processing load of the host processor, a large number of DSPs or ASIC units are required, which would increase the data processing apparatus in size and in manufacturing cost.
Instead of providing a large number of DSPs and ASIC units, a prior art data processing apparatus is constructed by a programmable logic device (PLD) whose task program is changed by a changing section as occasion demands (see: JP-11-184718-A). This will be explained later in detail.
In the above-described prior art data processing apparatus, since a task program of a task carried out by the programmable logic device is changed by the changing section, various tasks can be carried out, which would decrease the size and the manufacturing cost.
In the above-described prior art data processing apparatus, however, a task program of a task carried out by the programmable logic device is not actively changed by the changing section, but the host processor actually determines a task program to be carried out by the programmable logic device. As a result, the processing load of the host processor is still large. Also, the storing and loading operation of intermediate data increases latency of a memory, which would decrease the throughput of the host processor.
Note that a task switching operation of the host processor (CPU) is also disclosed in JP-2004-220070-A.
Also, a data-array type processor whose task programs can be also changed is known to correspond to the programmable logic device of the above-described prior art data processing apparatus (see: JP-2001-312481-A, JP-2003-196246-A, and Hideharu Amano, Akiya Jouraku and Kenichiro Anjo, “A dynamically adaptive switch fabric on a multicontext reconfigurable device”, Proceeding of International Field Programmable Logic and Application Conference, pp. 161-170, September 2003).
For example, an array-type data processing apparatus is constructed by a host processor (CPU), a stream processor formed by an array-type processor unit formed by plurality of processor elements arranged in an array and an input/output control circuit for controlling input/output operations of the array-type processor unit, and a memory for storing task programs and intermediate data for the stream processor. Due to the presence of the array-type processor unit, a plurality of processings can be carried out in parallel.
Even in this array-type data processing apparatus, in the same way as in the above-described prior art data processing apparatus, since the replacement of a task program and intermediate data for the stream processor is performed by the host processor (CPU), the processing load of the host processor (CPU) is still large, so that the processing capability and throughput of the data processing apparatus would be decreased.
According to the present invention, in a stream processor, an input direct memory access (DMA) circuit is adapted to receive a task command and task data in correspondence with a task from an external memory. A processor unit is adapted to receive the task command and the task data from the input direct memory access circuit and perform the task upon the task data in accordance with a task program designated by the task command. A direct memory access controller is adapted to load the task program from the external memory into the processor unit upon receipt of a task program load request from the processor unit.
The present invention will be more clearly understood from the description set forth below, as compared with the prior art, with reference to the accompanying drawings, wherein:
Before the description of the preferred embodiment, a prior art data processing apparatus will be explained with reference to
In
Since the programmable logic device 110 cannot load a task program into the memory of the programmable logic device 110 per se, such a task program is loaded into the memory of the programmable logic device 110 by the CPU 100 and the changing section 120.
That is, every time the CPU 100 needs to make the programmable logic device 110 carry out a task, the CPU 100 transmits a load request for loading a task program of the task and information specifying the task program to the changing section 120. Also, the CPU 100 transmits task data to be carried out to the programmable logic device 110.
On the other hand, when the changing section 120 has received the above-mentioned load request from the CPU 100, the changing section 120 reads a task program designated by the CPU 100 from the memory 130 and loads it into the memory of the programmable logic device 110. As a result, the programmable logic device 110 changes its internal circuit to perform a task using the received task program. After the task is completed, the programmable logic device 110 generates an interrupt signal INT and transmits it to the CPU 100. Then, the CPU 100 again determines the next task to be carried out by the programmable logic device 110. As a result, when the next task is the same as the one carried out immediately before by the programmable logic device 110, the CPU 100 transmits the next task data to the programmable logic device 110. Contrary to this, when the next task is different from the one carried out immediately before by the programmable logic device 110, the CPU 100 transmits a load request for loading another task program and information specifying the next task program, thus renewing the task program stored in the memory of the changing section 120.
Thus, in the data processing apparatus of
In the data processing apparatus of
In
The multi-tasking operation of the data processing system of
In
First, at cycle 1, when the programmable logic device 110 completes the task B using its task program while the CPU 100 carries out the task 0, the programmable logic device 110 generates an interrupt signal INT and transmits it to the CPU 100 to request loading of the task program of the task A.
Next, at cycle 2, the CPU 100 stops the task 0 and stores and reserves intermediate data of the task 0, i.e., the value of internal registers thereof such as a general-purpose register, a status register, a program counter and a stack pointer in the memory 130.
Next, at cycle 3, the CPU 100 reads initial data of the next task A to be carried out by the programmable logic device 110 from the memory 130 and writes it into the internal registers of the CPU 100. Note that the initial data indicates the kind of the task program.
Next, at cycle 4, the CPU 100 transmits a load request REQ for loading the task program of a task A and the information specifying this task program to the changing section 120. Also, the CPU 100 transmits the task data to the programmable logic device 110. As a result, at cycles 5, 6 and 7, the task program A and the task data a1 are loaded from the memory 130 into the programmable logic device 110. Thus, at cycle 8, the programmable logic device 110 performs the task A upon the task data using the task program of the task A.
On the other hand, at cycle 5, the CPU 100 stores the kind of the task program of a task A in the memory 130. Then, at cycle 6, the stored intermediate data of the task 0 is loaded from the memory 130 to the internal registers of the CPU 100, so that, at cycles 7 and 8, the CPU 100 carries out the task 0 again.
Also, at cycle 8, when the programmable logic device 110 completes the task A upon the task data using the task program a1 while the CPU 100 carries out the task 0, the programmable logic device 110 generates an interrupt signal INT and transmits it to the CPU 100 to request loading of the next task program and the next task data.
Thus, the larger the number of interrupt signals INT from the programmable logic device 110, the larger the overhead of the CPU 100 caused by the storing operation of intermediate data to the memory 130 and the loading operation of the stored intermediate data to the CPU 100, i.e., the increase of latency of the memory 130. This would decrease the throughput of the CPU 100.
In
Note that the CPU 1 of
The stream processor 5 is constructed by an input direct memory access (DMA) circuit 51 for reading the descriptors DSC and the task data TDA from the memory 3, an array-type processor unit 52 formed by a plurality of processor elements arranged in an array for performing a task, an input first-in first-out buffer (FIFO) 53 for receiving the task data TDA associated with a short descriptor DSC′ of a transaction identifier TID, a task command TCMD and a data size of the task data TDA of one descriptor DSC from the input DMA circuit 51 and supplying it to the array-type processor unit 52, an output FIFO 54 for receiving the output data OUT associated with the transaction identifier TID from the array-type processor unit 52, a memory access control circuit 55, a descriptor supervising table 56 for receiving the transaction identifier TID and a return (output data) address RADR of one descriptor DSC from the input DMA circuit 51 to control the memory access control circuit 55, and a DMA controller 57 for controlling the array-type processor unit 52.
Each of the input DMA circuit 51, the DMA controller 52, the memory access control circuit 55, and the descriptor supervising table 56 can be formed by using a logic circuit and a memory, or a CPU (or DSP). Also, a plurality of array-type processor units can be provided instead of the single array-type processor unit 52; in this case, each of the array-type processor units is associated with one input FIFO similar to the input FIFO 53 and one output FIFO similar to the output FIFO 54.
The DMA controller 57 will be explained later in detail.
The format of the descriptors DSC stored in the memory 3 is shown in
The interrupt flag FINT is a bit used for informing a completion of processing by the stream processor 5 to the CPU 1.
The task command TCMD is an indicator for indicating a task carried out by the stream processor 5.
The transaction identifier TID is an identifier for identifying descriptors (DSC) from each other, i.e., input data (task data) processed by the array-type processor unit 52.
The input data address IADR is a pointer for pointing to a start address of the memory 3 in which the task data is stored.
The input data (task data) size ISZ is size data of the input data (task data).
The return (output data) address RADR is a pointer for pointing to a start address of the memory 3 in which output data of the array-type processor unit 52 is stored.
The descriptor DSC as shown in
Note that the bit width of the descriptor DSC and the short descriptor DSC′ as shown in
The descriptors DSC in the memory 3 are prepared in advance by a descriptor preparing program stored in the memory 3 carried out by the CPU 1. When tasks are carried out by the stream processor 5, descriptors DSC are input by the input DMA circuit 51. Then, task programs are loaded into the array-type processor unit 52 via the DMA controller 57 in accordance with the input descriptors DSC, while task data are loaded into the array-type processor unit 52 via the input DMA circuit 51 and the input FIFO 53. Finally, output data OUT is supplied from the array-type processor unit 52 via the output FIFO 54 and the memory access control circuit 53 to the memory 3.
The input DMA circuit 51 has a descriptor pointer DP for pointing to an address of the memory 3 associated with the descriptors DSC as shown in
The input DMA circuit 51 reads one descriptor DSC from the memory 3 in accordance with the value of the descriptor pointer DP to extract the task command TCMD, the transaction identifier TID and the task data size ISZ as the short descriptor DSC′ as shown in
On the other hand, the input DMA circuit 51 extracts the transaction identifier TID and the return address RADR from the read descriptor DSC and transmits them to the descriptor supervising table 56 whose content is shown in
The input FIFO 53 sequentially stores sets each formed by one task command TCMD, one transaction identifier TID, one task data size ISZ and task data TDA defined by the transaction identifier TID and the task data size ISZ. Every time a task program is loaded by the array-type processor unit 52 or processing of the previous task by the array-type processor unit 52 is completed, the input FIFO 53 transmits the next set to be processed to the array-type processor unit 52. Thus, when a plurality of tasks are processed by the array-type processor unit 52, such tasks can be effectively and successively processed by the array-type processor unit 52 without stopping the operation thereof. On the other hand, while the array-type processor unit 52 loads a task program or intermediate data or carries out a task program, the input FIFO 53 can input the above-mentioned sets. Therefore, the processing efficiency of the stream processor 5 can be increased.
Every time the array-type processor unit 52 has received one task command TCMD from the input FIFO 53, the array-type processor unit 52 loads one task program TASK PRG via the DMA controller 57 from the memory 3, and then carries out the task program TASK PRG. As a result, the array-type processor unit 52 generates output data OUT as a result of processing the task data TDA and transmits the output data OUT via the output FIFO 54 to the memory access control circuit 55. In this case, the transaction identifier TID is associated with start data of the output data OUT.
The output FIFO 54 sequentially stores output data OUT associated with its transaction identifier TID. When the memory access control circuit 55 cannot transmit output data OUT to the bus 6 due to the access competition thereto or the like, the output FIFO 54 would not transmit the output data OUT to the memory access control circuit 55. On the other hand, after the access competition state to the memory 3 has disappeared, the output FIFO 54 would transmit the output data OUT associated with the transaction identifier TID to the memory access control circuit 55. Thus, the output data OUT of the array-type processor unit 52 can be sequentially stored in the output FIFO 54 without stopping the operation of the array-type processor unit 52. Therefore, the decrease of processing efficiency of the stream processor 5 would be suppressed.
When the memory access control circuit 55 receives the output data OUT associated with the transaction identifier TID, the memory access control circuit 55 accesses the description supervising table 56 to extract the return address RADR by referring to the transaction identifier TID. As a result, the memory access control circuit 55 stores the output data OUT transmitted from the output FIFO 54 in an area of the memory 3 starting at the return address RADR.
The DMA controller 57 is explained next in detail.
The DMA controller 57 is constructed by index registers 571 and 572, an arbitration circuit 573, a DMA command table 574, a DMA control unit 575, a data transmitter 576, a data receiver 577 and a bus interface 578.
The index register 571 receives an index IDX1 from the array-type processor unit 52 which is calculated in accordance with the task command TCMD using a task loading program LOAD PRG. In the simplest example, IDX1=TCMD.
The index register 572 receives an index IDX2 via the bus interface 578 from the CPU 1. The index IDX2 is used for loading a task loading program LOAD PRG into the array-type processor unit 52.
Note that the task loading program LOAD PRG includes the following processings:
extracting a task command TCMD from the short descriptor DSC′;
determining whether or not loading another task program TASK PRG is required in accordance with the extracted task command TCMD;
calculating an index IDX1 in accordance with the extracted task command TCMD;
transmitting the index IDX1 to the DMA controller 57; and
receiving a load completion signal CPL from the DMA controller 57.
The task loading program LOAD PRG will be explained later in detail by referring to
When the index registers 571 and 572 have received indexes IDX1 and IDX2, respectively, the index registers 571 and 572 generate index transfer request signals REQ1 and REQ2, respectively, and transmit them to the arbitration circuit 573 which, in turn, generates index transfer grant signals GNT1 and GNT2 and transmits them to the index registers 571 and 572, respectively. As a result, the indexes IDX1 and IDX2 are supplied as indexes IDX to the DMA command table 574. In this case, however, if the index transfer request signals REQ1 and REQ2 are generated simultaneously, the arbitration circuit 573 generates only one of the index transfer grant signals GNT1 and GNT2 in accordance with a prescribed manner so that collision between the indexes IDX1 and IDX2 can be avoided. For example, a priority is given to one of the transfer request signals REQ1 and REQ2, the arbitration circuit 573 transmits one of the transfer grant signals GNT1 and GNT2 having the priority to a respective one of the index registers 571 and 572, and then, transmits the other of the transfer request signals REQ1 and REQ2 to the other of the index registers 571 and 572.
The DMA control unit 575 carries out various processings such as processings for loading a task loading program LOAD PRG, a task program TASK PRG, intermediate data INTDA1, and processings for saving intermediate data INTDA2 in accordance with the content of the DMA control table 574.
The data transmitter 576 transmits a task loading program LOAD PRG, a task program TASK PRG and intermediate data INTDA1 from the bus interface 578 to the array-type processor unit 52 whose write (destination) address WRADR is designated by the DMA control unit 575. On the other hand, the data receiver 577 transmits intermediate data INTDA2 from the array-type processor unit 52 whose read (source) address RDADR is designated by the DMA control unit 575 to the bus interface 578.
The bus interface 578 transmits the task loading program LOAD PRG, the task program TASK PRG and the intermediate data INTDA1 from the memory 3 whose source address RDADR is designated by the DMA control unit 575 to the data transmitter 576. On the other hand, the bus Interface 578 transmits the intermediate data INTDA2 from the data receiver 577 to the memory 3 whose destination address WRADR is designated by the DMA control unit 575.
Note that a renewed index IDX by the DMA control unit 575 is fed back to the DMA command table 574.
In
The end flag EN (=“1”) is used for showing that one task is completed, i.e., a switching of tasks is required. That is, when EN=“1”, one or more DMA commands designated by one or more indexes are completed to generate a load completion signal CPL from the DMA control unit 575 to the array-type processor unit 52.
The read enable flag RE (=“0”) is used for showing that the task command is adapted to load a task loading program LOAD PRG, a task program TASK PRG or intermediate data INTDA1 from the memory 3 to the array-type processor unit 52. On the other hand, the read enable flag (=“1”) is used for saving intermediate data INTDA2 from the array-type processor unit 52 to the memory 3. The intermediate data INTDA1 and INTDA2 are data of the registers (not shown) of the array-type processor unit 52.
Note that one task program is actually constructed by one or more programs.
The interrupt flag FINT (=“1”) is used for generating an interrupt signal INT when one task is completed. For example, when the CPU 1 generates an index IDX2, the CPU 1 can carry out the next processing immediately upon receipt of an interrupt signal INT generated after the task loading program LOAD PRG is loaded when the last task command of one task is completed.
The transfer data length LGTH is used for defining a data length of a task loading program LOAD PRG, a task program TASK PRG, intermediate data INTDA1, or intermediate data INTDA2. In
The read (source) address RDADR is a start address of an area of the memory 3 or the array-type processor unit 52 from which a task loading program LOAD PRG, a task program TASK PRG, intermediate data INTDA1, or intermediate data INTDA2 is read. On the other hand, the write (destination) address WRADR is a start address of an area of the memory 3 or the array-type processor unit 52 into which a task loading program LOAD PRG, a task program TASK PRG, intermediate data INTDA1, or intermediate data INTDA2 is written. Here, M0, M1, . . . designate addresses of the memory 3, while P0, P1, . . . designate addresses of the array-type processor unit 52.
The operation of the DMA control unit 575 will be explained next with reference to a routine of
First, at step 601, It is determined whether the read enable flag RE of the DMA command is “0” or “1”. As a result, when RE=“0”, the control proceeds to steps 602 through 604 which loads a task loading program LOAD PRG, a task program TASK PRG or intermediate data INTDA1 from the memory 3 to the array-type processor unit 52. On the other hand, when RE=“1”, the control proceeds to steps 605 through 608 which saves intermediate data INTDA2 from the array-type processor unit 52 to the memory 3.
At step 602, a write request using a write address WRADR and a transfer data length LGTH is performed upon the array-type processor unit 52. In this case, a task loading program LOAD PRG, a task program TASK PRG or intermediate data INTDA1 is written into the array-type processor unit 52.
At step 603, a read request using a read address EDADR and the transfer data length LGTH is performed upon the memory 3 via the bus interface 578. In this case, the task loading program LOAD PRG, the task program TASK PRG or the intermediate data INTDA1 is read from the memory 3 to the data transmitter 576 which, in turn, transmits the task loading program LOAD PRG, the task program TASK PRG or the intermediate data INTDA1 is written into the array-type processor unit 52. After the transmission by the data transmitter 576 is completed, the data transmitter 576 generates a transmission completion signal CPL1 (=“1”) and transmits it to the DMA control unit 575. Thus, after the DMA control unit 575 has received this transmission completion signal CPL1, the control proceeds via step 604 to step 609.
On the other hand, at step 605, a read request using a read address RDADR and a transfer data length LGTH is performed upon the array-type processor unit 52. In this case, intermediate data INTDA2 is read from the array-type processor unit 52 to the data receiver 577. As a result, the data receiver 577 generates a data transfer preparing signal PE (=“1”) and transmits it to the DMA control unit 575. Thus, after the DMA control unit 575 has received this data transfer preparing signal PR, the control proceeds via step 606 to step 607.
At step 607, a write request using the write address WRADR and the transfer data length LGTH is performed upon the memory 3 via the bus interface 578. In this case, the intermediate data INTDA2 is written from the data receiver 577 into the memory 3. After the transmission by the data receiver 577 is completed, the data receiver 577 generates a transmission completion signal CPL2 (=“1”) and transmits it to the DMA control unit 575. Thus, the control proceeds via step 608 to step 609.
At step 609, it is determined whether the end flag EN is “0” or “1”. As a result, when EN=“0”, the control proceeds to step 610 which increments the index IDX by +1. Then, the incremented index IDX is supplied to the DMA command table 574. Thus, the control at steps 601 to 609 is for the incremented index IDX. On the other hand, when EN=“1”, the control proceeds to step 611 which generates a completion signal CPL and transmits it to the array-type processor unit 52, which would perform a task.
At step 612, it is determined whether the interrupt flag FINT is “1 ” or “0”. Only when FINT=“1”, does the DMA control unit 575 generate an interrupt signal INT and transmit it directly to the CPU 1. As a result, the CPU 1 can carry out other processings immediately.
The routine of
Thus, once one index IDX is supplied to the DMA command table 574, the routine of
The operation of the array-type processor unit 52 of
First, at step 701, a previous task command TCMD0 is initialized at −1, for example.
Next, at step 702, it is determined whether there is a short descriptor DSC′ in the input FIFO 53. Only when there is such a short descriptor DSC′ in the input FIFO 53, does the control proceed to step 703 which extracts a task command TCMD from the short descriptor DSC′.
At step 704, it is determined whether the task command TCMD is the same as the previous task command TCMD0. As a result, when TCMD≠TCMD0, the control proceeds to steps 705 through 707. On the other hand, when TCMD=TCMD0, the control proceeds directly to step 708.
At step 705, an index IDX1 is calculated in accordance with the task command TCMD. For example, IDX1←TCMD. The index IDX1 is supplied via the index register 571 and the arbitration circuit 573 to the DMA command table 574. As a result, the DMA control unit 575 is operated in accordance with the routine of
Next, at step 706, the previous task command TCMD0 is replaced by TCMD.
Next, at step 707, the array-type processor unit 52 awaits a completion signal CPL from the DMA control unit 575. Only when the array-type processor unit 52 has received such a completion signal CPL, does the control proceed to step 708. At step 708, the array-type processor unit 52 fetches task data TDA associated with the short descriptor DSC′ from the input FIFO 53.
Next, at step 709, the array-type processor unit 52 carries out a task defined by the task command TCMD using the task program TASK PRG and the task data TDA.
Step 710 repeats the control at steps 708 and 709 until there is no task data for the task command TCMD.
Then, the control returns to step 702.
In
Note that the operation of the routine of
The multi-tasking operation of the data processing apparatus of
In
First, at cycle 1, the CPU 1 prepares DMA command data as shown in
Next, at cycle 2, the CPU 1 generates an index IDX2 whose value is n and transmits it to the index register 572, to thereby request loading of the task loading program LOAD PRG of
Next, at cycle 3, the DMA command table 574 generates a DMA command designated by the index n as shown in
Next, at cycle 4, after the CPU 1 has received the interrupt signal INT from the DMA control unit 575, the CPU 1 sets “ADR0” in the descriptor pointer DP of the input DMA circuit 51 (see:
Next, at cycle 5, referring to the process PR1 of
Next, at cycle 6, after the DMA control unit 575 receives a task command designated by the index IDX1(=0) (see:
Next, at cycle 7, referring to the process PR4 of
Next, at cycle 8, referring to the process PR3 of
Next, at cycle 9, referring to the process PR4 of
Next, at cycle 10, referring to the process PR1 of
Next, at cycle 11, after the DMA control unit 575 receives a task command designated by the index IDX1(=4) (see:
Next, at cycle 12, referring to the process PR4 of
Next, at cycle 13, referring to the process PR3 of
Finally, at cycle 14, referring to the process PR4 of
Note that, at cycles 6 and 11, the index IDX is renewed by incrementing the index IDX; however, the index IDX is can be renewed by decrementing the index IDX if the table of
Also, the array-type processor unit 62 can be replaced by another type processor unit which can perform a task upon task data in accordance with a task program.
Additionally, note that, although the entire data processing apparatus of
As explained hereinabove, according to the present invention, since the CPU 1 only has to set DMA commands in the stream processor 5, tasks can be carried out without the control of the CPU 1 for the stream processor 5, the processing burden of the CPU 1 can be remarkably decreased. Also, even when the stream processor 5 loads one task program therein, a task switching never occurs in the CPU 1. Therefore, since the latency increase by the memory access of the CPU 1 is suppressed, the decrease in throughput can be suppressed.
Number | Date | Country | Kind |
---|---|---|---|
2005-164579 | Jun 2005 | JP | national |