The present invention relates generally to the issuance of Direct Memory Access (DMA) request commands and, more particularly, to operation of command queues.
Over the past few years, DMA has become an important aspect of computer architecture. In addition to DMA, multiprocessor systems have been developed using DMA to provide ever faster processing capabilities. Specifically with DMA, there are typically two types of requests or commands that can be issued from a processor for the DMA Controller (DMAC) to execute: load and store. Depending on the system, though, an individual processor can have the ability to load or store from an Input/Output (I/O) Device, another processor's local memory, a memory device, and so forth.
More recently, though, the multiprocessors and DMACs have been incorporated onto a single chip. Reduction to a single chip allows for a reduced size as well as increased speed. The DMACs, the processors, Bus Interface Units (BIUs), and a bus can all be incorporated onto a chip. The dataflow of such a system starts from the processor core, which dispatches a DMA command and that command is stored in a DMA command queue. Each DMA command may be unrolled or broken into smaller bus requests to the BIU. The resulting unrolled request is stored in the BIU outstanding bus request queue. The BIU then forwards the request to the bus controller. Generally, the requests are sent out from the BIU in the order it was received from the DMA. When a bus request is completed, the BIU outstanding bus request queue entry is available to receive a new DMA request. However, bottlenecks can result due to the physical sizes of the BIU outstanding bus request queue at the source device and the snoop queues at the destination device. The bottlenecks, typically, are a function of queue order and/or delays in executing commands. For example, command two to load from another processor's local memory can be delayed waiting for command one to store to the Dynamic Random Access Memory (DRAM). Hence, the resulting bottlenecks can cause dramatic losses in operational speed.
A contributor to the bottlenecks can be execution order of DMA commands. The fact is that certain commands are executed faster than others. For example, DMA command executions that move data between processors, on the same chip, can be completed faster than the DMA command executions to external Memory or I/O devices which typically take much longer. As a result, DMA commands for data movement to Memory or I/O Devices will stay in the BIU outstanding request queue much longer. Eventually the BIU outstanding request queue may become completely occupied with the slower bus requests leaving little or no room for additional bus requests from the DMA. This results in performance degradation of the processors since the processor has to stop to wait for available space in the BIU outstanding bus request queue.
Another contributor to the bottlenecks can be retries. In the case that multiple source devices are moving data to/from the same destination device, the destination device has to reject the bus request when the snoop queue is full which causes the source device to retry the same bus request at a later time.
Another contributor to the bottlenecks can be the order of execution of commands in the destination device. In a conventional DRAM access, the DRAM device can operate in parallel on consecutive memory banks. Moreover, bidirectional busses are typically utilized to interface with DRAM devices. If the data movement direction is changed frequently, bus bandwidth is reduced due to additional bus cycles required to turn around the bus. Also, it is desirable to do a series of reads or writes to the same memory page to obtain greater parallel DRAM access.
Therefore, there is a need for a method and/or apparatus for improving the efficiency of a DMA issue mechanism that addresses the aforementioned problems.
The present invention provides a method and a computer program for executing commands in a DMAC. A slot is first selected. Once the slot has been selected a determination is then made as to which groups in the selected slot are valid. If there are no valid groups, then another slot is selected. However, if there is at least one valid group, a round robin arbitration scheme is used to select a group. Within the selected group, the oldest pending DMA command is chosen and unrolled. The unrolled bus request is then dispatched to the BIU. After the unrolling, the DMA command paramenters are updated and written back into the DMA command queue.
For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
In the following discussion, numerous specific details are set forth to provide a thorough understanding of the present invention. However, those skilled in the art will appreciate that the present invention may be practiced without such specific details. In other instances, well-known elements have been illustrated in schematic or block diagram form in order not to obscure the present invention in unnecessary detail. Additionally, for the most part, details concerning network communications, electromagnetic signaling techniques, and the like, have been omitted inasmuch as such details are not considered necessary to obtain a complete understanding of the present invention, and are considered to be within the understanding of persons of ordinary skill in the relevant art.
It is further noted that, unless indicated otherwise, all functions described herein may be performed in either hardware or software, or some combinations thereof. In a preferred embodiment, however, the functions are performed by a processor such as a computer or an electronic data processor in accordance with code such as computer program code, software, and/or integrated circuits that are coded to perform such functions, unless indicated otherwise.
Referring to
Each of the processors 101, 103, and 105 are configured in a similar fashion to communicate data The first processor 101, the second processor 103, and the third processor 105 each further comprise a first processor core 104, a second processor core 106, and a third processor core 108, respectively. The first processor core 104 is coupled to a first DMAC 110 through a first load communication channel 152 and a first store communication channel 150. The second processor core 106 is coupled to a second DMAC 112 through a second load communication channel 156 and a second store communication channel 154. The third processor core 108 is coupled to a third DMAC 114 through a third load communication channel 160 and a third store communication channel 158. The first DMAC 110 is coupled to the first BIU 116 through a fourth store communication channel 162 and a fourth load communication channel 164. The second DMAC 112 is coupled to the second BIU 118 through a fifth store communication channel 166 and a fifth load communication channel 168. The third DMAC 114 is coupled to the third BIU 120 through a third store communication channel 170 and a third load communication channel 172.
Each of the respective processors also operates in a similar fashion. A command, either a load or store command, originates in a processor core. There are a variety of commands that can be issued by a given processor. However, the focus, for the purposes of illustration, is three distinct command types: processor to processor, processor to memory devices, and processor to I/O devices. Once the command is issued by the processor core, the command is passed onto the DMAC. The DMAC then unrolls the command to the BIU, where a outstanding bus request queue stores the unrolled bus request. At a later time, the bus request is sent out to the bus. When the bus controller grants the request, the source and destination devices will perform data transfer to complete the bus request.
The multiprocessor computer system utilizing DMAC 100 operates by utilizing a bus 130 to communicate data and bus requests among the varying components. The first processor 101 is coupled to the bus 130 through a seventh store communication channel 174 and a seventh load communication channel 176. The second processor 103 is coupled to the bus 130 through an eighth store communication channel 178 and an eighth load communication channel 180. The third processor 105 is coupled to the bus 130 through a ninth store communication channel 182 and a ninth load communication channel 184. The memory controller 122 utilizes a bidirectional memory bus implementation to communicate data to and from the memory devices 124. Hence, the memory controller 122 is coupled to the bus 130 via a bidirectional memory bus implementation through a tenth store communication channel 186 and a tenth load communication channel 188. Also, the I/O Controller 126 is coupled to the bus 130 through an eleventh store communication channel 190 and an eleventh load communication channel 192.
In addition to connections to the bus 130, there can also be connections between varieties of other components. More particularly, controllers, such as the memory controller 122 and the I/O controller 126, require connections to other respective devices. The memory controller 122 is coupled to the memory devices 124 through a first bandwidth controlled communication channel 194. The I/O controller 126 is coupled to the I/O devices 128 through a second bandwidth controlled communication channel 196 and a third bandwidth controlled communication channel 198.
Referring to
Within the DMAC, such as the DMAC 110 of
The enabling or disabling of the slot is used to match the bus bandwidth characteristics (i.e. if the bus is bidirectional such as a memory bus, the slot function is disabled). If the slot function is enabled for the streaming ID group, the load command will be assigned a value of zero in the slot field 210; the store command will be assigned a value of one in the slot field 210. If the slot function is disabled then both load and store commands will be assigned a value of zero in the slot field 210.
Typically, though, there are three bus request operations that can take place: processor to processor, processor to external or system memory, and processor to I/O devices. Each of the three operations can be assigned into streaming ID groups.
Generally, processor to processor commands are assigned to streaming ID group 0, processor to memory commands are assigned to streaming ID group 1, and processor to IO commands are assigned to streaming ID group 2. In this case, the slot function is enabled for streaming ID groups 0 and 2, and disabled for group 1 in order to match the bus bandwidth characteristics associated with the DMA command.
A DMA command is typically unrolled into one or more bus requests to the BIU. This bus request is queued in the BIU's outstanding DMA bus request queue, which has a limited size. By configuring the quota for each streaming ID group, this queue is divided into three virtual queues. Depending on the software application, the size of the three virtual queues can be dynamically configured via the streaming ID quotas.
Referring to
Once the DMA commands have been entered into the command queue as shown in the flow chart 300 of
If the Slot 0 is chosen to be executed next, then the DMAC should make a series of measurements to determine the issuing command queue. In step 304, the DMAC determines which group has valid pending DMA commands. Associated with each group is a maximum issue count or quota. The quota limits the number of bus request that can be issued to prevent the system overflow. To maintain a proper operation of the system, the DMAC determines whether each of the groups within the slot have exceeded their respective quotas in step 306.
Once a determination of validity and quotas has been made, the DMAC selects the next command. In step 308, the DMAC utilizes a round robin selection system between command groups. At the time of selection, a determination is made as to whether there are any valid groups under its respective quota limit with a pending command in step 310. If there are no valid groups under its respective quota limit with a pending command, then an alternation is made to the other slot, Slot 1. However, if there is a valid group under its respective quota with a pending command, then the oldest command from the group selected is unrolled in step 312. The round robin pointer is then adjusted to the next streaming ID command group and the size of the queue is reduced in step 314, and the slot is then alternated in step 302.
If the Slot 1 is chosen to be executed next, then the DMAC should make a series of measurements to determine the issuing command queue. In step 316, the DMAC determines which group has valid pending DMA commands. Associated with each group is a maximum issue count or quota. The quota limits the number of bus requests that can be issued to prevent the system overflow. To maintain a proper operation of the system, the DMAC determines whether each of the groups within the slot have exceeded their respective quotas in step 318.
Once a determination of validity and quotas has been made, the DMAC selects the next command. In step 320, the DMAC utilizes a round robin selection system between command groups. At the time of selection, a determination is made as to whether there are any valid groups under its respective quota limit with a pending command in step 322. If there are no valid groups under its respective quota limit with a pending command, then an alternation is made to the other slot, Slot 0. However, if there is a valid group under its respective quota with a pending command, then the oldest command from the group selected is unrolled in step 324. The round robin pointer is then adjusted to the next streaming ID command group and the size of the queue is reduced in step 326, and the slot is then alternated in step 302.
It should be noted that all Processor to Memory commands, be they load or store commands, are unrolled through Slot 0. The reason for issuing a number of commands in this manner is to improve efficiency. Changing direction of a bidirectional bus is time consuming. Moreover, with external memory, there is a plurality of banks that can each process requests individually, so the external memory is capable of receiving multiple commands. Also, the time required to process requests can be very long. Hence, it is advantageous to process as many requests to external memory as burst loads or stores to minimize changing the direction of the bidirectional bus and maximize the parallel load or parallel store.
It will further be understood from the foregoing description that various modifications and changes may be made in the preferred embodiment of the present invention without departing from its true spirit. This description is intended for purposes of illustration only and should not be construed in a limiting sense. The scope of this invention should be limited only by the language of the following claims.
Having thus described the present invention by reference to certain of its preferred embodiments, it is noted that the embodiments disclosed are illustrative rather than limiting in nature and that a wide range of variations, modifications, changes, and substitutions are contemplated in the foregoing disclosure and, in some instances, some features of the present invention may be employed without a corresponding use of the other features. Many such variations and modifications may be considered desirable by those skilled in the art based upon a review of the foregoing description of preferred embodiments. Accordingly, it is appropriate that the appended claims be construed broadly and in a manner consistent with the scope of the invention.