The invention, as well as a preferred mode of use and further objectives and advantages thereof, will best be understood by reference to the following detailed description of illustrative embodiments when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures and in particular with reference to
With reference now to the figures,
As shown in
The local memory or local store (LS) 163-170 is a non-coherent addressable portion of a large memory map which, physically, may be provided as small memories coupled to the SPUs 140-154. The local stores 163-170 may be mapped to different address spaces. These address regions are continuous in a non-aliased configuration. A local store 163-170 is associated with its corresponding SPU 140-154 and SPE 120-134 by its address location. Any resource in the system has the ability to read/write from/to the local store 163-170 as long as the local store is not placed in a secure mode of operation, in which case only its associated SPU may access the local store 163-170 or a designated secured portion of the local store 163-170.
The CBE 100 may be a system-on-a-chip such that each of the elements depicted in
The SPEs 120-134 are coupled to each other and to the L2 cache 114 via the EIB 196. In addition, the SPEs 120-134 are coupled to MIC 198 and BIC 197 via the EIB 196. The MIC 198 provides a communication interface to shared memory 199. The BIC 197 provides a communication interface between the CBE 100 and other external buses and devices.
The PPE 110 is a dual threaded PPE 110. The combination of this dual threaded PPE 110 and the eight SPEs 120-134 makes the CBE 100 capable of handling 10 simultaneous threads and over 128 outstanding memory requests. The PPE 110 acts as a controller for the other eight SPEs 120-134 which handle most of the computational workload. The PPE 110 may be used to run conventional operating systems while the SPEs 120-134 perform vectorized floating point code execution, for example.
The SPEs 120-134 comprise a synergistic processing unit (SPU) 140-154, memory flow control units 155-162, local memory or store 163-170, and an interface unit 180-194. The local memory or store 163-170, in one exemplary embodiment, comprises a 256 KB instruction and data memory which is visible to the PPE 110 and can be addressed directly by software.
The memory flow control units (MFCs) 155-162 serve as an interface for an SPU to the rest of the system and other elements. The MFCs 155-162 provide the primary mechanism for data transfer, protection, and synchronization between main storage and the local storages 163-170. There is logically an MFC for each SPU in a processor. Some implementations can share resources of a single MFC between multiple SPUs. In such a case, all the facilities and commands defined for the MFC must appear independent to software for each SPU. The effects of sharing an MFC are limited to implementation-dependent facilities and commands.
DMA control unit 212 processes a queue 222 of DMA commands. In the Cell Broadband Engine (CBE), there is a PPE-initiated DMA queue and a SPE-initiated DMA queue. For simplicity, one DMA queue 222 is shown in
MMU 214 performs address translation and protection using a segment table and page table model. A DMA transaction may involve a data transfer between a local store address, for example, and an effective address, which can be translated into a system-wide real address using the MFC page table. MMU 214 consists of a segment look-aside buffer (SLB) 216 and translation look-aside buffers (TLBs) 218. SLB 216 is managed through memory mapped input/output (MMIO) registers. The TLBs 218 cache the DMA page table entries. Storage descriptor register (SDR) 220 contains the DMA page table pointer. This architecture allows the PPE and all of the MFCs to share a common page table, which enables the application to use effective addresses directly in DMA operations without any need to locate the real address pages.
In the Cell Broadband Engine, a data transfer from external memory to a SPE local store may be called a DMA GET command, and a data transfer from the SPE local store to external memory may be called a DMA PUT command. The CBE processor supports DMA commands, and the majority of them are variants of GET or PUT. MFC synchronization commands are different from GET/PUT commands. MFC synchronization commands may be used between multiple GET and PUT DMA commands to enforce ordering of DMA transactions relative to each other.
DMA queue entry 300 may also include tag and class 310. The tag identifies the DMA or a group of DMAs. Any number of DMAs can be tagged with the same group. The tag is required for querying completion status of the group. The class is an identifier that determines the resource ID associated with the SPE.
In accordance with an illustrative embodiment, DMA queue entry 300 also includes MMU-miss dependency flag 312. This flag is set or cleared by the result of the MMU translation. The DMA issue mechanism uses MMU-miss dependency flag 312 to block the issue of commands that are known to result in a translation miss. When the MMU completes processing of a miss, the MMU sends a miss clear signal to the DMA control unit to reset all MMU-miss dependency flags.
Returning to
A miss may occur in either table, SLB 216 or TLBs 218, depending on whether certain parts of the effective address match. The application attempts to load the tables with the correct data. First, MMU 214 goes to the SLB for the segment. If a miss occurs in SLB 216, then the DMA control unit 212 invokes an interrupt to processing unit 202, and the application must fix the SLB. The TLB 218 may have 64 congruent classes, for example. Six bits in the effective address define the congruent class. If the effective address does not match one of the addresses in its congruent class in the TLB 218, then this results in a miss. If there is a miss in the TLB 218, then DMA control unit 212 sets the MMU-miss dependency flag.
For a first miss, MMU 214 will perform a tablewalk to do a page lookup to try to get the correct data in TLB 218. Note, however, that on subsequent misses, the MMU will not perform a tablewalk since one is already in progress. DMA control unit 212 may continue to issue DMA commands from DMA queue 222 as long as there is not a subsequent miss for that queue entry while MMU 214 is performing the tablewalk. For each subsequent miss, the MMU-miss dependency flag is set for that DMA queue entry. When MMU 214 completes the tablewalk, MMU 214 returns a miss clear to DMA control unit 212, which then resets the MMU-miss dependency flag for that command.
One simplification of this mechanism is that DMA control unit 212 may not record which entry corresponds to the miss MMU 214 is processing. When MMU 214 sends a miss clear, DMA control unit 212 resets all DMA queue entries with MMU-miss dependency flags set. DMA commands in DMA queue 222 that were blocked from issue by the MMU-miss dependency flag are now allowed to be selected by DMA control unit 212 for issue.
When the DMA command corresponding to the previous translation miss processed by the MMU is issued, and DMA control unit 212 makes a new translation request to MMU 214, the translation will be a hit. Other DMA commands that had their MMU-miss dependency flags set while MMU 214 was processing the miss may also be selected by DMA control unit 212 for issue. Thereafter, after a miss clear, all DMA commands that were previously blocked due to translation misses can be issued.
Consider an example with five DMA commands in queue, ready to issue. When the DMA device is ready to issue DMA Command 0, for which EA translation is required, the DMA control unit sends an address translation request to the MMU. In this example, the address translation results in a hit. The DMA control unit receives the RA from the MMU, and the DMA device sends the command to the bus interface unit.
Next, when the DMA device is ready to issue DMA Command 1, for which EA translation is required, the DMA control unit sends an address translation request to the MMU. The address translation results in a miss. The DMA device sets the MMU-miss dependency flag of DMA Command 1, and the MMU does a tablewalk (first miss).
Then, when the DMA device is ready to issue DMA Command 2, for which the RA is already valid, the DMA control unit does not make a request to the MMU. The DMA device sends DMA Command 2 to the bus interface unit.
When the DMA device is ready to issue DMA Command 3, for which address translation is required, the DMA control unit sends an address translation request to the MMU. The result of the address translation is a hit. The DMA control unit receives the RA from the MMU, and the DMA device sends the command to the bus interface unit.
Next, when the DMA device is ready to issue DMA Command 4, for which EA translation is required, the DMA control unit sends an address translation request to the MMU. In this example, the result of address translation is a miss. The DMA device sets the MMU-miss dependency flag for DMA Command 4. The MMU does not do a tablewalk for this miss, because a tablewalk is already in progress.
Then, when the DMA device is ready to issue DMA Command 0, for which the RA is already valid, the DMA device sends the command to the bus interface unit. Assuming DMA Command 0 is unrolled into several smaller transfers, this is the second time the command has been unrolled.
Thereafter, the MMU tablewalk completes for DMA Command 1. The MMU sends a miss clear to the DMA control unit, which in turn clears all MMU-miss dependency flags on all entries in the queue. Now all five commands are eligible for issue again.
Accordingly, blocks of the flowchart illustrations support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or by combinations of special purpose hardware and computer instructions.
With reference now to
If an address translation request is received in block 402, the memory management unit attempts address translation (block 404) and determines whether the address translation results in a hit or miss (block 406). If the address translation results in a hit, the memory management unit returns the real address to the DMA control unit (block 408), and operation returns to block 402 to wait for the next address translation request.
If the address translation attempt results in a miss in block 406, the memory management unit notifies the DMA control unit of the miss (block 410) and starts a tablewalk (block 412). Then, the memory management unit determines whether the tablewalk returns with the page table lookup for the translation look-aside buffer (block 414). If the tablewalk returns, the memory management unit sends a miss clear signal to the DMA control unit (block 416), and operation returns to block 402 to wait for the next address translation request.
If the tablewalk does not return in block 414, the memory management unit determines whether there is a subsequent address translation request while the memory management unit is performing the tablewalk (block 418). If there is not a subsequent address translation request, operation returns to block 414 to determine whether the tablewalk has returned.
If there is a subsequent address translation request in block 418, the memory management unit attempts address translation (block 420) and determines whether the address translation results in a hit or a miss (block 422). If address translation results in a hit, the memory management unit returns the real address (block 424), and operation returns to block 414 to determine whether the tablewalk returns. If the address translation attempt is a miss in block 422, then the memory management unit notifies the DMA control unit of the miss, and operation returns to block 414 to determine whether the tablewalk returns.
If there is a DMA command in the queue in block 502, the DMA control unit makes a request to the memory management unit for address translation for the selected DMA command in the queue (block 504). The DMA control unit determines whether the address translation request resulted in a hit or a miss (block 506). If the address translation is a hit, the DMA control unit issues the command (block 508), and operation returns to block 502 to determine whether there is a DMA command in the queue to issue.
If the address translation is a miss, the DMA control unit sets the MMU-miss dependency flag for the command (block 510). Next, the DMA control unit determines whether the memory management unit returns a miss clear signal (block 512). If the memory management unit returns a miss clear, the DMA control unit resets all MMU-miss dependency flags for all DMA commands in the DMA queue (block 514). Then, operation returns to block 502 to wait for a DMA command to be ready in the DMA queue.
If the memory management unit does not return a miss clear in block 512, the DMA control unit determines whether there is a DMA command in the DMA queue to issue (block 516). If there is not a DMA command in the queue to issue, operation returns to block 512 to determine whether the memory management unit returns a miss clear signal.
If there is a DMA command in the queue in block 516, the DMA control unit makes a request to the memory management unit for address translation for the selected DMA command in the queue (block 518). The DMA control unit determines whether the address translation request resulted in a hit or a miss (block 520). If the address translation is a hit, the DMA control unit issues the command (block 522), and operation returns to block 512 to determine whether the memory management unit returned a miss clear. If the address translation request resulted in a miss, then the DMA control unit sets the MMU-miss dependency flag for the command (block 524), and operation returns to block 512 to determine whether the memory management unit returns a miss clear signal.
Thus, the illustrative embodiments solve the disadvantages of the prior art by providing a direct memory access engine and memory management unit with hit-under-miss capability. A memory management unit (MMU) performs address translation and protection using a segment table and page table model. A direct memory access (DMA) transaction may involve a data transfer between a local store address, for example, and an effective address, which can be translated into a system-wide real address using the MFC page table. Each DMA queue entry may also include a MMU-miss dependency flag. This flag is set or cleared by the result of the MMU translation. The DMA issue mechanism uses the MMU-miss dependency flag to block the issue of commands that are known to result in a translation miss. When the MMU completes processing of a miss, the MMU sends a miss clear signal to the DMA control unit to reset all MMU-miss dependency flags. When the MMU sends a miss clear signal, the DMA control unit will reset all DMA queue entries with MMU-miss dependency flags set. DMA commands in the DMA queue that were blocked from issue by the MMU-miss dependency flag may now be selected by the DMA control unit for issue.
It should be appreciated that the illustrative embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In one exemplary embodiment, the mechanisms of the illustrative embodiments are implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the illustrative embodiments may take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.