1. Technical Field
The present invention relates to a system and method for using a data-only transfer protocol to store atomic cache line data in a local storage area. More particularly, the present invention relates to a system and method for a processing engine to use a data-only transfer protocol in conjunction with an external bus node to transfer data from an internal atomic cache to an internal local storage area.
2. Description of the Related Art
A computer system comprises a processing engine that includes an atomic cache. The processing engine uses the atomic cache for tasks that are dependent upon the atomicity of cache line accesses that require read cache line data and write cache line data without interruption, such as processor synchronization (e.g., semaphore utilization).
In a large symmetrical multi-processor system, the system typically uses a lock acquisition to synchronize access to data structures. Systems that run with producer-consumer application types have to ensure that the produced data is globally visible before allowing consumers to access the produced data structure. Usually, the producer attempts to acquire a lock using a lock-load instruction, such as a “Getllar” command, and verifies the acquisition on a lock-word value. The “Getlar” command has a transfer size of one cache line, and the command executes immediately instead of being queued in the processing engine's DMA command queue like other DMA commands. Once the producer application has acquired the lock, the producer application becomes the owner of the data structure until it releases the lock. In turn, the consumer waits for the lock release before accessing the data structure.
When attempting to acquire a lock, software “spins” or loops on an atomic update sequence that executes the Getllar instruction and compares the data with a software specific definition indicating “lock_free.” If the value is “not free,” the software branches back to the Getllar instruction to restart the sequence. When the value indicates “free,” the software exits the loop and uses a conditional lock_store instruction to update the lock word to “lock taken.” The conditional lock_store fails when the processor that is attempting to acquire the lock no longer holds the reservation. When this occurs, the software again restarts the loop beginning with the Getllar instruction. A challenge found is that this spin loop causes the same data to be retrieved out of cache over and over when the lock is taken by another processing element.
What is needed, therefore, is a system and method that reduces latency for DMA requests corresponding to atomic cache lines.
It has been discovered that the aforementioned challenges are resolved using a system and method for a processing engine to use a data-only transfer protocol in conjunction with an external bus node to transfer data from an internal atomic cache to an internal local storage area. When the processing engine encounters a request to transfer cache line data from the atomic cache to the local storage (e.g., GETTLAR command), the processing engine utilizes a data-only transfer protocol to pass cache line data through the external bus node and back to the processing engine. The data-only transfer protocol comprises a data phase without a command phase or a snoop phase.
A processing engine identifies a direct memory access (DMA) command that corresponds to a cache line located in the atomic cache. As such, the processing engine sends a data request to an external bus node controller that, in turn, sends a data grant back to the processing engine when the bus node controller determines that an external broadband data bus is inactive. In addition, the bus node controller configures a bus node's external multiplexer to receive data from the processing engine instead of receiving data from an upstream bus node.
When the processing engine receives the data grant from the bus node controller, the processing engine transfers the cache line data from the atomic cache to the bus node. In turn, the bus node feeds the cache line data back to the processing engine without delay and the processing engine stores the cache line data in its local storage area.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The following is intended to provide a detailed description of an example of the invention and should not be taken to be limiting of the invention itself. Rather, any number of variations may fall within the scope of the invention, which is defined in the claims following the description.
When processing engine 100 encounters a “GETLLAR” (get lock line and reservation) command to transfer data from a cache line located in atomic cache 120 to local storage 110, processing engine 100 utilizes internal multiplexer 130. A challenge found is that arbitration control 125 prioritizes bus data from latch 180 before cache line data from atomic cache 120. As a result, the cache line data stalls at internal multiplexer 130, waiting for bus data from multiplexer 180 to complete.
Processing engine 100 identifies a direct memory access (DMA) command that corresponds to a cache line located in atomic cache 120. As such, processing engine 100 sends a data request to bus node controller 200 and, in turn, bus node controller 200 sends a data grant to processing engine 100 when bus 162 is inactive. In addition, bus node controller 200 configures external multiplexer 165 to receive data from cache line buffer 145. Bus 162, external multiplexer 165, and cache line buffer 145 are the same as that shown in
Processing engine 100 receives the data grant from bus node controller 200, and transfers the cache line data from atomic cache 120 through multiplexer 140 into cache line buffer 145, which feeds into external multiplexer 165. External multiplexer 165 passes the cache line data to latch 170, which feeds into bus node 175 and latch 180. From latch 180, the cache line data feeds into latch 135, which transfers the cache line data into local storage 110. Comparing
Processing commence at 300, whereupon the master device (e.g., processing engine) sends a bus command to a bus controller at step 310. At step 320, the bus controller reflects the command to one or more slave devices. Once the command is reflected to the slave devices, the snoop phase begins at step 330, whereupon the slave devices snoop the bus command. At step 340, the slave devices send snoop responses back to the bus controller, which includes cache line status information to maintain memory coherency. The bus controller combines the snoop responses and sends the combined snoop responses to the master device at step 350, which the master device receives at step 360.
Once the master device receives the combined snoop responses, the data phase begins at step 370, whereupon the master device sends a data request to the bus controller based upon the snoop responses. At step 380, the master device receives a data grant from the bus controller, signifying approval to send data onto the bus. Once the master device receives the data grant, the master device sends the data onto the bus to the destination slave device (step 390), and processing ends at 395.
Processing commences at 400, whereupon the master device sends a data request to the bus node controller at step 420. The data request may result from an atomic cache line request that the master device identified.
At step 440, the master device receives a data grant from the bus node controller, signifying that the bus is currently inactive (see
Processing commences at 500, whereupon processing fetches an instruction from instruction memory at step 510. A determination is made as to whether the instruction is a direct memory access (DMA) instruction (decision 520). If the instruction is not a DMA instruction, decision 520 branches to “No” branch 522, which loops back to process (step 525) and fetch another instruction. This looping continues until the fetched instruction is a DMA instruction, at which point decision 520 branches to “Yes” branch 528.
A determination is made as to whether the DMA instruction corresponds to a cache line included in atomic cache, such as a “GETLLAR” command (decision 530). If the DMA command does not correspond to an atomic cache line, decision 530 branches to “No” branch 532, which loops back to process (step 525) and fetch another instruction. This looping continues until processing fetches a DMA command that requests data from an atomic cache line, at which point decision 530 branches to “Yes” branch 538.
Processing sends a data request to bus node controller 200 included in bus node 155 at step 540. At step 550, processing receives a data grant from bus node controller 200, signifying the bus is inactive. Bus node 155 is the same as that shown in
Once processing receives the data grant, processing sends data from atomic cache 120 to bus node 155, and receives the data from bus node 155 and stores the data in local storage 110 (step 560) (see
Processing checks bus activity at step 620, and a determination is made as to whether the bus is active (decision 630). If the bus is active, decision 630 branches to “Yes” branch 632, which loops back to continue to check the bus activity. This looping continues until the bus is inactive, at which point decision 630 branches to “No ” branch 638 whereupon processing switches an external bus multiplexer to select, as its input, cache line data from the atomic cache included in processing engine 100 (step 640). At step 645, processing sends a data grant to processing engine 100, informing processing engine 100 to send the cache line data.
At step 650, processing engine sends the cache line data to the bus node, which the bus node sends back to processing engine 100 to store in a local storage area (see
PCI bus 714 provides an interface for a variety of devices that are shared by host processor(s) 700 and Service Processor 716 including, for example, flash memory 718. PCI-to-ISA bridge 735 provides bus control to handle transfers between PCI bus 714 and ISA bus 740, universal serial bus (USB) functionality 745, power management functionality 755, and can include other functional elements not shown, such as a real-time clock (RTC), DMA control, interrupt support, and system management bus support. Nonvolatile RAM 720 is attached to ISA Bus 740. Service Processor 716 includes JTAG and I2C busses 722 for communication with processor(s) 700 during initialization steps. JTAG/I2C busses 722 are also coupled to L2 cache 704, Host-to-PCI bridge 706, and main memory 708 providing a communications path between the processor, the Service Processor, the L2 cache, the Host-to-PCI bridge, and the main memory. Service Processor 716 also has access to system power resources for powering down information handling device 701.
Peripheral devices and input/output (I/O) devices can be attached to various interfaces (e.g., parallel interface 762, serial interface 764, keyboard interface 768, and mouse interface 770 coupled to ISA bus 740. Alternatively, many I/O devices can be accommodated by a super I/O controller (not shown) attached to ISA bus 740.
In order to attach computer system 701 to another computer system to copy files over a network, LAN card 730 is coupled to PCI bus 710. Similarly, to connect computer system 701 to an ISP to connect to the Internet using a telephone line connection, modem 775 is connected to serial port 764 and PCI-to-ISA Bridge 735.
Control plane 810 includes processing unit 820 which runs operating system (OS) 825. For example, processing unit 820 may be a Power PC core that is embedded in BEA 800 and OS 825 may be a Linux operating system. Processing unit 820 manages a common memory map table for BEA 800. The memory map table corresponds to memory locations included in BEA 800, such as L2 memory 830 as well as non-private memory included in data plane 840.
Data plane 840 includes Synergistic processing element's (SPE) 845, 850, and 855. Each SPE is used to process data information and each SPE may have different instruction sets. For example, BEA 800 may be used in a wireless communications system and each SPE may be responsible for separate processing tasks, such as modulation, chip rate processing, encoding, and network interfacing. In another example, each SPE may have identical instruction sets and may be used in parallel to perform operations benefiting from parallel processes. Each SPE includes a synergistic processing unit (SPU) which is a processing core, such as a digital signal processor, a microcontroller, a microprocessor, or a combination of these cores.
SPE 845, 850, and 855 are connected to processor element bus 860, which passes information between control plane 810, data plane 840, and input/output 870. Bus 860 is an on-chip coherent multi-processor bus that passes information between I/O 870, control plane 810, and data plane 840. Input/output 870 includes flexible input-output logic which dynamically assigns interface pins to input output controllers based upon peripheral devices that are connected to BEA 800.
While
One of the preferred implementations of the invention is a client application, namely, a set of instructions (program code) in a code module that may, for example, be resident in the random access memory of the computer. Until required by the computer, the set of instructions may be stored in another computer memory, for example, in a hard disk drive, or in a removable memory such as an optical disk (for eventual use in a CD ROM) or floppy disk (for eventual use in a floppy disk drive). Thus, the present invention may be implemented as a computer program product for use in a computer. In addition, although the various methods described are conveniently implemented in a general purpose computer selectively activated or reconfigured by software, one of ordinary skill in the art would also recognize that such methods may be carried out in hardware, in firmware, or in more specialized apparatus constructed to perform the required method steps.
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, that changes and modifications may be made without departing from this invention and its broader aspects. Therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of this invention. Furthermore, it is to be understood that the invention is solely defined by the appended claims. It will be understood by those with skill in the art that if a specific number of an introduced claim element is intended, such intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present. For non-limiting example, as an aid to understanding, the following appended claims contain usage of the introductory phrases “at least one” and “one or more” to introduce claim elements. However, the use of such phrases should not be construed to imply that the introduction of a claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an”; the same holds true for the use in the claims of definite articles.