The Peripheral Component Interconnect Express (PCIe) standard is widely used in digital communications for a variety of computing systems. In a PCIe network, various electronic devices are coupled through one or more serial links controlled by a central switch. The switch controls the coupling of the serial links and, thus, the routing of data between components. Each serial link or “lane” carries streams of information packets between the devices. Furthermore, each lane may be further divided by dividing the packets into three packet types: posted packets, non-posted packets, and completion packets. Each packet type may be processed as a separate packet stream. Furthermore, to enable quality of service (QoS) between the three packet types, each type of packet may be assigned a different priority level. A packet stream designated as the higher priority type will generally be processed more often than packet streams designated as the lower-priority type. In this way, the higher priority packet stream will generally have access to the lane more often than lower-priority packet streams and will therefore consume a larger portion of the lane's bandwidth.
Prioritizing packet types can, however, lead to a situation known as “starvation,” which occurs when higher priority packet types consume nearly all of the lane's bandwidth and lower-priority packets are not processed with sufficient speed. Packet starvation may result in poor performance of devices coupled to the PCIe network.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
In accordance with an exemplary embodiment of the present invention, a PCIe interface receives a stream of packets from a first device, processes the packets and sends the packets to a second device, giving the highest priority to posted packets. Starvation of the lower-priority packet streams is avoided by using a counter that tracks the arrival and subsequent transmission of lower-priority packets to ensure that the lower-priority packets are processed within a sufficient amount of time. If a lower-priority packet is not processed before the counter reaches a specified threshold, the PCIe interface generates a “stop-credit” signal that temporarily stops the PCIe interface from receiving packets. By stopping the PCIe interface from receiving additional packets, all of the posted packets will eventually be processed and sent to the second device, thereby enabling the PCIe interface to begin processing lower-priority packets. Sometime after beginning to process lower-priority packets, the stop-credit signal may be deactivated, and the PCIe interface may again begin receiving additional packets. Using this process, some or all of the lower-priority packets may be processed and sent to the second device before the PCIe interface receives additional posted packets. Thus, starvation of the lower-priority packet stream is avoided while ensuring that the posted packets are processed ahead of the lower-priority packets.
Those of ordinary skill in the art will appreciate that the PCIe fabric 100 may comprise hardware elements including circuitry, software elements including computer code stored on a machine-readable medium or a combination of both hardware and software elements. Additionally, the functional blocks shown in
A computing fabric generally includes several networked computing resources, or “network nodes,” connected to each other via one or more network switches. In an exemplary embodiment of the present invention, the nodes of the PCIe fabric 100 may include several host blades 102. The host blades 102 may be configured to provide any suitable computing function, such as data storage or parallel processing, for example. The PCIe fabric 100 may include any suitable number of host blades 102. The host blades 102 may be communicatively coupled to each other through a PCIe interface 104, an I/O device such as a network interface controller (NIC) 106, and a network 108. The host blade 102 is communicatively coupled to the network 108 through the PCIe interface 104 and the NIC 106, enabling the host blades 102 to communicate with each other as well as other devices coupled to the network 108. The PCIe interface 104 couples the host blades 102 to the NIC 106 and may also couple one or more host blades 102 directly. The PCIe interface 104 may include a switch that allows the PCIe interface 104 to couple to each of the host blade 102 alternatively, enabling each of the host blades 102 to share the PCIe interface 104 to the NIC 106.
The PCIe interface 104 receives streams of packets from the host blade 102, processes the packets, and organizes the packets into another packet stream that is then sent to the NIC 106. The NIC 106 then sends the packets to the target device through the network 108. The target device may be another host blade 102 or some other device coupled to the network 108. The network 108 may be any suitable network, such as a local area network or the Internet, for example. As discussed above, the PCIe interface 104 may be configured to receive three types of packets from the host blade 102, and each packet type may be accorded a designated priority. Accordingly, the PCIe interface may be configured to receive and process higher priority packets ahead of lower-priority packets, while also preventing starvation of the lower-priority packet stream. The PCIe interface 104 is described further below with reference to
PCIe transactions generally employ a credit-based flow control mechanism to ensure that the receiving device has enough capacity, for example, buffer space, to receive the data being sent. Accordingly, the PCIe controller 200 transmits flow control credits to the host blade 102 via the PCIe outbound traffic 208. The flow control credits grant the host blade 102 the privilege to send a certain number of packets to the PCIe controller 200. As packets are transmitted to the PCIe controller 200, the flow control credits are expended. Once all of the credits are used, the host blade 102 may not send additional packets to the PCIe controller 200 until the PCIe controller 200 grants additional credits to the host blade 102. As the PCIe controller 200 processes the received packets, additional buffer capacity may become available within the PCIe controller 200 and additional credits may be granted to the host blade 102. As long as the PCIe controller 200 grants sufficient credits to the host blade 102, a steady stream of packets may be sent from the host blade 102 to the PCIe controller 200. If, however, the PCIe controller 200 stops granting credits to the host blade 102, the host blade 102 will, likewise, stop sending packets to the PCIe controller 200 as soon as the flow control credits granted to the host blade 102 have been expended.
When the PCIe controller 200 receives an inbound packet, it interprets the packet type information in the packet header and sends the packet to the memory 204. The memory 204 may be used to temporarily hold packets that are destined for the priority receiver 202, and may include any suitable memory device, such as a random access memory (RAM), for example. Furthermore, the memory 204 may be divided into separate buffers for each packet type, referred to herein as the posted RAM 216, the non-posted RAM 218, and the completion RAM 220, each of which may be first-in-first-out (FIFO) buffers. Furthermore, the RAM buffers 216, 218, and 220 may hold any suitable number of packets. In some embodiments, for example, each of the RAM buffers 216, 218, and 220 may hold approximately 128 packets. Packets received by the PCIe controller 200 from the host blade 102 may be sent to the one or more RAM buffers 216, 218, and 220 according to packet type. Posted packets 210 are sent to the posted RAM 216, non-posted packets 212 are sent to the non-posted RAM 218, and completion packets 214 are sent to the completion RAM 220. If any one of the RAM buffers 216, 218, and 220 become full, the PCIe controller 200 will temporarily stop issuing flow control credits to the host blade 102.
As packets 210, 212, and 214 are stored to the respective RAM buffers 216, 218, and 220 by the PCIe controller 200, packets 210, 212, or 214 are simultaneously retrieved by the priority receiver 202, one packet at a time. The priority receiver 202 switches alternatively between the posted RAM 216, the non-posted RAM 218, and the completion RAM 220, retrieving packets and ordering the packets into a single packet stream 222 that is transmitted to the NIC 106. Each time the priority receiver 202 receives a packet 210, 212, or 214, the packet is placed next in line in the packet stream 222 and sent to the NIC 106. Therefore, the resulting packet stream 222 is determined by the order in which packets are received from the RAM buffers 216, 218, and 220. Moreover, the frequency with which the priority receiver 202 receives packets from any one of the posted RAM 216, the non-posted RAM 218, or the completion RAM 220 determines the relative bandwidth accorded to each of the packet streams represented by the three different packet types.
The order in which the packets 210, 212, or 214 are received from the memory 204 is determined, in part, by the priority assigned to each packet type. It will be appreciated that if the PCIe interface 104 does not process packets in a suitable order, it may be possible, in some cases, for the host blade 102 to obtain outdated information in response to a memory read operation. In other words, if the PCIe interface 104 sends a later-arriving read operation (non-posted packet) to the NIC 106 before an earlier-arriving write operation (posted packet) directed to the same memory location of the target device, the data returned in response to the read operation may not be current. To avoid this situation, embodiments of the present invention assign the highest priority to posted packets 210 (memory writes). This means that the priority receiver 202 will receive posted packets 210 from the posted RAM 216 whenever there are posted packets 210 available in the posted RAM 216. In other words, non-posted packets 212 and completion packets 214 will not be received by the priority receiver 202 unless the posted RAM 216 is empty. Assigning the highest priority to posted packets 210 in this way avoids the possible problem of processing a later-arriving read operation ahead of an earlier-arriving write operation.
However, one consequence of giving posted packets 210 the highest priority is that if the host blade 102 provides a steady stream of posted packets 210 to the PCIe controller 200, the non-posted packets 212 and completion packets 214 may not be retrieved and processed by the priority receiver 202 for a significant amount of time. Failure to process lower-priority packets in a timely manner may hinder the performance of one of the devices coupled to the PCIe fabric 100. In some instances, for example, failure to timely process a completion packet 214 may result in a completion time-out, in which case the requesting device may send a duplicate read request. The PCIe standard provides that a device may initiate a completion time-out within 50 microseconds to 50 milliseconds after sending a read request.
Therefore, exemplary embodiments of the present invention also include techniques for enabling lower-priority packets to be processed in a timely manner. Accordingly, the priority receiver 202 may include a counter 224 that provides a value referred to herein as a “delay-reference.” In some embodiments, the delay-reference may be an amount of time that a lower-priority packet has been held in the non-posted RAM 218 and/or the completion RAM 220. In other embodiments, the delay-reference may be a count of the number of posted packets 210 that have been received by the priority receiver 202 from the posted RAM 216 while a lower-priority packet has been held in the non-posted RAM 218 and/or the completion RAM 220. If the delay-reference for a lower-priority packet exceeds a certain threshold, referred to herein as the “stop-credit threshold,” the priority receiver 202 issues a stop-credit signal 226 to the PCIe controller 200. The PCIe controller 200 in turn stops sending flow control credits to the host blade 102. As discussed above, this causes the host blade 102 to stop sending packets to the PCIe controller 200. As a result, the PCIe controller 200 will eventually run out of packets to send to the memory 204. Meanwhile, the priority receiver 202 continues to receive and process packets from the memory 204. When all of the posted packets 210 have been received from the posted RAM 216, the priority receiver 202 then starts receiving and processing the lower-priority packets from the non-posted RAM 218 and the completion RAM 220. The stop-credit signal 226 may be maintained long enough for one or more of the lower-priority packets to be processed before additional posted packets 210 become available in the posted RAM 216.
The delay-reference tracking of the lower-priority packets may be accomplished in a variety of ways. For example, the counter 224 may count an actual time such as the number of microseconds or milliseconds that have passed since the counter 224 was started or reset, for example. Accordingly, the counter 224 may be coupled to a clock and configured to count clock pulses. In this case, the stop-credit threshold may be some fraction of the maximum or minimum completion packet timeout defined by the PCIe standard. For example, in an exemplary embodiment, the stop-credit threshold may be 50 percent of the minimum completion packet timeout, or 25 microseconds. Setting the stop-credit threshold at a fraction of the completion timeout may allow lower-priority packets to be processed in sufficient time to prevent a requesting device from timing out and resending another request packet.
Alternatively, the counter may count a number of packets that have been processed by the priority receiver 202 since the arrival of a low priority packet, and the stop-credit threshold may be specified as any suitable number of high priority packets, for example, 4, 8 or 256 posted packets. In other words, upon the arrival of a lower-priority packet, the counter 224 may begin counting the number of posted packets 210 received by the priority receiver 202. If the counter 224 reaches the specified packet count threshold before a lower-priority packet is processed, then the stop-credit signal is issued. This technique allows an approximate upper limit to be placed on the number of posted packets 210 that may be processed before processing of non-posted packets 212 or completion packets 214 is performed. For example, the stop-credit threshold may be set at 8, in which case the stop-credit signal may be sent to the PCIe controller 200 after the priority receiver 202 receives 8 posted packets 210, consecutively. In some exemplary embodiments, the stop-count threshold may be specified as a packet count that is known to approximately correspond with the passage of a certain amount of actual time, based on the speed at which the PCIe interface 104 processes the packets. Furthermore, the actual time may correspond with a portion of the PCIe completion time-out.
Additionally, in some exemplary embodiments, a single counter may be used for both the non-posted packets 212 and the completion packets 214. In this case, the counter 224 may start when either a non-posted packet 212 or a completion packet 214 arrives in the non-posted RAM 218 or completion RAM 220. Additionally, the counter 224 may restart when a packet has been received by the priority receiver 202 from either of the non-posted RAM 218 or the completion RAM 220. In other words, the processing of either a non-posted or completion packet 214 may be sufficient to restart the counter 224. In other exemplary embodiments, the counter 224 may reset only if a packet is processed from the same RAM buffer 218 or 220 that caused the counter 224 to start. In other words, if the arrival of a non-posted packet in the non-posted RAM 218 causes the counter 224 to start, only the retrieval of a non-posted packet 212 from the non-posted RAM 218 will cause the counter 224 to reset. Conversely, if the arrival of a completion packet 214 in the completion RAM 220 causes the counter 224 to start, only the retrieval of a completion packet 214 from the completion RAM 220 will cause the counter 224 to reset.
In an exemplary embodiment, separate counters 224 may be used for the non-posted packets 212 held in the non-posted RAM 218 and the completion packets 214 held in the completion RAM 220. In this embodiment, one of the counters 224 may track packets in the non-posted RAM 218, while one of the counters 224 tracks the completion RAM 220. Furthermore, each counter 224 may independently trigger the stop-credit signal 226 if either counter 224 reaches the stop-credit threshold. A different threshold may be set for each of the RAM buffers 218, 220, to tune the system for the number of packets received. The methods described above may be better understood with reference to
As discussed above in reference to
Next, at block 410 a determination is made regarding whether the counter 224 is at or above the stop-credit threshold. If the counter 224 is not at or above the stop-credit threshold, then process flow returns to block 402, at which time the priority receiver is ready to receive a new packet. If, however, the counter is at or above the stop-credit threshold, the method 400 advances to block 412. At block 412, the value “stop credit” is set to a value of “true,” and the priority receiver therefore, sends a stop-credit signal to the PCIe controller. As discussed above in reference to
Returning to block 404, if a determination is made that a posted packet 210 is not available because the posted RAM 216 is empty, then the priority receiver may receive a lower-priority packet. Accordingly, process flow may advance to block 414, wherein a determination is made regarding whether a lower-priority packet is available. If either a non-posted packet 212 or completion packet 214 is available in the non-posted RAM 218 or the completion RAM 220, process flow advances to block 416, and the lower-priority packet is received by the priority receiver 202.
If both a non-posted packet 212 and a completion packet 214 are available, the packet that is received by the priority receiver 202 will depend on the relative priority assigned to the non-posted packets 212 and the completion packets 214. Exemplary embodiments of the present invention may include any suitable priority assignment between non-posted packets 212 and completion packets 214. For example, at block 416 a higher priority may be given to either the non-posted packets 212 or the completion packets 214. As another example, the priority may alternate between the non-posted 212 and the completion packets 214 each time a lower-priority packet is received from the non-posted RAM 218 or the completion RAM 220. In this way, the priority receiver 202 may alternately process packets from the non-posted RAM 218 and the completion RAM 220, when posted packets 210 are not available. Other priority conditions may be provided to distinguish between the non-posted packets 212 and the completion packets 214 while still falling within the scope of the present claims.
After receiving the lower-priority packet, process flow may advance to block 418. At this time a lower-priority packet will have been received by the priority receiver 202. Therefore, if the counter 224 has previously been started and is currently tracking the delay-reference of the lower-priority packet, the delay-reference information stored by the counter 224 may no longer be current. Accordingly, at block 416 the counter 224 may be reset. Resetting the counter 224 causes the counter 224 to begin tracking a delay-reference of the next available lower-priority packet in the memory 204. In exemplary embodiments with two counters 224, for example, one counter 224 for the non-posted RAM 218 and one counter 224 for the completion RAM 220, the receipt of the lower-priority packet may only reset the counter 224 associated with the RAM buffer from which the lower-priority packet was received. In exemplary embodiments with one counter 224 for both non-posted and completion packets 214, the counter 224 may be reset regardless of whether a non-posted packet 212 or completion packet 214 was received.
In some exemplary embodiments, the stop-credit signal 226 may be activated (“stop-credit” set to true) for only as long as it takes to empty the posted RAM 216 and receive at least one low priority packet from the non-posted RAM 218 or the completion RAM 220. Accordingly, the stop-credit signal 226 may be deactivated (“stop credit” set to false) at block 418, as shown in
Moreover, turning the stop-credit signal 226 off at block 418 when there may still be several lower-priority packets in the non-posted RAM 218 and the completion RAM 220, enables efficient use of the PCIe interface 104 bandwidth. This is true because the speed at which the PCIe interface 104 transfers data from the host blade 102 to the NIC 106 is limited by the speed at which the priority receiver 202 can process packets from the memory 204. As long as the priority receiver 202 continues to receive a steady stream of packets from the memory 204, the stop-credit signal 226 will not significantly diminish the data transfer speed between the host blade 102 and the NIC 106. In other words, if the stop-credit signal 226 causes the memory 204 to empty before additional packets are delivered to the memory 204 from the PCIe controller 200, then the priority receiver 202 will experience a period of inactivity, wherein no packets are being delivered to the NIC 106 despite the fact that one or more host blade 102 have additional data packets to send to the NIC 106. Such a period of inactivity may reduce the average data transmission rate of the PCIe interface 104. However, a brief period wherein the PCIe controller 200 stops receiving packets does not significantly reduce the overall speed of the PCIe interface 104 as long as the priority receiver 202 continues receiving packets from the memory 204. Therefore, by turning off the stop-credit signal 226 in block 416 after only a single lower-priority packet has been received by the priority receiver 202, the likelihood of the priority receiver 202 experiencing a period of inactivity is reduced because the process of enabling the host blade 102 to send additional packets begins before the memory have been emptied.
On the other hand, in some embodiments, it may be advantageous to keep the stop-credit signal activated until both the non-posted RAM 218 and the completion RAM 220 are empty. Accordingly, in some exemplary embodiments, the stop-credit signal 226 may not be deactivated at block 418, but rather at block 420, as will be discussed below. After block 418, process flow returns to block 402, and the priority receiver 202 is ready to receive a new packet. Returning to block 414, if a lower-priority packet is not available, the method 400 advances to block 420. As discussed above, the stop-credit signal 226 may, in some embodiments, be turned off at block 420 rather than block 418. Thus, at block 420, the stop-credit signal 226 may be deactivated. As discussed above in relation to block 418, turning off the stop-credit signal 226 may cause the PCIe controller 200 to resume sending flow control credits to the host blade 102, and the PCIe controller 102 may begin receiving additional packets from the host blade 102. Additionally, the delay-reference counter 224 may be stopped at block 420 because there are no longer any lower-priority packets available in the non-posted RAM 218 and the completion RAM 220. Referring briefly to
Furthermore, the processor 501 may be communicatively coupled to a tangible, computer readable media 502 for the processor 501 to store programs and data. The tangible, computer readable media 502 can include read only memory (ROM) 504, which can store programs that may be executed on the processor 501. The ROM 504 can include, for example, programmable ROM (PROM) and electrically programmable ROM (EPROM), among others. The computer readable media 502 can also include random access memory (RAM) 506 for storing programs and data during operation of the processor 501.
Further, the computer readable media 502 can include units for longer term storage of programs and data, such as a hard disk drive 508 or an optical disk drive 510. One of ordinary skill in the art will recognize that the hard disk drive 508 does not have to be a single unit, but can include multiple hard drives or a drive array. Similarly, the computer readable media 502 can include multiple optical drives 510, for example, CD-ROM drives, DVD-ROM drives, CD/RW drives, DVD/RW drives, Blu-Ray drives, and the like. The computer readable media 502 can also include flash drives 512, which can be, for example, coupled to the processor 501 through an external USB bus.
The processor 501 can be adapted to operate as a communications interface according to an exemplary embodiment of the present invention. Moreover, the tangible, machine-readable medium 502 can store machine-readable instructions such as computer code that, when executed by the processor 501, cause the processor 501 to perform a method according to an exemplary embodiment of the present invention.