The present invention relates to communications by a processor within a system of multiple processors or over a network.
One of the performance bottlenecks of computing systems which include multiple processors, is the speed at which data are transferred in messages between processors. Communication bandwidth, defined as the amount of data transferred per unit of time, depends on a number of factors which include not only the transfer rate between processors of a multiple processor system, but many others. Factors which determine communication bandwidth typically include both fixed cost factors which apply to all messages regardless of their length, and variable cost factors which vary in relation to the length of the message.
In order to best describe the factors affecting communication bandwidth, it is helpful to illustrate a computing system and various methods used to transfer messages between processors of such system.
Storage devices 140 are used for paging in and out memory as needed to support programs executed at each processor 110, especially application programs (hereinafter “applications”) at each processor 110. By contrast, local memory 115 is available to hold data which applications are actively using at each processor 110. When such data is no longer needed, it is typically paged out to the storage devices 140 under control of an operating system function such as “virtual memory manager” (VMM). When an application needs the data again, it is paged in from the storage devices 140.
Communications between processors 110 of the system can be handled in one of two basic ways. A first way, which is referred to as a “copy mode” transport mechanism, is illustrated with respect to
Similarly, when an application in user buffer 202 sends a message, the data is copied from the user buffer 202 into the send buffer 210b, from which it is copied into adapter memory 135b. From there sent over switch 130 to memory 135a of adapter 125a. The data is copied from adapter memory 135a into receive buffer 220a, and from there it is copied into user buffer 200.
The copy mode transport mechanism provides an efficient way of sending and receiving messages having relatively small amounts of data between processors, because this mechanism traditionally requires little time to set up the data transfer operation. However, for larger amounts of data, the copying time becomes excessive for the intermediate steps of copying the data from the user buffer 200 to the send buffer 210 on the send side, and from the receive buffer 220 to the user buffer 202 on the receive side. For this reason, various methods have been proposed for transferring data between processors which omit these intermediate steps of copying the data. Such methods are known generally as “zero copy” transport mechanisms. An example of such zero copy transport mechanism is shown in
However, there are certain resources that even the operating system is not given control over. These resources are considered “super-privileged”, and are managed by a Hypervisor layer 450 which operates below each of the operating systems. The Hypervisor 450 controls the particular resources of the hardware 460 allocated to each logical partition according to control algorithms, such resources including particular tables and areas of memory that the Hypervisor 450 grants access to use by the operating system for the particular logical partition. The computing system hardware 460 includes the CPU, its memory (not shown) and the adapter 125. The hardware typically reserves some of its resources for its own purposes and allows the Hypervisor to use or allocate the rest of its resources, as for example, to each logical partition.
Within each logical partition, the user is free to select the user space applications and protocols that are compatible with the particular operating system in that logical partition. Typically, end user applications operate above other user space applications used for communication and handling of data. For example, in LPAR2, the operating system 402b is AlX, and the communication protocol layers HAL 404, LAPI 406 and MPI 408 operate thereon in the user space of the logical partition. One or more end user applications operate above the MPI layer 408. On the other hand, in LPAR 4, the operating system 402c is LINUX, and the communication protocol layers KHAL 410 (kernel version hardware abstraction layer), KLAPI 412 (kernel version LAPI) and GPFS 414 (“General Parallel File System”) operated thereon on the user space of the logical partition. Other logical partitions may use other operating systems and/or other communication protocol stacks such as Transport Control Protocol (TCP) 420 and Internet Protocol (IP) 422 in LPAR 3 and Asynchronous Transfer Mode (ATM) 430 over an upper layer protocol (ULP) 432 in LPAR 5. Still another combination may run in an LPAR N, such as Internet Small Computer System Interface (iSCSI) 440, operating over an upper layer protocol (ULP) 442 and HAL 444.
One difficulty of conventional zero copy transport mechanisms is the setup time required to prepare a message to be sent. This will be described with respect to
Thereafter, after the necessary resources are allocated, as shown at 520, address translation for converting from virtual addresses to physical addresses must be done to prepare the message to be sent. This step is carried out in units of “pages”, a page being a common unit of data to be accessed typically by one transfer instruction. Conventionally, a page contains 4K bytes of data. The pages to be translated are identified from the virtual (starting) address and the message length provided by the initial message request.
Here, two operations are actually required. The first required operation is to “pin” each page of the data to be transferred by the message. To “pin” a page means to lock its location, i.e., to fix the relationship between the virtual address and the physical address so that no other application such as a virtual memory manager (VMM) can transfer the page to a different physical address, e.g., by “paging out” that page from the local memory 115 of a processor 110 to a storage device 140 (
These are time intensive operations, as will be apparent from the following. Regardless of the size of the message to be sent, addresses need to be pinned on basis of pages, and pages are 4K bytes in size. As used herein, “byte” means eight bits and is denoted as “B”, “K” means the number 1024 and “M” means the number K2, i.e., 1024×1024, which, to multiply it out, is 1,048,576. Similarly, “G” means the number K×M, i.e., 1024M, which can be expressed as 1024×1024×1024=1,073,741,824. These numbers “K” and “M” are conveniently used to refer to the amounts of bytes of data and other units of information handled by computers.
When the amount of data to be transferred by a message is 16M, which is 4096 pages, i.e. 4K pages of 4K size each, then these pinning and address translation operations require that the chain of PTE tables be traversed a great number of times. Since each of the 4K (i.e., 4096) addresses must be looked up by way of the first table 700 and then by Table 2 (710) in the pinning operation, a total of 8K lookups are performed to pin the addresses. Then, in the translating operation, the PTE must be fetched for each of the 4K addresses by way of the first table 700 and then by Table 2 (710). Here, the two tables are traversed a total of 8K times to fetch the PTEs. All total, 16K table traversals are performed to pin and translate addresses for the 16M message.
Therefore, from the foregoing, it is apparent that inefficiencies exist in prior art methods of transmitting messages which need to be addressed.
According to an aspect of the invention, a method is provided for facilitating zero-copy communications between computing systems of a group of computing systems. The method includes allocating, in a first computing system of the group of computing systems, a pool of privileged communication resources from a privileged resource controller to a communications controller. The communications controller designates the privileged communication resources from the pool for use in handling individual ones of the zero-copy communications, thereby avoiding a requirement to obtain individual ones of the privileged resources from the owner of the privileged resources at setup time for each zero-copy communication.
According to another aspect of the invention, a machine-readable recording medium having instructions thereon for performing a method of facilitating zero-copy communications between computing systems of a group of computing systems, in which the method includes allocating, in a first computing system of the group of computing systems, a pool of privileged communication resources from a privileged resource controller to a communications controller. The communications controller designates the privileged communication resources from the pool for use in handling individual ones of the zero-copy communications, thereby avoiding a requirement to obtain individual ones of the privileged resources from the owner of the privileged resources at setup time for each zero-copy communication.
According to yet another aspect of the invention, a communications resource controller is provided which is operable to facilitate zero-copy communications between computing systems of a group of computing systems. The communications resource controller includes means for allocating, in a first computing system of the group of computing systems, a pool of privileged communication resources from a privileged resource controller, and means for designating ones of the privileged communication resources from the pool for use in servicing the zero-copy communications, so as to avoid a requirement to obtain individual ones of the privileged resources from the privileged resource controller at setup time for each respective zero-copy communication.
The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of practice, together with further objects and advantages thereof, may best be understood by reference to the following description taken in connection with the accompanying drawings in which:
Accordingly, in the embodiments of the invention described herein, the prior art inefficiencies of transmitting messages between processors of a system or over a network are addressed. Inefficiencies are addressed as follows. A local “master controller” is established for each logical partition of a processor, having the function of assigning privileged communication resources to user applications for their use in transmitting messages via a zero copy mechanism. By the master controller assigning the communication resources, time-consuming resource allocation requests to the operating system, the Hypervisor and to the adapter can be avoided.
The master controller is implemented partly in a lower layer application programming interface and in a device driver (DD) of the operating system. Pools of privileged and super-privileged communication resources are allocated to the master controller from resources owned by the Hypervisor, the operating system and the adapter at time of initialization, e.g., at time of initial program load (IPL). The pools of resources include particular regions of memory, channels, translation tables, miscellaneous tables, and data structures of the operating system kernel. The master controller monitors the available resources in the pools and dynamically maintains the number of resources available according to targets.
Static assignments of particular combinations of communication resources are avoided. In an embodiment of the invention, memory is allocated to user applications for zero copy messaging through a mechanism such as “malloc”. “Malloc” operations are handled by the master controller rather than the operating system. In malloc operation, the master controller allocates a particular data buffer to a user application. Such data buffer can then be referenced in a subsequent message request by the user application to perform a zero copy communication. In response to the message request, the master controller then assigns a channel from the pool of the channels that it maintains, assigns a translation table from the pool of translation tables it maintains, assigns miscellaneous tables, and assigns a data structure from the respective pools that it maintains. In an embodiment, the master controller assigns the resources independently from its assignment of any other resource, except that the resources must correspond to each other in size. Resource contention is reduced in this way by not requiring fixed combinations of resources and allowing any resources which have the requisite size to be assigned for use in satisfying a particular message request.
Address translation is avoided, when possible, by the user application referencing the same previously allocated data buffer as the source data for successive message requests. In such case, the master controller is able to simply reference a data structure containing translation information for one or more previously sent messages, and thereby avoid performing address translation. Thus, the data structure then represents a “cache” containing translation information for a data buffer which has been previously referenced in a message request. An example of such translation information is a pointer to a PTE entry in the PTE table. In one embodiment, the master controller also examines use data for each data structure, continues to retain the data structures which correspond to more recently referenced data buffers and discards the data structure when the data buffer has not been recently referenced.
If translation information for the data buffer referenced by the requested message is not available from previously performed address translation, then low-cost techniques are employed for performing translations as necessary and for passing the translation information to the adapter.
Thus,
With combined reference to
At this time, a description of the differences between different sizes of data buffers would be helpful. For smaller size data buffers, e.g., data buffers up to 16M in size, each data buffer is mapped according to a conventional page size of 4K bytes per page. However, for the larger size data buffers, e.g., such as those of 32M and larger, the data buffers can be mapped to large size pages, e.g., in which each page is 16M in size. Page translation of large data buffers according to such “large pages” is more efficient because of much reduced time in performing address translation. As an example, for a 32M data buffer in a particular memory region, when the page size is 4K, it is apparent that at least 16K traversals of the PTE table are required to perform address translation. This is because, as discussed above relative to
However, when the page size is increased to 16M, this number of table traversals is reduced to only two traversals of the PTE table. It is evident that as the size of the data buffer is increased to a large size such as 256M, the number of PTE table traversals using a 4K page size can become prohibitive. Accordingly, such large data buffers are desirably mapped to large page sizes such as 16M.
Further resources allocated to the master controller include channels allocated from adapter resources, such as CHAN 1, CHAN 2, . . . CHAN N. In addition to the channels, tables are also allocated to the master controller from the Hypervisor, such tables including translation tables TTBL 1, TTBL 2, etc., to use for posting translation information and other miscellaneous tables. Additionally, a pool of data structures DS 1, DS 2, etc., is also allocated, the data structures to be used to contain translation information for the addresses of the most recently and/or most frequently data that is transferred by user applications in that logical partition. The data structures also contain information including use counts from which it is determined which data structures should be retained by the master controller, and which data structures can be purged.
The data structures can be viewed as containing address translation information for much of the “working set” of the data that is referenced in message requests by user applications in a particular logical partition. Ideally, the ratio of the translation information contained in the data structures to the data actually being referenced in message requests should be high. In such case, the data structures serve as a type of cache for translation information relating to data that is frequently being passed in messages from one processor to another over the switch 130. The master controllers' assignment of data buffers to user applications and their use of the data buffers should be arranged such that the data buffers represent relatively small areas of memory such that those areas are more likely to be referenced repeatedly in messages.
Referring again to
Thereafter, in operation, the master controller monitors the resources available in each pool, as shown at 830. Certain resources such as channels and translation tables are used only once by a particular user application, e.g., MPI or GPFS, during the sending of a particular message and then are returned to the master controller again for reassignment in response to another message request. Therefore, these resources remain available after each use. However, certain other resources such as the data buffers and data structures can be assigned to a user application and then used by that application over a longer period of time. In such case, at step 840 the master controller determines which of the resources are still needed. The master controller does this by determining which of the resources have been used most recently or most frequently, and which others, by contrast, have not been used as recently or frequently. For those resources which have not been used recently or frequently, the master controller returns them (step 850) to the corresponding pools for re-allocation according to subsequent requests. In doing so, the master controller informs the user application that the resource has been de-allocated. In addition, if the monitoring indicates that the number of such resources in the pool is more than the master controller expects to need for subsequent message requests, the resource is returned to the privileged resource owner, i.e., the Hypervisor, operating system and/or adapter.
Also, as indicated at step 860, the master controller monitors the amount of resources available in the pools, and if it appears that an additional resource will be needed soon, then the master controller requests its allocation, and the Hypervisor, operating system, and/or the adapter then allocate the requested resource, as indicated at 870. The arrow that closes the loop to step 810 at which the master controller assigns a data buffer to a user application indicates that the master controller performs an ongoing process of monitoring the use of and re-assigning resources for messages from the pools. Likewise, the master controller also obtains privileged resources to add to the pools from the owners (Hypervisor, operating system, adapter) of the resources as needed, and returns them when no longer needed.
A method of transmitting a message by way of a zero copy mechanism will now be described with respect to
As discussed above, the data structure holds a number of TCEs that correspond to the number of page addresses that were referenced by a previous message. The data structure also includes use counts indicating which TCEs have been used most frequently and/or most recently. Those TCEs which have been used less frequently or recently are discarded, by overwriting them with more recently used TCEs. However, in each case, the master controller associates one virtual address and one continuous range of memory with each data structure. If all TCEs already exist in a data structure for the message to be transmitted, then the translation table is loaded with the TCEs from the data structure. On the other hand, some TCEs for the message to be transmitted may exist in a data structure for a previously transmitted message, while others of the TCEs do not. In such case, the TCEs that exist in that translation table for part of the message are placed in the translation table and only those addresses which have not been previously transmitted are now translated.
When address translation still needs to be performed, desirably, one or more time-saving techniques are used to obtain and provide the translation information to the adapter in an efficient way, as indicated at 1035. One technique is to reduce the number of traversals of the PTE table that are required to pin and translate each page of the data to be sent. As discussed above, one way to do this is to assign data buffers that are mapped according to large pages, e.g., 16M pages, when assigning very large data buffers, e.g., those of size 32M and greater. In such case, the number of traversals of the PTE table is reduced by a factor of 4K. Thus, for a large region of memory such as the 32M example discussed above, while 8K traversals of the PTE table are ordinarily performed when the data buffer is mapped to 4K size pages, only two traversals of the PTE table are required when the data buffer is mapped to 16M size pages.
According to an embodiment of the invention, another technique provided to reduce the number of traversals of the PTE table is the use of simultaneous pinning and translation of the pages of the message to be sent. As discussed above relative to
This is further highlighted by returning to the previous example described as background of the invention. By the prior art method, when the message payload length is 16M, the number of times that the PTE table is traversed to pin each address is once for every page, which is 16M/4K, i.e., 4K times. Since traversing the PTE table requires traversing a chain of at least two tables, then 8K table traversals are required to pin the addresses. As also described above, an additional 8K table traversals are required to translate the addresses, Thus, a total of 16K table traversals are required to perform the necessary address translation for a 16M message. However, by the method according to this embodiment of the invention, since the PTE table is traversed only once instead of twice, the number of table traversals is reduced by half to 8K.
In one embodiment, another technique used to reduce the time associated with translating addresses is to pack the translation information in the translation table. Referring to
By contrast, in this embodiment of the invention, eight TCEs are packed into each 128 byte wide area of the translation table, such that when each 128 byte transfer occurs along the bus 112, eight TCEs are transferred from the processor 110 to the adapter 125. Accordingly, transfer of packed TCEs along the processor adapter bus 112 in this manner represents an eight-fold increase in the transfer rate of TCEs to the adapter.
As shown at step 1040, once the translation information is ready, the adapter is notified that there is data to be sent, and the translation table (TTBL) to be used is identified to the adapter. Other information such as the channel to be used is also identified to the adapter at this time. Thereafter, at step 1050, the adapter stores the contents of the translation table to its own memory, and then, at step 1060, the adapter transmits the message over the allocated channel across the switch to the receiving processor. This is the usual process used when a message has an average length, e.g., of 1M.
However, occasions exist where it is desirable to handle a request to send a large amount of payload data by sending the payload data as two or more messages each carrying a portion of the payload data, at least some messages of which can be transmitted simultaneously. This way of handling a message request is called “striping.” Referring to
As discussed above as background to the invention, despite the advantages of zero copy messaging for larger size messages, the amount of setup time required therefor makes zero copy messaging too costly for smaller size messages. While the improvements described herein seek to reduce the setup time required for messaging by way of a zero copy mechanism, there is still a crossover point in the size of the message to be transmitted at which it would take less time to transmit the message by way of a copy mode mechanism rather than the zero copy mechanism. In the embodiment of the invention illustrated in
In an embodiment of the invention, at least N−1 (one less than N) messages each have the same data payload length that is determined according to a desired striping size, and one message contains the remainder of the data. However, in another embodiment of the invention, striping is performed using messages having different data payload lengths. For example, one message request to transmit a message of length 8M could be striped according to the one embodiment as eight messages each having a data payload length of 1M. In another embodiment, as an example, an 8M message could be striped as four messages, one having a data payload length of 4M, one message having a data payload length of 2M, and the other two messages having a data payload length of 1M each.
In one embodiment, the threshold used to determine whether a requested message having a data payload length L should be striped as a plurality N of zero copy mode messages each having a data payload length L/N is based on the relation between the amount of setup time TS needed to prepare the requested message for transmission as the N striped messages and the transit time TTR of the requested message across the bus 112 (
L/Bus rate=0.5M/1.5 GBs=340 μsec.
By the above relation, the threshold for striping the 0.5M message as two zero copy mode messages is that the setup time TS for preparing the two striped messages is less than ½ of the transit time. Specifically, in this example, the setup time TS for preparing the two striped messages, each having length 256K, should be less than 170 μsecs for the requested 0.5M length message to be striped. Using techniques provided according to the embodiments of the invention, the setup time TS for preparing striped zero copy mode messages having 256K lengths is reduced to about 120 μsecs. Such small setup time applies, for example, when the message is able to be sent without requiring address translation because the data buffer referenced by the message request has already been translated and a data structure contains the needed translation information.
The time required to send data via each of the copy mode and zero copy transport mechanisms will now be described. An equation for the copy mode transfer time TC to send a message of length L via a copy mode mechanism is:
TC=mCL+CC
where mC is the time interval per byte corresponding to the copy rate, for example 1/1 GBs (gigabyte/sec), and CC is a constant time interval, e.g., 40 μsecs to account for latency in copying the data into the FIFO pinned memory and latency in handshaking across the bus 112 (
Thus, the copy mode transfer time for a 0.5 M length message is determined as
TC=0.5 M/1 GBs+40 μsecs=488+40=528 μsecs.
Bandwidth is a measure of the amount of data transmitted per unit of time. Therefore, for this message having a particular length of 0.5 M, the copy mode bandwidth is 0.5 M/528 μsecs=947 MBs (megabytes/sec).
On the other hand, the zero copy transfer time is determined by another equation as follows:
TZ=mZL+KZ+C(L)
where mZ is the time interval per byte corresponding to the bus transfer rate, for example 1/1.5 GBs (gigabyte/sec), and KZ is a constant time interval, e.g., 60 μsec, to account for latency in obtaining the needed resources, e.g., translation table, channel, etc., and C(L) is an amount of time which varies according to the amount of data to be transferred. It generally takes longer to perform the necessary translations for a larger amount of data than it does for a smaller amount of data. C(L) accounts for this variable element of the time. In an example, for a message having a payload length of 0.5 M, the numbers are as follows:
TZ=0.5 M/1.5 GBs+60 μsecs+80 μsecs=326 μsecs+140 μsecs=466 μsec.
The corresponding bandwidth is 0.5 M/466 μsecs=1072 MBs. Thus, in this example, since TZ is lower than TC and the zero copy bandwidth BWZ is higher than the copy mode bandwidth BWC, the decision should be to use a zero copy mechanism to send the message. On the other hand, if the message payload length is smaller, such as 200K bytes, for example, the above equations would lead to the opposite result, i.e., that the copy mode transport mechanism should be employed to transfer the message rather than the zero copy mechanism.
In another example, in a particular computing system, the copy rate is 1.7 GBs and the bus transfer rate is 2 GBs. Plugging these rates into the above equations,
the copy mode transfer time becomes:
TC=0.5 M/1.7 GBs+40 μsecs=287 μsecs+40 μsecs=327 μsecs.
and the zero copy transfer time becomes:
TZ=0.5 M/2 GBs+60 μsecs+80 μsecs=244 μsecs+140 μsecs=384 μsec.
Under these conditions, the setup time for the zero copy transfer time is a greater factor in the equations. Therefore, in this case, the threshold should be set higher than 0.5 M for zero copy transfer mode.
In a further example, it is assumed that a 2M message is to be sent and that the total setup time for the zero copy mode message is now 200 μsecs instead of 140 μsecs as before. In that case, the message should be sent via a zero copy transport mechanism because the copy mode transfer time becomes:
TC=2 M/1.7 GBs+40 μsecs=1148 μsecs+40 μsecs=1188 μsecs
and the zero copy transfer time becomes:
TZ=2 M/2 GBs+200 μsecs=976 μsecs+200 μsecs=1176 μsec,
which is less than the copy mode transfer time.
However, certain conditions may change during operation of the processor, such as when the processor is under high demand conditions and resources take longer to obtain. Under such conditions, the fixed and variable amounts of time required to set up a zero copy message may increase, and the bandwidth monitoring facility 1210 may detect a decrease in the zero copy bandwidth BWZ to a level below the copy mode bandwidth BWC, for messages having a particular size that is close to the threshold level. In such case, control is exerted, as shown at 1220, to adjust the threshold to a new value which is more appropriate to the current conditions. Thereafter, the new value is used for deciding whether a zero copy transport mechanism or a copy mode mechanism should be used. In an embodiment, the monitoring of such bandwidths is not based on just one measurement at each interval, but rather, is based on a collection of measurements that are taken over time. In such case, the bandwidth measurement for each mode of transmission represents a filtering of such measurements. For example, a simple moving average formula can be applied to average the measurements over a most recent interval of interest, e.g., ten sampling intervals.
As discussed above, a sampling interval for zero copy operation may be that required for transmitting 64 packets, each packet containing 2K bytes. In such a case, the interval needed to transfer the 128 K bytes is approximately 81 μsec, at the bus transfer rate of 1.5 GBs. Then, averaging is performed over an interval for taking 10 samples, which takes 10×81 μsecs=810 μsecs. However, in an embodiment, the more recent of the measurements are weighted more heavily, e.g., the weightings of the most recent of the 10 sampling intervals count for much more in the moving average, such that the moving average is more reflective of the most recent interval than the measurements which were taken earlier. For the copy mode mechanism, since the message length is usually smaller than for zero copy mechanisms, then the sampling interval is preferably made somewhat shorter than the 81 μsecs interval used for the zero copy mode. Likewise, the averaging interval can be made correspondingly shorter than the 810 μsecs example interval for zero copy mechanisms.
In addition, provision is made for varying the interval at which the bandwidth is monitored at step 1210. It is recognized that different system conditions could cause the zero copy bandwidth and the copy mode bandwidth to sometimes vary only slowly, while varying more rapidly at other times. In recognition of this, in one embodiment, it is a goal to obtain samples of the bandwidth at a sufficient sampling rate to fully determine the frequency at which these raw samples of bandwidth measurements vary. From sampling theory, in order to obtain complete data for determining the frequency at which these sampled bandwidths vary, the Nyquist criterion must be satisfied, i.e., the sampling rate must be higher than twice the maximum rate that the bandwidth measurements vary. Moreover, since the rates of change of the copy mode bandwidth and the zero copy bandwidth change over time, in this embodiment, the sampling rate is also varied over time, according to observed system conditions.
While the invention has been described with reference to certain preferred embodiments, those skilled in the art will recognize the many modifications and enhancements which can be made without departing from the true scope and spirit of the invention, which is limited only by the claims appended below.