The present invention relates in general to I/O adapters, and particularly to memory management in I/O adapters.
Computer networking is now ubiquitous. Computing demands require ever-increasing amounts of data to be transferred between computers over computer networks in shorter amounts of time. Today, there are three predominant computer network interconnection fabrics. Virtually all server configurations have a local area network (LAN) fabric that is used to interconnect any number of client machines to the servers. The LAN fabric interconnects the client machines and allows the client machines access to the servers and perhaps also allows client and server access to network attached storage (NAS), if provided. The most commonly employed protocol in use today for a LAN fabric is TCP/IP over Ethernet. A second type of interconnection fabric is a storage area network (SAN) fabric, which provides for high speed access of block storage devices by the servers. The most commonly employed protocol in use today for a SAN fabric is Fibre Channel. A third type of interconnection fabric is a clustering network fabric. The clustering network fabric is provided to interconnect multiple servers to support such applications as high-performance computing, distributed databases, distributed data storage, grid computing, and server redundancy. Although it was hoped by some that INFINIBAND would become the predominant clustering protocol, this has not happened so far. Many clusters employ TCP/IP over Ethernet as their interconnection fabric, and many other clustering networks employ proprietary networking protocols and devices. A clustering network fabric is characterized by a need for super-fast transmission speed and low-latency.
It has been noted by many in the computing industry that a significant performance bottleneck associated with networking in the near term will not be the network fabric itself, as has been the case in the past. Rather, the bottleneck is now shifting to the processor in the computers themselves. More specifically, network transmissions will be limited by the amount of processing required of a central processing unit (CPU) to accomplish network protocol processing at high data transfer rates. Sources of CPU overhead include the processing operations required to perform reliable connection networking transport layer functions (e.g., TCP/IP), perform context switches between an application and its underlying operating system, and copy data between application buffers and operating system buffers.
It is readily apparent that processing overhead requirements must be offloaded from the processors and operating systems within a server configuration in order to alleviate the performance bottleneck associated with current and future networking fabrics. One way in which this has been accomplished is by providing a mechanism for an application program running on one computer to transfer data from its host memory across the network to the host memory of another computer. This operation is commonly referred to as a remote direct memory access (RDMA) operation. Advantageously, RDMA drastically eliminates the need for the operating system running on the server CPU to copy the data from application buffers to operating system buffers and vice versa. RDMA also drastically reduces the latency of an inter-host memory data transfer by reducing the amount of context switching between the operating system and application.
Two examples of protocols that employ RDMA operations are INFINIBAND and iWARP, each of which specifies an RDMA Write and an RDMA Read operation for transferring large amounts of data between computing nodes. The RDMA Write operation is performed by a source node transmitting one or more RDMA Write packets including payload data to the destination node. The RDMA Read operation is performed by a requesting node transmitting an RDMA Read Request packet to a responding node and the responding node transmitting one or more RDMA Read Response packets including payload data. Implementations and uses of RDMA operations are described in detail in the following documents, each of which is incorporated by reference in its entirety for all intents and purposes:
Essentially all commercially viable operating systems and processors today provide memory management. That is, the operating system allocates regions of the host memory to applications and to the operating system itself, and the operating system and processor control access by the applications and the operating system to the host memory regions based on the privileges and ownership characteristics of the memory regions. An aspect of memory management particularly relevant to RDMA is virtual memory capability. A virtual memory system provides several desirable features. One example of a benefit of virtual memory systems is that they enable programs to execute with a larger virtual memory space than the existing physical memory space. Another benefit is that virtual memory facilitates relocation of programs in different physical memory locations during different or multiple executions of the program. Another benefit of virtual memory is that it allows multiple processes to execute on the processor simultaneously, each having its own allocated physical memory pages to access without having to be swapped in from disk, and without having to dedicate the full physical memory to one process.
In a virtual memory system, the operating system and CPU enable application programs to address memory as a contiguous space, or region. The addresses used to identify locations in this contiguous space are referred to as virtual addresses. However, the underlying hardware must address the physical memory using physical addresses. Commonly, the hardware views the physical memory as pages. A common memory page size is 4 KB. Thus, a memory region is a set of memory locations that are virtually contiguous, but that may or may not be physically contiguous. As mentioned, the physical memory backing the virtual memory locations typically comprises one or more physical memory pages. Thus, for example, an application program may allocate from the operating system a buffer that is 64 KB, which the application program addresses as a virtually contiguous memory region using virtual addresses. However, the operating system may have actually allocated sixteen physically discontiguous 4 KB memory pages. Thus, each time the application program uses a virtual address to access the buffer, some piece of hardware must translate the virtual address to the proper physical address to access the proper memory location. An example of the address translation hardware in an IA-32 processor, such as an Intel® Pentium® processor, is the memory management unit (MMU).
A typical computer, or computing node, or server, in a computer network includes a processor, or central processing unit (CPU), a host memory (or system memory), an I/O bus, and one or more I/O adapters. The I/O adapters, also referred to by other names such as network interface cards (NICs) or storage adapters, include an interface to the network media, such as Ethernet, Fibre Channel, INFINIBAND, etc. The I/O adapters also include an interface to the computer I/O bus (also referred to as a local bus, such as a PCI bus). The I/O adapters transfer data between the host memory and the network media via the I/O bus interface and network media interface.
An RDMA Write operation posted by the system CPU made to an RDMA enabled I/O adapter includes a virtual address and a length identifying locations of the data to be read from the host memory of the local computer and transferred over the network to the remote computer. Conversely, an RDMA Read operation posted by the system CPU to an I/O adapter includes a virtual address and a length identifying locations in the local host memory to which the data received from the remote computer on the network is to be written. The I/O adapter must supply physical addresses on the computer system's I/O bus to access the host memory. Consequently, an RDMA requires the I/O adapter to perform the translation of the virtual address to a physical address to access the host memory. In order to perform the address translation, the operating system address translation information must be supplied to the I/O adapter. The operation of supplying an RDMA enabled I/O adapter with the address translation information for a virtually contiguous memory region is commonly referred to as a memory registration.
Effectively, the RDMA enabled I/O adapter must perform the memory management, and in particular the address translation, that the operating system and CPU perform in order to allow applications to perform RDMA data transfers. One obvious way for the RDMA enabled I/O adapter to perform the memory management is the way the operating system and CPU perform memory management. As an example, many CPUs are Intel IA-32 processors that perform segmentation and paging, as shown in
The processor calculates a virtual address (referred to in
To translate a virtual, or linear, address to a physical address, the IA-32 MMU performs the following steps. First, the MMU adds the directory index bits of the virtual address to the base address of the page directory to obtain the address of the appropriate page directory entry. (The operating system previously programmed the page directory base address of the currently executing process, or task, into the page directory base register (PDBR) of the MMU when the process was scheduled to become the current running process.) The MMU then reads the page directory entry to obtain the base address of the appropriate page table. The MMU then adds the page table index bits of the virtual address to the page table base address to obtain the address of the appropriate page table entry. The MMU then reads the page table entry to obtain the physical memory page address, i.e., the base address of the appropriate physical memory page, or physical address of the first byte of the memory page. The MMU then adds the byte offset bits of the virtual address to the physical memory page address to obtain the physical address translated from the virtual address.
The IA-32 page tables and page directories are each 4 KB and are aligned on 4 KB boundaries. Thus, each page table and each page directory has 1024 entries, and the IA-32 two-level page directory/page table scheme can specify virtual to physical memory page address translation information for 2ˆ20 memory pages. As may be observed, the amount of memory the operating system must allocate for page tables to perform address translation for even a small memory region (even a single byte) is relatively large. However, this apparent inefficiency is typically not as it appears because most programs require a linear address space that is larger than the amount of memory allocated for page tables. Thus, in the host computer realm, the IA-32 scheme is a reasonable tradeoff in terms of memory usage.
As may also be observed, the IA-32 scheme requires two memory accesses to translate a virtual address to a physical address: a first to read the appropriate page directory entry and a second to read the appropriate page table entry. These two memory accesses may appear to impose undue pressure on the host memory in terms of memory bandwidth and latency, particularly in light of the present disparity between CPU cache memory access times and host memory access times and the fact that CPUs tend to make frequent relatively small load/store accesses to memory. However, the apparent bandwidth and latency pressure imposed by the two memory accesses is largely alleviated by a translation lookaside buffer within the MMU that caches recently used page table entries.
As mentioned above, the memory management function imposed upon host computer virtual memory systems typically has at least two characteristics. First, the memory regions are typically relatively large virtually contiguous regions. This is mainly because most operating systems perform page swapping, or demand paging, and therefore allow a program to use the entire virtual memory space of the processor. Second, the memory regions are typically relatively static; that is, memory regions are typically allocated and de-allocated relatively infrequently. This is mainly because programs tend to run a relatively long time before they exit.
In contrast, the memory management functions imposed upon RDMA enabled I/O adapters are typically quite the opposite of processors with respect to the two characteristics of memory region size and allocation frequency. This is because RDMA application programs tend to allocate buffers to transfer data that are relatively small compared to the size of a typical program. For example, it is not unusual for a memory region to be merely the size of a memory page when used for inter-processor communications (IPC), such as commonly employed in clustering systems. Additionally, unfortunately many application programs tend to allocate and de-allocate a buffer each time they perform an I/O operation, rather than initially allocating buffers and re-using them, which causes the I/O adapter to receive memory region registrations much more frequently than the frequency at which programs are started and terminated. This application program behavior may also require the I/O adapter to maintain many more memory regions during a period of time than the host computer operating system.
Because RDMA enabled I/O adapters are typically requested to register a relatively large number of relatively small memory regions and are requested to do so relatively frequently, it may be observed that employing a two-level page directory/page table scheme such as the IA-32 processor scheme may cause the following inefficiencies. First, a substantial amount of memory may be required on the I/O adapter to store all of the page directories and page tables for the relatively large number of memory regions. This may significantly drive up the cost of an RDMA enabled I/O adapter. An alternative is for the I/O adapter to generate an error in response to a memory registration request due to lack of resources. This is an undesirable solution. Second, as mentioned above, the two-level scheme requires at least two memory accesses per virtual address translation required by an RDMA request—one to read the appropriate page directory entry and one to read the appropriate page table entry. The two memory accesses may add latency to the address translation process and to the processing of an RDMA request. Additionally, the two memory accesses impose additional memory bandwidth consumption pressure upon the I/O adapter memory system.
Finally, it has been noted by the present inventors that in many cases the memory regions registered with an I/O adapter are not only virtually contiguous (by definition), but are also physically contiguous, for at least two reasons. First, because a significant portion of the memory regions tend to be relatively small, they may be smaller than or equal to the size of a physical memory page. Second, a memory region may be allocated to an application or device driver by the operating system at a time when physically contiguous memory pages were available to satisfy the needs of the requested memory region, which may particularly occur if the device driver or application runs soon after the system is bootstrapped and continues to run throughout the uptime of the system. In such a situation in which the memory region is physically contiguous, allocating a full two-level IA-32-style set of page directory/page table resources by the I/O adapter to manage the memory region is a significantly inefficient use of I/O adapter memory.
Therefore, what is needed is an efficient memory registration scheme for RDMA enabled I/O adapters.
The present invention provides an I/O adapter that allocates a variable set of data structures in its local memory for storing memory management information to perform virtual to physical address translation depending upon multiple factors. One of the factors is whether the memory pages of the registered memory region are physically contiguous. Another factor is whether the number of non-physically-contiguous memory pages is greater than the number of entries in a page table. Another factor is whether the number of non-physically-contiguous memory pages is greater than the number of entries in a small page table or a large page table. Based on the factors, a zero-level, one-level, or two-level structure for storing the translation information is allocated. Advantageously, the smaller the number of levels, the fewer accesses to the I/O adapter memory need be made in response to an RDMA request for which address translation must be performed. Also advantageously, the amount of I/O adapter memory required to store the translation information may be significantly reduced, particularly for a mix of memory region registrations in which the size and frequency of access is skewed toward the smaller memory regions.
In one aspect, the present invention provides a method for performing memory registration for an I/O adapter having a memory. The method includes creating a first pool of a first type of page table and a second pool of a second type of page table within the I/O adapter memory. The first type of page table includes storage for a first predetermined number of entries each for storing a physical page address. The second type of page table includes storage for a second predetermined number of entries each for storing a physical page address. The second predetermined number of entries is greater than the first predetermined number of entries. The method also includes, in response to receiving a memory registration request specifying physical page addresses of a number of physical memory pages backing a virtually contiguous memory region, allocating one of the first type of page table for storing the physical page addresses, if the number of physical memory pages is less than or equal to the first predetermined number of entries, and allocating one of the second type of page table for storing the physical page addresses, if the number of physical memory pages is greater than the first predetermined number of entries and less than or equal to the second predetermined number of entries.
In another aspect, the present invention provides a method for registering a memory region with an I/O adapter, in which the memory region comprises a virtually contiguous memory range implicating a plurality of physical memory pages in a host computer coupled to the I/O adapter, and the I/O adapter includes a memory. The method includes receiving a memory registration request. The request includes a list specifying a physical page address of each of the plurality of physical memory pages. The method also includes allocating an entry in a memory region table of the I/O adapter memory for the memory region, in response to receiving the memory registration request. The method also includes determining whether the plurality of physical memory pages are physically contiguous based on the list of physical page addresses. The method also includes, if the plurality of physical memory pages are physically contiguous, forgoing allocating any page tables for the memory region, and storing a physical page address of a beginning physical memory page of the plurality of physical memory pages into the memory region table entry.
In another aspect, the present invention provides an I/O adapter for interfacing a host computer to a transport medium, in which the host computer has a memory for storing virtually contiguous memory regions each backed by a plurality of physical memory pages, and the memory regions have been previously registered with the I/O adapter. The I/O adapter includes a memory that stores a memory region table. The table includes a plurality of entries. Each entry stores an address and an indicator associated with one of the virtually contiguous memory regions. The indicator indicates whether the plurality of memory pages backing the memory region are physically contiguous. The I/O adapter also includes a protocol engine, coupled to the memory region table, which receives from the host computer a request to transfer data between the transport medium and a location specified by a virtual address within the memory region associated with one of the plurality of table entries. The virtual address is specified by the data transfer request. The protocol engine reads the table entry associated with the memory region, in response to receiving the request. If the indicator indicates the plurality of memory pages are physically contiguous, the memory region table entry address is a physical page address of one of the plurality of memory pages that includes the location specified by the virtual address.
In another aspect, the present invention provides an I/O adapter for interfacing a host computer to a transport medium, in which the host computer has a memory. The I/O adapter includes a memory region table including a plurality of entries. Each entry stores an address and a level indicator associated with a memory region. The I/O adapter also includes a protocol engine, coupled to the memory region table, which receives from the host computer a request to transfer data between the transport medium and a virtual address in a memory region in the host memory associated with an entry in the memory region table. The protocol engine responsively reads the memory region table entry and examines the entry level indicator. If the level indicator indicates two levels, the protocol engine reads an address of a page table from an entry in a page directory. The entry within the page directory is specified by a first index comprising a first portion of the virtual address. An address of the page directory is specified by the memory region table entry address. The protocol engine further reads a physical page address of a physical memory page backing the virtual address from an entry in the page table. The entry within the page table is specified by a second index comprising a second portion of the virtual address. If the level indicator indicates one level, the protocol engine reads the physical page address of the physical memory page backing the virtual address from an entry in a page table. The address of the page directory is specified by the memory region table entry address. The entry within the page table is specified by the second index comprising the second portion of the virtual address.
In another aspect, the present invention provides an RDMA-enabled I/O adapter for interfacing a host computer to a transport medium, in which the host computer has a host memory. The I/O adapter includes a memory region table including a plurality of entries. Each entry stores information describing a memory region. The I/O adapter also includes a protocol engine, coupled to the memory region table, that receives first, second, and third RDMA requests specifying respective first, second, and third virtual addresses in respective first, second, and third memory regions described in respective first, second, and third of the plurality of memory region table entries. In response to the first RDMA request, the protocol engine reads the first entry to obtain a physical page address specifying a first physical memory page backing the first virtual address. In response to the second RDMA request, the protocol engine reads the second entry to obtain an address of a first page table, and reads an entry in the first page table indexed by a first portion of bits of the virtual address to obtain a physical page address specifying a second physical memory page backing the second virtual address. In response to the third RDMA request, the protocol engine reads the third entry to obtain an address of a page directory, reads an entry in the page directory indexed by a second portion of bits of the virtual address to obtain an address of a second page table, and reads an entry in the second page table indexed by the first portion of bits of the virtual address to obtain a physical page address specifying a third physical memory page backing the third virtual address.
In another aspect, the present invention provides an I/O adapter for interfacing a host computer to a transport medium, in which the host computer has a memory for storing a virtually contiguous memory region backed by a plurality of physical memory pages, and the memory region has been previously registered with the I/O adapter. The I/O adapter includes a memory for storing address translation information for use by the adapter to translate a virtual address to a physical address of a location within the memory region. The address translation information is stored in the memory in response to the previous registration of the memory region. The I/O adapter also includes a protocol engine, coupled to the memory, that performs only one access to the memory to fetch a portion of the address translation information to translate the virtual address to the physical address, if the plurality of physical memory pages are physically contiguous.
In another aspect, the present invention provides an I/O adapter for interfacing a host computer to a transport medium, in which the host computer has a memory for storing a virtually contiguous memory region backed by a plurality of physical memory pages, and the memory region has been previously registered with the I/O adapter. The I/O adapter includes a memory, for storing address translation information for use by the adapter to translate a virtual address to a physical address of a location within the memory region. The address translation information is stored in the memory in response to the previous registration of the memory region. The I/O adapter also includes a protocol engine, coupled to the memory, that performs only two accesses to the memory to fetch a portion of the address translation information to translate the virtual address to the physical address, if the plurality of physical memory pages are not greater than a predetermined number. The protocol engine performs only three accesses to the memory to fetch a portion of the address translation information to translate the virtual address to the physical address, if the plurality of physical memory pages are greater than the predetermined number.
In another aspect, the present invention provides a method for performing memory registration for an I/O adapter coupled to a host computer, the host computer having a host memory. The method includes creating a first pool of a first type of page table and a second pool of a second type of page table within the host memory. The first type of page table includes storage for a first predetermined number of entries each for storing a physical page address. The second type of page table includes storage for a second predetermined number of entries each for storing a physical page address. The second predetermined number of entries is greater than the first predetermined number of entries. The method also includes, in response to receiving a memory registration request specifying physical page addresses of a number of physical memory pages backing a virtually contiguous memory region, allocating one of the first type of page table for storing the physical page addresses, if the number of physical memory pages is less than or equal to the first predetermined number of entries, and allocating one of the second type of page table for storing the physical page addresses, if the number of physical memory pages is greater than the first predetermined number of entries and less than or equal to the second predetermined number of entries.
In another aspect, the present invention provides a method for registering a virtually contiguous memory region with an I/O adapter, the memory region comprising a virtually contiguous memory range implicating a plurality of physical memory pages in a host computer coupled to the I/O adapter, the host computer having a memory comprising the physical memory pages. The method includes receiving a memory registration request. The request includes a list specifying a physical page address of each of the plurality of physical memory pages. The method also includes allocating an entry in a memory region table of the host computer memory for the memory region, in response to receiving the memory registration request. The method also includes determining whether the plurality of physical memory pages are physically contiguous based on the list of physical page addresses. The method also includes forgoing allocating any page tables for the memory region and storing a physical page address of a beginning physical memory page of the plurality of physical memory pages into the memory region table entry, if the plurality of physical memory pages are physically contiguous.
In another aspect, the present invention provides an I/O adapter for interfacing a host computer to a transport medium, the host computer having a memory. The I/O adapter includes a protocol engine that accesses a memory region table stored in the host computer memory. The table includes a plurality of entries, each storing an address and a level indicator associated with a virtually contiguous memory region. The protocol engine receives from the host computer a request to transfer data between the transport medium and a virtual address in a memory region in the host memory associated with an entry in the memory region table, responsively reads the memory region table entry, and examines the entry level indicator. If the level indicator indicates two levels, the protocol engine reads an address of a page table from an entry in a page directory. The entry within the page directory is specified by a first index comprising a first portion of the virtual address. An address of the page directory is specified by the memory region table entry address. The page directory and the page table are stored in the host computer memory. If the level indicator indicates two levels, the protocol engine also reads a physical page address of a physical memory page backing the virtual address from an entry in the page table. The entry within the page table is specified by a second index comprising a second portion of the virtual address. However, if the level indicator indicates one level, the protocol engine reads the physical page address of the physical memory page backing the virtual address from an entry in a page table. The entry within the page table is specified by the second index comprising the second portion of the virtual address. The address of the page table is specified by the memory region table entry address. The page table is stored in the host computer memory.
Referring now to
The operating system 362 manages the host memory 304 as a set of physical memory pages 324 that back the virtual memory address space presented to application programs 358 by the operating system 362.
The host memory 304 also includes a queue pair (QP) 374, which includes a send queue (SQ) 372 and a receive queue (RQ) 368. The QP 374 enables the application programs 358 and device driver 318 to submit work queue elements (WQEs) to the I/O adapter 306 and receive WQEs from the I/O adapter 306. The host memory 304 also includes a completion queue (CQ) 366 that enables the application programs 358 and device driver 318 to receive completion queue entries (CQEs) of completed WQEs. The QP 374 and CQ 366 may comprise, but are not limited to, implementations as specified by the iWARP or INFINIBAND specifications. In one embodiment, the I/O adapter 306 comprises a plurality of QPs similar to QP 374. The QPs 374 include a control QP, which is mapped into kernel address space and used by the operating system 362 and device driver 318 to post memory registration requests 334 and other administrative requests. The QPs 374 also comprise a dedicated QP 374 for each RDMA-enabled network connection (such as a TCP connection) to submit RDMA requests to the I/O adapter 306. The connection-oriented QPs 374 are typically mapped into user address space so that user-level application programs 358 can post requests to the I/O adapter 306 without transitioning to kernel level.
The application programs 358 and device driver 318 may submit RDMA requests and memory registration requests 334 to the I/O adapter 306 via the SQs 372. The memory registration requests 334 provide the I/O adapter 306 with a means for the I/O adapter 306 to map virtual addresses to physical addresses of a memory region 322. The memory registration requests 334 may include, but are not limited to, an iWARP Register Non-Shared Memory Region Verb or an INFINIBAND Register Memory Region Verb.
The I/O adapter 306 includes an I/O controller 308 coupled to an I/O adapter memory 316 via a memory bus 356. The I/O controller 308 includes a protocol engine 314, which executes a memory region table (MRT) update process 312. The I/O controller 308 transfers data with the I/O adapter memory 316, with the host memory 304, and with a network via a physical data transport medium 428 (shown in
The I/O adapter memory 316 stores a variety of data structures, including a memory region table (MRT) 382. The MRT 382 comprises an array of memory region table entries (MRTE) 352. The contents of an MRTE 352 are described in detail with respect to
Advantageously, the I/O adapter 306 is capable of employing page tables 336 of two different sizes, referred to herein as small page tables 336 and large page tables 336, to enable more efficient use of the I/O adapter memory 316, as described herein. In one embodiment, the size of a PTE 346 is 8 bytes. In one embodiment, the small page tables 336 each comprise 32 PTEs 346 (or 256 bytes) and the large page tables 336 each comprise 512 PTEs 346 (or 4 KB). The I/O adapter memory 316 stores a free pool of small page tables 342 and a free pool of large page tables 344 that are allocated for use in managing a memory region 322 in response to a memory registration request 334, as described in detail with respect to
MRTE 352 N+1 points directly to physical memory page 324 P+2, i.e., MRTE 352 N stores the physical page address 332 of physical memory page 324 P+2. This is possible because the physical memory pages 324 for memory region 322 N+1 are all contiguous, i.e., physical memory page 324 P+2 and P+3 are physically contiguous. Advantageously, a minimal amount of I/O adapter memory 316 is used to store the information for managing memory region 322 N+1 because it is detected that all the physical memory pages 324 are physically contiguous, as described in more detail with respect to the remaining Figures. That is, rather than unnecessarily allocating two levels of page table 336 resources, the I/O adapter 306 allocates zero page tables 336.
MRTE 352 N+2 points to a third page table 336. The first PTE 346 of the third page table 336 stores the physical page address 332 of physical memory page 324 P, and the second PTE 346 stores the physical page address 332 of physical memory page 324 P+7. Advantageously, a smaller amount of I/O adapter memory 316 is used to store the information for managing memory region 322 N+2 than for memory region 322 N because the I/O adapter 306 detects that the number of physical memory pages 324 may be specified by a single page table 336 and does not require two levels of page table 336 resources, as described in more detail with respect to the remaining Figures.
Referring now to
The I/O controller 308 also includes the protocol engine 314 of
The protocol engine 314 includes a control processor 406, a transmit pipeline 408, a receive pipeline 412, a context update and work scheduler 404, an MRT update process 312, and two arbiters 414 and 416. The context update and work scheduler 404 and MRT update process 312 receive notification of new work requests from the write queue 426. In one embodiment, the context update and work scheduler 404 comprises a hardware state machine, and the MRT update process 312 comprises firmware instructions executed by the control processor 406. However, it should be noted that the functions described herein may be performed by hardware, firmware, software, or various combinations thereof. The context update and work scheduler 404 communicates with the receive pipeline 412 and the transmit pipeline 408 to process RDMA requests. The MRT update process 312 reads and writes the I/O adapter memory 316 to update the MRT 382 and allocate and de-allocate MRTEs 352, page tables 336, and page directories 338 in response to memory registration requests 334. The output of the first arbiter 414 is coupled to the transaction switch 418, and the output of the second arbiter 416 is coupled to the memory interface 424. The requesters of the first arbiter 414 are the receive pipeline 412 and the transmit pipeline 408. The requesters of the second arbiter 416 are the receive pipeline 412, the transmit pipeline 408, the control processor 406, and the MRT update process 312. The protocol engine 314 also includes a direct memory access controller (DMAC) for transferring data between the transaction switch 418 and the host memory 304 via the host interface 402.
Referring now to
At block 504, the I/O adapter 306 creates the pool of small page tables 342 and the pool of large page tables 344 based on the information specified in the command received at block 502. Flow ends at block 504.
Referring now to
The MRTE 352 also includes a Two_Level_PT bit 614. When the PT-Required bit 612 is set, then if the Two_Level_PT bit 614 is set, the Address 604 points to a page directory 338; otherwise, the Address 604 points to a page table 336. The MRTE 352 also includes a PT_Size 616 field that indicates whether small or large page tables 336 are being used to store the page translation information for this memory region 322.
The MRTE 352 also includes a Valid bit 618 that indicates whether the MRTE 352 is associated with a valid memory region 322 registration. The MRTE 352 also includes an Allocated bit 622 that indicates whether the index into the MRT 382 for the MRTE 352 (e.g., iWARP STag or INFINIBAND memory region handle) has been allocated. For example, an application program 358 or device driver 318 may request the I/O adapter 306 to perform an Allocate Non-Shared Memory Region STag Verb to allocate an STag, in response to which the I/O adapter 306 will set the Allocated bit 622 for the allocated MRTE 352; however, the Valid bit 618 of the MRTE 352 will remain clear until the I/O adapter 306 receives, for example, a Register Non-Shared Memory Region Verb specifying the STag, at which time the Valid bit 618 will be set.
The MRTE 352 also includes a Zero_Based bit 624 that indicates whether the virtual addresses used by RDMA operations to access the memory region 322 will be offsets from the beginning of the virtual memory region 322 or will be full virtual addresses. For example, the iWARP specification refers to these two modes as virtual address-based tagged offset (TO) memory-regions and zero-based TO memory regions. A TO is the iWARP term used for the value supplied in an RDMA request that specifies the virtual address of the first byte to be transferred. Thus, the TO may be either a full virtual address or a zero-based offset virtual address, depending upon the memory region 322 mode. The TO in combination with the STag memory region identifier enables the I/O adapter 306 to generate a physical address of data to be transferred by an RDMA operation, as described with respect to
Referring now to
At block 702, an application program 358 makes a memory registration request 334 to the operating system 362, which validates the request 334 and then forwards it to the device driver 318 all of
At decision block 704, the device driver 318 determines whether all of the physical memory pages 324 specified in the page list 328 of the memory registration request 334 are physically contiguous, such as memory region 322 N+1 of
At block 706, the device driver 318 commands the I/O adapter 306 to allocate an MRTE 352 only, as shown in
At decision block 708, the device driver 318 determines whether the number of physical memory pages 324 specified in the page list 328 is less than or equal to the number of PTEs 346 in a small page table 336. If so, flow proceeds to block 712; otherwise, flow proceeds to decision block 714.
At block 712, the device driver 318 commands the I/O adapter 306 to allocate an MRTE 352 and one small page table 336, as shown in
At decision block 714, the device driver 318 determines whether the number of physical memory pages 324 specified in the page list 328 is less than or equal to the number of PTEs 346 in a large page table 336. If so, flow proceeds to block 716; otherwise, flow proceeds to block 718.
At block 716, the device driver 318 commands the I/O adapter 306 to allocate an MRTE 352 and one large page table 336, as shown in
At block 718, the device driver 318 commands the I/O adapter 306 to allocate an MRTE 352, a page directory 338, and r large page tables 336, where r is equal to the number of physical memory pages 324 in the page list 328 divided by the number of PTEs 346 in a large page table 336 and then rounded up to the nearest integer, as shown in
In one embodiment, the device driver 318 may perform an alternate set of steps based on the availability of free small page tables 336 and large page tables 336. For example, if a single large page table 336 is implicated by a memory registration request 334, but no large page tables 336 are available, the device driver 318 may specify a two-level multiple small page table 336 allocation instead. Similarly, if a small page table 336 is implicated by a memory registration request 334, but no small page tables 336 are available, the device driver 318 may specify a single large page table 336 allocation instead.
In one embodiment, if the device driver 318 receives an iWARP Allocate Non-Shared Memory Region STag Verb or an INFINIBAND Allocate L_Key Verb, the device driver 318 performs the steps of
Referring now to
At block 902, the I/O adapter 306 receives an RDMA request from an application program 358 via the SQ 372 all of
At block 904, the I/O controller 308 reads the MRTE 352 indexed by the memory region identifier and examines the PT_Required bit 612 and the Two_Level_PT bit 614 to determine the memory registration level type for the memory region 322. Flow proceeds to decision block 905.
At block 905, the I/O adapter 306 calculates an effective first byte offset (EFBO) using the TO received at block 902 and the translation information stored by the I/O adapter 306 in the MRTE 352 in response to a previous memory registration request 334, as described with respect to the previous Figures, and in particular with respect to
EFBO(zero-based)=FBO+TO (1)
EFBO(VA-based)=FBO+(TO−Base—VA) (2)
In an alternate embodiment, if the Zero_Based bit 624 indicates the memory region 322 is virtual address-based, then the EFBO 1008 is calculated according to equation (3) below.
EFBO(VA-based)=TO−(Base—VA & (˜(Page_Size−1))) (3)
As noted above with respect to
At decision block 906, the I/O controller 308 determines whether the level type is zero, i.e., whether the PT_Required bit 612 is clear. If so, flow proceeds to block 908; otherwise, flow proceeds to decision block 912.
At block 908, the I/O controller 308 already has the physical page address 332 from the Address 604 of the MRTE 352, and therefore advantageously need not make another access to the I/O adapter memory 316. That is, with a zero-level memory registration, the I/O controller 308 must make no additional accesses to the I/O adapter memory 316 beyond the MRTE 352 access to translate the TO into the physical address 1012. The I/O controller 308 adds the physical page address 332 to the byte offset bits 1002 of the EFBO 1008 to calculate the translated physical address 1012, as shown in
At decision block 912, the I/O controller 308 determines whether the level type is one, i.e., whether the PT_Required bit 612 is set and the Two_Level_PT bit 614 is clear. If so, flow proceeds to block 914; otherwise, the level type is two (i.e., the PT_Required bit 612 is set and the Two_Level_PT bit 614 is set), and flow proceeds to block 922.
At block 914, the I/O controller 308 calculates the address of the appropriate PTE 346 by adding the MRTE 352 Address 604 to the page table index bits 1004 of the EFBO 1008, as shown in
At block 916, the I/O controller 308 reads the PTE 346 specified by the address calculated at block 914 to obtain the physical page address 332, as shown in
At block 918, the I/O controller 308 adds the physical page address 332 to the byte offset bits 1002 of the EFBO 1008 to calculate the translated physical address 1012, as shown in
At block 922, the I/O controller 308 calculates the address of the appropriate PDE 348 by adding the MRTE 352 Address 604 to the directory table index bits 1006 of the EFBO 1008, as shown in
At block 924, the I/O controller 308 reads the PDE 348 specified by the address calculated at block 922 to obtain the base address of a page table 336, as shown in
At block 926, the I/O controller 308 calculates the address of the appropriate PTE 346 by adding the address read from the PDE 348 at block 924 to the page table index bits 1004 of the EFBO 1008, as shown in
At block 928, the I/O controller 308 reads the PTE 346 specified by the address calculated at block 926 to obtain the physical page address 332, as shown in
At block 932, the I/O controller 308 adds the physical page address 332 to the byte offset bits 1002 of the EFBO 1008 to calculate the translated physical address 1012; as shown in
After the I/O adapter 306 translates the TO into the physical address 1012, it may begin to perform the data transfer specified by the RDMA request. It should be understood that as the I/O adapter 306 sequentially performs the transfer of the data specified by the RDMA request, if the length of the data transfer is such that as the transfer progresses it reaches physical memory page 324 boundaries, in the case of a one-level or two-level memory region 322, the I/O adapter 306 must perform the operation described in
Although not shown in
Referring now to
As shown in
In addition, the number of accesses per unit work to a PDE 348 or PTE 346 is calculated given the assumptions of number of memory regions 322 and percent accesses for each memory region 322 size range. A unit work is the processing required to translate one virtual address to one physical address; thus, for example, each scatter/gather element requires at least one unit work, and each page boundary encountered requires another unit work, except advantageously in the zero-level case of the present invention as described above. The values are given per 100. For the conventional IA-32 method, each unit work requires three accesses to I/O adapter memory 316: one to an MRTE 352, one to a page directory 338, and one to a page table 336. In contrast, for the present invention, in the zero-level category, each unit work requires only one access to I/O adapter memory 316: one to an MRTE 352; in the one-level categories, each unit work requires two accesses to I/O adapter memory 316: one to an MRTE 352 and one to a page table 336; in the two-level category, each unit work requires three accesses to I/O adapter memory 316: one to a page directory 338, and one to a page table 336.
As shown in the table, the number of PDE/PTEs is reduced from 1,379,840 (10.5 MB) to 77,120 (602.5 KB), which is a 94% reduction by the present invention over the conventional IA-32 method based on the values chosen in the example. Also as shown, the number of accesses per unit work to an MRTE 352, PDE 348, or PTE 346 is reduced from 300 to 144, which is a 52% reduction by the present invention over the conventional IA-32 method based on the values chosen in the example, thereby reducing the bandwidth of the I/O adapter memory 316 consumed and reducing RDMA latency. Thus, it may be observed that the embodiments of the memory management method described herein advantageously potentially significantly reduce the amount of I/O adapter memory 316 required and therefore the cost of the I/O adapter 306 in the presence of relatively small and relatively frequently registered memory regions. Additionally, the embodiments advantageously potentially reduce the average amount of I/O adapter memory 316 bandwidth consumed and the latency required to perform a memory translation in response to an RDMA request.
Referring now to
The advantage of the embodiment of
Although the present invention and its objects, features, and advantages have been described in detail, other embodiments are encompassed by the invention. For example, although embodiments have been described in which the device driver performs the steps to determine the number of levels of page tables required to describe a memory region and performs the steps to determine which size page table to use, the I/O adapter could perform some or all of these steps rather than the device driver. Furthermore, although an embodiment has been described in which the number of different sizes of page tables is two, other embodiments are contemplated in which the number of different sizes of page tables is greater than two. Additionally, although embodiments have been described with respect to memory regions, the I/O adapter is also configured to support memory management of subsets of memory regions, including but not limited to, memory windows such as those defined by the iWARP and INIFINIBAND specifications.
Still further, although embodiments have been described in which a single host CPU complex with a single operating system is accessing the I/O adapter, other embodiments are contemplated in which the I/O adapter is accessible by multiple operating systems within a single CPU complex via server virtualization enabled by, for example, VMware (see www.vmware.com) or Xen (see www.xensource.com), or by multiple host CPU complexes each executing its own one or more operating systems enabled by work underway in the PCI SIG I/O Virtualization work group. In these virtualization embodiments, the I/O adapter may translate virtual addresses into physical addresses, and/or physical addresses into machine addresses, and/or virtual addresses into machine addresses, as defined for example by the aforementioned virtualization embodiments, in a manner similar to the translation of virtual to physical addresses described above. In a virtualization context, the term “machine address,” rather than “physical address,” is used to refer to the actual hardware memory address. In the server virtualization context, for example, when a CPU complex is hosting multiple operating systems, three types of address space are defined: the term virtual address is used to refer to an address used by application programs running on the operating systems similar to a non-virtualized server context; the term physical address, which is in reality a pseudo-physical address, is used to refer to an address used by the operating systems to access what they falsely believe are actual hardware resources such as host memory; the term machine address is used to refer to an actual hardware address that has been translated from an operating system physical address by the virtualization software, commonly referred to as a Hypervisor. Thus, the operating system views its physical address space as a contiguous set of physical memory pages in a physically contiguous address space, and allocates subsets of the physical memory pages, which may be physically discontiguous subsets, to the application program to back the application program's contiguous virtual address space; similarly, the Hypervisor views its machine address space as a contiguous set of machine memory pages in a machine contiguous address space, and allocates subsets of the machine memory pages, which may be machine discontiguous subsets, to the operating system to back what the operating system views as a contiguous physical address space. The salient point is that the I/O adapter is required to perform address translation for a virtually contiguous memory region in which the to-be-translated addresses (i.e., the input addresses to the I/O adapter address translation process, which are typically referred to in the virtualization context as either virtual or physical addresses) specify locations in a virtually contiguous address space, i.e., the address space appears contiguous to the user of the address space—whether the user is an application program or an operating system or address translating hardware, and the translated-to addresses (i.e., the output addresses from the I/O adapter address translation process, which are typically referred to in the virtualization context as either physical or machine addresses) specify locations in potentially discontiguous physical memory pages. Advantageously, the address translation schemes described herein may be employed in the virtualization contexts to achieve the advantages described, such as reduced memory space and bandwidth consumption and reduced latency. The embodiments may be thus advantageously employed in I/O adapters that do not service RDMA requests, but are still required to perform virtual-to-physical and/or physical-to-machine and/or virtual-to-machine address translations based on address translation information about a memory region registered with the I/O adapter.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the scope of the invention. Thus, the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application claims the benefit of U.S. Provisional Application No. 60/666,757 (Docket: BAN.0201), filed on Mar. 30, 2005, which is herein incorporated by reference for all intents and purposes.
Number | Date | Country | |
---|---|---|---|
60666757 | Mar 2005 | US |