Embodiments of the subject matter described herein relate generally to PCI express lightweight notification implementation mechanisms. More particularly, embodiments of the subject matter relate to host implementation of LN notification protocols.
PCI Express (peripheral component interconnect express), or PCIe, is the state of the art computer expansion card standard designed to replace the older PCI and PCI-X bus standards. Base specifications and engineering change notices (ECNs) are developed and maintained by the PCI special interest group (PCI-SIG) comprising more than 900 companies including Advanced Micro Devices, the Hewlett-Packard Company, and Intel Corporation. The PCIe bus serves as the primary motherboard-level interconnect for many consumer, server, and industrial applications, linking the host system processor with both integrated (surface mount) and add-on (expansion) peripherals.
The lightweight notification (LN) protocol was approved for PCIe base specification version 3.0 in October, 2011. The lightweight notification ECN provides an optional normative protocol which allows an endpoint function (e.g., a PCIe device) to register an interest in specified cachelines in host memory, and to request that an LN notification message be sent from the CPU/memory complex to the device when the contents of a registered cacheline changes. The LN protocol permits multiple LN-enabled endpoints to register the same cacheline(s) concurrently. Consequently, an LN notification message, generated when a registered cacheline is updated, may be unicast to a single endpoint using ID-based routing, or broadcast to multiple devices using multicast routing.
Although the potential increase in input/output (I/O) bandwidth and the potential decrease in I/O latency associated with the use of LN protocols are substantial, neither the PCIe standard nor the lightweight notification ECN define precisely how LN is to be implemented in the CPU/memory complex.
Exemplary methods and corresponding structure for implementing LN protocols in a central processing unit (CPU) memory complex are provided herein. The method implements a lightweight notification (LN) protocol in a central processing unit (CPU) host having associated system memory, and includes defining a range of system memory for use as an LN data structure, the range comprising a plurality of cachelines each having a length of N bytes, allocating a portion of each cacheline for LN storage and a portion for payload data, and configuring a first location in each cacheline as a routing field such that when the first location contains a first value its associated cacheline corresponds to a unicast LN message, and when the first location contains a second value its associated cacheline corresponds to a multicast LN message.
Various methods and corresponding structure for implementing LN protocols in a CPU host are also provided. An exemplary method of implementing lightweight notification (LN) protocols involves a host having a range of system memory designated for use as an LN data structure, the range including a plurality of cachelines each having a length of N bytes with an M<N byte subset of each cacheline reserved for LN storage. The method includes: configuring, for each said cacheline in the range, a first location in LN storage for use as a routing field, such that when the first location contains a first value its associated cacheline corresponds to a unicast LN message, and when the first location contains a second value its associated cacheline corresponds to a multicast LN message; configuring, for each said cacheline in the range, a portion of the N bytes for use as payload data; and sending an LN notification message from the host to a PCIe endpoint when the payload data of a registered cacheline is updated.
An exemplary embodiment of a CPU/memory complex is also provided for use with LN protocols. The system includes: A CPU complex configured to communicate with a PCIe endpoint device of the type including a lightweight notification request (LNR) module configured to send LN read and LN write request messages to the CPU complex, and to receive LN notification messages from the CPU complex, a range of system memory designated for use as an LN data structure, the memory range including a plurality of cachelines each having a length of N bytes with an M<N byte subset of each cacheline reserved for LN storage, and a processor including a lightweight notification completer (LNC) configured to send LN notification messages to the LNR
The foregoing summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A more complete understanding of the subject matter may be derived by referring to the detailed description and claims when considered in conjunction with the following figures, wherein like reference numbers refer to similar elements throughout the figures.
The following detailed description is merely illustrative in nature and is not intended to limit the embodiments of the subject matter or the application and uses of such embodiments. As used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any implementation described herein as exemplary is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, there is no intention to be bound by any expressed or implied theory presented in the preceding technical field, background, brief summary or the following detailed description.
Techniques and technologies may be described herein in terms of functional and/or logical block components, and with reference to symbolic representations of operations, processing tasks, and functions that may be performed by various computing components or devices. Such operations, tasks, and functions are sometimes referred to as being computer-executed, computerized, software-implemented, or computer-implemented. It should be appreciated that the various block components shown in the figures may be realized by any number of hardware, software, and/or firmware components configured to perform the specified functions. For example, an embodiment of a system or a component may employ various integrated circuit components, e.g., memory elements, logic elements, look-up tables, or the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
The subject matter presented here relates to methods and apparatus for implementing lightweight notification (LN) protocols in a host processor system. The processor system and/or one or more associated cache memory, system memory, or other data structure, modules or elements are configured for LN storage. More particularly, a predefined region of memory includes a plurality of cachelines, each having a length of N bytes. The cachelines may be configured in the form of any desired data structure such as, for example, a queue or ring buffer. A first subset of M bytes (M<N) is reserved as the LN storage mechanism, and a second subset of D bytes is allocated for payload data. Typically, (D+M)=N; that is, the entire cacheline is available for payload data, except for the N-byte portion of the cacheline reserved for LN storage. Alternatively, (D+M)<N, where the portion of the cacheline not used for LN storage or payload data may be used for other bookkeeping, software overhead, or other administrative purposes.
Referring now to the drawings,
In the illustrated embodiment, one or more of controller hub 104, switch 108, and end point devices 110, 112 include respective I/O modules 114 configured to implement a layered protocol stack in accordance with, for example, the open systems interconnect (OSI) model. In an embodiment, I/O modules 114 facilitate PCIe compliant communication between and among processor 102, hub 104, switch 108, and devices 110 and 112.
In the detailed embodiment shown in
In one embodiment, the processor 102 may include multiple instances of the execution core 202, and one or more of the cache memories 204, 206, 208 may be shared between two or more instances of the execution core 202. For example, in one embodiment, two execution cores 202 may share the L4 cache memory 208, while respective instances of execution core 202 may have separate, dedicated instances of the L1 cache memory 204 and the L2 cache memory 206. Other arrangements are also possible and contemplated. Those skilled in the art will appreciate that PCIe compliant links are configured to maintain coherency with respect to processor caches and system memory as provided for in PCIe base specification version 3.0, which is available at http://www.pcisig.com/specifications/pciexpress.
The processor 102 also includes the memory controller 212 in the embodiment shown. The memory controller 212 may provide an interface between the processor 102 and the system memory 106, which may include one or more memory banks. The memory controller 212 may also be coupled to each of the cache memories 204, 206, 208. More particularly, the memory controller 212 may load cache lines (i.e., blocks of data stored in system memory) directly into any one or all of the cache memories 204, 206, 208. In one embodiment, the memory controller 212 may load a cache line into one or more of the cache memories 204, 206, 208 responsive to a demand request by the execution core 106.
As briefly discussed above, the LN protocol enables endpoints to register interest in specific cachelines in host memory, and to be notified via a hardware mechanism when the contents of a registered cacheline are updated. With continued reference to
The processor system 100 may be configured to operate in the manner described in detail below. For example,
It should be further appreciated that a described process may include any number of additional or alternative tasks, the tasks shown in the figures need not be performed in the illustrated order, and that a described process may be incorporated into a more comprehensive procedure or process having additional functionality not described in detail herein. Moreover, one or more of the tasks shown in the figures could be omitted from an embodiment of a described process as long as the intended overall functionality remains intact.
With continued reference to
The LN protocol permits multiple LNRs to register the same line concurrently. In this case, LNC 210 notifies the multiple LNRs either by sending a directed LN notification message to each requesting LNR, or by sending a broadcast LN notification to each root port associated with an LNR which has registered a watch request.
Referring now to
In accordance with an embodiment, cacheline 502 exhibits a co-located layout in which the LN storage data and payload data are co-located in the same cacheline. In particular, cacheline 502 includes payload region 504 and LN storage region 506. In one embodiment, payload (memory) region 504 has a length “D” (indicated by the arrow 510) of 60-bytes, and LN storage region 506 has a length “M” (indicated by the arrow 512) of 4-bytes. Alternatively, LN storage region 506 may be any desired number of bytes (or data words) in length such that M=1, 2, 8, etc. Similarly, memory region 504 may be any desired number of bytes or words in length such that the total byte length D of cacheline 502 is equal to the sum of the payload data byte length D plus the LN storage byte length M; that is, N=D+M.
In an alternate embodiment, the total byte length N of cacheline 502 is less than the sum of the payload data byte length D and the LN storage byte length M; that is, N<(D+M) where the difference is attributable to bookkeeping, software overhead, administration, or the like. It should be noted that LN storage portion 506 is reserved for the LN storage mechanism and, typically, not otherwise usable by the device; thus, the range of system memory (i.e., the plural cachelines 502) utilizes an altered programming model from regular system memory in that the programming model is adapted to implement the LN storage mechanisms described herein.
A variety of implementations are possible and contemplated by the schematic layout shown in
The endpoint device and/or endpoint function to which the unicast notification message is to be directed may be defined by one or more second locations 604, 606 within LN storage 506 designated for use as a destination field. In
Referring now to
The endpoint devices and/or endpoint functions to which the multicast notification message is to be broadcast may be defined by one or more second locations 704 within LN storage 506 designated for use as a destination field. In
With continued reference to
The method 800 further includes monitoring (task 812) each LN-configured cacheline and detecting (task 814) a change in the contents of the payload data bytes associated with a registered cacheline. When the system determines that a cacheline has been updated, the method 800 sends (task 816) a notification message to the requesting endpoint device(s) as discussed in connection with
Referring now to
In an embodiment, the method 900 may be configured too dynamically switch between the unicast and broadcast modes of operation. For example, if only one requester has registered an interest in a particular line, the unicast mode is employed. If a second or subsequent request is registered for the same line, the method converts to the broadcast mode. If the line is eventually evicted (and thereby causing eviction notices to be sent), the method again starts in unicast the next time a request is registered for that line.
In an alternate embodiment, the LN storage mechanism is stored in a pre-configured range in system memory as above, but the LN storage fields are located separate from the registered cacheline. That is, each LN capable cacheline has an associated LN storage are that is located in another cacheline. In this way, the entire cacheline may still be used as memory, and the memory address of the registered cacheline is used to determine the location (memory address) of the corresponding LN storage area. When the cacheline is modified (or when an LN operation is processed), two separate cachelines are affected; a first cacheline containing the payload data, and a second associated cacheline which stores the LN mechanism (e.g., the routing, destination, or other LN-related information).
While at least one exemplary embodiment has been presented in the foregoing detailed description, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or embodiments described herein are not intended to limit the scope, applicability, or configuration of the claimed subject matter in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing the described embodiment or embodiments. It should be understood that various changes can be made in the function and arrangement of elements without departing from the scope defined by the claims, which includes known equivalents and foreseeable equivalents at the time of filing this patent application.