Non-limiting and non-exhaustive embodiments of the invention are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified.
Embodiments of a system and method for a cache-coherent network interface are described herein. In the following description numerous specific details are set forth to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the techniques described herein can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring certain aspects.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Processor 210 and CCNI 205 both couple to and share system interconnect 215 as full participants on system interconnect 215. Since CCNI 205 couples to system interconnect 215 as a client thereof (does not couple via ICH 222) it is therefore addressable on system interconnect 215. Collaborating on system interconnect 215 as a full participant provides CCNI 205 with a high bandwidth, low latency direct link to processor 210. Since CCNI 205 is addressable on system interconnect 215, its internal hardware registers 250A and/or internal software buffers 250B (collectively internal memory 250) can be mapped into the system address space of processor 210. With internal memory 250 included in the memory map or address space of processor 210, processor 210 can then directly access (e.g., write to or read from) internal memory 250 without issuing interrupts or requests to a gate keeper or third party controller agent. In other words, internal memory 250 simply appears to be an extension of system memory 225, which processor 210 can read to or write from at will. Direct access to internal memory 250 enables processor 210 to quickly access data coming in from a network via CCNI 205 or checkup on internal control and status registers of CCNI 205 with very low latency. Internal memory 250 may be implemented as a variety of different cacheable memory types including write-back cacheable memory, write-through cacheable memory, write-combining cacheable memory, or the like.
Including internal memory 250 into the address space of processor 210 provides the added benefit that content stored in internal memory 250 can be cached into L1 cache or L2 cache of processor 210 as cacheable memory. Standard cache coherency mechanisms can be extended to ensure the cached copies of the content from internal memory 250 are kept up-to-date and valid within L1 cache or L2 cache. A cache coherency agent may be assigned to maintain this cache coherency. Accordingly, when data arrives from the network at CCNI 205, the cache coherency agent can invalid portions of the L1 or L2 cache, and transfer the data directly into the L1 or L2 cache for immediate access and processing by processor 210. The cache coherency agent may be implemented in a variety of manners including as a hardware entity in CCNI 205, a software driver executing on processor 210, firmware executing on a microcontroller internal to CCNI 105, a software application executing on processor 210, a kernel function of an operating system (“OS”) executing on processor 210, or some combination of these.
System interconnect 215 operates as a front side bus (“FSB”) of processing system 210 providing a coherent system interconnect for each client coupled thereto. A coherent system interconnect is a communication link that supports transport of cache coherency protocols there over. System interconnect 215 may be a high speed serial or parallel link. For example, in one embodiment system interconnect 215 is implemented with the Common System Interconnect (“CSI”) by Intel Corporation. In an alternative embodiment, system interconnect 215 is implemented with the HyperTransport (“HT”) interconnect by Advanced Micro Device, Inc.
In one embodiment, NV memory 240 is a flash memory device. In other embodiments, NV memory 240 includes any one of read only memory (“ROM”), programmable ROM, erasable programmable ROM, electrically erasable programmable ROM, or the like. In one embodiment, system memory 225 includes random access memory (“RAM”), such as dynamic RAM (“DRAM”), synchronous DRAM (“SDRAM”), double data rate SDRAM (“DDR SDRAM”), static RAM (“SRAM”), or the like. DSU 235 represents any storage device for software data, applications, and/or operating systems, but will most typically be a nonvolatile storage device. DSU 235 may optionally include one or more of an integrated drive electronic (“IDE”) hard disk, an enhanced IDE (“EIDE”) hard disk, a redundant array of independent disks (“RAID”), a small computer system interface (“SCSI”) hard disk, or the like. It should be appreciated that various other elements of processing system 200 may have been excluded from
CSR Aperture (“CSRA”) 350, RX Data Aperture (“RXDA”) 355, RX Descriptor Aperture (“RXA”) 360, and TX Descriptor Aperture (“TXA”) 365 are coherent memory mapped apertures (collectively apertures 370) that expose their respective internal memory structures of CCNI 300 to software executing on processor 210. Each aperture 370 is backed by a corresponding hardware register 250A or software buffer 250B of internal memory 250. From the perspective of processor 210, apertures 370 look just like system memory 225 and are mapped as cacheable memory. Apertures 370 act as a sort of “window” into internal memory 250 and may be mapped anywhere within address space 305 of processor 210. In one embodiment, apertures 370 are regions of address space 305, each starting at a respective base address and continuing for a defined offset, that include pointers into their respective internal memory 250 locations. Writing to an aperture 370 will result in a change in the corresponding register/buffer of internal memory 250, while reading from an aperture 370 will return the latest contents of the corresponding register/buffer of internal memory 250. Access to internal memory 250 via apertures 370 may be implemented using standard cache control mechanisms.
Data transfer via apertures 370 is effected via a number of data paths within CCNI 300. All communication between processor 210 and CCNI 300 occurs via a data path (1), which physically traverses system interconnect 215 to system interconnect interface 310. A data path (2) enables processor 210 to directly write data or commands into TX descriptor buffers 320. TX descriptor buffers 320 are accessible via TXA 365. A data path (3) enables memory transfer engine(s) 335 to read data and/or commands (e.g., transmit descriptors) from TX descriptor buffers 320. A data path (4) enables processor 210 to directly write data and/or commands (e.g., receive descriptors) into RX descriptor buffers 325. RX descriptor buffers 325 are accessible via RXA 360. A data path (5) enables memory transfer engine(s) 335 to read data and/or commands (e.g., receive descriptors) to execute receive related functions on data currently buffered in RX data buffers 330. A data path (6) enables memory transfer engine(s) 335 to issue commands directly on system interconnect 215 as well as read/write data directly onto system interconnect 215. A data path (7) is the transmit path exiting CCNI 300 from memory transfer engine(s) 335 onto a network 380 (e.g., LAN, WAN, Internet, PC-to-PC direct link, etc.). A data path (8) is the receive path entering CCNI 300 from network 380 into RX data buffers 330. A data path (9) enables processor 210 to directly read or snoop data currently buffered in RX data buffers 330 and received from network 380. RX data buffers 330 are accessible to processor 210 via RXDA 355. A data path (10) enables memory transfer engine(s) 335 to read data from RX data buffers 330 and move it into system memory 225 directly on system interconnect 215. It is note worthy that while conventional NICs can move receive data into system memory, a conventional NIC cannot place the data directly on the high bandwidth, low latency FSB for transport to system memory 225. Rather, conventional NICs must transport the received data over a PCI bus via ICH 110 and adhere to cumbersome ordering rules. Finally, a data path (11) enables processor 210 to read/write directly to CSRs 315. CSRs 315 are accessible via CSRA 350.
In a process block 405, processor 210 generates new data and transmit commands. The data and transmit commands may be initially created and stored in L1 or L2 cache of processor 210. If the data transfer is intended to be an “immediate data transfer” (decision block 410), then process 400 continues to a process block 415. An immediate data transfer is a type of zero copy transfer where the data to be transmitted is not first written into system memory 225.
In process block 415, the transmit commands and the data are evicted from the L1 or L2 cache of processor 210. The evicted transmit commands and data are written into TX descriptor buffers 320 of CCNI 300 through TXA 365 along data paths (1) and (2) (process block 420). In a process block 425, memory transfer engine(s) 335 access the transmit commands (e.g., transmit descriptors) in TX descriptor buffers 320 along data path (3) and executes the transmit commands. In a process block 430, memory transfer engine(s) 335 transfers the data also buffered in TX descriptor buffers 320 onto network 380 along data path (7) in response to executing the transmit commands.
Returning to decision block 410, if the data transfer is not an immediate data transfer, then process 400 continues to a process block 435. In a process block 435, the transmit commands are evicted or pushed into TX descriptor buffers 320 along data paths (1) and (2). Again, the transmit commands are pushed into TX descriptor buffers 320 through TXA 365. In a process block 440, memory transfer engine(s) 335 accesses TX descriptor buffers 320 along data path (3) to retrieve and execute the transmit commands (process block 445). In this case, the transmit commands include DMA transfer commands to DMA fetch the data from L1 or L2 cache (or system memory 225 if the data has been evicted from L1 and L2 cache into system memory 225) and push it onto network 380 along data paths (1), (6), and (7). It should be appreciated that the DMA transfers from L1 or L2 cache (or system memory 225) are transferred across system interconnect 215 (not a PCI or PCI-Express bus), and therefore are considerably faster than compared to a DMA transfer by NIC 140 in
When data arrives over network 380 via data path (8) (decision block 510), the data is buffered into RX data buffers 330 (process block 515). In a process block 520, processor 210 is notifed of the new data in response to a polling event. In one embodiment, when the new data arrives in RX data buffers 330, cache coherency agent 340 invalidates the cache of processor 210, which is identified by the polling event, indicating that the data in RX data buffers 330 has changed. In other embodiments, processor 210 does not continuously poll RXDA 355 for new data; rather, an interrupt event may be issued by CCNI 300 directly onto system interconnect 215 to notify processor 210. Accordingly, using the event driven interrupt mechanism, process block 505 is not executed.
Once processor 210 becomes aware of the new data in RX data buffers 330, there are multiple transfer types or techniques by which processor 210 may retrieve the data. In decision block 525, if the transfer is a zero-copy snoop transfer, then process 500 continues to a process block 530. A zero-copy snoop transfer is referred to as a “zero-copy transfer” because the data is copied directly into L1 or L2 cache by processor 210 without first copying the received data into system memory 225. A zero-copy snoop transfer is referred to as a “snoop transfer” because the transfer is initiated when processor 210 directly snoops into RX data buffers 330 to determine whether new data has arrived, as opposed to receiving an interrupt event.
In a process block 530, processor 210 reads the data directly from RX data buffers 330 through RXDA 355 along data paths (1) and (9), and then enrolls or copies the received data directly into the L1 or L2 cache (process block 535) for immediate consumption.
Returning to decision block 525, if the transfer mechanism is to be a DMA transfer, then process 500 proceeds to a process block 540. In process block 540, receive commands (e.g., receive descriptors) are transferred into RX descriptor buffers 325 via data paths (1) and (4). In one embodiment, processor 210 pushes the receive commands into RX descriptor buffers 325 via RXA 360. In a process block 545, memory transfer engine(s) 335 accesses the receive commands along data path (5) for execution. In response to the received commands, memory transfer engine(s) 335 fetches the received data from RX data buffers 330 along data path (10) and transfers the received data into system memory 225 via system interconnect 215 along data paths (6) and (1).
In one embodiment, CCNI 300 includes and maintains its own internal CCNI cache 345. CCNI cache 345 is accessible to processor 210 in a similar manner to system memory 225 and viewed by processor 210 simply as an extension of its system memory 225. In this embodiment, both received and transmit data may be cached locally by CCNI 300. For example, data received from network 380 may be cached locally for direct access by processor 210 therefrom. Data to be transmitted may be written into CCNI cache 345 by processor 210 with corresponding transmit descriptors written into TX descriptor buffers 320. Subsequently, when memory transfer engine 335 executes the transmit descriptor, memory transfer engine(s) 335 may pull the data directly from the local CCNI cache 345.
Directly coupling CCNI 300 to processor 210 over a cache coherent system interconnect enables processor 210 to directly and efficiently read network data received at CCNI 300 in any manner it chooses. Rather than having to adhere to strict ordering and fencing rules required for transfers over PCI or PCI-Express, the cacheable memory of CCNI 300 enables a host of technologies like software controlled zero-copy receive, software based packet splitting, and software based out of order packet processing. CCNI 300 enables processor 210 to directly peer into internal memory 250 to obtain control and status data at will and directly manage the resources of its network interface.
As illustrated, each processor 210 maintains an address space 305 which include apertures 370 for accessing a CCNI 205 sharing the same system interconnect 215. CCNIs 215 are full participants with processors 210 on their respective system interconnects 215. Although the illustrated system interconnects 215 assume a multi-drop front side bus configuration, other configurations with point-to-point interfaces between processors 210 and CCNI 205, with or without integrated memory controllers, may be implemented, as well.
Sharing a single coherent system interconnect, such as system interconnect 215, between CCNI 205 and multiple processors 210, enables assigning one or more processors 210 to specialized tasks to preprocess packets arriving or departing on network 380. For example, packets arriving at CCNI 205 may be initially cached by a first one of processors 210 who is assigned the task of decompression and/or decryption, then evicted into the cache of another one of processors 210 executing a software application consuming the data. In the outgoing direction, one of processors 210 may be assigned the task of compressing and/or encrypting data generated by a second one of processors 210, prior to transferring the data over system interconnect 215 to CCNI 205 for transmission onto network 380.
The processes explained above are described in terms of computer software and hardware. The techniques described may constitute machine-executable instructions embodied within a machine (e.g., computer) readable medium, that when executed by a machine will cause the machine to perform the operations described. Additionally, the processes may be embodied within hardware, such as an application specific integrated circuit (“ASIC”) or the like.
A machine-accessible medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant, manufacturing tool, any device with a set of one or more processors, etc.). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.), as well as electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.).
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.