To reduce the likelihood of dropping packets, a compute device may dedicate at least one core (e.g., a processing device of a processor that is capable of reading and executing instructions) to polling (e.g., checking to determine whether any items, such as packets, are available) an input queue that temporarily stores packets received by a network interface controller (NIC) of the compute device. More specifically, a thread executed by a dedicated core polls the input queue at a fixed interval to retrieve packets to guard against the input queue filling up and losing packets (e.g., because there is no capacity in the queue to store another packet). The core that is dedicated to polling the queue does not perform any work for an underlying workload (e.g., an application, a set of operations, etc.) to which the received packets pertain (e.g., contain data for the corresponding workload). In compute devices that execute multiple virtual machines (e.g., each on behalf of a different customer of a cloud computing service), a separate core is typically dedicated to polling a corresponding input queue used by each virtual machine, making a significant number of the cores unavailable to assist in the execution of the underlying workloads.
The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to “one embodiment,” “an embodiment,” “an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Additionally, it should be appreciated that items included in a list in the form of “at least one A, B, and C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C). Similarly, items listed in the form of “at least one of A, B, or C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).
The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory machine-readable (e.g., computer-readable) storage medium, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features.
Referring now to
As shown in
The main memory 116 may be embodied as any type of volatile (e.g., dynamic random access memory (DRAM), etc.) or non-volatile memory or data storage capable of performing the functions described herein. Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium. Non-limiting examples of volatile memory may include various types of random access memory (RAM), such as dynamic random access memory (DRAM) or static random access memory (SRAM). One particular type of DRAM that may be used in a memory module is synchronous dynamic random access memory (SDRAM). In particular embodiments, DRAM of a memory component may comply with a standard promulgated by JEDEC, such as JESD79F for DDR SDRAM, JESD79-2F for DDR2 SDRAM, JESD79-3F for DDR3 SDRAM, JESD79-4A for DDR4 SDRAM, JESD209 for Low Power DDR (LPDDR), JESD209-2 for LPDDR2, JESD209-3 for LPDDR3, and JESD209-4 for LPDDR4. Such standards (and similar standards) may be referred to as DDR-based standards and communication interfaces of the storage devices that implement such standards may be referred to as DDR-based interfaces.
In one embodiment, the memory device is a block addressable memory device, such as those based on NAND or NOR technologies. A memory device may also include a three dimensional crosspoint memory device (e.g., Intel 3D XPoint™ memory), or other byte addressable write-in-place nonvolatile memory devices. In one embodiment, the memory device may be or may include memory devices that use chalcogenide glass, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level Phase Change Memory (PCM), a resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, magnetoresistive random access memory (MRAM) memory that incorporates memristor technology, resistive memory including the metal oxide base, the oxygen vacancy base and the conductive bridge Random Access Memory (CB-RAM), or spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thyristor based memory device, or a combination of any of the above, or other memory. The memory device may refer to the die itself and/or to a packaged memory product.
In some embodiments, 3D crosspoint memory (e.g., Intel 3D XPoint™ memory) may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance. In some embodiments, all or a portion of the main memory 116 may be integrated into the processor 114. In operation, the main memory 116 may store various software and data used during operation such as applications, programs, libraries, and drivers. In the illustrative embodiment, the memory 116 defines one or more queues 150 (e.g., queues 152, 154), each of which may be embodied as a set of memory accessible at one or more memory addresses and capable of storing data sets (e.g., packets) to be operated on by one or more of the cores 140, the NIC 122, and/or other components of the compute device 110. For example, the queue 152 may temporarily store packets received by the NIC 122 (e.g., through the network 170 from the client device 160 or from another source) and the queue 154 may temporarily store packets to be sent by the NIC 122 (e.g., through the network 170).
The compute engine 112 is communicatively coupled to other components of the compute device 110 via the I/O subsystem 118, which may be embodied as circuitry and/or components to facilitate input/output operations with the compute engine 112 (e.g., with the processor 114 and/or the main memory 116) and other components of the compute device 110. For example, the I/O subsystem 118 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.), and/or other components and subsystems to facilitate the input/output operations. In some embodiments, the I/O subsystem 118 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with one or more of the processor 114, the main memory 116, and other components of the compute device 110, into the compute engine 112.
The communication circuitry 120 may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications over the network 170 between the compute device 110 and another compute device (e.g., the client device 160, etc.). The communication circuitry 120 may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., Ethernet, Bluetooth®, Wi-Fi®, WiMAX, etc.) to effect such communication.
The illustrative communication circuitry 120 includes a network interface controller (NIC) 122, which may also be referred to as a host fabric interface (HFI). The NIC 122 may be embodied as one or more add-in-boards, daughter cards, network interface cards, controller chips, chipsets, or other devices that may be used by the compute device 110 to connect with another compute device (e.g., the client device 160, etc.). In some embodiments, the NIC 122 may be embodied as part of a system-on-a-chip (SoC) that includes one or more processors, or included on a multichip package that also contains one or more processors. In some embodiments, the NIC 122 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the NIC 122. In such embodiments, the local processor of the NIC 122 may be capable of performing one or more of the functions of the compute engine 112 described herein. Additionally or alternatively, in such embodiments, the local memory of the NIC 122 may be integrated into one or more components of the compute device 110 at the board level, socket level, chip level, and/or other levels. In the illustrative embodiment, the NIC 122 includes a work estimator logic unit 124 which may be embodied as any device or circuitry (e.g., a co-processor, an ASIC, etc.) configured to write metadata to a packet descriptor for a received packet (e.g., a packet written to the input queue 152) that indicates a computational load associated with the packet. For example, the work estimator logic unit 124 may write metadata indicative of an instantaneous burst size associated with the packet (e.g., the number of packets that the NIC 122 received in a burst, such as a relatively short predefined period of time) and/or may write metadata indicative of the type of workload the packet pertains to (e.g., a cryptographic workload, a compression workload, an artificial intelligence (AI) model training workload, etc.). As described herein, the computational load associated with the received packets, in the illustrative embodiment, is inversely related to the rate at which the cores 140 are able to remove the packets from the input queue 152 (e.g., a set of received packets that take more time to process will be removed from the queue 152 slower than a set of packets associated with a workload that has a smaller computational load).
The one or more illustrative data storage devices 128 may be embodied as any type of devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices. Each data storage device 128 may include a system partition that stores data and firmware code for the data storage device 128. Each data storage device 128 may also include one or more operating system partitions that store data files and executables for operating systems.
The client device 160 may have components similar to those described in
As described above, the compute device 110 and the client device 160 are illustratively in communication via the network 170, which may be embodied as any type of wired or wireless communication network, including global networks (e.g., the Internet), local area networks (LANs) or wide area networks (WANs), cellular networks (e.g., Global System for Mobile Communications (GSM), 3G, Long Term Evolution (LTE), Worldwide Interoperability for Microwave Access (WiMAX), etc.), digital subscriber line (DSL) networks, cable networks (e.g., coaxial networks, fiber networks, etc.), or any combination thereof.
Referring now to
Subsequently, in block 212, the compute device 110 determines a predicted time period in which a queue 150 may receive packet data without overflowing (e.g., without running out of capacity and dropping packets). In doing so, the compute device 110 may determine the predicted time period for an input queue 152, as indicated in block 214. In the illustrative embodiment, the compute device 110 determines the predicted time period for a worst case scenario, as indicated in block 216. For example, and as indicated in block 218, the compute device 110 may determine the predicted time period as a function of the capacity of the queue (e.g., the number of packets that the queue is capable of storing) and the peak packet receipt rate of the NIC 122. In other words, the compute device 110 may determine how long the queue 152 can receive packets if the NIC 122 is receiving those packets and placing them in the queue 152 at the maximum theoretical speed that the NIC 122 is able to operate. In block 220, the compute device 110 may determine the time period as a function of the metadata in the packet descriptor(s) indicative of the instantaneous number of received packets (e.g., from block 208). Such information may be indicative of the present rate at which packets are being received from the NIC 122.
Referring now to
The compute device 110 may also determine the predicted time period for an output queue 154 to fill up, as indicated in block 234. In doing so, the compute device 110 may determine the predicted time period as a function of the rate at which the NIC 122 sends packets (e.g., from the output queue 154) out of the compute device 110 (e.g., to another compute device), as indicated in block 236. As indicated in block 238, the compute device 110 may determine the predicted time period as a function of a historical rate at which the NIC 122 has sent packets (e.g., an average rate over a predefined time period). The compute device 110 may also determine the time period as a function of a rate at which packets are added to the output queue 154, as indicated in block 240. In doing so, the compute device 110 may determine the rate as a function of the types of workloads presently being executed by the compute device 110, as indicated in block 242. The compute device 110 may determine the rate as a function of the computational load associated with each workload type, as indicated in block 244. Additionally or alternatively, as indicated in block 246, the compute device 110 may determine the rate as a function of a historical rate of packet generation for each workload type (e.g., a compression workload may have generated 200 packets per second over a predefined time period while an encryption workload may have generated 400 packets per second of the predefined time period, etc.). Subsequently, the method 200 advances to block 248 of
Referring now to block 250 of
In block 258, the compute device 110 determines whether the predicted time period associated with the input queue 152 (e.g., from block 212) has elapsed. In doing so, the compute device 110 polls the input queue 152 and removes any received packets. In polling the input queue 152, the compute device 110 may suspend execution of a workload task performed by core assigned to poll the input queue 152 (e.g., the core 142). Additionally, in block 264, the compute device 110 may determine whether a predicted time period (e.g., a time period determined in block 234) associated with the output queue 154 elapsed. If so, the compute device 110 may poll the output queue 154 for any packets and send those packets out of the compute device 110 with the NIC 122. In doing so, the compute device 110 may also suspend execution of a workload task performed by the polling core 142. Subsequently, the compute device 110 loops back to block 202 of
Referring now to
Illustrative examples of the technologies disclosed herein are provided below. An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
Example 1 includes a compute device comprising a network interface controller; a compute engine that includes a plurality of cores and a memory to include a queue to store packets received by the network interface controller; wherein the compute engine is configured to determine a predicted time period for the queue to receive packets without overflowing; execute, during the time period and with a core that is assigned to periodically poll the queue for packets, a workload; and poll, with the assigned core, the queue to remove packets from the queue.
Example 2 includes the subject matter of Example 1, and wherein the network interface controller is configured to write, to a packet descriptor associated with a received packet, metadata indicative of a computational load associated with the received packet.
Example 3 includes the subject matter of any of Examples 1 and 2, and wherein the metadata is indicative of a number of packets received by the network interface controller within a predefined time period.
Example 4 includes the subject matter of any of Examples 1-3, and wherein the metadata is indicative of a type of workload to be executed by one or more of the cores.
Example 5 includes the subject matter of any of Examples 1-4, and wherein the queue is an input queue and the memory further comprises an output queue to store packets that are to be sent from the compute device by the network interface controller and the compute engine is further to determine a second predicted time period for the output queue to receive packets without overflowing and to execute, with a core that is assigned to poll the output queue, a workload during the second predicted time period.
Example 6 includes the subject matter of any of Examples 1-5, and wherein to determine the predicted time period comprises to determine the predicted time period as a function of a capacity of the queue and a peak packet receipt rate of the network interface controller.
Example 7 includes the subject matter of any of Examples 1-6, and wherein to determine the predicted time period comprises to determine the predicted time period as a function of metadata present in a packet descriptor of a received packet, wherein the metadata is indicative of a computational load associated with the received packet.
Example 8 includes the subject matter of any of Examples 1-7, and wherein to determine the predicted time period comprises to determine the predicted time period as a function of a rate at which packets are removed from the queue.
Example 9 includes the subject matter of any of Examples 1-8, and wherein to determine the predicted time period comprises to determine the predicted time period as a function of a historical rate at which packets have been removed from the queue.
Example 10 includes the subject matter of any of Examples 1-9, and wherein to determine the predicted time period comprises to determine a rate at which packets are removed from the queue as a function of a workload type that each receive packet pertains to and a computational load associated with each workload type.
Example 11 includes the subject matter of any of Examples 1-10, and wherein the compute engine is further configured to determine each workload type at least from metadata present in a packet descriptor associated with each packet.
Example 12 includes the subject matter of any of Examples 1-11, and wherein the compute engine is further configured to adjust the rate in an inverse relationship with the computational load associated with each workload type.
Example 13 includes the subject matter of any of Examples 1-12, and wherein the compute engine is further configured to determine the rate as a weighted average rate for a set of the packets associated with multiple different workload types.
Example 14 includes the subject matter of any of Examples 1-13, and wherein the compute engine is further to operate the assigned core in a low power state for a portion of the predicted time period.
Example 15 includes one or more machine-readable storage media comprising a plurality of instructions stored thereon that, in response to being executed, cause a compute device to determine a predicted time period for a queue to receive packets without overflowing; execute, during the time period and with a core that is assigned to periodically poll the queue for packets, a workload; and poll, with the assigned core, the queue to remove the packets from the queue.
Example 16 includes the subject matter of Example 15, and wherein the plurality of instructions further cause the compute device to write, to a packet descriptor associated with a received packet, metadata indicative of a computational load associated with the received packet.
Example 17 includes the subject matter of any of Examples 15 and 16, and wherein the plurality of instructions further cause the compute device to write a number of packets received by a network interface controller of the compute device within a predefined time period.
Example 18 includes the subject matter of any of Examples 15-17, and wherein the plurality of instructions further cause the compute device to write metadata indicative of a type of workload to be executed by one or more of the cores.
Example 19 includes the subject matter of any of Examples 15-18, and wherein the queue is an input queue and the plurality of instructions further cause the compute device to determine a second predicted time period for an output queue to receive packets without overflowing and to execute, with a core that is assigned to poll the output queue, a workload during the second predicted time period.
Example 20 includes the subject matter of any of Examples 15-19, and wherein the plurality of instructions further cause the compute device to determine the predicted time period as a function of a capacity of the queue and a peak packet receipt rate of the network interface controller.
Example 21 includes the subject matter of any of Examples 15-20, and wherein the plurality of instructions further cause the compute device to determine the predicted time period as a function of metadata present in a packet descriptor of a received packet, wherein the metadata is indicative of a computational load associated with the received packet.
Example 22 includes a method comprising determining, by a compute device, a predicted time period for a queue to receive packets without overflowing; executing, during the time period and with a core of the compute device that is assigned to periodically poll the queue for packets, a workload; and polling, with the assigned core, the queue to remove the packets from the queue.
Example 23 includes the subject matter of Example 22, and further including writing, with a network interface controller and to a packet descriptor associated with a received packet, metadata indicative of a computational load associated with the received packet.
Example 24 includes the subject matter of any of Examples 22 and 23, and wherein writing the metadata comprises writing a number of packets received by the network interface controller within a predefined time period.
Example 25 includes a compute device comprising means for determining a predicted time period for a queue to receive packets without overflowing; circuitry for executing, during the time period and with a core of the compute device that is assigned to periodically poll the queue for packets, a workload; and circuitry for polling, with the assigned core, the queue to remove the packets from the queue.