With the semiconductor manufacturing minimum feature sizes reaching into the single nanometers, semiconductor chips are being developed having significant amounts of disparate functionality being integrated on a single semiconductor chip. As such, system designers are interested in finding out ways to combine any/all of these functions to effect complex computational processes.
A new data center paradigm is emerging in which “infrastructure” tasks are offloaded from traditional general purpose “host” CPUs (where application software programs are executed) to an infrastructure processing unit (IPU), data processing unit (DPU) or smart networking interface card (SmartNIC), any/all of which are hereafter referred to as an IPU.
Networked based computer services, such as those provided by cloud services and/or large enterprise data centers, commonly execute application software programs for remote clients. Here, the application software programs typically execute a specific (e.g., “business”) end-function (e.g., customer servicing, purchasing, supply-chain management, email, etc.). Remote clients invoke/use these applications through temporary network sessions/connections that are established by the data center between the clients and the applications.
In order to support the network sessions and/or the applications' functionality, however, certain underlying computationally intensive and/or trafficking intensive functions (“infrastructure” functions) are performed.
Examples of infrastructure functions include encryption/decryption for secure network connections, compression/decompression for smaller footprint data storage and/or network communications, virtual networking between clients and applications and/or between applications, packet processing, ingress/egress queuing of the networking traffic between clients and applications and/or between applications, ingress/egress queueing of the command/response traffic between the applications and mass storage devices, error checking (including checksum calculations to ensure data integrity), distributed computing remote memory access functions, etc.
Traditionally, these infrastructure functions have been performed by the host CPUs “beneath” their end-function applications. However, the intensity of the infrastructure functions has begun to affect the ability of the host CPUs to perform their end-function applications in a timely manner relative to the expectations of the clients, and/or, perform their end-functions in a power efficient manner relative to the expectations of data center operators. Moreover, the host CPUs, which are typically complex instruction set (CISC) processors, are better utilized executing the processes of a wide variety of different application software programs than the more mundane and/or more focused infrastructure processes.
As such, as observed in
Here, for instance, the mass storage pools 102 includes numerous storage devices 106 (e.g., solid state drives (SSDs)) to support “big data” applications, database applications or even remotely calling clients that desire to access data that has been previously stored in a mass storage pool 102. The application acceleration resource pool 103 includes numerous specific processors (acceleration cores) 107 (e.g., GPUs) that are tuned to better perform certain numerically intensive, application level tasks (e.g., machine learning of customer usage patterns, image processing, etc.). In a common scenario, applications 105 running on the host CPUs 104 access a mass storage pool 102 to obtain data that the applications perform operations upon, and/or, invoke an acceleration resource pool 103 to “speed-up” certain numerically intensive functions.
The host CPU, mass storage and acceleration pools 101, 102, 103 are respectively coupled by one or more networks 108. Notably, each pool 101, 102, 103 has an IPU 109_1, 109_2, 109_3 on its front end or network side. Here, the IPU 109 performs pre-configured infrastructure functions on the inbound (request) packets it receives from the network 108 before delivering the requests to its respective pool's end function (e.g., application software program 105, mass storage device 106, acceleration core 107). As the end functions send their output responses (e.g., application software resultants, read data, acceleration resultants), the IPU 109 performs pre-configured infrastructure functions on the outbound packets before transmitting them into the network 108.
The general purpose processing cores 211, by contrast, will perform their tasks slower and with more power consumption but can be programmed to perform a wide variety of different functions (via the execution of software programs). Here, it is notable that although the processing cores can be general purpose CPUs like the data center's host CPUs 104, in many instances the IPU's general purpose processors 211 are reduced instruction set (RISC) based processors rather than CISC based processors (which the host CPUs 104 are typically implemented with). That is, the host CPUs 104 that execute the data center's application software programs 105 tend to be CISC based processors because of the extremely wide variety of different tasks that the data center's application software could be programmed to perform.
By contrast, the infrastructure functions performed by the IPUs tend to be a more limited set of functions that are better served with a RISC processor. As such, the IPU's RISC processors can perform the infrastructure functions with noticeably less power consumption than CISC processors without significant loss of performance.
The FPGA(s) 212 provide for more programming capability than an ASIC block but less programming capability than the general purpose cores 211, while, at the same time, providing for more processing performance capability than the general purpose cores 211 but less than processing performing capability than an ASIC block.
The IPU 309 also includes multiple memory channel interfaces 328 to couple to external memory 329 that is used to store instructions for the general purpose cores 311 and input/output data for the IPU cores 311 and each of the ASIC blocks 321-326. The IPU includes multiple PCIe physical interfaces and an Ethernet Media Access Control block 330 to implement network connectivity to/from the IPU 309.
In the particular example of
When the application desires to store the page 431 of data, a message is sent 1 from the application to an IPU core 411 that is processing the storage software stack 440 on behalf of the application. The transport layer 441 of the software stack 440 receives the message and invokes 2 the NVMe ASIC block 422 through the NVMe ASIC block's device driver 445. Here, the NVMe ASIC block 422 includes a direct-memory-access (DMA) sub-ASIC (sASIC) 426 for performing DMAs in hardware. In response to the invocation 2 from the transport layer 441, the DMA sASIC 426 (under control of device driver 445) reads 3 the memory page 431 from host CPU memory 410 and stores 4 the page 432 in IPU memory 429 (the DMA sASIC 426 typically receives the page's location in host CPU memory 410 from the transport layer 441 (via device driver 445) which received it from the application as part of the initial storage request 1).
The storage software stack 440 then proceeds to execute at different layers of the stack as appropriate to perform the storage operation. Here, the NVMe target layer 442 mimics the NVMe protocol behavior of an NVMe storage device so that the application is presented with an experience “as if” it were communicating directly with a storage device. The block device layer 443 includes functionality that is traditionally found in an operating system directly above a storage device's device driver for communicating with storage device and controlling the storage device (e.g., submission queueing, completion queuing, timeout monitoring, reset handling, etc.).
The block device layer 443 also includes an encryption engine 444 that invokes the device driver 446 of the encryption ASIC block 425 if encryption is to be performed. Thus, if the page of data 432 is to be encrypted before it is physically sent to remote storage, the block device layer 443 will invoke the encryption engine 444 which, in turn, invokes 5 the encryption ASIC block 425 through its device driver 446 to encrypt the page. In response to the invocation 5, the encryption ASIC block 425 will read 6 the DMA'd page 432 from its location in IPU memory 429, encrypt the page and store the encrypted page 433 as another page in IPU memory 429.
Thus, for any page to be stored requiring encryption, there will be two separate instances of the page 432, 433 stored in IPU memory 429 (the unencrypted page 432 that was received via DMA and the encrypted page 433). Here, the transport and block device layers 441, 443 in the software stack 440 operate in isolation as two separate processes that write their respective outputs 4, 7 (the DMA'd page 432 and the encrypted page 433) to two different locations in IPU memory 429.
A problem is the amount of memory 429 that is available on the IPU. Consumption of two entire pages 432, 433 of data in IPU memory 429 per page write operation is inefficient and can create situations where the IPU 409 does not have enough memory space to process inbound data at the rate it is expected to.
A solution, as observed in
That is, the sequence of 64B units that are sequentially brought into the IPU 509 by the DMA sASIC 526 can be directly (or indirectly via a 64B buffer in IPU memory 529 or IPU register space) streamed to the encryption ASIC block 525 which encrypts the 64B units as they arrive and then stores them in IPU memory 529 in encrypted form 533.
According to one approach, the transport layer 541 of the software stack 540 is expanded to include its own encryption engine 544 so that the transport layer 541 can invoke the encryption ASIC 525 device driver 546 and arrange for encryption to be performed directly on DMA output 7.
Unfortunately, the addition of a second encryption engine to the transport layer 541 increases the size of the IPU's overall code footprint (the software stack 540 now includes two encryption engines (one at each of the transport 541 and block device 543 layers) instead of just one encryption engine at the block device layer 543). The expansion of program code does not necessarily solve the problem because, although additional IPU memory space is not needed for a second version of the payload, more IPU memory space is nevertheless needed for the program code of the second encryption engine.
A better approach, as observed in
As observed in
Here, when the transport layer 541 receives the request 1 from the application software, the transport layer 541 passes 2 a variable “X” to the acceleration framework 547 in order to effectively request use of the DMA sASIC 526.
Importantly, the transport layer 541 is also written to invoke 3 the encryption ASIC block 525 through the block device layer 543 and its encryption engine 544 which includes passing the same variable “X” to the encryption engine 544. Thus, unlike the above mentioned approach which incorporates an entire second encryption into the transport layer 541, by contrast, in the improved approach of
In response to the invocation 3 from the transport layer 541, the encryption engine 544 within the block device layer 543 passes 4 the same variable “X” to the acceleration framework 547 in order to effectively request use of the encryption ASIC block 525. When the acceleration framework 547 observes two concurrent requests 2, 4 that passed the same variable “X”, the acceleration framework 547 understands that chaining is being requested.
That is, the acceleration framework 547 understands that the encryption ASIC 525 is to operate directly upon the DMA output 7 from the DMA sASIC 526. The acceleration framework 547 then arranges, through appropriate manipulation 5 of device drivers 545, 546, for the DMA output stream 7 to be presented as an input stream to the encryption ASIC 525. As such, a full un-encrypted page is never stored in the IPU memory 529. Rather, at the completion of the chained operation, only a full encrypted page is stored in the IPU memory 529.
In various embodiments, the acceleration framework 547 is designed to pause, suspend or otherwise not take immediate action when a layer from a software stack attempts to invoke an accelerator. For example, upon the acceleration framework 547 receiving the earlier invocation 2 from the transport layer 541, rather than immediately invoke the NVMe device driver 545, the acceleration framework 547 pauses and waits to see if any immediately following invocations are made from other (e.g., deeper) layers of the software stack to the acceleration framework 547 using the same value “X”.
Thus, in various embodiments, the acceleration framework 547 is designed to wait for an appropriate number of machine cycles to see if any subsequent invocations include the same variable. Once enough time has passed for all lower stack layers to have invoked an ASIC block, the acceleration framework 547 understands which ASIC blocks are to be chained. Note that later invocations typically are made by deeper layers of the software which, in turn, correspond to properly made later operations (first, DMA of the page, and then, encryption of the page). Thus, the acceleration framework 547 can infer the correct order of the ASIC block chain from the order in which invocations that include the same value are received by the acceleration framework 547.
In various embodiments, the storage software stack includes 540 includes layers from the Storage Platform Development Kit (SPDK), such as lower layers 542, 543 and the value X is a virtual “memory domain” (SPDK permits its software layers to receive virtual memory domains as input variables) or other virtual/dummy memory location value.
In various embodiments, the acceleration framework 547 is designed with the knowledge of which ASIC blocks can be chained and in which order, including, chains of more than two ASIC blocks. For those ASIC block chains that the underlying hardware can actually implement, the acceleration framework 547 or associated meta data identifies these “workable” chains so that the various layers of software executed by the IPU processing cores 511 can be written to construct them. As invocations are received by the acceleration framework 547 during runtime, the acceleration framework 547 first confirms that any requested chain (as inferred from a series of invocations that include the same variable) is included in the list of workable chains.
Although embodiments above have emphasized the chaining of ASIC blocks for storage purposes, other chaining possibilities can be implemented depending on hardware capability. Examples include chaining: 1) calculating cyclic redundancy check (CRC) information on an outbound page to be stored with a third ASIC block that follows the DMA and encryption engine blocks; 2) CRC (first ASIC block) and decryption (second ASIC block) for a page being read from remote storage; 3) DMA (first ASIC block) and digital signature determination (integrity check value (ICV)) (second ASIC block) for output IPSec egress packets; 4) DMA (first ASIC block), ICV determination (second ASIC block) and encryption (third ASIC block) for output IPSec egress packets; 5) authentication (first ASIC block) and decryption (second ASIC block) for IPSec ingress packets; 6) DMA (first ASIC block) then compression (second ASIC block) then encryption (third ASIC block) (e.g., in an egress direction); 7) decryption (first ASIC block) then decompression (second ASIC block) then DMA (third ASIC block) (e.g., in an ingress direction); etc.
For the IPSec implementations described just above, the software layers that invoke the acceleration framework can be Data Plane Development Kit (DPDK) software layers of a DPDK software stack.
In various embodiments the device drivers 545, 546 and acceleration framework 547 execute on a same IPU processing core. The software stack 540 can also execute on the same IPU processing core as the device drivers 545, 546 and acceleration framework 547, or, on a different IPU processing core. In other embodiments the device drivers 545, 546 operate on a different processing core than the acceleration framework 547.
Referring back to
Also, in various embodiments, the platform on which the applications 105 are executed include a virtual machine monitor (VMM), or hypervisor, that instantiates multiple virtual machines (VMs). Operating system (OS) instances respectively execute on the VMs and the applications execute on the OS instances. Alternatively or combined, container engines (e.g., Kubernetes container engines) respectively execute on the OS instances. The container engines provide virtualized OS instances and containers respectively execute on the virtualized OS instances. The containers provide isolated execution environment for a suite of applications which can include, applications for micro-services.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in program code (e.g., machine-executable instructions). The program code, when processed, causes a general-purpose or special-purpose processor to perform the program code's processes. Alternatively, these processes may be performed by specific/custom hardware components that contain hard wired interconnected logic circuitry (e.g., application specific integrated circuit (ASIC) logic circuitry) or programmable logic circuitry (e.g., field programmable gate array (FPGA) logic circuitry, programmable logic device (PLD) logic circuitry) for performing the processes, or by any combination of program code and logic circuitry.
Elements of the present invention may also be provided as a machine-readable medium for storing the program code. The machine-readable medium can include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards or other type of media/machine-readable medium suitable for storing electronic instructions.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.