Many different solutions have been proposed to offload host networking processes to hardware. For example, smart network interface cards (SmartNICs) based on field-programmable gate arrays (FPGAs) have been contemplated. Such solutions provide advantages that include programmability that is comparable to software and performance and efficiency that are comparable to hardware. Other solutions include SmartNICs based on application-specific integrated circuits (ASICs), which provide cost-effective performance but is limited in flexibility compared to FPGA-based SmartNICs.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Implementations of architectures for network processing using a fixed-function logic per-operation (per-op) component close-coupled with programmable logic and software are provided. One aspect provides an integrated circuit device for network processing, the device comprising a composable processing pipeline that includes a programmable per-op component and a fixed-function logic per-op component that is close-coupled with programmable logic and software. The device further comprises a compute complex component comprising processing circuitry implementing the software for controlling the programmable per-op component and the fixed-function logic per-op component, wherein for a first processing pipeline, the processing circuitry is configured to perform a first function using the programmable per-op component, and for a second processing pipeline, the processing circuitry is configured to perform a second function using the fixed-function logic per-op component.
Network processing devices, such as SmartNICs, can be implemented in various ways. Common implementations of such devices include the use of FPGAs and/or ASICs. Different implementations and architectures may be designed to be application-specific, providing various functionalities for different purposes. FPGAs are programmable/re-programmable integrated circuits that provide high flexibility. For example, their programmability/re-programmability allows for more standard manufacturing and interfaces while still enabling their implementations in different applications. On the other hand, ASIC architectures are generally manufactured for specific functions/purposes. As such, they generally operate at higher speeds and are more efficient at performing their intended functions compared to other logic devices. Additionally, as they are manufactured for specific purposes, their space requirements are comparably lower than other logic devices. However, these advantages are weighed against high initial development and testing costs.
In some SmartNIC architectures, a combination of both FPGA and ASIC designs is employed. Many such devices are generally implemented with three major components: a per-op component, a per-byte component, and a control component.
The modules 102-106 can be implemented with various components and hardware architectures. The per-op module 102 can include one or more per-op components, which are programmable components that can provide various functions. For example, the per-op components can provide functions of processing headers and metadata of network packets and/or storage transactions. The per-op components can be implemented to support programmability at full per-operation rates. In some implementations, a hardened path is used to reduce power in common cases while providing full programmability support for every operation. Power can be determined by every operation being processed in a programmable way. In the example integrated circuit device 100, the per-op components are implemented with FPGA programmable logic. Other types of programmable logic devices can also be implemented. In some implementations, the per-op components are implemented with one or more microcontrollers.
The per-byte module 104 can include one or more per-byte components, which are components that can provide compute intensive functions. The per-byte components are generally implemented in hard logic and are not programmable. In some implementations, the per-byte module 104 includes a component that is configurable. A per-byte component can be considered as a data path processor controlled by a per-op component. The per-byte module 104 provides interfaces, such as PCI-e physical layers (PHYs) and controllers, Ethernet PHYs and controllers, data movement, transformation (e.g., crypto), and computational (e.g., cyclic redundancy check (CRC)) capabilities. For example, the per-byte components can provide functions of processing data bytes for each network packet and/or storage transaction as well as input/output (IO) interfaces. A per-byte component can accept commands with operands from a per-op component. For example, such a command can include “Read host data specified by provided gather-list into buffer, CRC-ing, encrypting, decrypting, checksum-ing, and CRC-ing while doing so.” In the example integrated circuit device 100, the per-byte components are implemented with ASICs. The compute complex module 106 can be implemented with a processor-based compute subsystem. For example, the compute complex module 106 can be implemented using various central processing unit (CPU) architectures. In some implementations, the compute complex module 106 includes a plurality of CPU cores configured to run control plane software agents.
The example processing pipeline 200 depicts the data flow of incoming packets received from a network connection, such as an Ethernet connection for example. The incoming packets arrive at the first ASIC per-byte component 204A, which performs an outer partial checksum operation. Packet headers and metadata are then sent to the first FPGA per-op component 202A for packet processing. The data is then sent to the second ASIC per-byte component 204B, along with command data for invoking the second ASIC per-byte component 204B to perform its intended function. In the example pipeline 200, the second ASIC per-byte component 204B performs decryption and Internet checksum functions. The packet headers and metadata are then sent to the second FPGA per-op component 202B for packet processing. The data is then sent to the third ASIC per-byte component 204C, along with command data for invoking the third ASIC per-byte component 204C to perform its intended function. In the example pipeline 200, the third ASIC per-byte component 204C performs packet editing and direct memory access (DMA) to the host system.
The implementations described in
In view of the observations above, network processing device architectures including a composable processing pipeline are provided. The composable processing pipeline includes a configurable programmable per-op component that is close-coupled with programmable logic and software. In some implementations, the programmable per-op component is ASIC-based. The ASIC per-op component can be implemented to perform well-known or highly used functions, which takes advantage of the speed and efficiency of ASIC architecture to improve performance of the network processing device. Such architectures can be implemented for various applications. For example, a SmartNIC architecture can be implemented using a configurable ASIC per-op component close-coupled with programmable logic and software to provide flexibility in supporting various use case scenarios. A set of uniform application programming interfaces (APIs) can be defined for configurable hard-wired logic, software, and programmable logic to invoke per-op and per-byte offload/acceleration functions implemented in ASIC. In contrast, traditional architectures utilize different sets of APIs used separately by hardware/software. In some implementations, a configurable ASIC per-op component is implemented to be invokable by programmable logic and software in a compute complex component via a set of uniform APIs. For example, the ASIC per-op component can be implemented such that each functional and sub-functional block can be invoked directly by hardware, by FPGA, or by software running in the compute complex to perform the set functions.
Architectures implementing configurable ASIC per-op components close-coupled with programmable logic and software can enable access to processing pipelines that provide optimized data flow for many different use cases. For example, use cases and related processing pipelines that involve low programmability can implement such configurable ASIC per-op components to reduce latency and cross section bandwidth between ASIC and FPGA. In some implementations, The ASIC per-op component includes network packet and storage IO processing functional blocks. Each block can be individually invoked by FPGA programmable logic or software in the compute complex component to perform its set function(s). In some processing pipelines, such functions can also be invoked by the arrival event of network packet and storage transactions.
The ASIC per-op component 304 can be implemented as a configurable component that is close-coupled with programmable logic and software in the compute complex component 106. In some implementations, the ASIC per-op component 304 includes functional blocks that can be individually invoked using a set of uniform APIs as defined for the FPGA per-op component 302. Such implementations provide a high degree of flexibility with combinations of software and hardware functional blocks to implement processing pipeline for various use cases and to allow customization in various deployment scenarios. For example, a first processing pipeline can be performed utilizing the FPGA per-op component 302 to perform its set function(s) while bypassing the ASIC per-op component 304. A second processing pipeline can be performed utilizing the ASIC per-op component 304 to perform its set function(s) while bypassing the FPGA per-op component 302.
Compared to the pipeline described in
The architecture depicted in
By bypassing the ASIC per-op components 402A, 402B, the composable processing pipeline 500 depicted in
In addition to different processing pipelines and data flows utilizing the bypass and implementations of different ASIC and FPGA per-op components, the architecture described herein enable processing pipelines in which software running in a compute complex component provides the main control.
At step 902, the method 900 includes performing a first composed processing pipeline. Performing the first composed processing pipeline includes, at substep 902A, selecting, using a compute complex component, the programmable per-op component for performing a first function of the first composed processing pipeline. At substep 902B, the compute complex component controls the programmable per-op component to perform the first function. In some implementations, performing the first composed processing pipeline includes bypassing the fixed-function logic per-op component. An example of such a pipeline bypassing the fixed-function logic per-op component is depicted in
At step 904, the method 900 includes performing a second composed processing pipeline. Performing the second composed processing pipeline includes, at substep 904A, selecting, using the compute complex component, the fixed-function logic per-op component for performing a second function of the second composed processing pipeline. At substep 904B, the compute complex component controls the fixed-function logic per-op component to perform the second function. The fixed-function logic per-op component can be configured to be invokable with programmable logic and software in a compute complex component via a set of uniform APIs. In some implementations, the fixed-function logic per-op component includes functional blocks and/or sub-functional blocks, including but not limited to network packet and storage IO processing functional blocks. The functional and sub-functional blocks can be implemented to be individually invoked with programmable logic or software in a compute complex component to perform its set function(s).
In some implementations, performing the second composed processing pipeline includes bypassing the programmable per-op component. An example of such a pipeline bypassing the programmable per-op component is depicted in
In addition to the composed processing pipelines described above with respect to steps 902 and 904, other variations and scenarios can be implemented using a similar integrated circuit device design. At step 906, the method 900 optionally includes performing a third composed processing pipeline. The third composed processing pipeline includes performing a third function using the fixed-function logic per-op component and performing a fourth function using a programmable per-op component, which may or may not be the same programmable per-op component described above with respect to the performance of the first function in the first composed processing pipeline. For example, the integrated circuit device can be implemented to perform functions not fully supported by a fixed-function logic per-op component by using a programmable per-op component to perform complementary functions to the fixed-function logic per-op component. In some implementations, the fixed-function logic per-op component performs a “front-end processing” function, and the programmable per-op component performs a “back-end processing” function that is complementary to the front-end processing. An example of such a pipeline performing separate front-end and back-end processing using different components is depicted in
In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as a computer-application program or service, an application-programming interface (API), a library, and/or other computer-program product.
Computing system 1000 includes processing circuitry 1002, volatile memory 1004, and a non-volatile storage device 1006. Computing system 1000 may optionally include a display subsystem 1008, input subsystem 1010, communication subsystem 1012, and/or other components not shown in
Processing circuitry typically includes one or more logic processors, which are physical devices configured to execute instructions. For example, the logic processors may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
The logic processor may include one or more physical processors configured to execute software instructions. Additionally or alternatively, the logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the processing circuitry 1002 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the processing circuitry optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. For example, aspects of the computing system disclosed herein may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood. These different physical logic processors of the different machines will be understood to be collectively encompassed by processing circuitry 1002.
Non-volatile storage device 1006 includes one or more physical devices configured to hold instructions executable by the processing circuitry to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 1006 may be transformed—e.g., to hold different data.
Non-volatile storage device 1006 may include physical devices that are removable and/or built in. Non-volatile storage device 1006 may include optical memory, semiconductor memory, and/or magnetic memory, or other mass storage device technology. Non-volatile storage device 1006 may include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 1006 is configured to hold instructions even when power is cut to the non-volatile storage device 1006.
Volatile memory 1004 may include physical devices that include random access memory. Volatile memory 1004 is typically utilized by processing circuitry 1002 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 1004 typically does not continue to store instructions when power is cut to the volatile memory 1004.
Aspects of processing circuitry 1002, volatile memory 1004, and non-volatile storage device 1006 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.
The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 1000 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function. Thus, a module, program, or engine may be instantiated via processing circuitry 1002 executing instructions held by non-volatile storage device 1006, using portions of volatile memory 1004. It will be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” may encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
When included, display subsystem 1008 may be used to present a visual representation of data held by non-volatile storage device 1006. The visual representation may take the form of a GUI. As the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of display subsystem 1008 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 1008 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with processing circuitry 1002, volatile memory 1004, and/or non-volatile storage device 1006 in a shared enclosure, or such display devices may be peripheral display devices.
When included, input subsystem 1010 may comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, camera, or microphone.
When included, communication subsystem 1012 may be configured to communicatively couple various computing devices described herein with each other, and with other devices. Communication subsystem 1012 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wired or wireless local- or wide-area network, broadband cellular network, etc. In some embodiments, the communication subsystem may allow computing system 1000 to send and/or receive messages to and/or from other devices via a network such as the Internet.
The following paragraphs provide additional description of the subject matter of the present disclosure. One aspect provides an integrated circuit device for network processing, the device comprising: a composable processing pipeline comprising: a programmable per-op component; and a fixed-function logic per-op component that is close coupled with programmable logic and software; and a compute complex component comprising processing circuitry implementing the software for controlling the programmable per-op component and the fixed-function logic per-op component, wherein: for a first composed processing pipeline, the processing circuitry is configured to perform a first function using the programmable per-op component; and for a second composed processing pipeline, the processing circuitry is configured to perform a second function using the fixed-function logic per-op component. In this aspect, additionally or alternatively, wherein: for the first composed processing pipeline, the processing circuitry is configured to bypass the fixed-function logic per-op component; and for the second composed processing pipeline, the processing circuitry is configured to bypass the programmable per-op component. In this aspect, additionally or alternatively, wherein the fixed-function logic per-op component comprises functional blocks that can be individually invoked using a uniform set of application programming interfaces (APIs). In this aspect, additionally or alternatively, wherein the functional blocks can be individually invoked by the compute complex component. In this aspect, additionally or alternatively, wherein the functional blocks comprise one or more of a network packet processing functional block or a storage input/output processing functional block. In this aspect, additionally or alternatively, wherein the programmable per-op component comprises a field-programmable gate array (FPGA) per-op component. In this aspect, additionally or alternatively, wherein the fixed-function logic per-op component comprises an application-specific integrated circuit (ASIC). In this aspect, additionally or alternatively, the integrated circuit device further comprises an application-specific integrated circuit (ASIC) per-byte component, wherein, for the second composed processing pipeline, the processing circuitry is configured to perform a third function using the ASIC per-byte component. In this aspect, additionally or alternatively, wherein the ASIC per-byte component can be invoked by the fixed-function logic per-op component or the programmable per-op component using a uniform set of application programming interfaces (APIs). In this aspect, additionally or alternatively, the integrated circuit device further comprises a second programmable per-op component, wherein, for a third composed processing pipeline, the processing circuitry is configured to perform a third function using the fixed-function logic per-op component and a fourth function using the second programmable per-op component.
Another aspect provides a method for network processing enacted on an integrated circuit device comprising a composable processing pipeline, the method comprising: performing a first composed processing pipeline, comprising: selecting, using a compute complex component, a programmable per-op component for performing a first function of the first composed processing pipeline; and controlling, using the compute complex component, the programmable per-op component to perform the first function; and performing a second composed processing pipeline, comprising: selecting, using the compute complex component, a fixed-function logic per-op component for performing a second function of the second composed processing pipeline, wherein the fixed-function logic per-op component is close-coupled with programmable logic and software running on the compute complex component; and controlling, using the compute complex component, the fixed-function logic per-op component to perform the second function. In this aspect, additionally or alternatively, wherein: performing the first composed processing pipeline comprises bypassing the fixed-function logic per-op component; and performing the second composed processing pipeline comprises bypassing the programmable per-op component. In this aspect, additionally or alternatively, wherein the fixed-function logic per-op component comprises functional blocks that can be individually invoked by a compute complex component of the integrated circuit device using a uniform set of application programming interfaces (APIs). In this aspect, additionally or alternatively, wherein the programmable per-op component comprises a field-programmable gate array (FPGA) per-op component; and wherein the fixed-function logic per-op component comprises an application-specific integrated circuit (ASIC) per-op component. In this aspect, additionally or alternatively, wherein performing the second composed processing pipeline comprises performing a third function using an application-specific integrated circuit (ASIC) per-byte component, wherein the ASIC per-byte component is invokable by the fixed-function logic per-op component, the programmable per-op component, or the compute complex component using a uniform set of application programming interfaces (APIs).
Another aspect provides an integrated circuit device for network processing, the device comprising: a composable processing pipeline comprising: a field-programmable gate array (FPGA) per-op component; and an application-specific integrated circuit (ASIC) per-op component; and a compute complex component comprising processing circuitry implementing software for controlling the FPGA per-op component and the ASIC per-op component, wherein: for a first composed processing pipeline, the processing circuitry is configured to bypass the ASIC per-op component; and for a second composed processing pipeline, the processing circuitry is configured to bypass the FPGA per-op component. In this aspect, additionally or alternatively, wherein the ASIC per-op component comprises functional blocks that can be individually invoked using a uniform set of application programming interfaces (APIs). In this aspect, additionally or alternatively, wherein the functional blocks can be individually invoked by the compute complex component. In this aspect, additionally or alternatively, the integrated circuit device further comprises an ASIC per-byte component, wherein, for the second composed processing pipeline, the processing circuitry is configured to perform a function using the ASIC per-byte component. In this aspect, additionally or alternatively, wherein the ASIC per-byte component can be invoked by the ASIC per-op component, the FPGA per-op component, or the compute complex component using a uniform set of application programming interfaces (APIs).
“And/or” as used herein means any or all of multiple stated possibilities. For example, the phrase “element A and/or element B” covers embodiments having element A alone, element B alone, or elements A and B taken together.
It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.