Embodiments generally relate to computing systems. More particularly, embodiments relate to cloud-based systems using an infrastructure processing unit architecture with high performance offload.
Some existing cloud-based systems (e.g., data centers) use an infrastructure processing unit (IPU) to handle certain tasks relating to management/control of the cloud (e.g., data center) infrastructure, instead of such tasks being performed by a host central processing unit (CPU). Such offloading of infrastructure tasks to the IPU frees the host CPU to handle more tasks requested by remote clients and/or to handle client-requested tasks more quickly and efficiently. Existing IPU-based implementations, however, present data traffic bottlenecks and are restricted in the number of storage drives that can be supported, all of which limits system performance and scalability.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
A performance-enhanced computing system as described herein provides an improved system architecture with an IPU for offloading infrastructure tasks. The improved system architecture uses a multi-root switch to route application data between host processor memory and cloud storage, which provides additional bandwidth and connectivity to storage devices. By using transaction identifier substitution (remapping), the technology provides for bypassing temporary storage of the application data in local IPU memory, permitting direct transfer of data between host memory and cloud storage devices. The technology helps improve the overall performance of cloud-based systems by increasing data storage throughput, eliminating critical data bottlenecks, and enabling an increase in the number of storage drives controlled by the architecture.
The IPU 130 provides infrastructure support and services to the host CPU 110 including, for example, networking and security services as well as managing data storage in the first storage tier 160 and the second storage tier 170. The IPU 130 can include, e.g., an Intel® Mt. Evans IPU. The IPU 130 has IPU local memory 140 used by the IPU 130 - e.g., for storing data, submission/completion queues, etc. The IPU local memory 140 can include, e.g., DRAM. The IPU 130 exposes to the host CPU 110 a set of PCIe endpoints - e.g., single root I/O virtualization (SR-IOV) virtual functions. The PCIe endpoints expose nonvolatile memory express (NVMe) virtual interfaces - e.g., submission/completion queues, which accept NVMe commands and return NVMe responses. NVMe is a storage access and transport protocol used for solid state drives (SSDs), e.g., flash memory drives. The IPU 130 also runs a host-based flash translation layer (FTL) to manage SSD operations such as, e.g., logical-to-physical address translation, garbage collection, wear-leveling, etc. Through FTL the IPU 130 provides virtual SSDs for use by the host CPU 110.
The IPU 130 has a total of 16 peripheral component interconnect express (PCIe) lanes. The IPU 130 is coupled to the host CPU 110 via eight of the PCIe lanes, and the IPU 130 is coupled to the PCIe switch 150 (a single-root switch) via the remaining eight PCIe lanes. The IPU 130 interfaces with SSDs in the first storage tier 160 and SSDs in the second storage tier 170 via the PCIe switch 150. The PCIe switch 150 is coupled to the SSDs in the first storage tier 160 via a set of eight PCIe lanes, where each solid state drive (SSD) in the first storage tier 160 occupies four PCIe lanes. Similarly, the PCIe switch 150 is coupled to the SSDs in the second storage tier 170 via a set of eight PCIe lanes, where each SSD in the second storage tier 170 occupies four PCIe lanes.
The first storage tier 160 and the second storage tier 170 each include two physical SSDs (physical devices) for data storage. Based on the system architecture 100, the first storage tier 160 and the second storage tier 170 are limited to two SSDs, for example because the PCIe switch 150 only has eight PCIe lanes each for coupling to the first storage tier 160 and the second storage tier 170, respectively, and each SSD occupies four of the PCIe lanes. The system architecture 100 is effectively limited to an eight-lane PCIe switch because the IPU 130 only has eight (of its total 16 PCIe lanes) available for coupling to the PCIe switch 150; the other eight lanes for the IPU 130 are coupled to the host CPU 110. The first storage tier 160 can, for example, be a “cache” tier (e.g., used for “hot” data/objects), and the SSDs in the first storage tier 160 can include, e.g., Intel® Optane™ memory devices. The second storage tier 170 can, for example, be a “capacity” tier (e.g., used for “cold” data/objects), and the SSDs in the second storage tier 170 can include, e.g., quad-level cell (QLC) flash memory devices.
The host CPU 110 does not directly communicate with the SSDs in the first storage tier 160 or in the second storage tier 170. Indeed, the host CPU 110 does not even know the identity/location of the physical SSDs (e.g., the SSDs are hidden from the host CPU 110). Instead, the IPU 130 provides virtual SSDs to the host CPU 110 via FTL using NVMe commands/interfaces. The IPU 130 effectively converts host read/write requests for virtual SSDs into requests having physical addressing for the SSDs in the first storage tier 160 or in the second storage tier 170.
In operation, the IPU 130 exposes PCIe virtual functions (VFs) that the FTL uses to expose virtual NVMe volumes (virtual SSDs) to the host CPU 110 through the virtual NVMe interfaces. These virtual NVMe volumes can be direct assigned to the virtual machines (VMs) running on the host CPU 110. The FTL stores its persistent metadata (e.g., logical to physical (L2P) address mapping table) on an SSD partition. For example, FTL metadata is stored on an SSD in the first storage tier 160, while the remaining capacity of the first storage tier 160 is used as a cache tier to place hot/temp data.
The IPU 130 maintains NVMe I/O submission queue(s) and NVMe I/O completion queue(s). NVMe commands (I/O) are placed into a submission queue, and completions (e.g., for the commands) are placed in an associated completion queue. Multiple submission queues can be associated with the same completion queue. The IPU 130 also maintains NVMe administration queue(s) for device management and control - e.g., creation and deletion of I/O Submission and Completion Queues, etc.
Similarly, as another example, when the host CPU 110 requests a transfer of application data from a VF/virtual SSD to the host local memory 120, the IPU 130 causes the requested application data to be retrieved from a physical location in one of the SSDs (the target SSD) and temporarily stored in the IPU local memory 140. For example, if the IPU 130 causes the application data to be transferred from a target SSD in the first storage tier 160, the flow of data follows the data path 135 between the target SSD in the first storage tier 160 and the IPU local memory 140. As another example, if the IPU 130 causes the application data to be transferred from a target SSD in the second storage tier 170, the flow of data follows the data path 137 between the target SSD in the second storage tier 170 and the IPU local memory 140. The IPU 130 then causes the application data to be moved from temporary storage in the IPU local memory 140 to the host local memory 120, where the flow of data follows the data path 125.
Similarly, as another example, data stored in the second storage tier 170 may, as a result of an application, become “hot” (e.g., objects that have been reassigned from a cold tier to a hot tier) and, thus, is transferred to the first storage tier 160. To carry out this data transfer, the IPU 130 causes the subject stored data to be retrieved from an SSD in the second storage tier 170 and temporarily stored in the IPU local memory 140, with the data flow following the data path 147. The IPU then causes the subject data to be moved from temporary storage in the IPU local memory 140 to a target SSD in the first storage tier 160, with the data flow following the data path 145.
Garbage collection (e.g., FTL garbage collection) involves, at a high level, moving good data to new pages or blocks and erasing blocks with old data to provide new storage, and is conducted under the control of the IPU 130. As an example, to perform garbage collection on a target SSD in the second storage tier 170, the IPU 130 causes valid data in a used block on the target drive to be retrieved from the target drive and temporarily stored in the IPU local memory 140. The IPU 130 then causes this valid data to be moved to a new block on the target drive, and the used block on the target drive is erased. Thus the garbage collection causes data transfers that follow the data path 147 between the second storage tier 170 and the IPU local memory 140.
As a result of the store-and-forward scheme used in the system architecture 100, all of the data transfers between the IPU 130 and the memory storage tiers (i.e., the first storage tier 160 and/or the second storage tier 170) - e.g., data transfers resulting from application execution as well as data compaction and garbage collection - utilize the eight PCIe lanes between the IPU 130 and the PCIe switch 150. This results in a critical data bottleneck in those eight PCIe lanes. For example, a typical PCIe bandwidth demand generated by application reads and writes to a single hot/cold drive pair (e.g., one SSD in the first storage tier 160 and one SSD in the second storage tier 170) is on the order of 8 GB/s.
Further, the FTL compaction process, in the worst case, reads all of the application data from SSDs in the first storage tier 160 and writes the data to SSDs in the second storage tier 170, thus generating two times 8 GB/s (i.e., 16 GB/s) of additional bandwidth demand. Assuming a nominal FTL write amplification factor for big data workloads, the FTL garbage collection process further generates on the order of 10 GB/s bandwidth demand (of note, that this bandwidth demand is much higher for random workloads). Thus, there is a potential bandwidth demand of 34 GB/s through the critical bottleneck (the eight PCIe lanes) between the IPU 130 and the PCIe switch 150. Thus, at worst case, the eight available PCIe lanes in the IPU 130 cannot sustain the bandwidth demand of even a single SSD pair. Even if the FTL compaction and garbage collection did not burden the critical bottleneck, application read/write demands cannot scale beyond the current two drive per tier architecture.
Furthermore, as described above, a transfer of application data between the host local memory 120 and a storage tier (e.g., the first storage tier 160 or the second storage tier 170 ) results in two application data transfers: one transfer between the host local memory 120 and the IPU local memory 140, and another transfer between the IPU local memory 140 and a physical SSD in the respective storage tier. Such duplication of data transfers caused by the store-and-forward scheme results in delays and inefficiencies in application execution. Moreover, the store-and-forward scheme requires IPU implementations to provision for sufficient local DRAM bandwidth as well as sufficient compute resources (e.g., a number of CPU cores on the IPU), providing further disadvantages, especially given power constraints.
The IPU 230 includes the same or similar functionality as the IPU 130, and can include, e.g., an Intel® Mt. Evans IPU. The IPU 230 - similar to the IPU 130 - has a total of 16 PCIe lanes, all of which are used to couple the IPU 230 to the MR switch 250. In other words, unlike the system architecture 100, where the 16 PCIe lanes of the IPU 130 are split between coupling to the host CPU 110 (via eight PCIe lanes) and coupling to the PCIe switch 150 (via the remaining eight PCIe lanes), in the system architecture 200 the IPU 230 does not need to split the 16 PCIe lanes between coupling to the host CPU 110 and coupling to the MR switch 250. The IPU 230 interfaces with the other devices (e.g., the host CPU 110, SSDs in the first storage tier 260 and/or SSDs in the second storage tier 270) through the MR switch 250. The IPU 230 is used primarily for handling the control path for controlling how the data moves in the system architecture 200. The IPU 230 also runs a flash translation layer (FTL) which has been offloaded from the host CPU 110 - thus saving processing cycles for the host CPU 110. FTL metadata is stored on an SSD in the first storage tier 260. As described herein, the system architecture 200 avoids the store-and-forward scheme of existing designs and instead employs direct data movement between the host local memory 120 and the memory storage tiers (e.g., the first storage tier 260 and/or the second storage tier 270) via the MR switch 250.
The MR switch 250 is coupled to four SSDs in the first storage tier 260 via a set of 16 PCIe lanes, where each SSD in the first storage tier 260 occupies four PCIe lanes. Similarly, the MR switch 250 is coupled to four SSDs in the second storage tier 270 via a set of 16 PCIe lanes, where each SSD in the second storage tier 270 occupies four PCIe lanes. The downstream switch ports coupling to the SSDs in the storage tiers 260 and 270 are assigned to a root port (RP) of the IPU 230 (and are hidden from the host CPU 110). As further described herein with reference to
The first storage tier 260 and the second storage tier 270 each include four physical SSDs (physical devices) for data storage. Based on the system architecture 200, the first storage tier 260 and the second storage tier 270 can each accommodate four SSDs, because the MR switch 250 has 16 PCIe lanes each for coupling to the first storage tier 260 and the second storage tier 270, respectively, and each SSD occupies four of the PCIe lanes. The first storage tier 260 can, for example, be a “cache” tier (e.g., used for “hot” data/objects), and the SSDs in the first storage tier 260 can include, e.g., Intel® Optane™ memory devices. The second storage tier 270 can, for example, be a “capacity” tier (e.g., used for “cold” data/objects), and the SSDs in the second storage tier 270 can include, e.g., quad-level cell (QLC) flash memory devices. FTL metadata is stored on an SSD in the first storage tier 260, while the remaining capacity of the first storage tier 260 is used as a cache tier to place hot/temp data.
The view of this architecture from the perspective of the host CPU 110 is simply an SR-IOV capable IPU 230. The host CPU 110 does not know the identity/location of the physical SSDs (the SSDs are hidden from the host CPU 110). Instead, the IPU 230 provides virtual SSDs to the host CPU 110 via FTL using NVMe commands/interfaces. The IPU 230 effectively converts host read/write requests for virtual SSDs into requests having physical addressing for the SSDs in the first storage tier 260 or in the second storage tier 270. The IPU 230 can use a private PCIe fabric via the MR switch 250 to manage the physical SSDs.
In operation, the IPU 230 exposes PCIe virtual functions (VFs) that the FTL uses to expose virtual NVMe volumes (virtual SSDs) to the host CPU 110 through the virtual NVMe interfaces. These virtual NVMe volumes can be direct assigned to the virtual machines (VMs) running on the host CPU 110. The host CPU 110 enumerates these VFs, discovers virtual NVMe controllers, and then binds NVMe drivers to them. The host CPU 110 neither “sees” nor enumerates the physical NVMe SSDs (they are hidden by the MR switch 250). Only the IPU can enumerate the physical NVMe SSDs.
As described further herein with reference to
The remapping table stored in the memory of the MR switch 250 also includes, in embodiments, transaction tags to enable retagging, by the MR switch 250, of requests from both the IPU 230 and the physical SSDs. Retagging is implemented to disambiguate the routing of completions from the host CPU 110 targeting the IPU’s VF requester ID (e.g., virtual SSD ID), which may have originated either from IPU 230 or one of the physical SSDs. The MR switch 250 must store the transaction ID (requester ID plus tag) of the original request and substitute it with the transaction ID of the VF. Subsequently, when the host CPU 110 returns a data transfer completion, the MR switch 250 must substitute (e.g., restore) the transaction ID of the original request.
In embodiments, the IPU 230 also provides a base address register (BAR) that maps the entire range of the IPU local memory 140 into the host address space. The host CPU 110 does not directly access this BAR; rather, it simply exists to provide an addressable PCIe path from the physical SSDs to the IPU local memory 140. The MR switch 250 does not need to provide ID substitution (remapping) for any peer-to-peer transfers between the physical SSDs and the IPU local memory 140.
As an example, if the IPU 230 causes the application data to be transferred to a target SSD in the first storage tier 260, the flow of data follows a data path 225 between the host local memory 120 and the target SSD in the first storage tier 260. As another example, if the IPU 230 causes the application data to be transferred to a target SSD in the second storage tier 270, the flow of data follows a data path 227 between the host local memory 120 and the target SSD in the second storage tier 270. As explained in more detail with reference to
Similarly, as another example, data stored in the second storage tier 270 may, as a result of an application, become “hot” (e.g., objects that have been reassigned from a cold tier to a hot tier) and, thus, is transferred to the first storage tier 260. To carry out this data transfer, the IPU 230 causes the subject stored data to be retrieved from an SSD in the second storage tier 270 and transferred directly to a target SSD in the first storage tier 260, with the data flow following the data path 265 (e.g., on a peer-peer basis).
Garbage collection is conducted under the control of the IPU 230. As an example, to perform garbage collection on a target SSD in the second storage tier 270, the IPU 230 causes valid data in a used block on the target SSD to be copied to a new block on the target SSD, and the used block on the target SSD is erased, using (if available) on-SSD copy support. Thus the garbage collection causes data transfers that follow a local data path 267 within the target SSD. Alternatively, if on-SSD copy support is unavailable, the IPU 230 causes the valid data in a used block on the target SSD to be copied to a new block on the target SSD (or to a block in another SSD in the storage tier) using a peer-to-peer transfer via the MR switch 250, such that the data path 267 extends from the target SSD through the MR switch 250 and back.
By providing for direct data transfers between the host local memory 120 and a target SSD (over up to 16 PCIe lanes), the system architecture 200 eliminates the data bottleneck presented by the existing system architecture 100. For example, the IPU 230 is only involved in the NVMe command and response flows and for FTL metadata flows, while the transfer of application data bypasses the IPU 230 and temporary storage in the IPU local memory 140. Furthermore, the FTL compaction data flow can take place peer to peer through the MR switch 250 without involving the IPU local memory 140. Similarly, FTL garbage collection flow can take place within the target SSD using the on-SSD copy command or peer-to-peer transfer without involving the IPU local memory 140. Moreover, since the IPU 230 is not in the critical data path, it can use older generation PCIe (e.g., PCIe 4.0) at lower power and cost, even while the MR switch 250 and the SSDs in the first storage tier 260 and the second storage tier 270, being on the critical data path, use newer generation PCIe (e.g., PCIe 5.0).
Some or all components and/or features in the system architecture 200 can be implemented using one or more of a CPU, an IPU, a graphics processing unit (GPU), an artificial intelligence (AI) accelerator, a field programmable gate array (FPGA) accelerator, an application specific integrated circuit (ASIC), and/or via a processor with software, or in a combination of a processor with software and an FPGA or ASIC. More particularly, components of the system architecture 200 can be implemented in one or more modules as a set of program or logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in hardware, or any combination thereof. For example, hardware implementations can include configurable logic, fixed-functionality logic, or any combination thereof. Examples of configurable logic include suitably configured programmable logic arrays (PLAs), FPGAs, complex programmable logic devices (CPLDs), and general purpose microprocessors. Examples of fixed-functionality logic include suitably configured ASICs, combinational logic circuits, and sequential logic circuits. The configurable or fixed-functionality logic can be implemented with complementary metal oxide semiconductor (CMOS) logic circuits, transistor-transistor logic (TTL) logic circuits, or other circuits.
For example, computer program code to carry out operations by the system architecture 200 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, program or logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
The MR switch 306 performs remapping (e.g., substitution and/or restoration of transaction IDs) as described herein. For example, MR switch 306 performs remapping of a transaction identifier field in a data transfer message (e.g., in a PCIe write request, a PCIe read request, or a PCIe data completion) between a first transaction identifier associated with the virtual function and a second transaction identifier associated with a physical storage device (e.g., the target SSD 308). In embodiments the MR switch 306 corresponds to the MR switch 250 with a remapping module (
The example command flow sequences described herein follow a typical three-phase process (e.g., in compliance with FTL/NVMe): a command phase, where the host issues a work order for the IPU (e.g., e.g., NVMe commands to read or write data); a data transfer phase, where data is transferred between the host local memory and storage using data transfer messages (e.g., a PCIe write request, or a PCIe read request followed by a PCIe data completion); and a completion phase, where the IPU issues a work order completion (e.g., NVMe command completion). Physical SSDs have a DMA engine to perform DMA data transfers to and from the SSD for transferring application data as described herein.
Turning to
The PCIe write request issued at label 316 includes a transaction ID (e.g., as part of transaction ID field in a PCIe packet). The transaction ID includes a requester ID; in embodiments, the transaction ID also includes a transaction tag. When the target SSD 308 issues the PCIe write request, the PCIe write request has a requester ID that is associated with the target SSD 308. If this request (with the requester ID that is associated with the target SSD 308) happens to be routed to the host 302, the host 302 (e.g., via an IOMMU such as the IOMMU 115) will not recognize the requester ID and will therefore reject the data transfer request (e.g., the data transfer request will not be fulfilled).
The MR switch 306 intercepts the PCIe write request issued by the target SSD 308 and performs, via the remapping module, a substitution of the transaction identifier (ID) at label 318 — e.g., based on an applicable entry in a remapping table (such as the remapping table 450 described herein with reference to
Once the MR switch 306 performs the transaction ID substitution (e.g., substituting at least the requester ID and, in embodiments, the transaction tag), the PCIe write request(s) —now bearing the substituted transaction ID — is/are routed via the MR switch 306 to the host 302 at label 320. Because of the transaction ID substitution, the PCIe write request(s) will appear to the host 302 as if the requester is the virtual function (operated by the IPU 304) rather than the target SSD 308, such that the host 302 will permit the data transfer according to the received PCIe write request(s) — e.g., host 302 will cause the data accompanying the PCIe write request(s) to be stored in host memory.
At label 322 the target SSD 308 issues an NVMe command completion to the IPU 304. The IPU 304 issues an NVMe command completion to the host 302 at label 324. In some embodiments, the IPU 304 “sniffs” (e.g., detects) that the target SSD 308 has completed the data transfer phase and immediately posts a NVMe command completion to the host 302. The IPU 304 accomplishes this by, e.g., setting a filter on the writes originating from the SSD, which enables the IPU 304 to more quickly return the NVMe command completion to the host 302.
Turning now to
The PCIe read request issued at label 356 includes a transaction ID (e.g., as part of a transaction ID field in a PCIe packet). The transaction ID includes a requester ID; in embodiments, the transaction ID also includes a transaction tag. When the target SSD 308 issues the PCIe read request, the PCIe read request has a requester ID that is associated with the target SSD 308. If this request (with the requester ID that is associated with the target SSD 308) happens to be routed to the host 302, the host 302 (e.g., via an IOMMU such as the IOMMU 115) will not recognize the requester ID and will therefore reject the data transfer request (e.g., the data transfer request will not be fulfilled).
The MR switch 306 intercepts the PCIe read request issued by the target SSD 308 and performs, via the remapping module, a substitution of the transaction identifier (ID) at label 358 — e.g., based on an applicable entry in a remapping table (such as the remapping table 450 described herein with reference to
Once the MR switch 306 performs the transaction ID substitution (e.g., substituting at least the requester ID and, in embodiments, the transaction tag), the PCIe read request(s) —now bearing the substituted transaction ID — is/are routed via the MR switch 306 to the host 302 at label 360. Because of the transaction ID substitution, the PCIe write request(s) will appear to the host 302 as if the requester is the virtual function (operated by the IPU 304) rather than the target SSD 308, such that the host 302 will permit the data transfer according to the received PCIe write request(s) — e.g., host 302 will cause the data requested by the PCIe read request(s) to be read from host memory and sent to the requester.
At label 362, the host 302 issues a PCIe data completion (e.g., Cp1D) responsive to the received PCIe read request. The PCIe data completion includes the data to be transferred (e.g., via DMA transfer). In some cases, based on the total amount of data to be transferred and the amount of data that can be transferred with a single PCIe data completion, the host 302 will issue multiple PCIe data completions to perform the data transfer, where each PCIe data completion is for the transfer of a portion of the total data transfer. The PCIe data completion issued at label 362 includes a transaction ID (e.g., as part of a transaction ID field in a PCIe packet). The transaction ID includes a requester ID (may also be known as a completer ID for PCIe data completions); in embodiments, the transaction ID also includes a transaction tag.
The MR switch 306 intercepts the PCIe data completion issued by the host 302 and performs, via the remapping module, a transaction ID substitution at label 364 — e.g., an “inverse” substitution restoring the transaction ID for the target SSD 308 based on an applicable entry in a remapping table (such as the remapping table 450 described herein with reference to
Once the MR switch 306 performs the transaction ID substitution at label 364 (e.g., restoring at least the requester ID and, in embodiments, the transaction tag), the PCIe data completions(s) — now bearing the transaction ID for the target SSD 308 - is/are routed via the MR switch 306 to the target SSD 308 at label 366, and the target SSD 308 then stores the data sent with the PCIe data completion(s). At label 368 the target SSD 308 issues an NVMe command completion to the IPU 304. The IPU 304 issues an NVMe command completion to the host 302 at label 370.
The MR switch 400 is coupled to an IPU 430 via, e.g., PCIe lanes. In embodiments, the IPU 430 corresponds to the IPU 230 (
In the example illustrated in
As one example, for a first transaction Tr1 (label 462) the remapping table includes a target SSD requester ID R1, a target SSD tag T1, a remapped requester ID Q2, and a remapped tag G1. The entries R1, T1, Q2 and G1 can be part of the same record for the transaction Tr1. For a data transfer message (e.g., read request or write request) provided by the target SSD relating to the transaction Tr1, the MR switch 400 will perform a transaction ID substitution, by substituting the remapped requester ID Q2 (associated with a virtual SSD) in place of the requester ID R1 (associated with a target SSD), before the data transfer command is provided to the host. In embodiments where transaction tag remapping occurs, the MR switch 400 will also substitute the transaction tag G1 for the transaction tag T1. For a data completion the MR switch 400 will perform a transaction ID substitution (e.g., an “inverse” substitution restoring the transaction ID) by substituting the requester ID R1 (associated with the target SSD) in place of the remapped requester ID Q2 (associated with the virtual SSD). In embodiments where the transaction tag was remapped, the MR switch 400 will also substitute the transaction tag T1 for the transaction tag G1.
More particularly, the method 500 can be implemented as one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in hardware, or any combination thereof. For example, hardware implementations can include configurable logic, fixed-functionality logic, or any combination thereof. Examples of configurable logic include suitably configured PLAs, FPGAs, CPLDs, and general purpose microprocessors. Examples of fixed-functionality logic include suitably configured ASICs, combinational logic circuits, and sequential logic circuits. The configurable or fixed-functionality logic can be implemented with CMOS logic circuits, TTL logic circuits, or other circuits.
For example, computer program code to carry out operations shown in the method 500 and/or functions associated therewith can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, program or logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
Turning to
In some embodiments, illustrated processing block 540 provides for maintaining a remapping table to hold the first transaction identifier and the second transaction identifier. The remapping table can correspond to the remapping table 450 (
In some embodiments, illustrated processing block 550a provides for performing data compaction by transferring stored data, via the switch, between the physical storage device and another physical storage device while, at block 550b, bypassing temporary storage of the stored data in the memory local to the IPU.
Turning now to
Turning now to
In embodiments, the system 40 includes a host processor 58 (e.g., central processing unit/CPU) that includes an integrated memory controller (IMC) 62, wherein the illustrated IMC 62 communicates with a system memory 64 (e.g., DRAM) over a bus or other suitable communication interface. In embodiments the host processor 58 and the IO module 60 are integrated onto a shared semiconductor die 56 in a system on chip (SoC) architecture. In embodiments, the host processor 58 corresponds to the host CPU 110 (
In embodiments, the IPU 42 includes IPU core(s) 43 which are processing cores. In some embodiments the IPU 42 is implemented using an FPGA (e.g., an FPGA platform). The IPU communicates with IPU local memory 44 (e.g., DRAM) over a bus or other suitable communication interface. In embodiments, the IPU 42 corresponds to the IPU 230 (
The switch 47 is coupled to the host processor 58, the IPU 42, and to storage devices 49. In embodiments, the switch 47 includes logic 48 to implement features such as, e.g., a remapping module (e.g., the remapping module 420 in
The computing system 40 is therefore performance-enhanced at least to the extent that it provides for transaction identifier remapping to enable direct application data transfers between host memory and storage drives in a cloud-based architecture having an IPU, while bypassing temporary storage of the application data in local IPU memory.
The semiconductor apparatus 30 can be constructed using any appropriate semiconductor manufacturing processes or techniques. For example, the logic 34 can include transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 32. Thus, the interface between the logic 34 and the substrate(s) 32 may not be an abrupt junction. The logic 34 can also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 32.
Embodiments of each of the above systems, devices, components, features and/or methods, including the system architecture 200, the IPU 230, the MR switch 250, the command flow sequence 300, the IPU 304, the MR switch 306, the MR switch 400, the remapping module 420, the IPU 430, the method 500, and/or any other system components, can be implemented in hardware, software, or any suitable combination thereof. For example, hardware implementations can include configurable logic, fixed-functionality logic, or any combination thereof. Examples of configurable logic include suitably configured PLAs, FPGAs, CPLDs, and general purpose microprocessors. Examples of fixed-functionality logic include suitably configured ASICs, combinational logic circuits, and sequential logic circuits. The configurable or fixed-functionality logic can be implemented with CMOS logic circuits, TTL logic circuits, or other circuits.
Alternatively, or additionally, all or portions of the foregoing systems, devices, components, features and/or methods can be implemented in one or more modules as a set of program or logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., to be executed by a processor or computing device. For example, computer program code to carry out the operations of the components can be written in any combination of one or more operating system (OS) applicable/appropriate programming languages, including an object-oriented programming language such as PYTHON, PERL, JAVA, SMALLTALK, C++, C# or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
Example A1 includes a semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable logic or fixed-functionality hardware logic, the logic to route, via a switch, application data in a data transfer message between a physical storage device and a host system, the host system to interface with a virtual function of an infrastructure processing unit (IPU), by remapping a transaction identifier field in the data transfer message between a first transaction identifier associated with the virtual function and a second transaction identifier associated with the physical storage device, wherein the physical storage device is to be managed by the IPU, and wherein to route the application data between the host system and the physical storage device includes to bypass temporary storage of the application data in a memory local to the IPU.
Example A2 includes the apparatus of Example A1, wherein the data transfer message is a write request issued by the physical storage device, and wherein to perform remapping the transaction identifier field in the data transfer message the logic is to: substitute, in the write request, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device.
Example A3 includes the apparatus of Example A1 or A2, wherein the data transfer message is a data completion issued by the host system, and wherein to perform remapping the transaction identifier field in the data transfer message the logic is to substitute, in the data completion, the second transaction identifier associated with the physical storage device in place of the first transaction identifier associated with the virtual function.
Example A4 includes the apparatus of Example A1, A2 or A3, wherein the logic is to substitute, in a read request to be issued by the physical storage device, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device, and route, via the switch, the read request to the host system, wherein the data completion is to be issued by the host system responsive to the read request.
Example A5 includes the apparatus of any of Examples A1-A4, wherein the logic is to maintain a remapping table to hold the first transaction identifier and the second transaction identifier.
Example A6 includes the apparatus of any of Examples A1-A5, wherein the first transaction identifier includes a virtual requester identifier associated with the virtual function, wherein the second transaction identifier includes a requester identifier for the physical storage device, and wherein the transaction identifier field includes a requester identifier field.
Example A7 includes the apparatus of any of Examples A1-A6, wherein the first transaction identifier further includes a first tag, wherein the second transaction identifier further includes a second tag, and wherein the transaction identifier field further includes a tag field.
Example A8 includes the apparatus of any of Examples A1-A7, wherein the logic is to perform data compaction by transferring stored data, via the switch, between the physical storage device and another physical storage device while bypassing temporary storage of the stored data in the memory local to the IPU.
Example A9 includes the apparatus of any of Examples A1-A8, wherein the physical SSD is hidden from the host system.
Example S1 includes a computing system comprising a host system comprising a host processor coupled to a host memory, an infrastructure processing unit (IPU), a plurality of storage devices, and a multi-root (MR) switch coupled to the host system, the IPU and the plurality of storage devices, wherein the computing system includes logic implemented at least partly in one or more of configurable logic or fixed-functionality hardware logic, the logic to route, via the MR switch, application data in a data transfer message between one of the physical storage devices and the host system, the host system to interface with a virtual function of the IPU, by remapping a transaction identifier field in the data transfer message between a first transaction identifier associated with the virtual function and a second transaction identifier associated with the one of the physical storage devices, wherein the physical storage devices are to be managed by the IPU, and wherein to route the application data between the host system and the one of the physical storage devices includes to bypass temporary storage of the application data in a memory local to the IPU.
Example S2 includes the computing system of Example S1, wherein the data transfer message is a write request issued by the physical storage device, and wherein to perform remapping the transaction identifier field in the data transfer message the logic is to substitute, in the write request, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device.
Example S3 includes the computing system of Example S1 or S2, wherein the data transfer message is a data completion issued by the host system, and wherein to perform remapping the transaction identifier field in the data transfer message the logic is to substitute, in the data completion, the second transaction identifier associated with the physical storage device in place of the first transaction identifier associated with the virtual function.
Example S4 includes the computing system of Example S1, S2 or S3, wherein the logic is to substitute, in a read request to be issued by the physical storage device, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device, and route, via the switch, the read request to the host system, wherein the data completion is to be issued by the host system responsive to the read request.
Example S5 includes the computing system of any of Examples S1-S4, wherein the logic is to maintain a remapping table to hold the first transaction identifier and the second transaction identifier.
Example S6 includes the computing system of any of Examples S1-S5, wherein the first transaction identifier includes a virtual requester identifier associated with the virtual function, wherein the second transaction identifier includes a requester identifier for the physical storage device, and wherein the transaction identifier field includes a requester identifier field.
Example S7 includes the computing system of any of Examples S1-S6, wherein the first transaction identifier further includes a first tag, wherein the second transaction identifier further includes a second tag, and wherein the transaction identifier field further includes a tag field.
Example S8 includes the computing system of any of Examples S1-S7, wherein the logic is to perform data compaction by transferring stored data, via the switch, between the physical storage device and another physical storage device while bypassing temporary storage of the stored data in the memory local to the IPU.
Example S9 includes the computing system of any of Examples S1-S8, wherein the physical SSD is hidden from the host system.
Example C1 includes at least one computer readable storage medium comprising a set of instructions which, when executed by a computing device, cause the computing device to route, via a switch, application data in a data transfer message between a physical storage device and a host system, the host system to interface with a virtual function of an infrastructure processing unit (IPU), by remapping a transaction identifier field in the data transfer message between a first transaction identifier associated with the virtual function and a second transaction identifier associated with the physical storage device, wherein the physical storage device is to be managed by the IPU, and wherein to route the application data between the host system and the physical storage device includes to bypass temporary storage of the application data in a memory local to the IPU.
Example C2 includes the at least one computer readable storage medium of Example C1, wherein the data transfer message is a write request issued by the physical storage device, and wherein to perform remapping the transaction identifier field in the data transfer message the instructions, when executed, cause the computing device to substitute, in the write request, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device.
Example C3 includes the at least one computer readable storage medium of Example C1 or C2, wherein the data transfer message is a data completion issued by the host system, and wherein to perform remapping the transaction identifier field in the data transfer message the instructions, when executed, cause the computing device to substitute, in the data completion, the second transaction identifier associated with the physical storage device in place of the first transaction identifier associated with the virtual function.
Example C4 includes the at least one computer readable storage medium of Example C1, C2 or C3, wherein the instructions, when executed, cause the computing device to substitute, in a read request to be issued by the physical storage device, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device, and route, via the switch, the read request to the host system, wherein the data completion is to be issued by the host system responsive to the read request.
Example C5 includes the at least one computer readable storage medium of any of Examples C1-C4, wherein the instructions, when executed, cause the computing device to maintain a remapping table to hold the first transaction identifier and the second transaction identifier.
Example C6 includes the at least one computer readable storage medium of any of Examples C1-C5, wherein the first transaction identifier includes a virtual requester identifier associated with the virtual function, wherein the second transaction identifier includes a requester identifier for the physical storage device, and wherein the transaction identifier field includes a requester identifier field.
Example C7 includes the at least one computer readable storage medium of any of Examples C1-C6, wherein the first transaction identifier further includes a first tag, wherein the second transaction identifier further includes a second tag, and wherein the transaction identifier field further includes a tag field.
Example C8 includes the at least one computer readable storage medium of any of Examples C1-C7, wherein the instructions, when executed, cause the computing device to perform data compaction by transferring stored data, via the switch, between the physical storage device and another physical storage device while bypassing temporary storage of the stored data in the memory local to the IPU.
Example C9 includes the at least one computer readable storage medium of any of Examples C1-C8, wherein the physical SSD is hidden from the host system.
Example M1 includes a method comprising routing, via a switch, application data in a data transfer message between a physical storage device and a host system, the host system interfacing with a virtual function of an infrastructure processing unit (IPU), by remapping a transaction identifier field in the data transfer message between a first transaction identifier associated with the virtual function and a second transaction identifier associated with the physical storage device, wherein the physical storage device is managed by the IPU, and wherein routing the application data between the host system and the physical storage device includes bypassing temporary storage of the application data in a memory local to the IPU.
Example M2 includes the method of Example M1, wherein the data transfer message is a write request issued by the physical storage device, and wherein remapping the transaction identifier in the data transfer message comprises substituting, in the write request, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device.
Example M3 includes the method of Example M1 or M2, wherein the data transfer message is a data completion issued by the host system, and wherein remapping the transaction identifier in the data transfer message comprises substituting, in the data completion, the second transaction identifier associated with the physical storage device in place of the first transaction identifier associated with the virtual function.
Example M4 includes the method of Example M1, M2 or M3, further comprising substituting, in a read request issued by the physical storage device, the first transaction identifier associated with the virtual function in place of the second transaction identifier associated with the physical storage device, and routing, via the switch, the read request to the host system, wherein the data completion is issued by the host system responsive to the read request.
Example M5 includes the method of any of Examples M1-M4, further comprising maintaining a remapping table to hold the first transaction identifier and the second transaction identifier.
Example M6 includes the method of any of Examples M1-M5, wherein the first transaction identifier includes a virtual requester identifier associated with the virtual function, wherein the second transaction identifier includes a requester identifier for the physical storage device, and wherein the transaction identifier field includes a requester identifier field.
Example M7 includes the method of any of Examples M1-M6, wherein the first transaction identifier further includes a first tag, wherein the second transaction identifier further includes a second tag, and wherein the transaction identifier field further includes a tag field.
Example M8 includes the method of any of Examples M1-M7, further comprising performing data compaction by transferring stored data, via the switch, between the physical storage device and another physical storage device while bypassing temporary storage of the stored data in the memory local to the IPU.
Example M9 includes the method of any of Examples M1-M8, wherein the physical SSD is hidden from the host system.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections, including logical connections via intermediate components (e.g., device A may be coupled to device C via device B). In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A, B, C; A and B; A and C; B and C; or A, B and C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.