This application is submitted in the name of the following inventor:
Not applicable
Not Applicable
Not Applicable
The present disclosure generally relates to dynamic and preemptive erasure encoding in software defined storage (SDS) systems.
Aspects of the subject technology include a technique of erasure encoding includes receiving data for storage, inspecting the data for purposes of erasure encoding, and beginning preemptive erasure encoding of the data without waiting for the data to be completely delivered. The data may be received in packets over a network through an ordered transmission protocol, for example a reliable protocol such as TCP (transmission control protocol).
Inspection of headers of the packets may determine which of the data will be subject to the step of beginning preemptive erasure encoding. This inspection may be “deep packet inspection” as described in more detail below.
The preemptive erasure encoding may be performed based on customer parameters. The customer parameters may include a number or size of data chunks into which the data will be divided and a number of erasure encoding shards that will be created by the preemptive erasure encoding. The customer parameters may be modified for operation at wire speed.
In some aspects, one or more processors of a device that performs the storage may be offloaded from having to perform erasure encoding because of the preemptive erasure encoding. At least part of the preemptive erasure encoding may be performed by one or more network interface cards, which may include one or more field programmable gate arrays that perform at least part of the preemptive erasure encoding.
Some or all of the data and results of the preemptive erasure encoding may be forwarded to local storage and to remote storage.
The subject technology also includes systems that may perform the foregoing techniques. In some aspects, such systems include one or more network interfaces such as a network interface card that receives data for storage, and one or more processors that inspect the data for purposes of erasure encoding and begin preemptive erasure encoding of the data without waiting for the data to be completely delivered. In some aspects, one or more of the systems includes logic adjacent to the one or more network interfaces that performs at least part of the preemptive erasure encoding.
This brief summary has been provided so that the nature of the invention may be understood quickly. Additional steps and/or different steps than those set forth in this summary may be used. A more complete understanding of the invention may be obtained by reference to the following description in connection with the attached drawings.
Briefly, techniques according to aspects of the subject technology include receiving data for storage, inspecting the data for purposes of erasure encoding, and beginning preemptive erasure encoding of the data without waiting for the data to be completely delivered. The data may be received in packets over a network through an ordered transmission protocol, for example a reliable protocol such as TCP (transmission control protocol).
Inspection of headers of the packets may determine which of the data will be subject to the step of beginning preemptive erasure encoding. This inspection may be “deep packet inspection” as described in more detail below.
The preemptive erasure encoding may be performed based on customer parameters. The customer parameters may include a number or size of data chunks into which the data will be divided and a number of erasure encoding shards that will be created by the preemptive erasure encoding. The customer parameters may be modified for operation at wire speed.
For example, a customer preferably is able to choose any erasure code parameters they wish, even esoteric combinations of k+m. This capability may be achieved by performing erasure encoding according to aspects of the subject technology using a device that includes a network interface card (NIC). The NIC may in turn include at least one field programmable gate array (FPGA). The customer's chosen erasure encoding parameters can be built and dynamically loaded into the FPGA, for example remotely “in the field.”
In some aspects, one or more processors of a device that performs the storage may be offloaded from having to perform erasure encoding because of the preemptive erasure encoding. At least part of the preemptive erasure encoding may be performed by one or more network interface cards, which may include one or more field programmable gate arrays that perform at least part of the preemptive erasure encoding.
Some or all of the data and results of the preemptive erasure encoding may be forwarded to local storage and to remote storage.
The subject technology also includes systems that may perform the foregoing operations. For example, such systems include one or more network interfaces such as a network interface card that receives data for storage, and one or more processors that inspect the data for purposes of erasure encoding and begin preemptive erasure encoding of the data without waiting for the data to be completely delivered. In some aspects, one or more of the systems includes logic adjacent to the one or more network interfaces that performs at least part of the preemptive erasure encoding.
Further non-limiting details about some aspects of the subject technology are set forth in this section. Inclusion of details in this section about general aspects of the subject technology is not an admission that any of the disclosed details are prior art.
Most distributed Software Defined Storage (SDS) product's default replication level provides excellent protection against data loss by storing three copies of your data on different storage nodes. The chance of losing all three disks that contain the same objects, within the period that it takes the SDS system to rebuild from a failed disk, is on the extreme edge of probability. However, storing three copies of data vastly increases both the purchase cost of the hardware and also associated operational costs such as power and cooling. Furthermore, storing copies also means that for every data write, the backend storage must write three times the amount of data. In some scenarios, either of these drawbacks may mean that SDS is not a viable option. Erasure codes are designed to offer a solution. Erasure encoding allows distributed SDS systems to provide more usable storage from the same raw capacity.
Erasure encoding allows SDS systems to achieve either greater usable storage capacity or increase resilience to disk failure for the same number of disks versus the standard replica method. Erasure encoding achieves this by splitting up an object into a number of parts, calculating a type of forward error correction (FEC) code, the erasure code, and storing the results in one or more extra parts. Each part is then stored on a separate drive. These parts are referred to as k shards and M chunks, where k refers to the number of data shards and M refers to the number of erasure code shards. As in RAID, these can often be expressed in the form k+m (k=real data; some number of equal chunks of a data object; M=error correcting data). When the encoding function is called, it returns chunks of the same size to match the size of the data chunks. Data chunks which can be concatenated to reconstruct the original object and encoding chunks which can be used to rebuild a lost chunk.
In more detail, k may represent the number of data chunks, i.e. the number of chunks in which the original object is divided. For instance, if k=2 a 10 KB object will be divided into k objects of 5 KB each. M represents the number of encoding chunks, i.e. the number of additional chunks computed by the encoding functions. Further data can be found at the time of this disclosure at http://docs.ceph.com/docs/master/rados/operations/erasure-code/
One implementation specific detail that is often configured by the customer is the exact values of k+m. As the customer requirements evolve, the values for these parameters may change. In the solution described in this disclosure, a customer specific implementation of k+m, and all the encoding calculations involved in this, can be updated dynamically at any time to generate an accelerated point solution that is customer specific.
In the event of a drive failure which contains an object's shard which is one of the calculated erasure codes, data is read from the remaining drives that store data with no impact. However, in the event of a drive failure which contains the data shards of an object, the SDS can use the erasure codes to mathematically recreate the data from a combination of the remaining data and erasure code shards.
In order to calculate the code bits required for erasure encoding protection, Reed-Solomon and Galois field (finite field) matrix theory with relatively time-consuming software calculations may be employed. When calculated in software, these calculations utilize precious processor resources and also increase traffic latency for data writes and reads that need to be recovered due to missing data. As network speeds increase, these performance drawbacks are exacerbated.
For this reason, erasure encoding pools of storage are often advertised and utilized for large data pools, but with the expected tradeoff of higher main processor utilization and increased latency. These drawbacks have been addressed through various offload attempts into hardware for accelerating these erasure code calculations. However, existing known solutions still require termination of the TCP streams at the processor and an exchange of data with another offload element in order to calculate the erasure code shards. The SoftIron method intends to bypass the need for TCP stream termination before erasure encoding calculation can commence.
Known solutions today claiming to improve the speed of erasure encoding calculations (see Mellanox ConnectX-4) require the processor to initiate and terminate all erasure encoding offload transactions. Prior art on a similar subject as it relates to RAID offload (see U.S. Pat. No. 9,459,957) highlights a network path in the description of possible accelerated paths. However, there are no details on how this is accomplished.
Aspects of the subject technology focus at least in part on accelerating erasure encoding. Some aspects include deep packet inspection of TCP traffic where termination of the traffic is not required before computation can commence. These and other aspects of the subject technology disclosed herein may result in a measurable decrease in latency by performing erasure code calculations of storage data preemptively and in parallel to network stack termination code in the processor. Some of these aspects include deep packet inspection of storage TCP flows in or by one or more network interface card(s) (NICs), one or more of which may include one or more custom field programmable gate array(s) (FPGAs), before the data is expected to be computed by the processor for the same storage flows.
Of note, NICs that implement aspects of the subject technology do not necessarily require a round trip to transfer data into an offload FPGA or other processor to calculate erasure shards. Aspects of the subject technology may achieve this result through deep packet inspection of storage TCP and/or other information flows in by one or more custom FPGA based NICs before data is expected to be computed for storage.
Turning to the figures,
Data traffic enters the primary storage node via a NIC 101. Data is forwarded to storage node processor 102 for processing. Data is forwarded to the erasure encoding 103 for calculation of erasure code shards. Erasure code shards are returned to the processor. The processor 102 then forwards the original data and/or erasure code shards to at least one end target, namely local storage and/or remote storage 104.
Aspects of the subject technology radically change the foregoing process. With respect to
In the example, network and erasure encoding offload functionality are merged into an FPGA that includes the ability to perform deep packet inspection on incoming network traffic to bypass storage data and preemptively begin erasure code calculations. The FPGA includes normal NIC functionality, deep packet inspection and erasure encoding calculation capabilities, and the ability to easily configure and monitor targeted storage data flows. For the sake of brevity, aspects involving a single FPGA and NIC are described. However, multiple FPGAs and NICs may be used.
Data traffic enters the primary storage node via the network interface card that also includes the erasure code calculation logic.
Internally in the FPGA, traffic identified as belonging to storage flows requiring erasure encoding are copied to the erasure code block internally for erasure code shard calculation.
Both original data and erasure code shards are forwarded to a processor. The processor then forwards to two end targets: (a) the original data and erasure code shards destined to the local storage media, and (b) the original data and erasure code shards destined for remote storage nodes that must go back out over the network.
On possible aspect of the subject technology is storage aware deep packet inspection logic in incorporated into or facilitated by an FPGA in the NIC. This logic preferably has the ability to perform filtering of incoming TCP and/or other data packet header(s) in order to identify and copy packets requiring erasure encoding through a line rate match filter and arbitration scheme. In addition, some out of order packet arrival can be managed using internal and external buffering (memory) on the FPGA.
Local context memory may be used to define the filter used in the header search. This context memory can be programmed manually via the customer or IT professional if the TCP session data for the erasure coded streams are known or can be programmed automatically with additional software provided that works directly with the SDS product of choice to update the context memory as erasure coded pools are added.
Note the shown aspects are not necessarily a TCP/IP offload solution, but rather preferably illustrate a snooper that attempts to identify packet data eventually requiring erasure encoding and thereby may reduce latency of the system by preemptively making calculations that will imminently be requested by the storage node processor.
In some aspects, the transfer of network data to processor(s) during normal operations is not required before being directed back to the hardware for acceleration. Work preferably begins on the erasure code calculations as soon as the Ethernet, TCP, or other frame(s) containing the targeted packet or information related to the subject data hits the ingress network of the storage node. This highlights the benefit of using deep packet inspection to decode the erasure encoding flagged TCP sessions for pre-emptive erasure encode.
In accord with the above, network interface(s) 201 such as NICs, which may include one or more FPGAs, receives data and preferably performs deep packet inspection and/or erasure encoding on the data. The network interface(s) include pack inspection elements 202 and erasure encoding element(s) 203, for example processor(s), memory, and/or other computing elements, that may perform erasure encoding in parallel to reception of the data. Element 204 represents a transceiver to storage node processor(s) 205. The data and/or erasure code data is then sent to local and/or network storage 206.
In step 301, received data is inspected. This inspection may involve “deep packet inspection” as described further below, for example with reference to
Step 303 represents provision and/or acceptance of customer k+m parameters, as discussed above. This data preferably is provided to step 302.
Headers are inspected in step 304. In some aspects, step 304 is part of step 302.
The customer parameters may be modified in step 305, for example to enable operation at wire speed.
In step 401, k+m values are chosen, specified by a customer, or otherwise determined. In step 402, one or more NICs are configured and/or designed based on the k+m values from step 401. If the NIC(s) include one or more FPGAs, configuration and/or design of the NIC(s) may include configuration and/or design of those FPGA(s).
In step 403, the NIC(s) may be updated, for example to accommodate new k+m values. If the NIC(s) include one or more FPGAs, update may include field and/or other update of the FPGA(s). In preferred aspects, the customer parameters may be modified for operation at wire speed through such updating of the FPGA(s).
Any and/or all of these steps may occur dynamically, for example in real time based on customer inputs and/or requests.
Network 501 may be a private (e.g., Virtual Private Network—VPN) or public (e.g., the World Wide Web) network from which data for potential pre-emptive erasure encoding may be received. In the case of a VPN, one or more VPN termination (encryption) block(s) may be added.
The data is received by at least one transceiver and/or receiver 502. The data is encoded/decoded by element 503, packet data 504 is determined, and framing such as MAC framing 505 is performed. The resulting data is sent to other steps and/or elements according to aspects of the subject technology as illustrated by the arrow extending from block 505. Data may also be sent from other steps and/or elements. The data preferably also is sent to at least one CPU 506 via at least one CPU interface 507 such as a PCIe, for example to be processed, stored on network storage, and/or further managed for such storage.
Flow 508 represents deep packet inspection of the data from block 505 or any of the other foregoing blocks. For example, the inspection may be of SDS application headers to find data associated with erasure coded pools. Multiple headers, for example based on different transmission protocol layers, may be inspected. Four such headers 509 to 512 are illustrated. Fewer or more headers may be inspected.
Data for pre-emptive erasure encoding preferably is sent to staging memory 513 as represented by flow 514. Use of staging memory may help account for possible out of order data before delivering contiguous blocks for erasure encoding.
Staging memory 513 may be part of one or more FPGAs that reside in or on a NIC and/or are otherwise accessible according to aspects of the subject technology.
This data is then erasure encoded in step/block 515. Again, this step/block preferably is performed by or a part of at least one FPGA that resides in or on a NIC and/or are otherwise accessible according to aspects of the subject technology. Erasure encoding preferably is performed on contiguous data to generate erasure code shard(s) for saving to local memory and/or shipping to network memory in response to requests from the CPUs.
For example, the data and/or code shards may be sent to memory 516 for local storage and/or forwarding to network storage by or under control of CPU 506. In some aspects, staging memory 513 and memory 516 may be the same memory, may contain portions of each other, or may be entirely different.
Step/block 601 represents a customer choosing k+m erasure encoding parameters depending on their unique application and/or redundancy requirements. Step/block 602 represents designing NIC(s) and/or related FPGA(s) based on those parameters. For example, a bitfile for updating FPGA(s) may be generated. The bitfile and/or other design parameters may be pushed to a customer's computing platform 603, for example in their own or a leased data center. Storage nodes 604 may then be updated through interface 605. This update may occur through a local or remote “push” operation. Alternatively, other techniques for updating the storage nodes such as via a USB drive or other local media may be used.
The subject technology has been shown to improve storage efficiency in several experiments. Results of some such experiments are shown in
These figures show an example of possible latency reduction by a NIC utilizing a PCIe Generation 3, x8 system bus connected to the CPU. Using calculations of bandwidth and latency, we see that the typical one-way latency for a 128-byte PCIe packet is about 19.5 nanoseconds. Note that we stop at 128-byte maximum transfer size because that is the typical size allowed by Intel CPUs to PCIe endpoints.
General specifications for this test were the following:
Curve 701 in
Curve 801 in
The subject technology may be performed by and improve the performance of one or more computing devices. The computing device preferably includes at least a tangible computing element. Examples of a tangible computing element include but are not limited to a microprocessor, application specific integrated circuit, programmable gate array, memristor based device, and the like. A tangible computing element may operate in one or more of a digital, analog, electric, photonic, and/or some other manner. Examples of a computing device include but are not limited to a mobile computing device such as a smart phone or tablet computer, a wearable computing device (e.g., Google® Glass), a laptop computer, a desktop computer, a server, a client that communicates with a server, a smart television, a game console, a part of a cloud computing system, a virtualized computing device that ultimately runs on tangible computing elements, or any other form of computing device. The computing device preferably includes or accesses storage for instructions and data used to perform steps such as those discussed above.
Additionally, some operations may be considered to be performed by multiple computing devices. For example, steps of displaying may be considered to be performed by both a local computing device and a remote computing device that instructs the local computing device to display something. For another example, steps of acquiring or receiving may be considered to be performed by a local computing device, a remote computing device, or both. Communication between computing devices may be through one or more other computing devices and/or networks.
The invention is in no way limited to the specifics of any particular embodiments and examples disclosed herein. For example, the terms “aspect,” “example,” “preferably,” “alternatively,” and the like denote features that may be preferable but not essential to include in some embodiments of the invention. In addition, details illustrated or disclosed with respect to any one aspect of the invention may be used with other aspects of the invention. Additional elements and/or steps may be added to various aspects of the invention and/or some disclosed elements and/or steps may be subtracted from various aspects of the invention without departing from the scope of the invention. Singular elements/steps imply plural elements/steps and vice versa. Some steps may be performed serially, in parallel, in a pipelined manner, or in different orders than disclosed herein. Many other variations are possible which remain within the content, scope, and spirit of the invention, and these variations would become clear to those skilled in the art after perusal of this application.