The present disclosure relates to the field of computational storage, and particularly to cohesively utilizing multiple computational storage devices to accelerate computation.
As the scaling of semiconductor technology (also known as Moore's Law) slows down and approaches an end, the computing power/capability of CPUs can no longer continue to noticeably improve. This makes it increasingly inevitable to complement CPUs with other computing devices such as GPUs and FPGAs that can much more efficiently handle certain computation-intensive workloads. This leads to so-called heterogeneous computing. For many data-intensive applications, computational storage can complement CPUs to implement highly effective heterogeneous computing platforms. The essence of computational storage is to empower data storage devices with additional processing or computing capability. Loosely speaking, any data storage device (e.g., HDD, SSD, or DIMM) that can carry out any data processing tasks beyond its core data storage duties can be classified as computational storage. One desirable property of computational storage is that the total computing capability increases with the data storage capacity. When computing systems deploy multiple computational storage devices to increase the storage capacity, the aggregated computing capability naturally increases as well.
With multiple storage devices, computing systems typically distribute one file or a big chunk of data across multiple storage devices in order to improve data access parallelism. However, such distributed data storage could cause severe resource contention when utilizing computational storage devices to accelerate streaming computation tasks with a sequential data access pattern (e.g., encryption and checksum).
Accordingly, embodiments of the present disclosure are directed to methods for utilizing multiple computational storage devices to accelerate streaming computation tasks.
A first aspect of the disclosure is directed to a host-assisted method for accelerating a streaming computation task, including: storing a plurality of data segments x to be processed for the streaming computation task among a plurality of computational storage devices; at the computational storage device in which a next data segment xi to be processed for the streaming computation task is stored: receiving, from a host, an intermediate result ui−1 of the streaming computation task; performing a next streaming computation of the streaming computation task on the data segment xi using the received intermediate result ui−1 to generate an intermediate result ui of the streaming computation task; and sending the intermediate result ui of the streaming computation task to the host.
A second aspect of the disclosure is directed to method for reducing resource contention while performing a plurality of streaming computation tasks in a system including a host coupled to a plurality of computational storage devices, including: for each of the plurality of streaming computation tasks: for each data segment of a plurality of data segments to be processed for the streaming computation task: randomly choosing a computational storage device from the plurality of computational storage devices; and storing the data segment to be processed for the streaming computation task in the randomly chosen computational storage device.
A third aspect of the disclosure is directed to a storage system for performing a streaming computation task, including: a plurality of computational storage devices for storing a plurality of data segments x to be processed for the streaming computation task; and a host coupled to the plurality of computational storage devices, wherein, the computational storage device in which a next data segment xi to be processed for the streaming computation task is stored is configured to: receive, from the host, an intermediate result ui−1 of the streaming computation task; perform a next streaming computation of the streaming computation task on the data segment xi using the received intermediate result ui−1 to generate an intermediate result ui of the streaming computation task; and send the intermediate result ui of the streaming computation task to the host.
The numerous advantages of the present disclosure may be better understood by those skilled in the art by reference to the accompanying figures.
Reference will now be made in detail to embodiments of the disclosure, examples of which are illustrated in the accompanying drawings.
Computational storage devices can perform in-line computation on the data read path, as illustrated in
Streaming computation tasks (e.g., encryption and checksum) must process the data in a strictly sequential manner, which is called streaming computation. For example, for to-be-processed data x=[x0, x1, . . . , xn−1], a streaming computation task must complete the processing of data xi−1 before processing xi.
For computing systems that contain multiple computational storage devices, data striping is typically applied across multiple computational storage devices in order to improve data access parallelism and hence improve data access speed performance. As illustrated in
However, when striping data across multiple computational storage devices, a streaming computation task may require data from multiple computational storage devices. As a result, the computation engine in any one computational storage device cannot accomplish the entire streaming computation on its own.
According to embodiments, a host-assisted method is provided that can enable multiple computational storage devices 10 to collectively realize the streaming computation. For any streaming computation task over the data x=[x0, x1, . . . , xn−1], in order to carry out the computation on the data segment xi, all the proceeding i−1 data segments (i.e., x0, x1, . . . , xi−1) should already have been processed to produce an intermediate result ui−1.
In the above-described host-assisted streaming computation, for each streaming computation task, only one computational storage device 10 can carry out the streaming computation at one time. According to embodiments, to better leverage the computation engines 18 in a plurality of computational storage devices 10, multiple concurrent streaming computation tasks may be performed over different sets of data. Given multiple concurrent streaming computation tasks, the host 22 can use an operational flow similar to that illustrated in
In order to improve the achievable operational parallelism, it is highly desirable to reduce computation resource contention, i.e., reduce the probability that one computational storage device 10 is scheduled to serve two or more streaming computation tasks at the same time. Given the data x=[x0, x1, . . . , xn−1] and m computational storage devices (denoted as S0, S1, . . . , Sm−1), conventional practice simply stores each data segment on the computational storage device Sj, where j=i mod m. All the data are striped across all the computational storage devices 10 in the exactly same pattern. However, such a conventional data placement approach may cause severe resource contention. For example, if multiple streaming computation tasks start at the same time, they will always compete for the resource in the first computational storage device S0.
In order to reduce such resource contention, randomized data placement methods are presented. In particular, according to embodiments, if a plurality of streaming computation tasks collide at one computational storage device 10 (i.e., the streaming computation tasks need to process data segments on the same computational storage device 10), then most likely the tasks will subsequently move on to different computational storage devices 10. Randomized data placement can be implemented in different manners, and below two possible implementations for randomized data placement for the data segments in each of a plurality of streaming computation tasks are presented.
In a first randomized data placement method, illustrated in
In
In a second randomized data placement method, it is first noted that, given the vector [0, 1, . . . , m−1], where m is the number of computational storage devices 10, there are total m! (i.e., the factorial of m) different permutations of the computational storage devices 10, where each unique permutation is denoted as pk with an index k∈[1,m!]. Given the data x=[x0, x1, . . . , xn−1] and m computational storage devices, without loss of generality, it is assumed that n is divisible by m, i.e., n=t·m where t is an integer. The data xis partitioned into t segment groups, where each segment group di=[x(i−1)·m, x(i−1)·m+1, . . . , xi·m−1] contains m consecutive data segments xi. For each segment group di, one permutation pk is randomly chosen and used to realize the data placement, i.e., the j-th data segment in the segment group dj is stored on the computational storage device Sh, where the index h is the j-th element in the chosen permutation pk. The host 22 keeps the record of the index of the chosen permutation for each data segment group. The corresponding operational flow diagram is illustrated in
At process D1 in
Advantageously, when using a randomized data placement (e.g., as depicted in
It is understood that aspects of the present disclosure may be implemented in any manner, e.g., as a software program, or an integrated circuit board or a controller card that includes a processing core, I/O and processing logic. Aspects may be implemented in hardware or software, or a combination thereof. For example, aspects of the processing logic may be implemented using field programmable gate arrays (FPGAs), ASIC devices, or other hardware-oriented system.
Aspects may be implemented with a computer program product stored on a computer readable storage medium. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, etc. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Python, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
The computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. The computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by hardware and/or computer readable program instructions.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The foregoing description of various aspects of the present disclosure has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the concepts disclosed herein to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to an individual in the art are included within the scope of the present disclosure as defined by the accompanying claims.
Number | Date | Country | |
---|---|---|---|
62748433 | Oct 2018 | US |