System and method for offloading computation to storage nodes in distributed system

Description

BACKGROUND
Field

This disclosure is generally related to distributed computing systems. More specifically, this disclosure is related to a method and system that improves the performance of a distributed computing system.

Related Art

A distributed computing system can refer to a system whose components are located on different networked computers, which communicate and coordinate their actions to achieve a common goal. Typical distributed computing systems can include two types of nodes: compute nodes and storage nodes. Computing nodes can be responsible for receiving and processing incoming requests from users or applications, whereas storage nodes can be responsible for storing data. A typical system operation may include a compute node receiving a user request, obtaining necessary data from a storage node, processing the obtained data according to the user request, and sending the processed or updated data back to the storage node for storage. Such a process can involve data being passed between the compute node and the storage node multiple times. Moving large amounts of data can consume considerable amounts of time and bandwidth, and can become a significant bottleneck for performance improvement in the distributed system.

SUMMARY

One embodiment described herein provides a distributed computing system. The distributed computing system can include a compute cluster comprising one or more compute nodes and a storage cluster comprising a plurality of storage nodes. A respective compute node can be configured to: receive a request for a computation task; obtain path information associated with data required by the computation task; identify at least one storage node based on the obtained path information; send at least one computation instruction associated with the computation task to the identified storage node; and receive computation results from the identified storage node subsequently to the identified storage node performing the computation task.

In a variation on this embodiment, the distributed computing system further includes at least one master node. The master node can be configured to maintain a compute context associated with the data, generate data-placement paths based on the compute context, and provide the path information according to the data-placement paths to the compute node.

In a variation on this embodiment, the compute node can be further configured to partition the computation task into a number of sub-tasks based on the path information, which indicate locations of the data on the plurality of storage nodes. The computation task is partitioned in such a way that a respective sub-task only requires data stored on a single storage node. The compute node can then send the respective sub-task to the corresponding single storage node to allow the single storage node to execute the respective sub-task.

In a further variation, the compute node is further configured to receive computation results from multiple storage nodes executing the sub-tasks to generate a combined result.

In a variation on this embodiment, the compute node can be further configured to: receive data to be written into the storage cluster, group the to-be-written data into one or more data chunks based on compute context associated with the to-be-written data, and submit the compute context associated with the to-be-written data to a master node of the distributed system.

In a further variation, the compute node can be further configured to: receive, from the master node, data-placement paths for the data chunks, with a respective data-placement path indicating a storage node for storing a corresponding data chunk; and write the data chunks into corresponding storage nodes identified by the data-placement paths.

In a further variation, the master node can be configured to store the compute context as part of metadata of the to-be-written data.

In a variation on this embodiment, the identified storage node can be further configured to execute the computing task, determine whether execution of the computing task updates the data, and send a data-update notification to a master node of the distributed computing system in response to determining that the data is updated.

In a further variation, the master node can be configured to update metadata associated with the data in response to receiving the data-update notification, perform a lookup for data paths associated with replicas of the data stored on other storage nodes within the storage cluster, and send the data paths associated with the replicas to the identified storage node.

In a further variation, the identified storage node can be further configured to synchronize the replicas of the data stored on the other storage nodes, in response to receiving the data paths associated with the replicas.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates the architecture of a distributed computing system.

FIG. 2 presents a diagram illustrating an exemplary data-placement system, according to one embodiment.

FIG. 3 presents a flowchart illustrating an exemplary process for writing data in the distributed system, according to one embodiment.

FIG. 4 presents a diagram illustrating the flow of information during computation, according to one embodiment.

FIG. 5 presents a diagram illustrating the flow of information during data synchronization, according to one embodiment.

FIG. 6 presents a flowchart illustrating an exemplary process for executing a computation task, according to one embodiment.

FIG. 7 presents a flowchart illustrating an exemplary process for data synchronization, according to one embodiment.

FIG. 8A shows the exemplary structure of a compute node, according to one embodiment.

FIG. 8B shows the exemplary structure of a storage node, according to one embodiment.

FIG. 9 conceptually illustrates an electronic system, which implements some embodiments of the subject technology.

In the figures, like reference numerals refer to the same figure elements.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

OVERVIEW

Embodiments of the present disclosure solve the problem of improving the performance of a distributed computing system by offloading data-intensive computation tasks from the compute node to the storage nodes. More specifically, in order to offload the computational tasks to the storage nodes, the system can combine the computing-task partition with the data placement to ensure that data needed for the computation is stored locally on the storage node performing the computation. To do so, the master node of the distributed system needs to keep the data context used for computation in addition to the distributed storage logic. This approach can significantly reduce the data amount loaded from storage clusters to the compute node, thus providing the benefits of reducing system latency, network bandwidth consumption, required capacity of the compute cache, and the overall CPU consumption.

Distributed Computing System

FIG. 1 illustrates the architecture of a distributed computing system. Distributed computing system 100 can include a compute cluster comprising a number of compute nodes (e.g., compute nodes 102 and 104) and a storage cluster comprising a number of storage nodes (e.g., storage nodes 106 and 108). The compute nodes and the storage nodes can be coupled to each other via a network 110, which can include a high-speed Ethernet. Network 110 can also include other types of wired or wireless network.

As shown in FIG. 1, a compute node can include one or more memories, CPUs, and at least one cache. For example, compute node 102 can include memories 112 and 114, CPU(s) 116, and a compute cache 118. In some embodiments, memories 112 and 114 can include large capacity dual in-line memory modules (DIMMs); CPU(s) 116 can include one or more powerful CPUs (which can include single-core or multi-core CPUs) that can handle computation-intensive tasks; and compute cache 118 can be implemented using solid-state drives (SSDs). The compute node can be designed and configured to provide high efficiency computation and processing. For example, CPU(s) 116 can include computation-intensive CPUs.

A storage node can also include memories and CPUs. For example, storage node 106 can include memories 122 and 124, and CPU(s) 126. Compared to those in the compute node, the CPUs in the storage node can be less powerful (e.g., having a lower processing speed) and the memories in the storage node can have less capacity. Storage node 106 can further include a number of storage modules, such as storage modules 128 and 130. In some embodiments, storage modules 128 and 130 can include large capacity SSDs or hard disk drives (HDDs). Each storage node can be designed and configured to provide high-performance large-capacity storage. Moreover, the plurality of storage nodes within the distributed system can form a storage cluster, which not only can provide the desired large storage capacity but also can provide sufficient reliability by employing multiple replicas.

In a conventional distributed system, the compute nodes perform all required computation and data processing. For example, when a user makes a request for certain data processing (e.g., updating a stored table) tasks, a compute node receives the user request and fetches the needed data (e.g., the stored table) from the storage cluster. Subsequent to processing the data, the compute node returns the processed data (e.g., the updated table) to the storage cluster for storage. As one can see, even for a simple processing task, the data needs to travel back and forth at least twice between the compute node and the storage node. A compute cache (e.g., compute cache 118) can be implemented to improve the performance of the distributed system by reducing data movements. However, the capacity of the compute cache is limited, and in the event of a cache miss, the data still needs to be fetched from the storage cluster.

The back-and-forth movements of the data not only increase the system latency (computation happens after data is loaded from the storage cluster) but also increases the operation complexity. More specifically, the long data write path means that synchronization among the multiple replicas of the data will be needed. Moreover, the loading and writing of the data can consume a large amount of bandwidth, thus leading to degraded system performance.

Schemes for Offloading Computations to Storage Nodes

To improve the performance of the distributed system, in some embodiments, certain computation tasks (e.g., data-intensive computation tasks) can be offloaded from the compute nodes to one or more storage nodes, thus reducing system latency and the amount of data movement. For example, certain e-commerce applications, such as updating product inventory or adjusting product prices, can be data-intensive, because the number of entries in the to-be-updated table can be huge. On the other hand, such computations are relatively simple and often do not need powerful CPUs. It is more efficient to perform such computations (e.g., table updates) at the storage node where the data is stored. In a distributed system, data needed for a computation may be scattered among multiple storage nodes. To ensure that a storage node performing the computation has all of the data needed for the computation, the system needs to make sure, during data placement, that data placed onto the storage node can meet the requirements of the computation. In order to do so, the data-placement layer needs to have, in addition to distributed-storage logic information, compute context information associated with the to-be-stored data. When placing data, along with its replicas, among the plurality of storage nodes within the distributed system, a compute node can take into consideration both the distributed-storage logic and the compute context.

FIG. 2 presents a diagram illustrating an exemplary data-placement system, according to one embodiment. Data-placement system 200 can be a distributed system, and can include one or more master nodes (e.g., master node 202), a plurality of compute nodes (e.g., compute node 204), and a plurality of storage nodes (e.g., storage nodes 206, 208, and 210).

In a distributed system implementing the master/slave architecture, a master node (also referred to as a primary node) (e.g., master node 202) can be in charge of distributing data (e.g., assigning data-placement paths) among the storage nodes. A compute node can group data and write data to the storage nodes based on the data-placement paths assigned by the master node. For example, master node 202 can send data-placement path information to compute node 204, which writes the data to storage nodes 206, 208, and 210 based on the data-placement path. A master node can be a compute node or a storage node.

As discussed previously, to improve efficiency, data needed for a computation should be grouped together and stored on the same storage node performing the computation. In other words, data grouping and data-placement path-assigning need to take into consideration the compute context of the data. In some embodiments, the compute nodes send the file-organization information along with the data context used for computation to the master node, which uses such information to assign a data-placement path and send the data-placement path to the compute nodes.

FIG. 3 presents a flowchart illustrating an exemplary process for writing data in the distributed system, according to one embodiment. During operation, a compute node received user data to be stored in the distributed system (operation 302). The compute node can merge or group the received data into one or more chunks based on the compute context associated with the data (operation 304). The compute context associated with the data can include but is not limited to: the data type, the source of the data, the format of the data, possible computations that can be performed on the data, etc. Data grouped into a particular chunk may share similar compute contexts and is more likely to participate in the same computation task or sub-task. The compute node can also send the compute context associated with the data to the master node, which registered the compute context as part of the metadata (operation 306).

The master node can then generate data-placement paths based on the data grouping and other data-storage considerations (e.g., current loads of the storage nodes, redundancy requirement, etc.) (operation 308) and send the data-placement paths to the compute node (operation 310). The master node maintains both the compute context and the data-placement paths for each data chunk. The compute node can then write the data to corresponding storage nodes based on the received data-placement paths (operation 312). The compute node needs to make sure that the individual data chunks are kept together and written into a same storage node and multiple copies of the data chunk can be written into multiple different storage nodes. In some embodiments, to provide redundancy, at least three copies of each data chunk are written into the storage nodes, with each copy being written into a different storage node.

The system can determine whether a predetermined number of copies of each chunk have been successfully written (operation 314). For example, when three copies are to be written, the system can determine whether at least two copies of each data chunk have been successfully written. In some embodiments, each time a copy is written successfully into a storage node, the storage node can report to the compute node. If a sufficient number of copies have been successfully written, the compute node can acknowledge that the write is complete and the data is available (operation 316). If not, the compute node continues to write. This way, the data can be available for users or applications before all copies are written, thus reducing latency. The compute node can finish writing the remaining data copies (e.g., the last copy of the three data copies) to the storage nodes (operation 318).

FIG. 4 presents a diagram illustrating the flow of information during computation, according to one embodiment. During operation, a compute node 402 can launch a computation task. In some embodiments, compute node 402 can send a query 404 to a master node 406 to request data-placement information associated with the computation. For example, if the computation involves updating a table, compute node 402 can query master node 406 about the storage locations of the table content (e.g., columns and/or rows of the table). Master node 406 can respond to query 404 using the stored data-placement path information 408. Using the table update as an example, master node 406 can send storage information associated with the table (e.g., which portion of the table is stored at a particular storage node) to compute node 402.

In some embodiments, compute node 402 can then partition the computation task into a plurality of sub-tasks based on the compute context of the data. More specifically, compute node 402 can partition the computation task in such a way that each sub-task only requires data stored on a single storage node. This way, the single storage node can perform the sub-task without the need to request additional data from other storage nodes. For example, when partitioning the computation task of updating a table into a number of sub-tasks, compute node 402 can partition the computation task based on the way the table is stored in multiple storage nodes. A sub-task can include updating a section (e.g., a set of rows or columns) of the table, with such a section being stored on a particular storage node. Hence, that particular storage node can perform the sub-task without the need to obtain additional table content from other storage nodes. Note that task partitioning can be optional. When the computation task is relatively small, compute node 402 may chose not to partition the computation task.

Compute node 402 can send computation instruction 410 to corresponding storage node 412. When the computation task has been divided into a plurality of sub-tasks, compute node 402 can send the computation instruction for each sub-task to its corresponding storage node. For distributed systems, the data often have multiple replicas stored at multiple storage nodes, and master node 406 may send path information associated with the multiple data replicas to compute node 402. However, instead of offloading the computation task to the multiple storage nodes storing the multiple replicas, compute node 402 offloads the computation task to a single replica of the data (e.g., to the one or more storage nodes that store the single replica). To do so, compute node 402 randomly selects a data replica, identifies one or more storage nodes storing the data replica, and sends the computation instruction for each sub-task to the corresponding storage node. For example, a replica of a table may be stored in three different storage nodes, with each storage node storing a section of the table. Accordingly, compute node 402 may send the computation instruction for updating each section of the table to each corresponding storage node. Alternatively, each of the three storage nodes may store a replica of the entire table, and only one replica is selected for each sub-task. More specifically, instead of offloading the table-updating task to one storage node, compute node 402 can partition the table-updating task into three sub-tasks, with each sub-task updating a portion of the table. Compute node 402 can then send the computation instruction for each sub-task to each of the three storage nodes. When selecting which replica to send the computation task to, compute node 402 may perform load balancing (i.e., to ensure that the sub-tasks are sent evenly among the storage nodes to avoid heavy load on any particular storage node). Note that each computation instruction only affects a portion of the table, and the combined affected portions from the three storage nodes form a complete replica of the table.

FIG. 4 also shows that storage node 412 can include a network interface card (NIC) 414, processor(s) 416, and storage device(s) 418. Processor(s) 416 can include any type of processing unit, such as a central processing unit (CPU), graphics processing unit (GPU), field-programming gate array (FPGA), etc. Storage device(s) 418 can include hard-disk drives (HDDs), such as conventional-magnetic recording (CMR) HDDs, shingled-magnetic recording (SMR) HDDs, solid-state drives (SSDs), etc. A typical storage node can have a relatively large storage capacity. Subsequent to receiving the computation instruction, processor(s) 416 can load data from the local drives (e.g., storage device(s) 418) to perform the computation task or sub-task based on the computation instruction. Storage node 412 can then return computation result 420 to compute node 402. For a partitioned computation task, compute node 402 gathers computation results from all sub-tasks. Compute node 402 can further return the computation result to the user or application requesting the result.

During and after the computation, the data (or a portion of the data) is often updated. Because the distributed system maintains multiple replicas of the data, it is essential to maintain synchronization among the multiple replicas in order to ensure data consistency. In some embodiments, once a storage node updates its locally stored data by performing a computation task or sub-task, the storage node needs to send an update notification to the master node. The master node can then update its metadata record associated with the data and look up paths to other replicas of the data. Based on the looked up paths, the master node can synchronize all replicas of the data according to the updated data. This approach proactively synchronizes data based on the updated copy. In some embodiments, data consistency is also checked periodically. However, unlike the data scrub scheme used in a distributed storage system where data majority is used as a selection criterion, in the invented system, the updated version can be used as a selection criterion for correct data. In other words, a data replica that is most recently updated can be chosen as the correct copy for data synchronization.

FIG. 5 presents a diagram illustrating the flow of information during data synchronization, according to one embodiment. During operation, a storage node 502 can perform a computation task or sub-task, resulting in data stored at storage node 502 being updated. Storage node 502 sends computation result 504 to compute node 506, which is the node assigning the computation task or sub-task. Compute node 506 can gather results from all sub-tasks and return the results to the user or application requesting the result. Moreover, storage node 502 sends an update notification 508 to master node 510, notifying master node 510 that data stored on storage node 502 has been updated. Master node 510 updates its own record (e.g., metadata associated with the data) and looks up paths to other replicas of the data. For example, master node 510 can determine that other replicas of the data are stored in storage nodes 512 and 514. Master node 510 can send the path information 516 to storage node 502, notifying storage node 502 that other replicas are stored in storage nodes 512 and 514. Based on the path information, storage node 502 can send updated data 518 to storage nodes 512 and 514, which then synchronize their local copy of the data to the updated data.

Note that, in certain scenarios, multiple storage nodes may be used to perform a computation task, with each storage node performing a sub-task and updating its local copy of the data based on the local computation result. For example, when each of storage nodes 502, 512, and 514 updates a separate section of a table, each storage node can update its local copy of the table based on the local computation result. To ensure data consistency, each storage node needs to use its updated table section to synchronize corresponding table sections stored at other storage nodes. If storage node 502 updates the first n rows of the table, storage node 502 needs to send the updated first n rows of the table to storage nodes 512 and 514 to allow these two storage nodes to update the first n rows of the table stored locally.

FIG. 6 presents a flowchart illustrating an exemplary process for executing a computation task, according to one embodiment. During operation, a compute node launches a computation task and queries the master node path information associated with the data involved in the computation task (operation 602). For example, for e-commerce applications, a computation task can involve updating the price or inventory information associated with a product, and such a computation task often involves updating a table. The compute node can obtain location information of the data (e.g., the table), including replicas of the data, from the master node. In some embodiments, different portions of the required data (e.g., different sections of the table) may be stored on different storage nodes. In such a scenario, the path information can include the location of each data portion.

Upon receiving the location information of the data, the compute node can partition the computation task into a number of sub-tasks based on the data location information (operation 604). More specifically, the computation task can be partitioned in such a way that a respective sub-task only involves data stored within a single storage node. If the computation task is updating a large table, and different sections (e.g., rows or columns) of the table are stored in different nodes, then the compute node can partition the table-updating task into a number of sub-tasks, with each sub-task being updating a section of the table, and each sub-task only updating a section of the table that is stored on a single storage node. This way, all data required for executing the sub-task is located on a single storage node, making it possible for the storage node to execute the sub-task.

The compute node can then send the sub-tasks to corresponding storage nodes based on the previously obtained path info (operation 606). More specifically, if a particular sub-task requires a portion of data, which is stored on a particular storage node according to the path information, the sub-task can be sent to that particular storage node. In the event of multiple replicas of the data existing on multiple storage nodes, the compute node can randomly select a replica to send the sub-task, instead of sending the sub-task to all replicas. In some embodiments, the compute node sends detailed computation instructions associated with a sub-task to its corresponding storage node, thus enabling the storage node to execute the sub-task. The computation instructions can specify what type of operation is to be performed on which data. For example, a table-update instruction may specify that all numbers in the top five rows of a table should be increased by 20% or that the first two columns should be merged. The storage node then loads data from its local drives, which can be SSDs or HDDs, executes the sub-task based on the received computation instruction, and sends the result of the sub-task to the compute node.

The compute node receives results from the storage nodes executing the sub-tasks and combines the sub-task results to generate a final result of the computation task (operation 608). Depending on the need, the compute node may return the final result to the user or application requesting the computation (operation 610). In some embodiments, if the user or application requested multiple computation tasks to be performed, the compute node can return the computation result until all computation tasks have been performed.

FIG. 7 presents a flowchart illustrating an exemplary process for data synchronization, according to one embodiment. Note that certain computations generate a result without affecting the stored data, whereas certain computations (e.g., the table-update computation) will update the data stored in the storage nodes. Once the data has been updated, other replicas of the data need to synchronize to the updated data to maintain data consistency. During operation, a storage node can execute a computation task or sub-task offloaded from a compute node by performing a computation based on received computation instructions (operation 702). Subsequently, the storage node can determine whether its locally stored data has been updated by the computation (operation 704). If not, there is no need for data synchronization; the process ends.

If the data has been updated, the storage node can notify the master node that its local data has been updated (operation 706) and queries the master node path information associated WITH other replicas of the data (operation 708). In other words, the storage node with the updated data needs to find the locations of other un-updated copieS of the data. Based on the received path information, the storage node can use its own updated local copy to synchronize other replicas of the data (operation 710). In some embodiments, the storage node with the updated data can send the updated data to other storage nodes storing replicas of the original data such that those other storage nodes can update their local data copy accordingly.

In certain situations, a table and its replicas may be stored on different storage nodes, and each storage node may update a section of the table by performing a sub-task of a table-update operation. To ensure consistency of the table among all the copies, each storage node can synchronize the corresponding sections of the table using its updated table section using a process similar to the one shown in FIG. 7. After all storage nodes having updated table sections have synchronized other copies using their updated sections, the entire table is updated and synchronized among all copies.

FIG. 8A shows the exemplary structure of a compute node, according to one embodiment. Compute node 800 can include an application interface 802, a path-querying module 804, a computation-task-partitioning module 806, a sub-task distribution module 808, and a result-gathering module 810.

Application interface 802 can be responsible for interfacing with user applications. More specifically, compute node 800 can receive, via application interface 802, user data and a computation request. Moreover, compute node 800 can return the computation result to the user applications via application interface 802. Path-querying module 804 can be responsible for querying a master node path information associated with data needed for performing a computation. In some embodiments, path-querying module 804 can send the query to the master node via an interface that allows the communication among the nodes within the distributed system. The query can include information used for identifying the data, such as a file name.

Computation-task-partitioning module 806 can be responsible for partitioning a requested computation task to one or more sub-tasks based on the path information associated with the data. More specifically, the partition is done in such a way that any sub-task only requires data included on one storage node, thus ensuring no data migration is needed to execute the sub-task. Sub-task distribution module 808 can be responsible for distributing the partitioned sub-tasks to corresponding storage nodes based on the path information associated with the data. Result-gathering module 810 can be responsible for gathering and combining results of all sub-tasks in order to generate the final result for the computation task.

FIG. 8B shows the exemplary structure of a storage node, according to one embodiment. Storage node 820 can include a task-receiving module 822, a computation module 824, a data-update-notification module 826, a path-querying module 828, and a data-synchronization module 830.

Task-receiving module 822 can be responsible for receiving computation tasks or sub-tasks from a compute node. More specifically, task-receiving module 822 can include computation instructions for performing the received task or sub-task. Computation module 824 can be responsible for performing the computation based on the received computation instruction. To perform the computation, computation module 824 can load the required data from a local drive (e.g., an SSD or HDD). Data-update-notification module 826 can be responsible for sending data-update notifications to the master drive, in response to the data stored in the local drive being updated by the computation. Path-querying module 828 can be responsible for querying the master node for path information associated with the replicas of the data. Data-synchronization module 830 can be responsible for synchronizing, using the updated local data, data replicas stored in other remote storage nodes.

In general, embodiments of the present disclosure provide a solution for reducing the data transfer amount by offloading data-intensive computation to storage nodes. In addition to reducing the amount of data being transferred among the different nodes, which reduces latency and bandwidth consumption, this approach can also reduce the CPU consumption used for data transfer. Moreover, by offloading the data-intensive processing to the storage nodes, the requirement on the cache hit rate of the compute node can be relaxed, hence making it possible for the compute node to have a smaller cache. This disclosure presents the solution for enhancing system performance by placing data based on computation context in the distributed system and offloading computation tasks onto multiple storage nodes based on the data locality. This significantly reduces the data amount loaded from storage clusters to the compute node, so that the novel system is able to reduce the latency, reduce the network bandwidth consumption, reduce the total capacity of the compute cache SSD, and reduce the overall CPU consumption.

In some embodiments, not all computation tasks have been offloaded to the storage nodes. Computation-intensive tasks, which require powerful CPUs, can still be processed by the compute node. In the previously discussed examples, the nodes in the distributed system have been characterized as compute nodes or storage nodes based on their configuration, where compute nodes are configured to have good computation capability and the storage nodes are configured to have large storage capacity. It is also possible for the distributed system to have nodes that are not clearly characterized as compute or storage nodes. These nodes may have relatively powerful CPUs (may not be as powerful as those on a compute node) and relatively large storage capacity (may not be as large as that of a storage node). These nodes can serve as compute nodes or storage nodes, depending on the system needs. The general principle of the embodiments can also be applied to these nodes, meaning that they can launch, or receive from other nodes, computation tasks that involve only local data. They can store data. Depending on the data location, a node can perform the computation task or offload the computation task to one or more other nodes.

In the disclosed examples, the distributed system can have a master-slave type of architecture, where one or more master nodes maintain the storage logic as well as compute context of the data. It is also possible for a distributed system to have a peer-to-peer type of architecture, and the storage logic and compute context of data can be maintained in a distributed manner among all nodes in the system. In such a scenario, subsequent to receiving a computation request, a compute node may broadcast a query to the whole system to obtain path information associated with the data.

FIG. 9 conceptually illustrates an electronic system, which implements some embodiments of the subject technology. Electronic system 900 can be a client, a server, a computer, a smartphone, a PDA, a laptop, or a tablet computer with one or more processors embedded therein or coupled thereto, or any other sort of electronic device. Such an electronic system includes various types of computer-readable media and interfaces for various other types of computer-readable media. Electronic system 900 includes a bus 908, processing unit(s) 912, a system memory 904, a read-only memory (ROM) 910, a permanent storage device 902, an input device interface 914, an output device interface 906, and a network interface 916.

Bus 908 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of electronic system 900. For instance, bus 908 communicatively connects processing unit(s) 912 with ROM 910, system memory 904, and permanent storage device 902.

From these various memory units, processing unit(s) 912 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure. The processing unit(s) can be a single processor or a multi-core processor in different implementations.

ROM 910 stores static data and instructions that are needed by processing unit(s) 912 and other modules of the electronic system. Permanent storage device 902, on the other hand, is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when electronic system 900 is off. Some implementations of the subject disclosure use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) as permanent storage device 902.

Other implementations use a removable storage device (such as a floppy disk, flash drive, and various types of disk drive) as permanent storage device 902. Like permanent storage device 902, system memory 904 is a read-and-write memory device. However, unlike storage device 902, system memory 904 is a volatile read-and-write memory, such as a random access memory. System memory 904 stores some of the instructions and data that the processor needs at runtime. In some implementations, the processes of the subject disclosure are stored in system memory 904, permanent storage device 902, and/or ROM 910. From these various memory units, processing unit(s) 912 retrieves instructions to execute and data to process in order to execute the processes of some implementations.

Bus 908 also connects to input and output device interfaces 914 and 906. Input device interface 914 enables the user to communicate information and send commands to the electronic system. Input devices used with input device interface 914 include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices”). Output device interface 906 enables, for example, the display of images generated by the electronic system 900. Output devices used with output device interface 906 include, for example, printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD). Some implementations include devices such as a touchscreen that function as both input and output devices.

Finally, as shown in FIG. 9, bus 908 also couples electronic system 900 to a network (not shown) through a network interface 916. In this manner, the computer can be a part of a network of computers (such as a local area network (“LAN”), a wide area network (“WAN”), an intranet, or a network of networks, such as the Internet. Any or all components of electronic system 900 can be used in conjunction with the subject disclosure.

These functions described above can be implemented in digital electronic circuitry, in computer software, firmware or hardware. The techniques can be implemented using one or more computer program products. Programmable processors and computers can be included in or packaged as mobile devices. The processes and logic flows can be performed by one or more programmable processors and by one or more programmable logic circuitry. General and special purpose computing devices and storage devices can be interconnected through communication networks.

The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.

Claims

1. A distributed computing system, the system comprising: a compute cluster comprising one or more compute nodes;a storage cluster comprising a plurality of storage nodes; anda master node for distributing data among the storage nodes;wherein a respective storage node comprises: a processor;a memory;a receiving module configured to receive, from a compute node, computation instructions associated with a computation task, wherein the computation task is divided into a number of sub-tasks in such a way that each sub-task only requires data stored on a single storage node;a computation module configured to execute a corresponding sub-task using data stored in the storage node without requesting additional data from a different storage node, and send a computation result to the compute node;a path-querying module, wherein in response to determining that executing the sub-task updates data locally stored on the storage node, the path-querying module is configured to query the master node to identify other storage nodes in the storage cluster that store replicas of the locally stored data; andan update module configured to send the updated data to the identified other storage nodes.
2. The distributed computing system of claim 1, wherein the master node comprises: a receiving module configured to receive, from the compute node, compute context associated with to-be-written data;a data-path generation module configured to generate data-placement paths based on the compute context; anda transmitting module configured to provide the data-placement paths to the compute node to allow the compute node to write the to-be-written data to one or more storage nodes based on the data-placement paths.
3. The distributed computing system of claim 2, wherein the compute node comprises: a computation-task-partitioning module configured to partition the computation task into a number of sub-tasks based on the data-placement paths; anda distribution module configured to send each sub-task to a corresponding single storage node.
4. The distributed computing system of claim 3, wherein the compute node further comprises a result-gathering module configured to receive computation results from multiple storage nodes executing the sub-tasks to generate a combined result.
5. The distributed computing system of claim 2, wherein the master node is configured to store the compute context as part of metadata of the to-be-written data.
6. The distributed computing system of claim 5, wherein the update module of the storage node is further configured to: in response to determining that executing the sub-task updates the locally stored data, send a data-update notification to the master node to allow the master node to update the corresponding metadata.
7. The distributed computing system of claim 1, wherein the compute node comprises a first processor having a first processing speed, and wherein the storage node comprises a second processor having a second processing speed that is slower than the first processing speed.
8. A computer-implemented method for offloading computation tasks from a compute cluster comprising one or more compute nodes to a storage cluster comprising a plurality of storage nodes in a distributed computing system, the method comprising: receiving, by a storage node from a compute node, computation instructions associated with a computation task, wherein the computation task is divided into a number of sub-tasks in such a way that each sub-task only requires data stored on a single storage node;executing, by the storage node, a corresponding sub-task using data stored in the storage node without requesting additional data from a different storage node;sending a computation result to the compute node;in response to determining that executing the sub-task updates data locally stored on the storage node, querying a master node in the distributed computing system to identify other storage nodes in the storage cluster that store replicas of the data; andsending the updated data to the identified other storage nodes.
9. The computer-implemented method of claim 8, further comprising: receiving, by the master node from the compute node, compute context associated with to-be-written data;generating data-placement paths based on the compute context; andproviding the data-placement paths to the compute node to allow the compute node to write the to-be-written data to one or more storage nodes based on the data-placement paths.
10. The computer-implemented method of claim 9, further comprising: partitioning, by the compute node, the computation task into a number of sub-tasks based on the data-placement paths; andsending each sub-task to a corresponding single storage node.
11. The computer-implemented method of claim 10, further comprising receiving, by the compute node, computation results from multiple storage nodes executing the sub-tasks to generate a combined result.
12. The computer-implemented method of claim 9, further comprising: storing, by the master node, the compute context as part of metadata of the to-be-written data.
13. The computer-implemented method of claim 12, further comprising: in response to determining that executing the sub-task updates the locally stored data, sending, by the storage node, a data-update notification to the master node to allow the master node to update the corresponding metadata.
14. The computer-implemented method of claim 8, wherein the compute node comprises a first processor having a first processing speed, and wherein the storage node comprises a second processor having a second processing speed that is lower than the first processing speed.
15. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for offloading computation tasks from a compute cluster comprising one or more compute nodes to a storage cluster comprising a plurality of storage nodes in a distributed computing system, the method comprising: receiving, by a storage node from a compute node, computation instructions associated with a computation task, wherein the computation task is divided into a number of sub-tasks in such a way that each sub-task only requires data stored on a single storage node;executing, by the storage node, a corresponding sub-task using data stored in the storage node without requesting additional data from a different storage node;sending a computation result to the compute node;in response to determining that executing the sub-task updates data locally stored on the storage node, querying a master node in the distributed computing system to identify other storage nodes in the storage cluster that store replicas of the data; andsending the updated data to the identified other storage nodes.
16. The non-transitory computer-readable storage medium of claim 15, wherein the method further comprises: receiving, by the master node from the compute node, compute context associated with to-be-written data;generating data-placement paths based on the compute context; andproviding the data-placement paths to the compute node to allow the compute node to write the to-be-written data to one or more storage nodes.
17. The non-transitory computer-readable storage medium of claim 16, wherein the method further comprises: partitioning, by the compute node, the computation task into a number of sub-tasks based on the data-placement paths; andsending each sub-task to a corresponding single storage node.
18. The non-transitory computer-readable storage medium of claim 17, wherein the method further comprises receiving, by the compute node, computation results from multiple storage nodes executing the sub-tasks to generate a combined result.
19. The non-transitory computer-readable storage medium of claim 16, wherein the method further comprises: storing, by the master node, the compute context as part of metadata of the to-be-written data.
20. The non-transitory computer-readable storage medium of claim 19, wherein the method further comprises: in response to determining that executing the sub-task updates the locally stored data, sending, by the storage node, a data-update notification to the master node to allow the master node to update the corresponding metadata.

RELATED APPLICATION

This application is a continuation of U.S. application Ser. No. 16/238,359, entitled “SYSTEM AND METHOD FOR OFFLOADING COMPUTATION TO STORAGE NODES IN DISTRIBUTED SYSTEM,” by inventor Shu Li, filed 2 Jan. 2019, the disclosure of which is incorporated herein by reference for all purposes.

US Referenced Citations (507)

Number	Name	Date	Kind
3893071	Bossen	Jul 1975	A
4562494	Bond	Dec 1985	A
4718067	Peters	Jan 1988	A
4775932	Oxley	Oct 1988	A
4858040	Hazebrouck	Aug 1989	A
5394382	Hu	Feb 1995	A
5602693	Brunnett	Feb 1997	A
5715471	Otsuka	Feb 1998	A
5732093	Huang	Mar 1998	A
5802551	Komatsu	Sep 1998	A
5930167	Lee	Jul 1999	A
6098185	Wilson	Aug 2000	A
6148377	Carter	Nov 2000	A
6226650	Mahajan et al.	May 2001	B1
6243795	Yang	Jun 2001	B1
6457104	Tremaine	Sep 2002	B1
6658478	Singhal	Dec 2003	B1
6795894	Neufeld	Sep 2004	B1
7351072	Muff	Apr 2008	B2
7565454	Zuberi	Jul 2009	B2
7599139	Bombet	Oct 2009	B1
7953899	Hooper	May 2011	B1
7958433	Yoon	Jun 2011	B1
8024719	Gorton, Jr.	Sep 2011	B2
8085569	Kim	Dec 2011	B2
8144512	Huang	Mar 2012	B2
8166233	Schibilla	Apr 2012	B2
8260924	Koretz	Sep 2012	B2
8281061	Radke	Oct 2012	B2
8452819	Sorenson, III	May 2013	B1
8516284	Chan	Aug 2013	B2
8527544	Colgrove	Sep 2013	B1
8751763	Ramarao	Jun 2014	B1
8819367	Fallone	Aug 2014	B1
8825937	Atkisson	Sep 2014	B2
8832688	Tang	Sep 2014	B2
8868825	Hayes	Oct 2014	B1
8904061	O'Brien, III	Dec 2014	B1
8949208	Xu	Feb 2015	B1
9015561	Hu	Apr 2015	B1
9031296	Kaempfer	May 2015	B2
9043545	Kimmel	May 2015	B2
9088300	Chen	Jul 2015	B1
9092223	Pani	Jul 2015	B1
9129628	Fallone	Sep 2015	B1
9141176	Chen	Sep 2015	B1
9208817	Li	Dec 2015	B1
9213627	Van Acht	Dec 2015	B2
9213632	Song	Dec 2015	B1
9251058	Nellans	Feb 2016	B2
9258014	Anderson	Feb 2016	B2
9280472	Dang	Mar 2016	B1
9280487	Candelaria	Mar 2016	B2
9311939	Malina	Apr 2016	B1
9336340	Dong	May 2016	B1
9436595	Benitez	Sep 2016	B1
9495263	Pang	Nov 2016	B2
9529601	Dharmadhikari	Dec 2016	B1
9529670	O'Connor	Dec 2016	B2
9569454	Ebsen	Feb 2017	B2
9575982	Sankara Subramanian	Feb 2017	B1
9588698	Karamcheti	Mar 2017	B1
9588977	Wang	Mar 2017	B1
9607631	Rausch	Mar 2017	B2
9671971	Trika	Jun 2017	B2
9722632	Anderson	Aug 2017	B2
9747202	Shaharabany	Aug 2017	B1
9830084	Thakkar	Nov 2017	B2
9836232	Vasquez	Dec 2017	B1
9852076	Garg	Dec 2017	B1
9875053	Frid	Jan 2018	B2
9910705	Mak	Mar 2018	B1
9912530	Singatwaria	Mar 2018	B2
9923562	Vinson	Mar 2018	B1
9933973	Luby	Apr 2018	B2
9946596	Hashimoto	Apr 2018	B2
10013169	Fisher	Jul 2018	B2
10199066	Feldman	Feb 2019	B1
10229735	Natarajan	Mar 2019	B1
10235198	Qiu	Mar 2019	B2
10268390	Warfield	Apr 2019	B2
10318467	Barzik	Jun 2019	B2
10361722	Lee	Jul 2019	B2
10417086	Lin	Sep 2019	B2
10437670	Koltsidas	Oct 2019	B1
10459663	Agombar	Oct 2019	B2
10459794	Baek	Oct 2019	B2
10466907	Gole	Nov 2019	B2
10484019	Weinberg	Nov 2019	B2
10530391	Galbraith	Jan 2020	B2
10635529	Bolkhovitin	Apr 2020	B2
10642522	Li	May 2020	B2
10649657	Zaidman	May 2020	B2
10649969	De	May 2020	B2
10678432	Dreier	Jun 2020	B1
10756816	Dreier	Aug 2020	B1
10831734	Li	Nov 2020	B2
10928847	Suresh	Feb 2021	B2
10990526	Lam	Apr 2021	B1
11016932	Qiu	May 2021	B2
11023150	Pletka	Jun 2021	B2
11068165	Sharon	Jul 2021	B2
11068409	Li	Jul 2021	B2
11126561	Li	Sep 2021	B2
11138124	Tomic	Oct 2021	B2
11243694	Liang	Feb 2022	B2
11360863	Varadan	Jun 2022	B2
20010003205	Gilbert	Jun 2001	A1
20010032324	Slaughter	Oct 2001	A1
20010046295	Sako	Nov 2001	A1
20020010783	Primak	Jan 2002	A1
20020039260	Kilmer	Apr 2002	A1
20020073358	Atkinson	Jun 2002	A1
20020095403	Chandrasekaran	Jul 2002	A1
20020112085	Berg	Aug 2002	A1
20020161890	Chen	Oct 2002	A1
20030074319	Jaquette	Apr 2003	A1
20030145274	Hwang	Jul 2003	A1
20030163594	Aasheim	Aug 2003	A1
20030163633	Aasheim	Aug 2003	A1
20030217080	White	Nov 2003	A1
20040010545	Pandya	Jan 2004	A1
20040066741	Dinker	Apr 2004	A1
20040103238	Avraham	May 2004	A1
20040143718	Chen	Jul 2004	A1
20040255171	Zimmer	Dec 2004	A1
20040267752	Wong	Dec 2004	A1
20040268278	Hoberman	Dec 2004	A1
20050033828	Watanabe	Feb 2005	A1
20050038954	Saliba	Feb 2005	A1
20050097126	Cabrera	May 2005	A1
20050138325	Hofstee	Jun 2005	A1
20050144358	Conley	Jun 2005	A1
20050149827	Lambert	Jul 2005	A1
20050174670	Dunn	Aug 2005	A1
20050177672	Rao	Aug 2005	A1
20050177755	Fung	Aug 2005	A1
20050195635	Conley	Sep 2005	A1
20050235067	Creta	Oct 2005	A1
20050235171	Igari	Oct 2005	A1
20050240681	Fujiwara	Oct 2005	A1
20060031709	Hiraiwa	Feb 2006	A1
20060101197	Georgis	May 2006	A1
20060156009	Shin	Jul 2006	A1
20060156012	Beeson	Jul 2006	A1
20060184813	Bui	Aug 2006	A1
20070033323	Gorobets	Feb 2007	A1
20070061502	Lasser	Mar 2007	A1
20070061542	Uppala	Mar 2007	A1
20070101096	Gorobets	May 2007	A1
20070168581	Klein	Jul 2007	A1
20070204128	Lee	Aug 2007	A1
20070250756	Gower	Oct 2007	A1
20070266011	Rohrs	Nov 2007	A1
20070283081	Lasser	Dec 2007	A1
20070283104	Wellwood	Dec 2007	A1
20070285980	Shimizu	Dec 2007	A1
20080028223	Rhoads	Jan 2008	A1
20080034154	Lee	Feb 2008	A1
20080065805	Wu	Mar 2008	A1
20080082731	Karamcheti	Apr 2008	A1
20080104369	Reed	May 2008	A1
20080112238	Kim	May 2008	A1
20080163033	Yim	Jul 2008	A1
20080195829	Wilsey	Aug 2008	A1
20080301532	Uchikawa	Dec 2008	A1
20090006667	Lin	Jan 2009	A1
20090089544	Liu	Apr 2009	A1
20090110078	Crinon	Apr 2009	A1
20090113219	Aharonov	Apr 2009	A1
20090125788	Wheeler	May 2009	A1
20090177944	Kanno	Jul 2009	A1
20090183052	Kanno	Jul 2009	A1
20090254705	Abali	Oct 2009	A1
20090282275	Yermalayeu	Nov 2009	A1
20090287956	Flynn	Nov 2009	A1
20090307249	Koifman	Dec 2009	A1
20090307426	Galloway	Dec 2009	A1
20090310412	Jang	Dec 2009	A1
20100031000	Flynn	Feb 2010	A1
20100169470	Takashige	Jul 2010	A1
20100217952	Iyer	Aug 2010	A1
20100229224	Etchegoyen	Sep 2010	A1
20100241848	Smith	Sep 2010	A1
20100281254	Carro	Nov 2010	A1
20100321999	Yoo	Dec 2010	A1
20100325367	Kornegay	Dec 2010	A1
20100332922	Chang	Dec 2010	A1
20110031546	Uenaka	Feb 2011	A1
20110055458	Kuehne	Mar 2011	A1
20110055471	Thatcher	Mar 2011	A1
20110060722	Li	Mar 2011	A1
20110072204	Chang	Mar 2011	A1
20110099418	Chen	Apr 2011	A1
20110153903	Hinkle	Jun 2011	A1
20110161621	Sinclair	Jun 2011	A1
20110161784	Selinger	Jun 2011	A1
20110191525	Hsu	Aug 2011	A1
20110218969	Anglin	Sep 2011	A1
20110231598	Hatsuda	Sep 2011	A1
20110239083	Kanno	Sep 2011	A1
20110252188	Weingarten	Oct 2011	A1
20110258514	Lasser	Oct 2011	A1
20110289263	McWilliams	Nov 2011	A1
20110289280	Koseki	Nov 2011	A1
20110292538	Haga	Dec 2011	A1
20110296411	Tang	Dec 2011	A1
20110299317	Shaeffer	Dec 2011	A1
20110302353	Confalonieri	Dec 2011	A1
20110302408	McDermott	Dec 2011	A1
20120017037	Riddle	Jan 2012	A1
20120039117	Webb	Feb 2012	A1
20120047338	Akirav	Feb 2012	A1
20120084523	Littlefield	Apr 2012	A1
20120089774	Kelkar	Apr 2012	A1
20120096330	Przybylski	Apr 2012	A1
20120117399	Chan	May 2012	A1
20120147021	Cheng	Jun 2012	A1
20120151253	Horn	Jun 2012	A1
20120159099	Lindamood	Jun 2012	A1
20120159289	Piccirillo	Jun 2012	A1
20120173792	Lassa	Jul 2012	A1
20120203958	Jones	Aug 2012	A1
20120210095	Nellans	Aug 2012	A1
20120233523	Krishnamoorthy	Sep 2012	A1
20120246392	Cheon	Sep 2012	A1
20120278579	Goss	Nov 2012	A1
20120284587	Yu	Nov 2012	A1
20120324312	Moyer	Dec 2012	A1
20120331207	Lassa	Dec 2012	A1
20130013880	Tashiro	Jan 2013	A1
20130013887	Sugahara	Jan 2013	A1
20130016970	Koka	Jan 2013	A1
20130018852	Barton	Jan 2013	A1
20130024605	Sharon	Jan 2013	A1
20130054822	Mordani	Feb 2013	A1
20130061029	Huff	Mar 2013	A1
20130073798	Kang	Mar 2013	A1
20130080391	Raichstein	Mar 2013	A1
20130138871	Chiu	May 2013	A1
20130144836	Adzic	Jun 2013	A1
20130145085	Yu	Jun 2013	A1
20130145089	Eleftheriou	Jun 2013	A1
20130151759	Shim	Jun 2013	A1
20130159251	Skrenta	Jun 2013	A1
20130159723	Brandt	Jun 2013	A1
20130166820	Batwara	Jun 2013	A1
20130173845	Aslam	Jul 2013	A1
20130179898	Fang	Jul 2013	A1
20130191601	Peterson	Jul 2013	A1
20130205183	Fillingim	Aug 2013	A1
20130219131	Alexandron	Aug 2013	A1
20130227347	Cho	Aug 2013	A1
20130238955	D Abreu	Sep 2013	A1
20130254622	Kanno	Sep 2013	A1
20130318283	Small	Nov 2013	A1
20130318395	Kalavade	Nov 2013	A1
20130325419	Al-Shaikh	Dec 2013	A1
20130329492	Yang	Dec 2013	A1
20130346532	D'Amato	Dec 2013	A1
20140006688	Yu	Jan 2014	A1
20140019650	Li	Jan 2014	A1
20140019661	Hormuth	Jan 2014	A1
20140025638	Hu	Jan 2014	A1
20140082273	Segev	Mar 2014	A1
20140082412	Matsumura	Mar 2014	A1
20140095758	Smith	Apr 2014	A1
20140095769	Borkenhagen	Apr 2014	A1
20140095827	Wei	Apr 2014	A1
20140108414	Stillerman	Apr 2014	A1
20140108891	Strasser	Apr 2014	A1
20140164447	Tarafdar	Jun 2014	A1
20140164879	Tam	Jun 2014	A1
20140181532	Camp	Jun 2014	A1
20140195564	Talagala	Jul 2014	A1
20140215129	Kuzmin	Jul 2014	A1
20140223079	Zhang	Aug 2014	A1
20140233950	Luo	Aug 2014	A1
20140250259	Ke	Sep 2014	A1
20140279927	Constantinescu	Sep 2014	A1
20140304452	De La Iglesia	Oct 2014	A1
20140310574	Yu	Oct 2014	A1
20140337457	Nowoczynski	Nov 2014	A1
20140359229	Cota-Robles	Dec 2014	A1
20140365707	Talagala	Dec 2014	A1
20140379965	Gole	Dec 2014	A1
20150006792	Lee	Jan 2015	A1
20150019798	Huang	Jan 2015	A1
20150039849	Lewis	Feb 2015	A1
20150067436	Hu	Mar 2015	A1
20150082317	You	Mar 2015	A1
20150106556	Yu	Apr 2015	A1
20150106559	Cho	Apr 2015	A1
20150121031	Feng	Apr 2015	A1
20150142752	Chennamsetty	May 2015	A1
20150143030	Gorobets	May 2015	A1
20150186657	Nakhjiri	Jul 2015	A1
20150199234	Choi	Jul 2015	A1
20150227316	Warfield	Aug 2015	A1
20150234845	Moore	Aug 2015	A1
20150269964	Fallone	Sep 2015	A1
20150277937	Swanson	Oct 2015	A1
20150286477	Mathur	Oct 2015	A1
20150294684	Qjang	Oct 2015	A1
20150301964	Brinicombe	Oct 2015	A1
20150304108	Obukhov	Oct 2015	A1
20150310916	Leem	Oct 2015	A1
20150317095	Voigt	Nov 2015	A1
20150341123	Nagarajan	Nov 2015	A1
20150347025	Law	Dec 2015	A1
20150363271	Haustein	Dec 2015	A1
20150363328	Candelaria	Dec 2015	A1
20150370700	Sabol	Dec 2015	A1
20150372597	Luo	Dec 2015	A1
20160014039	Reddy	Jan 2016	A1
20160026575	Samanta	Jan 2016	A1
20160041760	Kuang	Feb 2016	A1
20160048327	Jayasena	Feb 2016	A1
20160048341	Constantinescu	Feb 2016	A1
20160054922	Awasthi	Feb 2016	A1
20160062885	Ryu	Mar 2016	A1
20160077749	Ravimohan	Mar 2016	A1
20160077764	Ori	Mar 2016	A1
20160077968	Sela	Mar 2016	A1
20160078245	Amarendran	Mar 2016	A1
20160098344	Gorobets	Apr 2016	A1
20160098350	Tang	Apr 2016	A1
20160103631	Ke	Apr 2016	A1
20160110254	Cronie	Apr 2016	A1
20160124742	Rangasamy	May 2016	A1
20160132237	Jeong	May 2016	A1
20160141047	Sehgal	May 2016	A1
20160154601	Chen	Jun 2016	A1
20160155750	Yasuda	Jun 2016	A1
20160162187	Lee	Jun 2016	A1
20160179399	Melik-Martirosian	Jun 2016	A1
20160188223	Camp	Jun 2016	A1
20160188890	Naeimi	Jun 2016	A1
20160203000	Parmar	Jul 2016	A1
20160224267	Yang	Aug 2016	A1
20160232103	Schmisseur	Aug 2016	A1
20160234297	Ambach	Aug 2016	A1
20160239074	Lee	Aug 2016	A1
20160239380	Wideman	Aug 2016	A1
20160274636	Kim	Sep 2016	A1
20160283140	Kaushik	Sep 2016	A1
20160306699	Resch	Oct 2016	A1
20160306853	Sabaa	Oct 2016	A1
20160321002	Jung	Nov 2016	A1
20160335085	Scalabrino	Nov 2016	A1
20160342345	Kankani	Nov 2016	A1
20160343429	Nieuwejaar	Nov 2016	A1
20160350002	Vergis	Dec 2016	A1
20160350385	Poder	Dec 2016	A1
20160364146	Kuttner	Dec 2016	A1
20160381442	Heanue	Dec 2016	A1
20170004037	Park	Jan 2017	A1
20170010652	Huang	Jan 2017	A1
20170068639	Davis	Mar 2017	A1
20170075583	Alexander	Mar 2017	A1
20170075594	Badam	Mar 2017	A1
20170091110	Ash	Mar 2017	A1
20170109199	Chen	Apr 2017	A1
20170109232	Cha	Apr 2017	A1
20170123655	Sinclair	May 2017	A1
20170147499	Mohan	May 2017	A1
20170161202	Erez	Jun 2017	A1
20170162235	De	Jun 2017	A1
20170168986	Sajeepa	Jun 2017	A1
20170177217	Kanno	Jun 2017	A1
20170177259	Motwani	Jun 2017	A1
20170177483	Vinod	Jun 2017	A1
20170185316	Nieuwejaar	Jun 2017	A1
20170185498	Gao	Jun 2017	A1
20170192848	Pamies-Juarez	Jul 2017	A1
20170199823	Hayes	Jul 2017	A1
20170212680	Waghulde	Jul 2017	A1
20170212708	Suhas	Jul 2017	A1
20170220254	Warfield	Aug 2017	A1
20170221519	Matsuo	Aug 2017	A1
20170228157	Yang	Aug 2017	A1
20170242722	Qiu	Aug 2017	A1
20170249162	Tsirkin	Aug 2017	A1
20170262176	Kanno	Sep 2017	A1
20170262178	Hashimoto	Sep 2017	A1
20170262217	Pradhan	Sep 2017	A1
20170269998	Sunwoo	Sep 2017	A1
20170277655	Das	Sep 2017	A1
20170279460	Camp	Sep 2017	A1
20170285976	Durham	Oct 2017	A1
20170286311	Juenemann	Oct 2017	A1
20170322888	Booth	Nov 2017	A1
20170344470	Yang	Nov 2017	A1
20170344491	Pandurangan	Nov 2017	A1
20170353576	Guim Bernat	Dec 2017	A1
20180024772	Madraswala	Jan 2018	A1
20180024779	Kojima	Jan 2018	A1
20180033491	Marelli	Feb 2018	A1
20180052797	Barzik	Feb 2018	A1
20180067847	Oh	Mar 2018	A1
20180069658	Benisty	Mar 2018	A1
20180074730	Inoue	Mar 2018	A1
20180076828	Kanno	Mar 2018	A1
20180088867	Kaminaga	Mar 2018	A1
20180107591	Smith	Apr 2018	A1
20180113631	Zhang	Apr 2018	A1
20180143780	Cho	May 2018	A1
20180150640	Li	May 2018	A1
20180165038	Authement	Jun 2018	A1
20180165169	Camp	Jun 2018	A1
20180165340	Agarwal	Jun 2018	A1
20180167268	Liguori	Jun 2018	A1
20180173620	Cen	Jun 2018	A1
20180188970	Liu	Jul 2018	A1
20180189175	Ji	Jul 2018	A1
20180189182	Wang	Jul 2018	A1
20180212951	Goodrum	Jul 2018	A1
20180219561	Litsyn	Aug 2018	A1
20180226124	Perner	Aug 2018	A1
20180232151	Badam	Aug 2018	A1
20180260148	Klein	Sep 2018	A1
20180270110	Chugtu	Sep 2018	A1
20180293014	Ravimohan	Oct 2018	A1
20180300203	Kathpal	Oct 2018	A1
20180307620	Zhou	Oct 2018	A1
20180321864	Benisty	Nov 2018	A1
20180322024	Nagao	Nov 2018	A1
20180329776	Lai	Nov 2018	A1
20180336921	Ryun	Nov 2018	A1
20180349396	Blagojevic	Dec 2018	A1
20180356992	Lamberts	Dec 2018	A1
20180357126	Dhuse	Dec 2018	A1
20180373428	Kan	Dec 2018	A1
20180373655	Liu	Dec 2018	A1
20180373664	Vijayrao	Dec 2018	A1
20190004944	Widder	Jan 2019	A1
20190012111	Li	Jan 2019	A1
20190034454	Gangumalla	Jan 2019	A1
20190042571	Li	Feb 2019	A1
20190050312	Li	Feb 2019	A1
20190050327	Li	Feb 2019	A1
20190065085	Jean	Feb 2019	A1
20190073261	Halbert	Mar 2019	A1
20190073262	Chen	Mar 2019	A1
20190087089	Yoshida	Mar 2019	A1
20190087115	Li	Mar 2019	A1
20190087328	Kanno	Mar 2019	A1
20190108145	Raghava	Apr 2019	A1
20190116127	Pismenny	Apr 2019	A1
20190166725	Jing	May 2019	A1
20190171532	Abadi	Jun 2019	A1
20190172820	Meyers	Jun 2019	A1
20190196748	Badam	Jun 2019	A1
20190196907	Khan	Jun 2019	A1
20190205206	Hornung	Jul 2019	A1
20190212949	Pletka	Jul 2019	A1
20190220392	Lin	Jul 2019	A1
20190227927	Miao	Jul 2019	A1
20190272242	Kachare	Sep 2019	A1
20190278654	Kaynak	Sep 2019	A1
20190278849	Chandramouli	Sep 2019	A1
20190317901	Kachare	Oct 2019	A1
20190320020	Lee	Oct 2019	A1
20190339998	Momchilov	Nov 2019	A1
20190361611	Hosogi	Nov 2019	A1
20190377632	Oh	Dec 2019	A1
20190377821	Pleshachkov	Dec 2019	A1
20190391748	Li	Dec 2019	A1
20200004456	Williams	Jan 2020	A1
20200004674	Williams	Jan 2020	A1
20200013458	Schreck	Jan 2020	A1
20200042223	Li	Feb 2020	A1
20200042387	Shani	Feb 2020	A1
20200082006	Rupp	Mar 2020	A1
20200084918	Shen	Mar 2020	A1
20200089430	Kanno	Mar 2020	A1
20200092209	Chen	Mar 2020	A1
20200097189	Tao	Mar 2020	A1
20200133841	Davis	Apr 2020	A1
20200143885	Kim	May 2020	A1
20200159425	Flynn	May 2020	A1
20200167091	Haridas	May 2020	A1
20200210309	Jung	Jul 2020	A1
20200218449	Leitao	Jul 2020	A1
20200225875	Oh	Jul 2020	A1
20200242021	Gholamipour	Jul 2020	A1
20200250032	Goyal	Aug 2020	A1
20200257598	Yazovitsky	Aug 2020	A1
20200322287	Connor	Oct 2020	A1
20200326855	Wu	Oct 2020	A1
20200328192	Zaman	Oct 2020	A1
20200348888	Kim	Nov 2020	A1
20200364094	Kahle	Nov 2020	A1
20200371955	Goodacre	Nov 2020	A1
20200387327	Hsieh	Dec 2020	A1
20200401334	Saxena	Dec 2020	A1
20200409559	Sharon	Dec 2020	A1
20200409791	Devriendt	Dec 2020	A1
20210010338	Santos	Jan 2021	A1
20210075633	Sen	Mar 2021	A1
20210089392	Shirakawa	Mar 2021	A1
20210103388	Choi	Apr 2021	A1
20210124488	Stoica	Apr 2021	A1
20210132999	Haywood	May 2021	A1
20210191635	Hu	Jun 2021	A1
20210263795	Li	Aug 2021	A1
20210286555	Li	Sep 2021	A1

Foreign Referenced Citations (4)

Number	Date	Country
2003022209	Jan 2003	JP
2011175422	Sep 2011	JP
9418634	Aug 1994	WO
1994018634	Aug 1994	WO

Non-Patent Literature Citations (19)

Entry
https://web.archive.org/web/20071130235034/http://en.wikipedia.org:80/wiki/logical_block_addressing wikipedia screen shot retriefed on wayback Nov. 20, 2007 showing both physical and logical addressing used historically to access data on storage devices (Year: 2007).
Ivan Picoli, Carla Pasco, Bjorn Jonsson, Luc Bouganim, Philippe Bonnet. “uFLIP-OC: Understanding Flash I/O Patterns on Open-Channel Solid-State Drives.” APSys'17, Sep. 2017, Mumbai, India. pp. 1-7, 2017, <10.1145/3124680.3124741>. <hal-01654985>.
EMC Powerpath Load Balancing and Failover Comparison with native MPIO operating system solutions. Feb. 2011.
Tsuchiya, Yoshihiro et al. “DBLK: Deduplication for Primary Block Storage”, MSST 2011, Denver, CO, May 23-27, 2011 pp. 1-5.
Chen Feng, et al. “CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Devices”< FAST'11, San Jose, CA Feb. 15-17, 2011, pp. 1-14.
Wu, Huijun et al. “HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud”, Cornell Univ. arXiv: 1702.08153v2[cs.DC], Apr. 16, 2017, pp. 1-14https://www.syncids.com/#.
WOW: Wise Ordering for Writes—Combining Spatial and Temporal Locality in Non-Volatile Caches by Gill (Year: 2005).
Helen H. W. Chan et al. “HashKV: Enabling Efficient Updated in KV Storage via Hashing”, https://www.usenix.org/conference/atc18/presentation/chan, (Year: 2018).
S. Hong and D. Shin, “NAND Flash-Based Disk Cache Using SLC/MLC Combined Flash Memory,” 2010 International Workshop on Storage Network Architecture and Parallel I/Os, Incline Village, NV, 2010, pp. 21-30.
Arpaci-Dusseau et al. “Operating Systems: Three Easy Pieces”, Originally published 2015; Pertinent: Chapter 44; flash-based SSDs, available at http://pages.cs.wisc.edu/˜remzi/OSTEP/.
Jimenex, X., Novo, D. and P. Ienne, “Pheonix:Reviving MLC Blocks as SLC to Extend NAND Flash Devices Lifetime,” Design, Automation & Text in Europe Conference & Exhibition (DATE), 2013.
Yang, T. Wu, H. and W. Sun, “GD-FTL: Improving the Performance and Lifetime of TLC SSD by Downgrading Worn-out Blocks,” IEEE 37th International Performance Computing and Communications Conference (IPCCC), 2018.
C. Wu, D. Wu, H. Chou and C. Cheng, “Rethink the Design of Flash Translation Layers in a Component-Based View”, in IEEE Acess, vol. 5, pp. 12895-12912, 2017.
Po-Liang Wu, Yuan-Hao Chang and T. Kuo, “A file-system-aware FTL design for flash-memory storage systems,” 2009, pp. 393-398.
S. Choudhuri and T. Givargis, “Preformance improvement of block based NAND flash translation layer”, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and Systems Synthesis (CODES+ISSS). Saizburg, 2007, pp. 257-262.
A. Zuck, O. Kishon and S. Toledo. “LSDM: Improving the Preformance of Mobile Storage with a Log-Structured Address Remapping Device Driver”, 2014 Eighth International Conference on Next Generation Mobile Apps, Services and Technologies, Oxford, 2014, pp. 221-228.
J. Jung and Y. Won, “nvramdisk: A Transactional Block Device Driver for Non-Volatile RAM”, in IEEE Transactions on Computers, vol. 65, No. 2, pp. 589-600, Feb. 1, 2016.
Te I et al. (Pensieve: a Machine Assisted SSD Layer for Extending the Lifetime: (Year: 2018).
ARM (“Cortex-R5 and Cortex-R5F”, Technical reference Manual, Revision r1p1) (Year:2011).

Related Publications (1)

	Number	Date	Country
	20210311801 A1	Oct 2021	US

Continuations (1)

	Number	Date	Country
Parent	16238359	Jan 2019	US
Child	17350933		US

System and method for offloading computation to storage nodes in distributed system

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Term Extension

Abstract