The present disclosure claims priority to Chinese patent application No. 202111118002.7, titled “TASK SCHEDULING METHOD, CHIP, AND ELECTRONIC DEVICE”, filed on Sep. 24, 2021 before China National Intellectual Property Administration, which is incorporated herein in its entirety by reference.
The present disclosure relates to the field of acceleration architecture, and in particular to a task scheduling method, a chip and an electronic device.
With a rapid development of emerging industries such as big data, artificial intelligence (AI), the fifth generation (5G), the massive data generated will grow exponentially, and the demand for computing power for data processing is increasing. Moore's Law and Dennard scaling Law jointly lead to a rapid development of the chip industry for 30 years. As the Moore's Law slows down and the Dennard scaling Law becomes invalid, improvement of computing power in a general central processing unit (CPU) has been unable to meet the demand of the current data center on computational power improvement. The heterogeneous computing based on the domain specific architecture (DSA) uses various accelerators to accelerate characteristic services, thereby improving system computing power and reducing costs. The most typical accelerator is the deep learning accelerator, no matter whether a graphics processing unit (GPU), a field-programmable gate array (FPGA) or various types of network processing units (NPU) are used, the computing power of the system can be improved by a plurality of times compared with the solution of merely using the CPU.
In view of the above, in order to overcome at least one aspect of the above-mentioned problem, an embodiment of the present disclosure proposes a task scheduling method, including:
In some embodiments, the method further includes:
In some embodiments, the method further includes:
In some embodiments, the method further includes:
In some embodiments, the method further includes:
In some embodiments, the method further includes:
In some embodiments, the sending a notification to the scheduler in response to an operating phase when the corresponding sub-engine executes the corresponding sub-task to be processed being the same as the start phase in the received task parameter further includes:
In some embodiments, the issuing, by the comparator, the notification to the scheduler further includes:
Based on the same creative concept, according to another aspect of the present disclosure, an embodiment of the present disclosure further provides a chip including a digital logic circuit that, when being operated, implements the steps of any one of the task scheduling methods described above.
Based on the same creative concept, according to another aspect of the present disclosure, an embodiment of the present disclosure further provides an electronic device, including the chip described above.
In order to explain the technical solutions of embodiment of the present disclosure or the related art more clearly, accompanying drawings which are needed in the illustration of the embodiments or the related art will be briefly introduced below. Apparently, the drawings in the description below are merely some embodiments of the present disclosure, for those skilled in the art, other embodiments may be obtained based on the drawings without paying any creative effort.
In order that the objects, technical solutions and advantages of the present disclosure may be more clearly understood, embodiments of the present disclosure will be described in further detail below in combination with the detailed embodiments with reference to accompanying drawings.
It should be noted that all the expressions related to “first” and “second” in the embodiments of the present disclosure are used for distinguishing two different entities or different parameters with the same name. It can be seen that “first” and “second” are merely for the convenience of expressions and should not be understood as limiting the embodiments of the present disclosure, and the subsequent embodiments will not be described regarding this one by one.
The hardware accelerator of the domain specific architecture is designed for a certain business domain, and the business domain often contains multiple user scenarios. In each of the scenarios, the hardware accelerator needs to implement different functions that generally have similar or common characteristics. Therefore, when a hardware accelerator is designed, the functions to be implemented are generally split, so that the business processes in various scenarios are changed into a combination of respective independent sub-processes as much as possible, and a dedicated hardware acceleration module, called as a sub-engine, is designed for each sub-process.
The sub-engines are generally multiplexed among different user scenarios, i.e., a certain sub-engine may be used in multiple user scenarios, and the difference is that the task parameter of the sub-engine, the position of the sub-engine in the business process, and other sub-engines constituting the process may be different.
For example, a redundant array of independent disks (RAID) accelerator in a storage server can implement various scenarios such as RAID0/1/5/6, and functional modules such as a direct memory access (DMA) module, a storage page allocation/recovery module, a disk read/write module, an exclusive OR calculation module and a finite field calculation module can be obtained by splitting these scenarios into sub-flows. For RAID0/1, sub-engines 1 to 3 are required to be used, and these two scenarios have different task parameters of the sub-engines. For RAID5, sub-engines 1 to 4 are required to be used, and for RAID6, sub-engines 1 to 5 are required to be used.
The hardware accelerator, when being operated, implements the functions of different user scenarios by combining different sub-engines, and the order of the sub-engines in data flow is also different for each of the above-mentioned read/write sub-scenarios.
For example, for a read operation for RAID0, the hardware accelerator first schedules the storage page allocation module to allocate a block of data cache space; then the disk read/write module is called to read data from the disk and put the same into the cache space, and organization and sequencing of data for RAID0 is completed in the cache space; then the DMA module is called to move data from the cache space to a host-end memory; and finally, the storage page recovery module is called to recover the cache space. However, for the write operation for RAID0, after the storage page allocation module is called, the DMA module is called to transfer data from the host end to the cache space and complete the organization and sequencing of the data, then the disk read/write module is called to sequentially write the data in the cache space to the disk, and finally the cache space also needs to be recovered. Therefore, the sub-engines 1 to 3 are used both for the read and write scenarios of RAID0, but the calling sequence of the read scenario is 2-3-1-2, while the calling sequence of the write scenario is 2-1-3-2.
The scheduling of the sub-engines by the hardware accelerator is implemented using a module called as a parser and a module called as a scheduler. There are various implementations of the parser and the scheduler, which can be implemented in software or hardware. An implementation example is given below.
The parser parses the command from the host end according to the user scenario, decomposes the command into a plurality of sub-tasks, among them, each sub-task corresponds to one sub-engine; and organizes these sub-tasks into a list in order. The scheduler is used to dispatch the sub-tasks to the sub-engines, that is, reading a sub-task entry in the task list and then sending the same to a corresponding sub-engine according to a type of the sub-task entry.
As shown in
The above-mentioned conventional method has various implementation forms, but generally has the following features: the sub-engine notifies the scheduler to start the next task after one sub-task is completed; the data cache area of each sub-task needs to be able to accommodate all the output data of the sub-task.
However, this method also has obvious disadvantages:
Another type of traditional hardware accelerator is implemented with cascaded sub-engines, i.e., a data output port of sub-engine 1 is connected to a data input port of sub-engine 2, and so on. When sub-engine 1 outputs a first data, sub-engine 2 starts to work, and a first in first out (FIFO) interface or other streaming data interface is generally used between the engines. In this way, the hardware accelerator can achieve very low delay because that a pipelining operating manner is adopted between engines. Moreover, a data cache with a large capacity is not required, because a streaming interface is used between the engines. However, this traditional method has a big disadvantage of poor versatility, and thus it cannot be used for complex scenarios. Since the sub-engines need to directly exchange data in this method, the connection relationship of the sub-engines is relatively fixed. Even when a data selector is used, only a few options can be supported, and an order of data flow between engines cannot be changed. Therefore, only simple scenarios with relatively few processing steps and relatively fixed flows can generally be supported, and scenarios such as the RAID acceleration described above cannot be achieved.
According to an aspect of the present disclosure, an embodiment of the present disclosure proposes a task scheduling method.
The technical solution proposed in the present disclosure enables sub-tasks in a chronological order to partially or fully overlap in execution time. Therefore, compared with the traditional method, an overlapping time between all two engines in a chronological order can be saved. In general, for a task requiring N sub-engines, the solution proposed in the present disclosure can reduce the delay to 1/N thereof.
In some embodiments, the method further includes:
In some embodiments, in step S1, in response to receiving an issued task, the task is divided into a plurality of sub-tasks by the parser, and a sub-task list is generated, the task parameter corresponding to each sub-task being recorded in the sub-task list, and the task parameter including a start phase of the next sub-task.
For two sub-engines, the data flows of which are in a chronological order, the latter sub-engine may start executing at a start point or an end point of a certain phase of a previous sub-engine. The two sub-engines may also start at the same time, namely, the latter sub-engine starts to execute at the start point of phase 1 of the previous sub-engine. For example, as shown in
The phase of a sub-engine and a task is pre-defined for different sub-engine types and tasks. When an IO command is parsed into the sub-task list, the parser adds the start phase of an engine in the task parameter of a previous sub-engine of the engine.
In some embodiments, in step S4, the sending a notification to the scheduler in response to an operating phase when the corresponding sub-engine executes the corresponding sub-task to be processed being the same as the start phase in the received task parameter further includes:
In some embodiments, in step S4, the issuing, by the comparator, the notification to the scheduler further includes:
The notification may be implemented by writing specific information to a register specified by the scheduler, the scheduler captures the event by detecting the writing-in action on the bus and discriminating the content of the writing-in. After the scheduler captures the event, the next task is dispatched to the corresponding sub-engine, and so on.
In some embodiments, the method further includes:
In order to avoid overflow when a small data cache is used to process a large data block, the present disclosure uses two counters to realize a passive flow control method, in which a source sub-engine (the sub-engine receiving the data request) does not actively send data to a target sub-engine (the sub-engine sending the data request), and needs to wait for the data request sent by the target sub-engine. Different with the traditional method that performs handshake via signal lines of a dedicated data interface, the target sub-engine writes, through the interconnection bus, the data block size requested at this time to a specified register of the source sub-engine. After detecting the writing-in action of the bus on the specified register, the source sub-engine saves the request, and then sends no more than that size of data to the target sub-engine.
It should be noted that each sub-engine may serve as both the source sub-engine and the target sub-engine, so that two counters are provided in each sub-engine.
In some embodiments, the method further includes:
In some embodiments, the method further includes:
In order to realize passive flow control, a counter needs to be implemented in the target sub-engine to store a remaining size of a current data cache, and the remaining size is not a remaining size of the data cache at a current moment, but a remaining size that needs to contain a part for which the request has been issued and the data has not yet reached the cache. The first counter operates according to the following rules:
At the same time, the source sub-engine also needs to implement a counter for saving an amount of data to be output, and the second counter operates according to the following rules:
The core of the above-mentioned method or similar methods is that at any time, a total size of the data requests sent to the source sub-engine from the target sub-engine does not exceed the size of the data cache, and an amount of data sent from the source sub-engine to the target sub-engine does not exceed the requested amount.
In some embodiments, the method further includes:
For guaranteed bus utilization, data requests may continue to be sent to other sub-engines after the first counter in the target sub-engine increases to the preset value.
Implementation of the present disclosure are described below by taking a two-disk RAID0 writing-in scenario for a RAID accelerator as an example.
Writing-in of a RAID0 requires:
The sub-engines are connected together by an advanced extensible interface (AXI) bus, and there is no dedicated wiring between the engines.
Assuming that the host sends a RAID0 write IO of 256 KB to the RAID accelerator, a disk page size is 4 KB, a data memory on the host end is discontinuous, and an address linked list is used for organization:
First, the parser parses the IO into four tasks as follows:
Then, the parser performs a start phase configuration as follows:
After receiving the task, the allocation sub-engine saves the start phase of the disc writing-in 1 sub-engine in the internal register, and then starts to execute; at the beginning of execution, the comparator recognizes that the phase at this time is the same as that in the register, and then issues the notification to the scheduler to request for the next task to be dispatched to the disk writing in 1 sub-engine.
The scheduler dispatches the next task to the disk writing-in 1 sub-engine, and the disk writing-in 1 sub-engine saves the start phase of the disk writing-in 2 sub-engine in the internal register.
The allocation sub-engine initializes the first counter to 4 KB according to the cache size thereof (assuming that it is the size of one data page, and may also be less); then the data processing logic sends a data request of 4 KB to the DMA sub-engine, and reduces the first counter to zero.
The DMA sub-engine receives the data request of 4 KB, and adds the second counter to 4 KB; then the data processing logic sends a DMA data read request to the host one or more times according to a content of the address linked list; and the host sends data to the address of the allocation sub-engine via the PCIe bus.
The disk writing-in 1 sub-engine also initializes the first counter to 4 KB according to the cache size thereof (assumed to be one data page size); and then the allocation sub-engine sends a data request of 4 KB.
The allocation sub-engine receives data from the DMA, sends the same to the data processing module, and then outputs the same to the disk writing-in 1 sub-engine. With every 1 byte output to the disc writing-in 1 sub-engine, the second counter is subtracted with one, and the first counter is added with one; in order to ensure the utilization rate of the PCIe bus, whenever the first counter is greater than 1 KB, the allocation sub-engine requests data from the DMA sub-engine once.
The disk writing-in 1 sub-engine writes the received data into the disk as pages. When the processed data reaches 2 KB, the disk writing-in 1 sub-engine sends the notification to the scheduler to request for dispatching a task to the disk writing-in 2 sub-engine;
The allocation sub-engine processes the data on the second page (the page needs to be written into the disk 2 at RAID0), and sends the data to disk writing-in 2 sub-engine; and disk writing-in 2 sub-engine writes the same into disk 2.
The foregoing steps are repeated until one IO operation is completed.
In the solution proposed in the present disclosure, sub-engines are connected through a common interconnection bus, and the sub-engines are scheduled via the task list and the scheduler, so as to ensure that the scheduler can schedule the sub-engines in an arbitrary order to realize the processing of a complex scenario. In addition, the task of sub-engines is split into a plurality of operating phases, and the purpose of reducing delay is achieved by overlapping the operating phases between the sub-engines. Unlike the traditional method in which the next sub-engine starts to operate after one sub-engine is completely finished, there may be a plurality of sub-engines serving the same IO task. The start phase of the next task is saved in the previous sub-engine and determined by the sub-engine, and then the scheduler is informed to schedule the next sub-engine. When a task is processed, the target sub-engine sends a request for a data block to the source sub-engine via the interconnection bus, which is different from the traditional method using a signal line connection method to realize flow control, and is also different from the traditional method using a bus interconnection without using flow control. The method enables the use of data caches less than the size of the data block, thereby reducing costs.
Based on the same creative concept, according to another aspect of the present disclosure,
Based on the same creative concept, according to another aspect of the present disclosure,
Finally, it should be noted that, those skilled in the art can understand that all or part of the processes in the above method embodiments can be implemented by instructing related hardware through computer programs. The programs may be stored in a computer-readable storage medium. The programs, when being executed, may include the processes of the above method embodiments.
In addition, it should be appreciated that the computer-readable storage medium (e.g., memory) herein may be either a volatile memory or nonvolatile memory, or may include both the volatile memory and nonvolatile memory.
Those skilled in the art would also appreciate that various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as an electronic hardware, a computer software, or combinations of both. To clearly illustrate such interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described generally in terms of their functionality. Whether such functionality is implemented as software or as hardware depends upon the particular application and design constraints imposed on the overall system. Those skilled in the art may implement the functions in various ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope disclosed in the embodiments of the present application.
The above are exemplary embodiments of in the present disclosure, but it should be noted that various changes and modifications can be made without departing from the scope of the embodiments of the present disclosure defined by the claims. The functions, steps and/or actions of the method claims in accordance with the embodiments described herein need not be performed in any particular order. In addition, although the elements disclosed in the embodiments may be described or required in an individual form, they may also be understood as plural unless explicitly limited to a singular number.
It should be understood that a singular form “a” and “an” as used herein is intended to include the plural forms as well, unless the context clearly supports an exception. It should also be understood that “and/or” as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The serial numbers of the embodiments disclosed above are only for description, and do not represent the advantages and disadvantages of the embodiments.
Those skilled in the art can understand that all or part of the steps for implementing the above-mentioned embodiments can be completed by hardware, or can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.
Those skilled in the art should understand that the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope (including claims) of the embodiments of the present disclosure is limited to these examples. Under the idea of the embodiments of the present disclosure, the technical features in the above embodiments or different embodiments can also be combined, and there are many other changes in different aspects of the above embodiments of the present disclosure, which are not provided in details for the sake of brevity. Therefore, within the spirit and principle of the embodiments of the present disclosure, any omissions, modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the embodiments of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202111118002.7 | Sep 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/074613 | 1/28/2022 | WO |