The present invention relates to an accelerator offload device, an accelerator offload system, and an accelerator offload method.
Workloads that processors are good at are different depending on the type of processor. Central processing units (CPUs) have high versatility, but are not good at operating a workload having a high degree of parallelism, whereas accelerators (hereinafter, appropriately referred to as ACCs), such as a field programmable gate array (FPGA)/(hereinafter, “/” means “or”) a graphics processing unit (GPU)/an application specific integrated circuit (ASIC), can operate the workload at high speed with high efficiency. Offload techniques, which improve overall operation time and operation efficiency by combining those different types of processors and offloading a workload that CPUs are not good at to ACCs to operate the workload, have been increasingly utilized.
Representative examples of a specific workload subjected to ACC offloading include encoding/decoding processing (forward error correction processing (FEC)) in a virtual radio access network (vRAN), audio and video media processing, and encryption/decryption processing.
As illustrated in
Hardware 10 includes a CPU 11 and an accelerator (ACC) 12.
ACC 12 is computing unit hardware that performs specific operation at high speed based on an input from CPU 11. Specifically, accelerator 12 is a GPU or a programmable logic device (PLD) such as an FPGA.
As indicated by the white arrow a in
Techniques of transferring data in a server include New API (NAPI), Data Plane Development Kit (DPDK), and Kernel Busy Poll (KBP).
New API (NAPI) performs, upon arrival of a packet, packet processing in response to a software interrupt request after a hardware interrupt request (see Non-Patent Literature 1).
DPDK implements a packet processing function in the user space in which applications operate and, when a packet arrives, immediately pulls the packet from the user space according to a polling model. Specifically, DPDK is a framework for performing control on a network interface card (NIC) in the user space, which has been conventionally performed by the Linux kernel (registered trademark). The largest difference from the processing by the Linux kernel is to have a polling-based reception mechanism called Pull Mode Driver (PMD). Normally, in the Linux kernel, an interrupt is generated upon arrival of data to the NIC, and reception processing is triggered by an interrupt. On the other hand, in PMD, a dedicated thread continuously performs checking data arrival and reception processing. PMD is capable of performing high-speed packet processing by eliminating overheads such as context switching and interrupts. DPDK greatly improves the performance and throughput of packet processing, thereby securing more time for data plane application processing. However, DPDK exclusively uses computer resources such as CPUs and NICs.
Patent Literature 1 discloses an in-server network delay control device (KBP). KBP constantly monitors packet arrivals according to a polling model in the kernel. This restrains softIRQ and achieves low-latency packet processing.
Conventional methods of acquiring the operation result of an ACC in ACC offloading include a (1) interrupt method and a (2) polling method.
In the (1) interrupt method, an application detects completion of the operation via an interrupt. In the (2) polling method, completion of the operation by the ACC is immediately detected by performing busy polling (constantly monitoring a buffer in which data is stored when the operation is completed). Description thereof will be given in order.
As illustrated in
As illustrated in
In the (1) interrupt method, APL 1 detects completion of the operation performed by the ACC via an interrupt (see reference sign c in
As illustrated in
As illustrated in
The advantages and disadvantages of the (1) interrupt method and the (2) polling method as methods of acquiring the operation result of the ACC in ACC offloading are summarized as follows:
The (1) interrupt method has the advantage of achieving a high CPU utilization efficiency but has the concern of an increase in the processing time due to an interrupt processing overhead.
The (2) polling method has the advantage of detecting completion of the operation by the ACC at high speed by busy polling but has the concern of an increase in the power consumption due to wasteful use of CPU resources during the polling and a decrease in the CPU resource efficiency.
The present invention has been made in view of such a background, and an object of the present invention is to reduce the processing time and improve the CPU utilization efficiency.
In order to achieve the above-described object, an accelerator offload device that offloads specific processing of an application program to an accelerator includes: a request-related processing part; a request I/O part; a response I/O part; and a response-related processing part, wherein the request-related processing part is configured to perform predetermined processing required before performing offloading to the accelerator and then notify the request I/O part of a request to perform offloading, wherein the request I/O part is configured to operate on a first CPU core to perform request processing of notifying the accelerator of an offload request, wherein the response I/O part is configured to operate on a second CPU core different from the first CPU core to perform response processing of notifying the response-related processing part of operation completion of the accelerator; and wherein the response-related processing part is configured to perform an operation described in the application program by using an operation result of the accelerator.
The present invention can reduce the processing time and improve the CPU utilization efficiency.
Hereinafter, an accelerator offload system and the like in a mode for carrying out the present invention (hereinafter, referred to as “present embodiment”) will be described with reference to the drawings.
As illustrated in
ACC 12 is computing unit hardware that performs specific operation based on an input from CPU 11 at high speed. Specifically, accelerator 12 is a GPU or a PLD such as an FPGA.
Ring buffer 13 is provided in hardware 10 and copies a workload to be processed. A request I/O part 150 and a response I/O part 160 exchange data with the accelerator via ring buffer 13.
An application (APL) 1 (application program) is further deployed in user space 200. APL 1 is a program executed by an application thread (CPU) in user space 200.
Accelerator offload device 100 includes a management part 110, a task scheduler 120, a request-related processing part 130, a response-related processing part 140, a request I/O part (CPU #n) 150, and a response I/O part (CPU #m) 160. Task scheduler 120 includes a sleep control part 121. Request I/O part (CPU #n) 150 includes a sleep control part 151. Response I/O part (CPU #m) 160 includes a sleep control part 161.
In the following description, “sleep” means that the CPU executes a command having a small number of cycles, such as a pause command.
The notations of CPU #n and CPU #m (n and m are any natural numbers) represents use of different CPU cores.
Accelerator offload device 100 is deployed in user space 200. For example, request-related processing part 130 and response-related processing part 140 are implemented in APL 1, and request I/O part (CPU #n) 150 and response I/O part (CPU #m) 160 are implemented in high-speed data communication part 40 described later (as a library of a high-speed data communication layer configured with CUDA, OpenCL BBDEV API, and the like).
Note that request I/O part (CPU #n) 150 and response I/O part (CPU #m) 160 may be included and implemented in request-related processing part 130 and response-related processing part 140, respectively, in APL 1.
Management part 110 manages a CPU core group composed of a plurality of CPU cores. Management part 110 determines a CPU core to be used by request-related processing part 130, response-related processing part 140, request I/O part 150, or response I/O part 160 from among the CPU core group.
Management part 110 allocates one CPU core to response I/O part 160 as a response-dedicated functional part from among the CPU core group.
Management part 110 secures, in advance, a CPU core group that may be used by the functional parts (request-related processing part 130, response-related processing part 140, request I/O part (CPU #n) 150, and response I/O part (CPU #m) 160). Here, an operator may determine how to use the CPU cores in advance such that another application does not use the CPU core group.
Management part 110 determines, for each of the functional parts (request-related processing part 130, response-related processing part 140, request I/O part (CPU #n) 150, and response I/O part (CPU #m) 160), a CPU core to be used by the functional part from among the CPU core group.
When a task that needs to be offloaded to the accelerator occurs, task scheduler 120 registers the task in a task queue of request-related processing part 130. Task scheduler 120 registers the task as a task using a CPU core different from the CPU core used by response I/O part 160.
Task scheduler 120 includes a sleep control part configured to, when no task is to be operated on a CPU, cause a thread running on the CPU to sleep.
When a task that needs to be offloaded to ACC 12 occurs, task scheduler 120 registers the task in the task queue of request-related processing part 130 (see
Further, when the CPU operating frequency of CPU core #n used by each of the processing parts has lowered, task scheduler 120 increases the CPU operating frequency, and, when the CPU idle state is in the power saving mode, task scheduler 120 causes a transition to the non-power saving mode.
Sleep control part 121 of task scheduler 120 causes request-related processing part 130 to sleep. At this time, for further power saving, the CPU operating frequency of CPU core #n being used may be lowered, and the CPU idle state may be set to the power saving mode.
Request-related processing part 130 performs a series of processing (predetermined processing) required before offloading to the accelerator. The series of processing required before offloading to the accelerator will be described later.
Request-related processing part 130 performs the series of processing required before ACC offloading and then notifies request I/O part (CPU #n) 150 of a request to perform ACC offloading. Note that when request I/O part 150 and request-related processing part 130 use different CPU cores, sleep control part 121 of task scheduler 120 may cause request I/O part 150 to sleep at this timing.
Response-related processing part 140 performs operations described in the application program by using the operation result of ACC 12.
Response-related processing part 140 performs the operations described in APL 1 by using the operation result of ACC 12. Response-related processing part 140 may perform the processing by CPU core #m used by response I/O part (CPU #m) 160 or may perform the processing by another core (
The series of processing required before ACC offloading will be described.
A description will be given taking an example where encoding processing of FEC in a vRAN or virtual DU (vDU) is offloaded to ACC 12.
Request-related processing part 130 performs mapping, equivalent processing, inverse discrete Fourier transform (IDFT), channel estimation, demodulation, and descrambling in resource elements.
Response-related processing part 140 performs frame processing (transmission processing of an Ethernet frame or the like).
Request-related processing part 130 performs frame processing (reception processing of an Ethernet frame or the like).
Response-related processing part 140 performs scrambling, modulation, layer mapping, precoding, and resource element mapping.
Request I/O part (CPU #n) 150 is composed of a CPU core and performs request processing of issuing an offload request to ACC 12.
Request I/O part (CPU #n) 150 issues an offload request to ACC 12. At this time, request I/O part (CPU #n) 150 copies via ring buffer 13 a workload to be processed.
Response I/O part (CPU #m) 160 is composed of a CPU core different from the CPU core and performs response processing of notifying response-related processing part 140 of operation completion of the accelerator.
Response I/O part (CPU #m) 160 wakes up upon receipt of an interrupt. At this time, when the CPU operating frequency of CPU core #m used by response I/O part 160 has lowered, task scheduler 120 increases the CPU operating frequency, and, when the CPU idle state is in the power saving mode, task scheduler 120 causes a transition to the non-power saving mode.
Response I/O part (CPU #m) 160 notifies response-related processing part 140 that ACC 12 has completed the operation to communicate pointer information on an area of ring buffer 13 where the operation result is stored.
High-speed data communication part 40 is a high-speed data communication layer configured with CUDA, OpenCL BBDEV API, and the like. For example, high-speed data communication part 40 is CUDA Toolkit (registered trademark) for using a GPU manufactured by NVIDIA (registered trademark) or OpenCL (registered trademark) for operation using a heterogeneous processor. In addition, BBDEV API (registered trademark) provides an accelerator I/O function for processing wireless access signals as a development kit (library).
High-speed data communication part 40 incorporates the accelerator I/O function provided as libraries by above-described CUDA, OpenCL, BBDEV API, or the like into APL 1 in user space 200, thereby allowing APL 1 to have the accelerator I/O function for processing wireless access signals.
Hereinafter, a description will be given of an operation of accelerator offload device 100 of the accelerator offload system 1000 configured as described above.
The present invention uses CPU cores separate for request processing and for response processing in order to avoid an interrupt overhead caused by the save processing in the interrupt method. In the present embodiment, accelerator offload device 100 includes a CPU core for request processing (request-related processing part 130 and request I/O part 150) and a CPU core for response processing (response-related processing part 140 and response I/O part 160). In other words, accelerator offload device 100 provides (allocates) at least one of a plurality of CPU cores as a CPU core that is a response-dedicated functional part (response-related processing part 140 and response I/O part 160).
As the CPU core for response processing (response-dedicated functional part) is provided, a state where no in-progress processing is present is maintained at the time of interruption to eliminate the save processing, thereby achieving low latency.
In order to reduce an increase in the power consumption caused by provision of the response-dedicated functional part, the present invention performs sleep control (including CPU operating frequency control and CPU idle state control) while there is no processing.
When there is no power saving processing, sleep is performed and the CPU operating frequency and the CPU idle state are controlled to reduce the power consumption to achieve power saving.
Ring buffer 13, which can be referenced by both the high-speed data communication layer and an ACC, is provided for data communication between APL 1 and ACC 12. The I/O parts of high-speed data communication part 40 exchange data with ACC 12 via ring buffer 13, thereby reducing the number of memory copies between APL 1 and ACC 12. Reducing unnecessary memory copies achieves high-speed data communication.
As illustrated in
ACC 12 operates the offloaded workload, then notifies, by a hardware interrupt, response I/O part (CPU #m) 160 that the operation has been completed (see reference sign bb in
As illustrated in
At this time, as response-related processing part 140 and response I/O part (CPU #m) 160 perform no processing (are not involved in the request-related processing), response-related processing part 140 and response I/O part (CPU #m) 160 perform sleep control (including CPU operating frequency control and CPU idle state control) while performing no processing. By performing sleep control (including CPU operating frequency control and CPU idle state control), power saving is achieved.
ACC 12 operates the offloaded workload and notifies, by a hardware interrupt (see reference sign bb in
The reason why response-related processing part 140 and response I/O part (CPU #m) 160 can immediately perform the post-processing of the processing 1 in response to the interrupt is that request-related processing part 130 and request I/O part (CPU #n) 150 exclusively complete the processing 2. Response-related processing part 140 and response I/O part (CPU #m) 160 perform sleep control until waking up in response to the next hardware interrupt.
As the in-progress processing is not suspended by an interrupt and there is no need of saving intermediate data (concern of the conventional (1) interrupt method), it is possible to reduce the interrupt overhead by avoiding the save processing.
At this time, as request-related processing part 130 and request I/O part (CPU #n) 150 are not involved in the response-related processing (the processing is exclusively performed by response-related processing part 140 and response I/O part (CPU #m) 160), the application thread (CPU) can request ACC 12 to offload the next processing (processing 2) (see reference sign cc in
The reason why the in-progress processing is not suspended by the interrupt is that the response-related processing is exclusively performed by response-related processing part 140 and response I/O part (CPU #m) 160 and the resources therefor are allocated to request-related processing part 130 and request I/O part (CPU #n) 150, thereby increasing the efficiency of request-related processing part 130 and request I/O part (CPU #n) 150.
Response I/O part (CPU #m) 160 detects the completion of the operation by ACC 12 via an interrupt (see reference sign dd in
An operation of accelerator offload device 100 will be described with reference to the flowcharts illustrated in
In step S1, management part 110 (
In step S2, management part 110 determines, for each of the functional parts (request-related processing part 130, response-related processing part 140, request I/O part (CPU #n) 150, and response I/O part (CPU #m) 160), a CPU core to be used by the functional part from among the CPU core group and terminates the processing of the flow.
A description will be given of an example of determining the CPU core to be used by each functional part from among the CPU core group.
For example, determination is made such that request-related processing part 130 use CPUs #n−a to #n−1, request I/O part 150 use CPU #n, response I/O part 160 use CPU #m, and response-related processing part 140 use CPUs #m−β to #m−1. The above a and B are constants. Depending on the heaviness of the processing by request-related processing part 130 and response-related processing part 140, the number of CPU cores usable by request-related processing part 130 and response-related processing part 140 is set to be large by increasing a and b when the processing is heavy.
In step S11, when a task that needs to be offloaded to the ACC occurs, task scheduler 120 (
At this time, task scheduler 120 registers the task as a task using a CPU core different from the CPU core used by response I/O part 160.
Here, when request-related processing part 130 (
In addition, when the CPU operating frequency of CPU core #n used by a processing part has lowered, task scheduler 120 increases the CPU operating frequency, and, when the CPU idle state is in the power saving mode, task scheduler 120 causes a transition to the non-power saving mode.
In step S12, request-related processing part 130 of accelerator offload device 100 performs the series of processing required before ACC offloading and then notifies request I/O part 150 of a request to perform ACC offloading.
Here, when request I/O part 150 and request-related processing part 130 use different CPU cores, request-related processing part 130 may cause request-related processing part 130 to sleep at this timing.
In step S13, request I/O part 150 (
In step S14, request I/O part (CPU #n) 150 determines whether a task is present in the task queue of request-related processing part 130.
When no task is present in the task queue of request-related processing part 130 (S14: No), in step S15, sleep control part 151 of request I/O part (CPU #n) 150 causes request-related processing part 130 to sleep and terminates the processing of the flow.
At this time, for further power saving, the CPU operating frequency of CPU core #n being used may be lowered, and the CPU idle state may be set to the power saving mode.
When a task is present in the task queue of request-related processing part 130 (S14: Yes), the processing proceeds to step S12.
In step S21, ACC 12 (
When the processing to be performed by response-related processing part 140 is heavy and new ACC offloading processing is completed before the processing of response-related processing part 140 is completed, a hardware interrupt may be raised to a different CPU core, and request I/O part 150 and request-related processing part 130 may be multi-threaded to perform the processing.
When the frequency of the interrupt increases/decreases, scaling out/in may be achieved by changing the interrupt destination CPU core.
In step S22, sleep control part 161 (
In step S23, response I/O part (CPU #m) 160 of accelerator offload device 100 notifies response-related processing part 140 that the ACC has completed the operation to communicate pointer information on an area of ring buffer 13 where the operation result is stored.
When request I/O part (CPU #n) 150 and request-related processing part 130 use different CPU cores, request I/O part 150 may be caused to sleep at this timing.
In step S24, response-related processing part 140 performs the operation described in APL 1 by using the operation result of ACC 12.
Here, response-related processing part 140 may perform the processing by CPU core #m used by response I/O part (CPU #m) 160 or may perform the processing by another core (see
In step S25, when there is no other task to be processed, sleep control part 121 of task scheduler 120 causes response I/O part (CPU #m) 160 and response-related processing part 140 to sleep and terminates the processing of the flow.
For further power saving, sleep control part 121 may lower the CPU operating frequency of CPU core #m being used and/or may set the CPU idle state to the power saving mode.
Here, the sleep control, CPU operating frequency setting, and CPU idle state setting of response I/O part 160 may be performed in step S23 described above.
A description will be given of a task allocation example of task scheduler 120.
Task scheduler 120 distributes (schedules) tasks according to the availabilities of the task queues of the request-related processing parts 130. At this time, round robin may be used, or the tasks may be distributed in the ascending order of the number of remaining tasks of the request-related processing parts 130.
When tasks are registered in the plurality of request-related processing parts 130 at the same time, the plurality of request-related processing parts 130 may possibly complete the tasks at the same time and the processing may possibly be communicated to request I/O part (CPU #n) 150 at the same time; and further, when there is a plurality of ACC offloading operation processors, the ACC offloading processing may possibly be completed at the same time and hardware interrupts to response I/O part (CPU #m) 160 may possibly occur at the same time, and thus the hardware interrupts may occur in the middle of the processing by response I/O part (CPU #m) 160. In view of this, it is conceivable to intentionally shift the timings of distributing the tasks from task scheduler 120 to the request-related processing parts 130.
[Example of Case where Post-Response Processing is Heavy]
A description will be given of an example of a case where post-response processing is heavy.
As illustrated in
Accelerator offload device 100A includes a management part 110A, task scheduler 120, request-related processing part 130, response-related processing part 140, request I/O part (CPU #n) 150, and response I/O part (CPU #m) 160.
Management part 110A has, in addition to the function (see
Hereinafter, a description will be given of an operation of accelerator offload device 100A of accelerator offload system 1000A configured as described above.
As illustrated in
In accelerator offload device 100A illustrated in
Response I/O part (CPU #m) 160 concentrates on receiving (mediating) hardware interrupts from ACC 12, and upon receipt of a hardware interrupt (see reference sign bb in
At this time, as request-related processing part 130 and request I/O part (CPU #n) 150 are not involved in the response-related processing (the processing is exclusively performed by response-related processing part 140 and response I/O part (CPU #m) 160), the application thread (CPU) can request ACC 12 to offload the next processing (processing 2) (see reference sign cc in
Response I/O part (CPU #m) 160 concentrates on receiving (mediating) hardware interrupts from ACC 12, and upon receipt of a hardware interrupt (see reference sign ee in
With this, accelerator offload device 100A can avoid a state where, when a hardware interrupt occurs, response I/O part (CPU #m) 160 that receives the hardware interrupt is performing other processing.
In the case of the first embodiment, a mode in which the request-related processing and the response-related processing are performed by different threads (different CPU cores) is illustrated. It is also conceivable of a mode in which request and response processing are performed by the same thread (CPU core). This mode will be described as a second embodiment.
As illustrated in
Accelerator offload device 100B includes a management part 210, a task scheduler 220, request/response-related processing parts 230 (request-related processing parts 230 and response-related processing parts 230), and request/response I/O parts 250 (request/response I/O part (CPU #n) 250 and request/response I/O part (CPU #n+1) 250).
Request/response I/O part (CPU #n) and request/response I/O part (CPU #n+1) have the same configuration, and thus the same number is given to request/response I/O part (CPU #n) 250 and request/response I/O part (CPU #n+1) 250.
Task scheduler 220 includes a sleep control part 221. Request/response I/O part (CPU #n) and request/response I/O part (CPU #n+1) 250 each include a sleep control part 251.
Management part 210 secures, in advance, a CPU core group that may be used by functional parts (request/response-related processing parts 230, request/response I/O part (CPU #n) 250, and request/response I/O part (CPU #n+1) 250). Here, the operator may determine how to use the CPU cores in advance such that another application does not use the CPU core group.
Management part 210 determines, for each of the functional parts (request/response-related processing parts 230, request/response I/O part (CPU #n) 250, and request/response I/O part (CPU #n+1) 250), a CPU core to be used by the functional part.
When a task that needs to be offloaded to ACC 12 occurs, task scheduler 220 registers the task in the task queue of a request/response-related processing part 230 (see
Task scheduler 220 allocates tasks to a plurality of CPU cores in a distributed manner taking into account the timings of receiving the operation results from ACC 12. In order to maintain a state where no in-progress processing is present when a hardware interrupt for receiving the processing result from ACC 12 is generated, task scheduler 220 registers the task as a task using request/response I/O part (CPU #n+1) 250 that uses a CPU core different from the CPU core used by request/response I/O part (CPU #n) 250. When the request-related processing part 230 is sleeping, sleep control part 221 of task scheduler 220 wakes up the request-related processing part 230.
The request/response-related processing part 230 performs the series of processing required before ACC offloading and then notifies a request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) of a request to perform ACC offloading.
The Request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) notifies ACC 12 of an offload request. At this time, the Request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) copies via ring buffer 13 a workload to be processed.
A request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) wakes up in response to an interrupt. At this time, when the CPU operating frequency of CPU core #n or #n+1 used by the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) has lowered, task scheduler 220 increases the CPU operating frequency, and when the CPU idle state is in the power saving mode, task scheduler 220 causes a transition to the non-power saving mode.
The request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) notifies a request/response-related processing part 230 that ACC 12 has completed the operation to communicate pointer information on an area of ring buffer 13 where the operation result is stored.
Hereinafter, an operation of accelerator offload device 100B of accelerator offload system 1000B configured as described above will be described.
Task scheduler 220 illustrated in
Task scheduler 220 allocates tasks in a distributed manner to a plurality of CPU cores in order to maintain a state where no in-progress processing is present when a hardware interrupt for receiving the processing result from the ACC is generated. Here, the processing 1 is handled by request/response I/O part (CPU #n) 250, and the processing 2 is handled by request/response I/O part (CPU #n+1) 250.
As illustrated in
ACC 12 operates the offloaded workload, notifies, by a hardware interrupt, request/response I/O part (CPU #n) 250 that ACC 12 has completed the operation (see reference sign hh in
Task scheduler 220 allocates tasks to a plurality of CPU cores in a distributed manner taking into account the timings of receiving the operation results from ACC 12. In order to maintain a state where no in-progress processing is present when a hardware interrupt for receiving the processing result from ACC 12 is generated, task scheduler 120 registers the task as a task using request/response I/O part (CPU #n+1) 250.
APL 1 requests a request/response-related processing part 230 of accelerator offload device 100B to offload the next processing (processing 2). The request/response-related processing part 230 performs the series of processing required before ACC offloading and then notifies request/response I/O part (CPU #n+1) 250 of a request to perform ACC offloading. Request/response I/O part (CPU #n+1) 250 notifies ACC 12 of an offload request (see reference sign ii in
ACC 12 operates the offloaded workload, then notifies, by a hardware interrupt, request/response I/O part (CPU #n+1) 250 that the operation has been completed (see reference sign jj in
Task scheduler 220 registers a task as a task using request/response I/O part (CPU #n) 250.
As illustrated in
At this time, as the request-related processing part 230 and request/response I/O part (CPU #n) 250 perform no processing (are not involved in the request-related processing), the request-related processing part 230 and request/response I/O part (CPU #n) 250 perform sleep control (including CPU operating frequency control and CPU idle state control) while performing no processing.
ACC 12 operates the offloaded workload and notifies, by a hardware interrupt, request/response I/O part (CPU #n) 250 that the operation has been completed (see reference sign hh in
In order to maintain a state where no in-progress processing is present when a hardware interrupt receiving the processing result from ACC 12 is generated, task scheduler 220 registers a task as a task using request/response I/O part (CPU #n+1) 250 that uses a CPU core different from CPU #n.
APL 1 requests a request-related processing part 230 of accelerator offload device 100B to offload a next processing (processing 2) (see reference sign ii in
At this time, because the request-related processing part 230 and request/response I/O part (CPU #n+1) 250 perform no processing (are not involved in the request-related processing), the request-related processing part 230 and request/response I/O part (CPU #n+1) 250 perform sleep control (including CPU operating frequency control and CPU idle state control) while performing no processing.
ACC 12 operates the offloaded workload and notifies, by a hardware interrupt, request/response I/O part (CPU #n+1) 250 that the operation has been completed (see reference sign jj in
The first embodiment illustrates a mode in which the request-related processing and the response-related processing are performed by different threads (different CPU cores). In this case, in order to maintain a state where no in-progress processing is present when a hardware interrupt for receiving the processing result from the ACC is generated, task scheduler 120 (
Accelerator offload device 100B according to the present embodiment includes: task scheduler 220 that allocates tasks to a plurality of CPU cores in a distributed manner taking into account the timings of receiving the operation results from ACC 12; and request/response-related processing parts 230, request/response I/O part (CPU #n) 250, request/response I/O part (CPU #n+1) 250, which perform request and response processing in the same thread (CPU core).
This makes it possible to prevent the CPU resource efficiency from deteriorating for the reason the CPU cannot be used for other processing.
An operation of accelerator offload device 100B will be described with reference to the flowcharts illustrated in
In step S31, management part 210 (
In step S32, management part 210 determines CPU cores to be used by request/response-related processing parts 230, request/response I/O part (CPU #n) 250, and request/response I/O part (CPU #n+1) 250 from among the CPU core group and terminates the processing of the flow.
An example of determining the CPU core to be used by each functional part from among the CPU core group will be described.
For example, determination is made such that request/response-related processing parts 230 use CPUs #n−a to #n−1, request/response I/O part (CPU #n) 250 use CPU #n, and request/response I/O part (CPU #n+1) 250 use CPU #n+1. The above a is a constant. Depending on the heaviness of the processing, the number of CPU cores usable by request/response-related processing parts 230 is set to be large by increasing a when the processing is heavy.
In step S41, when a task that needs to be offloaded to the ACC occurs, task scheduler 220 (
Task scheduler 220 allocates tasks to a plurality of CPU cores in a distributed manner taking into account the timings of receiving the operation results from ACC 12.
Here, when the request/response-related processing part 230 (
When the CPU operating frequency of the CPU core used by each processing part has lowered, task scheduler 220 increases the CPU operating frequency, and when the CPU idle state is in the power saving mode, task scheduler 220 causes a transition to the non-power saving mode.
In step S42, the request/response-related processing part 230 of accelerator offload device 100B performs the series of processing required before ACC offloading and then notifies the request/response I/O part 250 of a request to perform ACC offloading.
Here, when the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) uses a different CPU core, the request/response-related processing part 230 may be caused to sleep at this timing. Further, for further power saving, the CPU operating frequency of the CPU core being used may be lowered, and/or the CPU idle state may be set to the power saving mode.
In step S43, the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) of accelerator offload device 100B notifies ACC 12 of an offload request (at this time, the request/response I/O part 250 copies via ring buffer 13 a workload to be processed). The processing transitions to the <Response-Side Processing> (
In step S44, sleep control part 251 of the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) causes the request/response-related processing part 230 to sleep and terminates the processing of the flow.
At this time, for further power saving, the CPU operating frequency of the CPU core being used may be lowered, and/or the CPU idle state may be set to the power saving mode.
In step S51, ACC 12 (
In step S52, in a case where the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) is sleeping, sleep control part 251 of the request/response I/O part 250 wakes up the request/response I/O part 250.
When the CPU operating frequency of the CPU core to be used has lowered, sleep control part 251 increases the CPU operating frequency, and when the CPU idle state is in the power saving mode, sleep control part 251 causes a transition to the non-power saving mode.
In step S53, the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) notifies the request/response-related processing part 230 that ACC 12 has completed the operation to communicate pointer information on an area of ring buffer 13 where the operation result is stored.
At this time, when the CPU core used by the request/response-related processing part 230 and the CPU core used by the request/response I/O part 250 (request/response I/O part (CPU #n) 250 or request/response I/O part (CPU #n+1) 250) is different, the request/response I/O part 250 may be caused to sleep at this timing.
In addition, for further power saving, the CPU operating frequency of the CPU core being used may be lowered, and/or the CPU idle state may be set to the power saving mode.
In step S54, when the request/response-related processing part 230 is sleeping, the sleep control part 221 of task scheduler 220 wakes up the request/response-related processing part 230. In addition, when the CPU operating frequency of the CPU core being used has lowered, the sleep control part 221 increases the CPU operating frequency, and when the CPU idle state is in the power saving mode, the sleep control part 221 causes a transition to the non-power saving mode.
In step S55, the request/response-related processing part 230 performs the operation described in APL 1 by using the operation result of ACC 12.
In step S56, when there is no other task to be processed, sleep control part 221 of task scheduler 220 causes the request/response-related processing part 230 to sleep and terminates the processing of the flow.
In addition, for further power saving, the sleep control part 221 may lower the CPU operating frequency of the CPU core being used and/or may set the CPU idle state to the power saving mode.
A description will be given of an extended function of the task scheduler.
In any of the first embodiment and the second embodiment, task scheduler 120 or 220 (
In order to avoid a state in which a certain thread is performing other processing when an operation result is received from ACC 12, the timing of receiving the operation result from ACC 12 may be machine-learned from past results and inferred, and scheduling may be performed by using the inference result. For example, in the FEC processing in a vRAN, the FEC processing time varies depending on a data size or error rate. By learning this, it is possible to estimate a time from transmission of a request to the ACC to acquisition of a response.
In any of the first embodiment and the second embodiment, task scheduler 120 or 220 (
The accelerator offload devices 100, 100A, and 100B according to the above embodiments are implemented by a computer 900 having the configuration illustrated in
Computer 900 includes a CPU 901, a RAM 902, a ROM 903, an HDD 904, an accelerator 905, an input/output interface (I/F) 906, a media interface (I/F) 907, and a communication interface (I/F) 908. Accelerator 905 corresponds to accelerator (ACC) 12 in
Accelerator 905 is an accelerator (device) 12 (
Accelerator 905 is connected to an external device 915 via communication I/F 908. Input/output I/F 906 is connected to an input/output device 916. Media I/F 907 reads/writes data from/to a recording medium 917.
CPU 901 operates according to a program stored in ROM 903 or HDD 904 and controls each component of accelerator offload devices 100, 100A, and 100B in
ROM 903 stores a boot program to be executed by CPU 901 when computer 900 is activated, a program that depends on the hardware of computer 900, and the like.
CPU 901 controls input/output device 916 including an input unit such as a mouse or a keyboard and an output unit such as a display or a printer via input/output I/F 906. CPU 901 acquires data from input/output device 916 and outputs generated data to input/output device 916 via input/output I/F 906. Note that a graphics processing unit (GPU) or the like may be used as a processor in conjunction with CPU 901.
HDD 904 stores a program to be executed by CPU 901, data to be used by the program, and the like. Communication I/F 908 receives data from another device via a communication network (e.g. network (NW)) and outputs the data to CPU 901 and also transmits data generated by CPU 901 to another device via the communication network.
Media I/F 907 reads a program or data stored in the recording medium 917 and outputs the program or data to the CPU 901 via the RAM 902. CPU 901 loads a program for the desired processing from recording medium 917 onto RAM 902 via media I/F 907 and executes the loaded program. Recording medium 917 is an optical recording medium such as a digital versatile disc (DVD) or a phase change rewritable disk (PD), a magneto-optical recording medium such as a magneto optical disk (MO), a magnetic recording medium, a conductor memory tape medium, a semiconductor memory, or the like.
For example, in a case where computer 900 functions as accelerator offload devices 100, 100A, or 100B configured as a device according to the present embodiment, CPU 901 of computer 900 implements the functions of accelerator offload devices 100, 100A, or 100B by executing the program loaded onto RAM 902. HDD 904 stores data in RAM 902. CPU 901 reads the program for the desired processing from recording medium 917 and executes the program. In addition, CPU 901 may read the program for the desired processing from another device via the communication network.
Accelerator offload devices 100, 100A, and 100B is to be deployed in user space 200, and the OS is not limited. In addition, they are not limited to being deployed under a server virtualized environment. Therefore, accelerator offload systems 1000 to 1000B can be applied to the configurations illustrated in
As illustrated in
Specifically, the server includes host OS 50 in which a virtual machine and an external process formed outside the virtual machine can operate, hypervisor 60, VM 70 having a virtual IF 71, and guest OS 80 operating in the virtual machine. Host OS 50 includes a ring buffer 13 managed by a kernel in a memory space in the server.
In accelerator offload system 1000C, accelerator offload device 100 is deployed in user space 200. Therefore, like DPDK, accelerator offload device 100 can reference a ring-structured buffer while bypassing the kernel. That is, accelerator offload device 100 does not use a ring buffer (ring buffer 13) or a pole list (not illustrated) in the kernel.
Accelerator offload device 100 can reference the ring-structured buffer (ring buffer 13) while bypassing the kernel and thus can instantly detect a packet arrival (i.e., polling model rather than interrupt model).
As illustrated in
With this configuration, in both of host OS 50 and guest OS 80 in the system having the VM virtual server configuration, the notification interrupt of notifying of an ACC offloading result is applied to the notification to the APL 1 deployed in guest.
As illustrated in
As illustrated in
With this configuration, in the system having the container configuration, the notification interrupt of notifying of an ACC offloading result is applied to the notification to an APL 1 deployed on the container.
The present invention can be applied to a system having a non-virtualized configuration such as a bare-metal configuration (
<CPU Pinning when Hyper-Threading is Used>
In a case of using the hyper-threading technique, which logically creates a plurality of CPU cores from a single physical CPU core, the cache hit rate may be improved by, for CPU cores #n1 and #n2 that are logical cores created on a physical CPU core #n, allocating request-related processing part 130 (
CPU cores #n1 and #n2 use the same physical CPU core and thus may share an L1 cache or L2 cache. In this case, when the request processing and response processing are allocated to use the same physical core, the cache hit rate can be improved in a case where there is data to be used in common in the request processing and the response processing.
As described above, an accelerator offload device 100 that offloads specific processing of an application program (APL 1) to an accelerator (ACC 12) includes: a request-related processing part 130 configured to perform predetermined processing required before offloading to the accelerator and notify a request I/O part 150 of a request to perform offloading; request I/O part 150, composed of a CPU core and configured to perform request processing of notifying the accelerator of an offload request; a response I/O part 160 composed of a CPU core different from the CPU core and configured to perform response processing of notifying a response-related processing part 140 of operation completion of the accelerator; and response-related processing part 140, configured to perform an operation described in the application program by using an operation result of the accelerator.
With this configuration, when a workload that the CPU is not good at is offloaded to the ACC, it is possible to reduce the processing time (achieve low latency) and improve the CPU utilization efficiency by reducing the overhead caused by a hardware interrupt at the time of receiving the ACC operation result.
Accelerator offload device 100 further includes a management part 110 configured to manage a CPU core group composed of a plurality of CPU cores. Management part 110 is configured to determine a CPU core to be used by request-related processing part 130, response-related processing part 140, request I/O part 150, or response I/O part 160 from among the CPU core group.
With this configuration, the CPU utilization efficiency can be improved by determining the CPU core to be used by request-related processing part 130, response-related processing part 140, request I/O part 150, or the response I/O part 160 from among the CPU core group.
In accelerator offload device 100, management part 110 allocates one CPU core to response I/O part 160 as a response-dedicated functional part from among the CPU core group.
With this configuration, the CPU utilization efficiency can be improved by allocating the CPU core to at least response I/O part 160 as the response-dedicated functional part from among the CPU core group.
Accelerator offload device 100 further includes a task scheduler 120 configured to, when a task that needs to be offloaded to the accelerator (ACC 12) occurs, register the task in a task queue of request-related processing part 130, and task scheduler 120 registers the task as a task using a CPU core different from the CPU core used by response I/O part 160.
With this configuration, it is possible to prevent the CPU resource efficiency from deteriorating due to the CPU being unavailable for other processing. Further, it is possible to avoid a state in which, in a case where the post-response processing is heavy or when a hardware interrupt is generated, a response I/O part (CPU #m) 160 that is to receive the hardware interrupt is performing other processing. This makes it possible to reduce the processing time and improve the CPU utilization efficiency.
In accelerator offload device 100, task scheduler 120 includes a sleep control part configured to, when there is no task to be operated on a CPU, cause a thread running on the CPU to sleep.
With this configuration, it is possible to achieve high power saving by performing sleep when there is no processing. Further, by controlling the CPU operating frequency and the CPU idle state, higher power saving can be achieved.
An accelerator offload system 1000 is an accelerator offload system including an accelerator offload device that offloads specific processing of an application program to an accelerator. An accelerator offload device 100 is deployed in a user space 200. Hardware 10 including the accelerator (ACC 12) includes a ring buffer 13 that copies a workload to be processed. A request I/O part 150 and a response I/O part 160 exchange data with the accelerator via ring buffer 13.
With this configuration, the number of unnecessary memory copies in data communication between APL 1 and ACC 12 is reduced via ring buffer 13. This makes it possible to further reduce the processing time (achieve low latency) and improve the CPU utilization efficiency.
Note that, in the processing described in the above embodiments, all or some pieces of the processing described as those to be automatically performed may be manually performed, or all or some pieces of the processing described as those to be manually performed may be automatically performed by a known method. Further, processing procedures, control procedures, specific name, and information including various types of data and parameters described in the specification and the drawings can be freely changed unless otherwise specified.
The constituent elements of the devices illustrated in the drawings are functionally conceptual ones and are not necessarily physically configured as illustrated in the drawings. In other words, a specific form of distribution and integration of individual devices is not limited to the illustrated form, and all or part thereof can be functionally or physically distributed and integrated in any unit according to various loads, usage conditions, and the like.
Some or all of the configurations, functions, processing parts, processing means, and the like described above may be implemented by hardware by, for example, being designed in an integrated circuit. Each of the configurations, functions, and the like may be implemented by software for interpreting and executing a program for causing a processor to implement each function. Information such as a program, table, and file for implementing each function can be held in a recording device such as a memory, hard disk, or solid state drive (SSD) or a recording medium such as an integrated circuit (IC) card, secure digital (SD) card, or optical disc.
This is a National Stage Application of PCT Application No. PCT/JP2022/010422, filed on Mar. 9, 2022. The disclosure of the prior application is considered part of the disclosure of this application, and is incorporated in its entirety into this application.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2022/010422 | 3/9/2022 | WO |