The present invention relates to an on-server data transmission device, an on-server data transmission method, and an on-server data transmission program.
Against the background of advances in virtualization technology achieved through NFV (Network Functions Virtualization), systems are being constructed and operated on a per-service basis. Also, a mode called SFC (Service Function Chaining) is becoming mainstream, in which, based on the above-described mode of constructing a system on a per-service basis, service functions are divided into units of reusable modules and are operated on independent virtual machine (VM: Virtual Machine, container, etc.) environments, and thereby the service functions are used as needed in a manner as if they are components, and the operability is improved.
A hypervisor environment consisting of Linux (registered trademark) and a KVM (kernel-based virtual machine) is known as a technology for forming a virtual machine. In this environment, a Host OS (an OS installed on a physical server is called a “Host OS”) in which a KVM module is incorporated operates as a hypervisor in a memory area called kernel space, which is different from user spaces. In this environment, the virtual machine operates in a user space, and a Guest OS (an OS installed on a virtual machine is called a Guest OS) operates in the virtual machine.
Unlike the physical server in which the Host OS operates, in the virtual machine in which the Guest OS operates, all hardware (HW) including network devices (such as Ethernet (registered trademark) card devices) is controlled via registers, which is needed for interrupt processing from the HW to the Guest OS and for writing from the Guest OS to the hardware. In such register-based control, the performance is generally lower than that in the Host OS environment because the notifications and processing that would have been executed by physical hardware are emulated virtually by software.
To deal with this degraded performance, there is a technique of reducing HW emulation from a Guest OS, in particular, for a Host OS and an external process present outside of the virtual machine of the Guest OS, to improve performance and versatility of communication using a high-speed and consistent interface. As such a technique, a device abstraction technique called virtio, that is, a para-virtualization technique, has been developed and already been applied to many general-purpose OSes, such as FreeBSD (registered trademark) as well as Linux (registered trademark) and currently put into practical use (see Patent Literatures 1 and 2).
In virtio, regarding data input/output such as console input/output, file input/output, and network communication, as a unidirectional data transfer transport, data exchange using a queue designed with a ring buffer are defined as queue operations. With the use of the queue specification in virtio, the number and the size of queues suitable for each device are prepared at the time of activation of the Guest OS. Thus, communication between the Guest OS and the outside of its own virtual machine can be performed simply through an operation using a queue, without execution of hardware emulation.
HW 10 includes a network interface card (NIC) 11 (physical NIC) (interface part), and performs communication for data transmission and reception with a data processing application (APL) 1 in a user space 60 via a virtual communication channel constructed by a Host OS 20, a KVM 30, which is a hypervisor that constructs virtual machines, virtual machines (VM 1, VM 2) 40, and a Guest OS 50. In the following description, as indicated by the thick arrows in
Host OS 20 includes a kernel 21, a Ring Buffer 22, and a Driver 23. Kernel 21 includes a vhost-net module 221A, which is a kernel thread, a TAP device 222A, and a virtual switch (br) 223A.
TAP device 222A is a kernel device of a virtual network and is supported by software. Virtual machine (VM 1) 40 is configured such that Guest OS 50 and Host OS 20 can communicate via virtual switch (br) 223A created in a virtual bridge. TAP device 222A is a device connected to a Guest OS 50's virtual NIC (vNIC) created in this virtual bridge.
Host OS 20 copies the configuration information (sizes of shared buffer queues, number of queues, identifiers, information on start addresses for accessing the ring buffers, etc.) constructed in the virtual machine of Guest OS 50 to vhost-net module 221A, and constructs, inside Host OS 20, information on the endpoint on the virtual machine side. This vhost-net module 221A is a kernel-level back end for virtio networking, and can reduce virtualization overhead by moving virtio packet processing tasks from the user area (user space) to vhost-net module 221A of kernel 21.
Guest OSes 50 include a Guest OS (Guest 1) installed on the virtual machine (VM 1) and a Guest OS (Guest 2) installed on the virtual machine (VM 2), and Guest OSes 50 (Guest 1, Guest 2) operate in virtual machines (VM 1, VM 2) 40. Taking Guest 1 as an example of Guest OSes 50, Guest OS 50 (Guest 1) includes a kernel 51, a Ring Buffer 52, and a Driver 53, and Driver 53 includes a virtio-driver 531.
Specifically, as PCI (Peripheral Component Interconnect) devices, there are respective virtio devices for console input/output, file input/output, and network communication in the virtual machine (the device for the console, which is called virtio-console, the device for file input/output, which is called virtio-blk, and the device for the network, which is called virtio-net, and their corresponding drivers included in the OS are each defined with a virtio queue). When Guest OS starts up, two data transfer endpoints (transmission/reception endpoints) for each device are created between Guest OS and the counterpart side, and a parent-child relationship for data transmission and reception is constructed. In many cases, the parent-child relationship is formed between the virtual machine side (child side) and the Guest OS (parent side).
The child side exists as configuration information of each device in the virtual machine, and requests the size of each data area, the number of combinations of needed endpoints, and the type of the device to the parent side. In accordance with the request from the child side, the parent side allocates and maintains memory for a shared buffer queue for accumulating and transferring the needed amount of data, and sends the address of the memory as a response to the child side so that the child side can access it. Operations of the shared buffer queue necessary for data transfer are uniformly defined in virtio, and are performed in a state where both the parent side and the child side have agreed on the definition. Furthermore, the size of the shared buffer queue also has been agreed on by both sides (i.e., it is determined for each device). As a result, it is possible to operate the queue shared by both the parent side and the child side by merely communicating the address to the child side.
As each shared buffer queue prepared in virtio is prepared for one direction, for example, a virtual network device called a virtio-net device is constituted by three Ring Buffers 52 for transmission, reception, and control. Communication between the parent and the child is realized by writing to the shared buffer queue and performing a buffer update notification. That is, after writing to the Ring Buffer 52, a notification is made to the counterpart. Upon receipt of the notification, the counterpart side uses common operations of virtio to check which shared buffer queue contains the new data and check how much the new data is, and retrieves a new buffer area. As a result, transfer of data from the parent to the child or from the child to the parent is achieved.
As described above, by sharing Ring Buffer 52 for mutual data exchange and the operation method (used in common in virtio) for each ring buffer between the parent and the child, communication between Guest OS 50 and the outside, which does not require hardware emulation, is realized. This makes it possible to realize transmission and reception of data between Guest OS 50 and the outside at a high speed compared to the conventional hardware emulations.
If Guest OS 50 in the virtual machine communicates with the outside, the child side needs to connect to the outside and transmit and receive data as a relay between the outside and the parent side. For example, communication between Guest OS 50 and Host OS 20 is one example. Here, if the outside is Host OS 20, two patterns are present as existing communication methods.
In the first method (hereinafter referred to as “external communication method 1”), a child-side endpoint is constructed in the virtual machine, and a communication between Guest OS 50 and Host OS 20 is connected in the virtual machine to a communication endpoint (usually called a “TAP/TUN device”) provided by Host OS 20. This connection constructs a connection as follows and thus realizes communication from Guest OS 50 to Host OS 20.
In this case, Guest OS 50 operates in a memory area that is a user space having privileges different from a memory area called kernel space, in which the TAP driver and Host OS 20 operate. For this reason, at least one memory copy occurs in the communication from Guest OS 50 to Host OS 20.
In the second method (hereinafter referred to as “external communication method 2”), a technology called vhost-net exists as means for solving this. According to the vhost-net, parent-side configuration information (sizes of shared buffer queues, number of queues, identifiers, information on start addresses for accessing ring buffers, etc.) once constructed in the virtual machine is copied into the vhost-net module 221A inside the Host OS 20, and information on the endpoints of the child side is constructed inside the host. Vhost-net is a technology that enables operations on shared buffer queues to be carried out directly between Guest OS 50 and Host OS 20 by this construction. As a result, the number of copy operations is substantially zero, and data transfer can be realized at a higher speed than the external communication method 1 because the number of copy operations is less by one compared to virtio-net.
In this manner, in the case of Host OS 20 and Guest OS 50 connected by virtio, packet transfer processing can be sped up by reducing the number of virtio-net related memory copy operations.
Note that in kernel v4.10 (February 2017-) and later, the specifications of the TAP interface have changed, and packets inserted from the TAP device are completed in the same context as the processing of copying packets to the TAP device. Accordingly, software interrupts (softIRQ) no longer occur.
The method of connecting and coordinating virtual machines is called Inter-VM Communication, and in large-scale environments such as data centers, virtual switches have been typically used in connections between VMs. However, since it is a method with a large communication delay, faster methods have been newly proposed. For example, a method of using special hardware called SR-IOV (Single Root I/O Virtualization), a method performed with software using Intel DPDK (Intel Data Plane Development Kit) (hereinafter referred to as DPDK), which is a high-speed packet processing library, and the like have been proposed (see Non-Patent Literature 1).
DPDK is a framework for controlling a network interface card (NIC), which was conventionally controlled by a Linux kernel (registered trademark), in a user space. The biggest difference from the processing in the Linux kernel is that it has a polling-based reception mechanism called pull mode driver (PMD). Normally, with a Linux kernel, an interrupt occurs upon arrival of data on the NIC, and this interrupt triggers the execution of reception processing. On the other hand, in a PMD, a dedicated thread continuously checks arrival of data and performs reception processing. High-speed packet processing can be performed by eliminating the overhead of context switching, interrupts, and the like. DPDK significantly increases packet processing performance and throughput, making it possible to ensure more time for processing of data plane applications.
DPDK exclusively uses computer resources such as a CPU (Central Processing Unit) and an NIC. For this reason, it is difficult to apply it to an application, such as SFC, that flexibly reconnects in units of modules. There is SPP (Soft Patch Panel), which is an application for mitigating this. SPP omits packet copy operations in the virtualization layer by adopting a configuration in which shared memory is prepared between VMs and each VM can directly reference the same memory space. Also, DPDK is used to speed up exchanging packets between a physical NIC and the shared memory. In SPP, the input destination and output destination of a packet can be changed by software by controlling the reference destination for the memory exchange by each VM. Through this process, SPP realizes dynamic connection switching between VMs, and between a VM and a physical NIC (see Non-Patent Literature 2).
As illustrated in
Moreover, a data processing APL 1A includes a dpdk (PMD) 2, which is a function part that performs polling in the Guest OS 50 section. That is, data processing APL 1A is an APL obtained by modifying data processing APL 1 illustrated in
As an extension of DPDK, packet transfer performed based on the polling model enables a routing operation using a GUI in an SPP that rapidly performs packet copy operations between Host OS 20 and Guest OS 50 and between Guest OSes 50 via shared memory with zero-copy operation.
As illustrated in
OS 70 has a kernel 71, a ring buffer 72, and a driver 73, and kernel 71 has a protocol processor 74.
Kernel 71 has the function of the core part of OS 70 (e.g., Host OS). Kernel 71 monitors hardware and manages execution status of programs, on a per-process basis. Here, kernel 71 responds to requests from data processing APL 1 and conveys requests from HW 10 to data processing APL 1. In response to a request from data processing APL 1, kernel 71 performs processing via a system call (a “user program operating in a non-privileged mode” requests processing to a “kernel operating in a privileged mode”).
Kernel 71 transmits packets to data processing APL 1 via a Socket 75. Kernel 71 receives packets from data processing APL 1 via socket 75.
Ring buffer 72 is managed by kernel 71 and is in the memory space in the server. Ring buffer 72 is a constant-sized buffer that stores messages output by kernel 71 as logs, and is overwritten from the beginning when the messages exceed a maximum size.
Driver 73 is a device driver for monitoring hardware in kernel 71. Incidentally, driver 73 depends on kernel 71, and is replaced if the source code of the created (built) kernel is modified. In this case, a corresponding driver source code is to be obtained and rebuilding is to be performed on the OS that will use the driver, to create the driver.
Protocol processor 74 performs protocol processing of L2 (data link layer)/L3 (network layer)/L4 (transport layer), which are defined by the Open Systems Interconnection (OSI) reference model.
Socket 75 is an interface for kernel 71 to perform inter-process communication. Socket 75 has a socket buffer and does not frequently cause a data copying process. The flow up to the establishment of communication via Socket 75 is as follows. 1. The server side creates a socket file according to which the server side accepts clients. 2. Name the acceptance socket file. 3. Create a socket queue. 4. Accept a first connection from a client that is in the socket queue. 5. The client side creates a socket file. 6. The client side sends a connection request to the server. 7. The server side creates a connection socket file separately from the acceptance socket file. As a result of establishing communication, data processing APL 1 becomes able to call a system call, such as read( ) and write( ), to kernel 71.
In the above configuration, kernel 71 receives a notification of a packet arrival from NIC 11 via a hardware interrupt (hardIRQ) and schedules a software interrupt (softIRQ) for packet processing.
Above-described New API (NAPI), implemented in Linux kernel 2.5/2.6 and later versions, processes, upon arrival of a packet, the packet by a software interrupt (softIRQ) after the hardware interrupt (hardIRQ). As illustrated in
An overview of Rx-side packet processing of NAPI will be described below.
As illustrated in
The components deployed in the networking layer include: softIRQ 83, which is a handler called due to the generation of a processing request from netif_rx 82 to perform the requested processing (software interrupt); and do_softirq 84, which is a control function part that performs the actual part of the software interrupt (softIRQ). The components deployed in the networking layer further include: net_rx_action 85, which is a packet processing function part that is executed upon reception of the software interrupt (softIRQ); a poll_list 86, in which information on a net device (net_device), indicative of which device the hardware interrupt from NIC 11 comes from, is registered; netif_receive_skb 87, which creates a sk_buff structure (structure for enabling the kernel 71 to know the structure of the packet); and a ring buffer 72.
The components deployed in the protocol layer include: ip_rcv 88, arp_rcv 89, and the like, which are packet processing function parts.
The above-described netif_rx 82, do_softirq 84, net_rx_action 85, netif_receive_skb 87, ip_rcv 88, and arp_rcv 89 are program components (function names) used for packet processing in kernel 71.
The arrows (reference signs) d to o in
A hardware function part 11a of NIC 11 (hereinafter referred to as “NIC 11”) is configured to, upon reception of a packet in a frame (or upon reception of a frame) from a remote device, copy the arrived packet to ring buffer 72 by a Direct Memory Access (DMA) transfer (see reference sign d in
However, kernel 71 cannot notice the arrived packet simply by NIC 11 copying the arrived packet to ring buffer 72. In view of this, when the packet arrives, NIC 11 raises a hardware interrupt (hardIRQ) to hardIRQ 81 (see reference sign e in
netif_rx 82 has a function of performing actual processing. When hardIRQ 81 (handler) has started execution (see reference sign fin
In this way, in the device driver illustrated in
With the above-described processing, the hardware interrupt processing in device driver illustrated in
netif_rx 82 passes up, to softIRQ 83 (handler) via a software interrupt (softIRQ) (see reference sign h in
do_softirq 84 is a software interrupt control function part that defines functions of the software interrupt (there are various types of packet processing; the interrupt processing is one of them; it defines the interrupt processing). Based on the definition, do_softirq 84 notifies net_rx_action 85, which performs actual software interrupt processing, of a request for processing the current (corresponding) software interrupt (see reference sign j in
When the order of the softIRQ comes, net_rx_action 85 calls, according to net_device registered in poll_list 86 (see reference sign k in
Thereafter, net_rx_action 85 notifies netif_receive_skb 87 (see reference sign m in
netif_receive_skb 87 creates a sk_buff structure, analyzes the content of the packet, and assigns processing to the protocol processor 74 arranged in the subsequent stage (see
Non-Patent Literature 3 describes a server network delay control device (KBP: Kernel Busy Poll). KBP constantly monitors packet arrivals according to a polling model in a kernel. With this, softIRQ is refrained, and low-latency packet processing is achieved.
As illustrated in
Next, a DPDK system is described.
The DPDK system includes HW 110, an OS 140, a DPDK 150, which is high-speed data transfer middleware deployed in a user space 160, and data processing APL 1.
Data processing APL 1 is packet processing to be performed prior to execution of an APL.
HW 110 performs data transmission/reception communication with data processing APL 1. In the following description, as shown in
HW 110 includes accelerator 120 and NICs 130 (physical NICs) for connecting to communication networks.
Accelerator 120 is computing unit hardware that performs a specific operation at high speed based on an input from the CPU. Specifically, accelerator 120 is a programmable logic device (PLD) such as a graphics processing unit (GPU) or a field programmable gate array (FPGA). In
Part of the processing by data processing APL 1 is offloaded to accelerator 120, to achieve performance and power efficiency that cannot be achieved only by software (CPU processing).
There will be cases where accelerator 120 described above is applied in a large-scale server cluster such as a data center that implements network functions virtualization (NFV) or a software defined network (SDN).
NIC 130 is NIC hardware that forms a NW interface. NIC 130 includes an Rx queue 131 and a Tx queue 132 that hold data in first-in first-out list structures. NIC 130 is connected to a remote device 170 via, for example, a communication network and performs packet transmission/reception.
Note that NIC 130 may be a SmartNIC, which is an NIC equipped with an accelerator, for example. A SmartNIC is a NIC capable of offloading burdensome processing, such as IP packet processing that causes a decrease in processing capacity, to reduce the load of the CPU.
DPDK 150 is a framework for performing NIC control in user space 160, and specifically, is formed with high-speed data transfer middleware. DPDK 150 includes poll mode drivers (PMDs) 151 (drivers capable of selecting data arrival either in a polling mode or in an interrupt mode), which are each a polling-based reception mechanism. In each PMD 151, a dedicated thread continuously checks arrivals of data and performs reception processing.
DPDK 150 implements a packet processing function in user space 160 in which the APL operates, and, from user space 160, performs dequeuing immediately when a packet arrives in a polling model, to shorten the packet transfer delay. That is, as DPDK 150 performs dequeuing of packets with polling (busy-polling the queues by CPU), there is no wait, and the delay is short.
Patent Literature 1: JP 2015-197874 A
Patent Literature 2: JP 2018-32156 A
Non-Patent Literature 1: New API Intel, [online], [retrieved on Jul. 5, 2021], the Internet <http://lwn.net/2002/0321/a/napi-howto.php3>
Non-Patent Literature 2: “Resource Setting (NIC)-DPDK Primer, Vol. 6 (in Japanese) (Resource Setting (NIC)—Introduction to DPDK, Part 6)”, NTT TechnoCross, [online], [retrieved on Jul. 5, 2021], the Internet <https://www.ntt-tx.co.jp/column/dpdk_blog/190610/>
Non-Patent Literature 3: Kei Fujimoto, Kenichi Matsui, Masayuki Akutsu, “KBP: Kernel Enhancements for Low-Latency Networking without Application Customization in Virtual Server”, IEEE CCNC 2021.
However, the packet transfer based on the interrupt model and the packet transfer based on the polling model have the following problems.
In the interrupt model, the kernel that receives an event (hardware interrupt) from the HW performs packet transfer through software interrupt processing for performing packet processing. As the interrupt model transfers packets through an interrupt (software interrupt) processing, there is a problem in that when a contention with other interrupts occurs and/or when the interrupt destination CPU is in use by a process with a higher priority, a wait occurs, and thus the delay in packet transfer increases. In this case, if the interrupt processing is congested, the wait delay further increases.
For example, as illustrated in
A supplemental description will be given of the mechanism by which a delay occurs in the interrupt model.
In a general kernel, in packet transfer processing, packet transfer processing is performed in software interrupt processing after hardware interrupt processing.
When a software interrupt for packet transfer processing occurs, the software interrupt processing cannot be executed immediately under the conditions (1) to (3) described below. Thus, a wait in the order of milliseconds occurs due to the interrupt processing being mediated and scheduled by a scheduler such as ksoftirqd (a kernel thread for each CPU; executed when the load of the software interrupt becomes high).
Under the above conditions, the software interrupt processing cannot be executed immediately.
In addition, a NW delay in the order of milliseconds also occurs in the same manner in the packet processing by New API (NAPI) due to a contention with an interrupt processing (softIRQ), as indicated in the dashed box p in
As described above, KBP is able to refrain softIRQ by constantly monitoring packet arrivals according to a polling model in the kernel, and thus is able to achieve low-latency packet processing.
However, as the kernel thread that constantly monitors packet arrivals occupies a CPU core and uses the CPU time all the time, there is a problem in that the power consumption increases. Referring now to
As illustrated in
DPDK has a problem similar to that of the KBP.
In DPDK, a kernel thread occupies the CPU core to perform polling (busy-polling the queues by the CPU). Therefore, even in the intermittent packet reception illustrated in
As described above, as DPDK embodies the polling model in a user space, no softIRQ contention occurs. As KBP embodies the polling model in the kernel, no softIRQ contention occurs. Thus, low-latency packet transfer is possible. However, both DPDK and KBP unnecessarily use CPU resources to constantly monitor packet arrivals, regardless of whether a packet has arrived. Therefore, there is a problem in that the power consumption increases.
The present invention has been made in view of such a background, and the present invention aims to lower the CPU usage rate to save power while maintaining low latency.
To solve the above problem, an on-server data transmission device performs, in a user space, data transfer control on an interface part. An OS includes: a kernel; a ring-structured buffer in a memory space in which a server deploys the OS; and a driver capable of selecting a data arrival from the interface part either in a polling mode or in an interrupt mode. The on-server data transmission device includes: a data transfer part configured to launch a thread that monitors a packet arrival using a polling model; and a sleep control manager configured to manage data arrival schedule information and deliver the data arrival schedule information to the data transfer part, to perform sleep control on the data transfer part, wherein the data transfer part is configured to put the thread into a sleep state based on the data arrival schedule information delivered from the sleep control manager and cause a timer to expire immediately before an arrival of data to perform cancellation of the sleep state, causing the thread to wake up.
According to the present invention, it is possible to aim for saving power by lowering the CPU usage rate, while maintaining low latency.
Hereinafter, an on-server data transmission system and the like in a mode for carrying out the present invention (hereinafter, referred to as “the present embodiment”) will be described with reference to the drawings.
As illustrated in
In user space 160, a data processing APL 1 and a data flow timeslot management scheduler 2 are further deployed. Data processing APL 1 is a program to be executed in user space 160. Data flow timeslot management scheduler 2 transmits (see reference sign q in
HW 110 performs data transmission/reception communication with data processing APL 1. The data flow in which data processing APL 1 receives a packet from HW 110 will be hereinafter referred to as Rx-side reception, and the data flow in which data processing APL 1 transmits a packet to HW 110 will be hereinafter referred to as Tx-side transmission.
HW 110 includes an accelerator 120 and NICs 130 (physical NICs) for connecting to communication networks.
Accelerator 120 is computing unit hardware such as a GPU or an FPGA. Accelerator 120 includes a plurality of cores (core processors) 121, Rx queues 122 and Tx queues 123 that hold data in first-in-first-out list structures.
Part of the processing by data processing APL 1 is offloaded to accelerator 120, to achieve performance and power efficiency that cannot be achieved only by software (CPU processing).
NIC 130 is NIC hardware that forms a NW interface. NIC 130 includes an Rx queue 131 and a Tx queue 132 that hold data in first-in first-out list structures. NIC 130 is connected to a remote device 170 via, for example, a communication network and performs packet transmission/reception.
OS 140 is, for example, Linux (registered trademark). OS 140 includes a high-resolution timer 141 that performs timer management in greater detail than a kernel timer. High-resolution timer 141 uses hrtimer of Linux (registered trademark), for example. In hrtimer, the time at which a callback occurs can be specified with a unit called ktime_t. High-resolution timer 141 communicates the data arrival timing at the specified time to sleep controller 221 of data transfer part 220 described later (see reference sign u in
On-server data transmission device 200 is a DPDK for performing NIC control in user space 160, and specifically, is formed of high-speed data transfer middleware.
On-server data transmission device 200 includes sleep control manager 210 and data transfer parts 220.
Like a DPDK deployed in the user space 160, on-server data transmission device 200 includes PMDs 151 (drivers capable of selecting data arrival either in a polling mode or in an interrupt mode) (see
Sleep control manager 210 manages a data arrival schedule, and performs sleep control on each data transfer part 220 in accordance with the data arrival timing.
Sleep control manager 210 collectively performs sleep/activation timing control on data transfer parts 220 (see reference sign tin
Sleep control manager 210 manages data arrival schedule information and delivers the data arrival schedule information to data transfer parts 220, to perform sleep control on data transfer parts 220.
Sleep control manager 210 includes a data transfer part manager 211, a data arrival schedule manager 212, and a data arrival schedule delivery part 213.
Data transfer part manager 211 holds information such as the number and process IDs (PIDs: Process Identifications) of data transfer parts 220 as a list.
In response to a request from data arrival schedule delivery part 213, data transfer part manager 211 transmits information such as the number and process IDs of data transfer parts 220 to data transfer parts 220.
Data arrival schedule manager 212 manages data arrival schedule. Data arrival schedule manager 212 retrieves (see reference sign r in
In a case where a change is made to the data arrival schedule information, data arrival schedule manager 212 receives a notification of the change in the data arrival schedule information from data flow timeslot management scheduler 2 to detect the change in the data arrival schedule information. Alternatively, data arrival schedule manager 212 performs the detection by snooping data including the data arrival schedule information (see
Data arrival schedule manager 212 transmits (see reference sign s in
Data arrival schedule delivery part 213 retrieves information such as the number of data transfer parts 220 and the process IDs of data transfer parts 220 from data transfer part manager 211.
Data arrival schedule delivery part 213 delivers the data arrival schedule information to each data transfer part 220 (see reference sign tin
Each data transfer part 220 launches a thread (polling thread) that monitors packet arrivals using a polling model.
Based on the data arrival schedule information delivered from sleep control manager 210, data transfer part 220 puts the thread into a sleep state and causes a timer to expire immediately before an arrival of data to perform cancellation of the sleep state, causing the thread to wake up. Here, data transfer part 220 cancels the sleep state of the thread with a hardware interrupt, in preparation for reception of a packet at a timing not intended by the timer. The sleep/cancellation will be described later in [Sleep/Cancellation].
Each data transfer part 220 includes a sleep controller 221, a data arrival monitoring part 222, an Rx data transfer part 223 (packet dequeuer), and a Tx data transfer part 224.
Data arrival monitoring part 222 and Rx data transfer part 223 are function parts on the Rx side, and Tx data transfer part 224 is a function part on the Tx side.
Based on the data arrival schedule information from sleep control manager 210, sleep controller 221 suspends the data arrival monitoring and performs sleep control to transition into a sleep state when there are no incoming data arrivals.
Sleep controller 221 holds the data arrival schedule information received from data arrival schedule delivery part 213.
Sleep controller 221 sets (see reference sign v in
As illustrated in
Data arrival monitoring part 222 is activated immediately before data arrives, in accordance with the data arrival schedule information managed by sleep controller 221.
Data arrival monitoring part 222 monitors Rx queue 122 or 131 of accelerator 120 or NIC 130, and checks whether data has arrived.
Data arrival monitoring part 222 occupies the CPU core regardless of whether data has arrived and monitors whether data has arrived by polling. If an interrupt model is used here, the delay mentioned in relation to the conventional technique illustrated in
In a case where data has arrived in Rx queue 122 or 131, data arrival monitoring part 222 performs a dequeuing operation on the queues stored in Rx queue 122 or 131 (referencing the content of a packet pooled in a buffer and, processing to be performed on the packet, taking into account the processing to be performed next, removing the corresponding queue entry from the buffer), and transfers the data to Rx data transfer part 223.
Rx data transfer part 223 transfers the received data to data processing APL 1. Like Tx data transfer part 224, Rx data transfer part 223 operates only when data arrives, and accordingly, does not unnecessarily use the CPU.
Tx data transfer part 224 stores received data into Tx queue 123 of accelerator 120 or Tx queue 132 of NIC 130.
Tx data transfer part 224 is activated by inter-process communication when data processing APL 1 sends data and returns to a CPU idle state when the data transfer is completed. Accordingly, unlike data arrival monitoring part 222, Tx data transfer part 224 does not unnecessarily use the CPU.
Data transfer part 220 puts the thread into a sleep state based on the data arrival schedule information received from sleep controller 221, and cancels the sleep state upon being triggered by a timer.
Based on the scheduling information about the data arrival timing (data arrival schedule information), data transfer part 220 causes the timer to expire immediately before an arrival of data and wakes up the data arrival monitoring part thread of data transfer part 220. For example, using the Linux kernel standard function hr_timer, a hardware interrupt of the timer is activated when the timer expires, and data arrival monitoring part 222 wakes up the thread.
In a case where data arrives at an unscheduled timing, the thread of data arrival monitoring part 222 is in a sleep state. Also, there is no timer expiration which is expected in the normal case. In view of this, a hardware interrupt notifying of the packet arrival is to be activated when a packet arrives.
As described above, as packets are constantly monitored in a polling mode in the normal time, the hardware interrupt is not necessary and thus the function of the hardware interrupt is halted by the driver (PMD).
However, when causing the polling thread to sleep, assuming a case where data arrives at an unscheduled time, the mode is changed so that a hardware interrupt would be raised when a packet arrives. By doing so, a hardware interrupt will be raised when a packet arrives, and it is possible to cause data arrival monitoring part 222 to wake up the thread in the hardware interrupt handler of the hardware interrupt.
Examples of acquisition of data arrival schedule information in the on-server data transmission system according to this embodiment are now described.
Examples of data flows in which the data arrival schedule has been determined include signal processing in a radio access network (RAN). In the signal processing in a RAN, the MAC scheduler of a MAC 4 (described later) manages timing of arrival of data in time division multiplexing.
Signal processing with a virtual RAN (vRAN) or a virtual distributed unit (vDU) often uses DPDK for high-speed data transfer. Applying the method of the invention, the sleep control on the data transfer part (DPDK, PMD, and the like) is performed in accordance with the data arrival timing managed by the MAC scheduler.
Examples of the method of acquiring the data arrival timing managed by the MAC scheduler include <Acquisition of Data Arrival Schedule Information from MAC Scheduler> (direct acquisition from the MAC scheduler) (see
<Acquisition of Data Arrival Schedule Information from MAC Scheduler>
As illustrated in
As remote devices connected to NICs 130, a radio unit (RU) 171 is connected to the reception side of a NIC 130, and a vCU 172 is connected to the transmission side of a NIC 130.
Sleep control manager 210 of on-server data transmission system 1000A acquires (see reference sign z in
Although an example applied to a vDU system has been described, the application may be made not only to a vDU but also to a vRAN system such as a vCU.
As illustrated in
FAPI 6 is an interface (IF) that is specified by Small Cell Forum (SCF) and that connects PHY (high) 3 and MAC 4 and exchanges data schedule information (see reference sign aa in
Sleep control manager 210 of on-server data transmission system 1000B snoops FAPI 6 to acquire data arrival schedule information (see reference sign bb in
As illustrated in
Transmission device 173 is a transmission device defined by the O-RAN community.
MAC 4 and transmission device 173 in user space 160 are connected via a collaborative transport interface (CTI) 7. CTI 7 is an IF specified by the O-RAN community for exchanging data schedule information and the like with a transmission device (see reference sign cc in
Sleep control manager 210 of on-server data transmission system 1000C snoops CTI 7 to acquire data arrival schedule information (see reference sign dd in
In the description below, operations of an on-server data transmission system are described.
As the basic operations of on-server data transmission systems 1000 (see
Step S10 surrounded by the dashed line in
In step S10 [external factor], if a change has been made to the data arrival schedule information, data flow timeslot management scheduler 2 (see
In step S11, data arrival schedule manager 212 (see
In step S12, data arrival schedule manager 212 transmits the data arrival schedule information to data arrival schedule delivery part 213 (see
In step S13, data arrival schedule delivery part 213 of sleep control manager 210 retrieves, from data transfer part manager 211 (see
In step S14, data arrival schedule delivery part 213 delivers the data arrival schedule information to each data transfer part 220 (see
In step S20 [external factor], when an addition/removal of a data transfer part 220 (see
In step S21, data transfer part manager 211 of sleep control manager 210 holds information regarding the number and/or process IDs of the data transfer parts 220 among other details as a list.
In step S22, in response to a request from data arrival schedule delivery part 213, data transfer part manager 211 communicates information regarding the number and/or process IDs of the data transfer parts 220 among other details, and then finishes the processing in this flow.
Operations of sleep control manager 210 have been described. Next, an operation of data transfer part 220 is described.
In step S31, sleep controller 221 (see
Here, a constant difference may exist between the data arrival timing managed by sleep control manager 210 (see
In step S32, sleep controller 221 (see
Note that, at this point of time, high-resolution timer 141 (see
An operation of sleep controller 221 has been described. Next, <Rx side> and <Tx side> operations of data transfer part 220 are described. The present invention is characterized in that the <Rx side> and the <Tx side> differ in operation.
In step S41, data arrival monitoring part 222 (see
Here, when data is received from the accelerator 120 or NIC 130 (see
In step S42, data arrival monitoring part 222 monitors Rx queue 122 or 131 (see
In step S43, data arrival monitoring part 222 determines whether data has arrived at Rx queue 122 or 131.
If data has arrived at Rx queue 122 or 131 (S43: Yes), in step S44, data arrival monitoring part 222 performs dequeuing of the data (queue) stored in Rx queue 122 or 131 (referencing the content of a packet pooled in a buffer and, processing to be performed on the packet, taking into account the processing to be performed next, removing the corresponding queue entry from the buffer), and transfers the data to Rx data transfer part 223 (see
If no data has arrived at Rx queue 122 or 131 (S43: No), the flow returns to step S42.
In step S45, Rx data transfer part 223 transfers the received data to data processing APL 1 (see
Like Tx data transfer part 224 (see
In step S46, when there are no data arrivals even after a certain period of time specified by the operator has elapsed, sleep control manager 210 (see
In step S50 [external factor], data processing APL 1 (see
In step S51, Tx data transfer part 224 of data transfer part 220 stores the received data into Tx queue 123 or 132 (see
Tx data transfer part 224 is activated with inter-process communication when data processing APL 1 sends data, and returns to a CPU idle state when the data transfer is completed. Accordingly, unlike data arrival monitoring part 222 on the <Rx side>, Tx data transfer part 224 does not unnecessarily use the CPU.
Operations of data transfer parts 220 have been described.
Next, a description will be given of measures to be taken in a case where a certain time difference exists between the data arrival schedule held by sleep control manager 210 and the actual data arrival schedule. This is a supplementary explanation of step S31 in
This embodiment is based on the assumption of a use case, such as RAN, where a data arrival schedule is determined in advance. As a RAN system (APL side) does not allow any data arrival whose time difference is not constant, such data arrivals are excluded from cases to be addressed.
In step S61, data arrival monitoring part 222 (see
In step S62, when a data arrival difference ΔT is observed multiple times in a row, data arrival monitoring part 222 (see
In step S63, having been informed that the data arrival schedule is ahead by ΔT, sleep controller 221 (see
In step S71, data arrival monitoring part 222 (see
In step S72, when data has already arrived at the start of polling multiple times in a row, data arrival monitoring part 222 communicates with sleep controller 221 (see
In step S73, having been informed that the data arrival schedule is to be advanced by ΔS, sleep controller 221 advances the data arrival schedule by ΔS, and finishes the processing in this flow. By repeatedly performing the time correction by ΔS, it is possible to correct the schedule in a case where a delay exists in the data arrival schedule by a certain period of time.
As described above, in on-server data transmission system 1000, on-server data transmission device 200 is deployed in user space 160. Therefore, like DPDK, each data transfer part 220 of on-server data transmission device 200 is able to, while bypassing the kernel, reference a ring-structured buffer (when a packet arrives at accelerator 120 or NIC 130, the packet is stored by direct memory access (DMA) to the ring-structured buffer created in a memory space managed by DPDK). That is, on-server data transmission device 200 does not use a ring buffer (ring buffer 72) (see
Data transfer part 220 is able to instantly notice a packet arrival by the polling thread constantly monitoring the ring-structured buffer (mbuf: a ring-structured buffer to which PMD 151 copies data by DMA) created in the memory space managed by this DPDK (meaning a polling model instead of an interrupt model).
In addition to the features observed in user space 160, on-server data transmission device 200 has the following features regarding the method for waking up the polling thread.
That is, for a workload whose data arrival timing has been already determined, on-server data transmission device 200 wakes up the polling thread using a timer based on scheduling information on the data arrival timing (data arrival schedule information). Note that an on-server data transmission device 200B (see
Operations of on-server data transmission device 200 are supplementarily described.
In on-server data transmission device 200, the polling thread in user space 160 monitors a ring buffer prepared in a memory space by accelerator 120 or NIC 130 (see
The regular operation (polling mode) has been described. Next, an operation in an unexpected interrupt mode is described.
On-server data transmission device 200 changes the mode of the drivers (PMDs 151) so that a hardware interrupt (hardIRQ) can be raised from accelerator 120 or NIC 130 (see
In this manner, the drivers (PMDs 151) of on-server data transmission device 200 have two modes, which are the polling mode and the interrupt mode.
As illustrated in
Like on-server data transmission device 200 illustrated in
On-server data transmission device 200A includes sleep control manager 210 and data transfer parts 220A.
Each data transfer part 220A includes a CPU frequency/CPU idle controller 225 (CPU frequency controller or CPU idle controller) in addition to the components of data transfer part 220 illustrated in
CPU frequency/CPU idle controller 225 performs control to vary the CPU operating frequency and the CPU idle setting. Specifically, CPU frequency/CPU idle controller 225 of a polling thread (on-server data transmission device 200A) activated by a hardware interrupt handler sets a lower CPU operating frequency of the CPU core used by the polling thread compared to the frequency used during normal operation.
Here, the kernel is able to change the operating frequency of the CPU core through governor setting. CPU frequency/CPU idle controller 225 is able to set a lower CPU operating frequency compared to the frequency used during normal operation, using the governor setting or the like. Note that the CPU idle setting depends on the type of the CPU. Note that, in a case where the CPU core has activated the CPU idle setting, the CPU idle setting can be canceled.
In the description below, operations of on-server data transmission system 1000D are described.
In step S41, when data arrival monitoring part 222 (see
When sleep control manager 210 (see
In this way, in on-server data transmission device 200A, each data transfer part 220A includes CPU frequency/CPU idle controller 225 and sets a CPU frequency/CPU idle state in conjunction with the sleep control on data arrival monitoring part 222 to achieve further power saving.
Note that the processing of lowering the CPU frequency and the processing of putting data arrival monitoring part 222 into a sleep state may be performed at the same time. In addition, data arrival monitoring part 222 may be put into a sleep state after completion of the packet transfer processing is confirmed.
On-server data transmission devices 200 and 200A are each to be an on-server data transmission device that launches a thread that monitors packet arrivals using a polling model in a kernel, and the OS is not limited to any particular kind. Also, there is no limitation to being in a server virtualization environment. Therefore, on-server data transmission systems 1000 to 1000D can be applied to the respective configurations illustrated in
As illustrated in
Specifically, the server includes: Host OS 20, on which a virtual machine and an external process formed outside the virtual machine can operate; and Guest OS 70, which operates in the virtual machine.
Host OS 20 includes: a kernel 91; a ring buffer 22 (see
On the other hand, Guest OS 70 includes: a kernel 181; a driver 73; a ring buffer 52 (see
In on-server data transmission system 1000E, on-server data transmission devices 200 or 200A are deployed in user spaces 160. Therefore, like DPDK, each data transfer part 220 of on-server data transmission devices 200 or 200A is able to reference the ring-structured buffer while bypassing the kernel. That is, on-server data transmission devices 200 or 200A do not use a ring buffer (ring buffer 72) (see
Data transfer part 220 is able to reference the ring-structured buffer (ring buffer 72) (mbuf: a ring-structured buffer to which PMD 151 copies data by DMA) while bypassing the kernel and instantly notice a packet arrival (meaning a polling model instead of an interrupt model).
In this manner, reduction of the delay is aimed for in the system having the virtual server configuration of a VM by, while bypassing the kernel, performing a packet transfer in the polling mode with a low delay in both Host OS 20 and Guest OS 70 when an arrival of data is scheduled. Further, power saving is aimed for when no data arrival is scheduled, by suspending monitoring of data arrival and by transitioning to a sleep state. As a result, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account. Furthermore, packet transfer can be performed with a shorter delay in the server, without any modification to the APL.
As illustrated in
In a virtual server system such as a container, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account. Furthermore, packet transfer can be performed with a shorter delay in the server, without any modification to the APL.
The present invention can be applied to a system with a non-virtualized configuration, such as in a bare-metal configuration. In a system having a non-virtualized configuration, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account. Furthermore, packet transfer can be performed with a shorter delay in the server, without any modification to the APL.
The present invention makes it possible to scale out against a network load by increasing the number of CPUs allocated to packet arrival monitoring threads in conjunction with receive-side scaling (RSS), which is capable of processing inbound network traffic with multiple CPUs when the number of traffic flows increases.
The present invention can also be applied to, as an example of a network system in which the data arrival schedule is determined, a high-speed packet transfer processing function part of a network system in which the data arrival timing needs to be guaranteed as in a time aware shaper (TAS) in a time sensitive network (TSN). It is possible to achieve both low latency and power saving in the network system in which the data arrival schedule is determined.
In the cases of the first and second embodiments, on-server data transmission device 200 or 200A is deployed in user space 160. A third embodiment includes, in the kernel, an on-server data transmission device 200B, which deploys a polling thread in the kernel and performs sleep control, instead of on-server data transmission device 200 or 200A deployed in user space 160.
As illustrated in
Adopting a configuration in which on-server data transmission device 200B configured to perform sleep control is deployed inside kernel 71 eliminates the need for on-server data transmission device 200 or 200A to be deployed in space 160 (this case includes a mode in which on-server data transmission device 200 or 200A is deployed in the on-server data transmission system with general-purpose operations being taken into account and on-server data transmission device 200 or 200A is not used in an adaptive manner). The reason why the need for on-server data transmission device 200 or 200A is eliminated is now described. That is, in cases where DPDK is not used, a software interrupt that causes a delay problem occurs only inside kernel 71. In cases where DPDK is not used, data transfer to/from data processing APL 1 is performed using socket 75 without any interrupt. Thus, data can be transferred to data processing APL 1 at high speed even if on-server data transmission device 200 or 200A is not present in user space 160.
OS 70 includes: kernel 71; ring buffer 22 (see
As described above, in on-server data transmission device 200B, at least data transfer part 220 (see
Data transfer part 220 of on-server data transmission device 200B includes data arrival monitoring part 222 (see
Specifically, regarding data transfer part 220 of on-server data transmission device 200B, an OS (OS 70) includes: a kernel (Kernel 71); a ring buffer (ring buffer 72) that is managed by the kernel, in a memory space the server deploy the OS; and a poll list (poll_list 86) (see
As described above, data transfer part 220 of on-server data transmission device 200B includes: data arrival monitoring part 222 configured to monitor (poll) the poll list; Rx data transfer part (packet dequeuer) 223 configured to, when a packet has arrived, reference the packet held in the ring buffer, and perform, based on the processing to be performed next, dequeuing to remove a corresponding queue entry from the ring buffer; and sleep controller 221 configured to, when no packet has arrived over a predetermined period of time, put the thread (polling thread) into a sleep state and cancel the sleep state with a hardware interrupt (hardIRQ) for the thread (polling thread) when a packet arrives.
With this configuration, on-server data transmission device 200B halts the software interrupts (softIRQs) that perform packet processing, which is the main cause of the occurrence of the NW delay, and executes a thread in which data arrival monitoring part 222 of on-server data transmission device 200B monitors packet arrivals; and Rx data transfer part (packet dequeuer) 223 performs packet processing according to a polling model (no softIRQ) at the arrivals of packets. In a case where there is no packet arrival over a predetermined period of time, sleep controller 221 causes the thread (polling thread) to sleep, so that the thread (polling thread) is in a sleep state while no packet is arriving. Sleep controller 221 cancels the sleep by a hardware interrupt (hardIRQ) when a packet arrives.
As described above, on-server data transmission system 1000G includes on-server data transmission device 200B that generates the polling thread in the kernel, and data transfer part 220 of on-server data transmission device 200B wakes up the polling thread upon being triggered by a hardware interrupt from NIC 11. In particular, data transfer part 220 is characterized in waking up the polling thread with a timer in a case where the polling thread is generated in the kernel. With this configuration, on-server data transmission device 200B is able to achieve both low latency and power saving by performing sleep management on the polling thread that performs packet transfer processing.
On-server data transmission device 200, 200A, and 200B according to the above-described embodiments are each embodied by, for example, a computer 900 having a configuration such as illustrated in
Computer 900 has a CPU 901, a ROM 902, a RAM 903, an HDD 904, a communication interface (I/F: Interface) 906, an input/output interface (I/F) 905, and a media interface (I/F) 907.
CPU 901 operates according to a program stored in ROM 902 or HDD 904, and controls components of on-server data transmission device 200, 200A, or 200B illustrated in
CPU 901 controls an input device 910 such as a mouse and a keyboard and an output device 911 such as a display via an input/output I/F 905. CPU 901 acquires data from an input device 910 via input/output I/F 905, and outputs generated data to output device 911. A GPU (Graphics Processing Unit) or the like may be used together with CPU 901 as a processor.
HDD 904 stores programs to be executed by CPU 901, data to be used by the programs, and the like. Communication interface 906 receives data from another device via a communication network (e.g., network (NW) 920), sends the received data to CPU 901, and transmits data generated by CPU 901 to another device via the communication network.
Media I/F 907 reads a program or data stored in a recording medium 912 and provides the read program or data to CPU 901 via RAM 903. The CPU 901 loads a program related to target processing from the recording medium 912 onto RAM 903 via media I/F 907 and executes the loaded program. Recording medium 912 is an optical recording medium such as a digital versatile disc (DVD) or a phase change rewritable disk (PD), a magneto-optical recording medium such as a magneto-optical disk (MO), a magnetic recording medium, a conductor memory tape medium, a semiconductor memory, or the like.
For example, when computer 900 functions as on-server data transmission device 200, 200A, or 200B configured as one device according to the present embodiment, CPU 901 of computer 900 embodies the functions of on-server data transmission device 200, 200A, or 200B by executing the program loaded on RAM 903. Data in RAM 903 are stored in HDD 904. CPU 901 reads a program related to target processing from recording medium 912 and executes it. In addition, CPU 901 may read a program related to target processing from another device via a communication network (NW 920).
As described above, an on-server data transmission device (on-server data transmission device 200) performs, in a user space, data transfer control on an interface part (accelerator 120 or NIC 130). An OS (OS 70) includes: a kernel (kernel 171); a ring buffer (mbuf: a ring-structured buffer to which PMD 151 copies data by DMA) in a memory space in which a server deploys the OS; and a driver (PMD 151) capable of selecting data arrival from the interface part (accelerator 120 or NIC 130) either in a polling mode or in an interrupt mode. The on-server data transmission device includes: a data transfer part (data transfer part 220) configured to launch a thread (polling thread) that monitors a packet arrival using a polling model; and a sleep control manager (sleep control manager 210) configured to manage data arrival schedule information and deliver the data arrival schedule information to the data transfer part to perform sleep control on the data transfer part. The data transfer part is configured to, based on the data arrival schedule information delivered from the sleep control manager, put the thread into a sleep state and cause a timer to expire immediately before an arrival of data to perform cancellation of the sleep state, causing the thread to wake up.
With this configuration, in order to perform sleep control on a plurality of data transfer parts according to the data arrival timing, sleep control manager 210 collectively performs sleep/activation timing control on each data transfer part 220. When data arrival is scheduled, reduction of the delay is aimed for by bypassing the kernel by performing a packet transfer in the polling mode with a low delay. Further, power saving is aimed for when no data arrival is scheduled, by suspending monitoring of data arrival and by transitioning to a sleep state. As a result, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account.
It is possible for on-server data transmission device 200 to reduce the delay by causing a data transfer delay in the server with a polling model instead of an interrupt model. That is, in on-server data transmission device 200, like DPDK, each data transfer part 220 deployed in user space 160 is able to reference the ring-structured buffer while bypassing the kernel. Further, by the polling thread constantly monitoring this ring-structured buffer, it is possible to notice a packet arrival (meaning a polling model instead of an interrupt model).
Furthermore, for a data flow in which the data arrival timing is fixedly determined, such as a time division multiplexing data flow as in signal processing in vRAN, performing sleep control on data transfer part 220 taking the data arrival schedule into account makes it possible to lower the CPU usage rate while maintaining low latency and achieve power saving. That is, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account, addressing the problem of unnecessary use of CPU resources in a polling model.
Further, a Guest OS (Guest OS 70) that operates in a virtual machine includes: a kernel (kernel 171); a ring buffer (mbuf; a ring-structured buffer to which PMD 151 copies data by DMA) in a memory space in which a server deploys the Guest OS; a driver (PMD 151) capable of selecting data arrival from an interface part (accelerator 120 or NIC 130) either in a polling mode or in an interrupt mode; and a protocol processor (protocol processor 74) configured to perform protocol processing on a packet on which dequeuing has been performed. The on-server data transmission device includes: a data transfer part (data transfer part 220) configured to launch a thread (polling thread) that monitors a packet arrival using a polling model; and a sleep control manager (sleep control manager 210) configured to manage data arrival schedule information and deliver the data arrival schedule information to the data transfer part to perform sleep control on the data transfer part. The data transfer part is configured to, based on the data arrival schedule information delivered from the sleep control manager, put the thread into a sleep state and cause a timer to expire immediately before an arrival of data to perform cancellation of the sleep state, causing the thread to wake up.
With this configuration, in a system having a virtual server configuration of a VM, it is possible for the server including a Guest OS (Guest OS 70) to lower the CPU usage rate while maintaining low latency, achieving power saving.
Moreover, a Host OS (Host OS 20) on which a virtual machine and an external process formed outside the virtual machine can operate includes: a kernel (kernel 91); a ring buffer (mbuf: a ring-structured buffer to which PMD 151 copies data by DMA) in a memory space in which the server deploys the Host OS; a driver (PMD 151) capable of selecting data arrival from an interface part (accelerator 120 or NIC 130) either in a polling mode or in an interrupt mode; and a TAP device (TAP device 222A), which is a virtual interface created by the kernel (kernel 91). The on-server data transmission device includes: a data transfer part (data transfer part 220) configured to launch a thread (polling thread) that monitors a packet arrival using a polling model; and a sleep control manager (sleep control manager 210) configured to manage data arrival schedule information and deliver the data arrival schedule information to the data transfer part to perform sleep control on the data transfer part. The data transfer part is configured to, based on the data arrival schedule information delivered from the sleep control manager, put the thread into a sleep state and cause a timer to expire immediately before an arrival of data to perform cancellation of the sleep state, causing the thread to wake up.
With this configuration, in a system having a virtual server configuration of a VM, it is possible for the server including a kernel (kernel 191) and a Host OS (Host OS 20) to lower the CPU usage rate while maintaining low latency, achieving power saving.
Further, in an on-server data transmission device (on-server data transmission device 200B), an OS (OS 70) includes: a kernel (kernel 171); a ring buffer (ring buffer 72) that is managed by the kernel, in a memory space the server deploy the OS; and a poll list (poll_list 86), in which information on net device, indicative of which device the hardware interrupt (hardIRQ) from an interface part (NIC 11) comes from, is registered. The on-server data transmission device includes: in the kernel, a data transfer part (data transfer part 220) configured to launch a thread that monitors a packet arrival using a polling model; and a sleep control manager (sleep control manager 210) configured to manage data arrival schedule, manage data arrival schedule information, and deliver the data arrival schedule information to the data transfer part to perform sleep control on the data transfer part. The data transfer part includes: a data arrival monitoring part (data arrival monitoring part 222) configured to monitor (poll) the poll list; a packet dequeuer (Rx data transfer part 223) configured to, when a packet has arrived, reference the packet held in the ring buffer, and perform, based on the processing to be performed next, dequeuing to remove a corresponding queue entry from the ring buffer; and a sleep controller (sleep controller 221) configured to, based on the data arrival schedule information received from the sleep control manager, put the thread (polling thread) into a sleep state, and perform sleep cancellation by a hardware interrupt (hardIRQ) when the sleep state is to be canceled.
With this configuration, it is possible for on-server data transmission device 200B to reduce the delay by causing a data transfer delay in the server with a polling model instead of an interrupt model. In particular, for a data flow in which the data arrival timing is fixedly determined, such as a time division multiplexing data flow as in signal processing in vRAN, performing sleep control on data transfer part 220 taking the data arrival schedule into account makes it possible to lower the CPU usage rate while maintaining low latency and achieve power saving. That is, it is possible to achieve both low latency and power saving by performing sleep control with timer control taking the data arrival timing into account, addressing the problem of unnecessary use of CPU resources in a polling model.
Data transfer part 220 is configured to, based on the data arrival schedule information received from sleep control manager 210, put the thread (polling thread) into a sleep state, and perform sleep cancellation by a hardware interrupt (hardIRQ) when the sleep state is to be canceled.
Accordingly, in addition to the above-described effects, effects (1) and (2) are further achieved.
(1) Software interrupts (softIRQs) at the arrivals of packets, which are the cause of the occurrence of a delay, are halted and the polling model is embodied in the kernel (kernel 171). That is, on-server data transmission system 1000G embodies the polling model rather than the interrupt model, which is the main cause of the NW delay, unlike NAPI of the existing technique. As the packet is immediately dequeued without a wait at the arrival of the packet, low-latency packet processing can be performed.
(2) The polling thread in on-server data transmission device 200 operates as a kernel thread and monitors packet arrivals in a polling mode. The kernel thread (polling thread) that monitors packet arrivals sleeps while there is no packet arrival. In a case where there is no packet arrival, as the CPU is not used due to the sleep, an effect of power saving can be obtained.
When a packet arrives, the polling thread in a sleep state is awoken (sleep is canceled) by the hardIRQ handler at the arrival of the packet. As the sleep is canceled by the hardIRQ handler, the polling thread can be promptly started while avoiding softIRQ contentions. Here, the cancelation of sleep is characterized in that the sleep is not canceled by a timer that is provided therein, but by the hardIRQ handler. Note that, in a case where the traffic load is known in advance, such as a case where 30 ms sleep is known like the workload transfer rate illustrated in
As described above, on-server data transmission device 200B is able to achieve both low latency and power saving by performing sleep management on the polling thread that performs packet transfer processing.
The on-server data transmission device (on-server data transmission device 200A) is characterized in including a CPU frequency setting part (CPU frequency/CPU idle controller 225) configured to lower a CPU operating frequency of a CPU core used by the thread while in the sleep state.
As described above, on-server data transmission device 200A dynamically varies the CPU operating frequency in a manner depending on the traffic. In other words, when not using the CPU due to the sleep state, the effect of power saving is further enhanced by lowering the CPU operating frequency while in the sleep state.
The on-server data transmission device (on-server data transmission device 200A) includes a CPU idle setting part (CPU frequency/CPU idle controller 225) that sets a CPU idle state of the CPU core to be used by the thread while in the sleep state to a power-saving mode.
In this manner, on-server data transmission device 200A dynamically changes the CPU idle state (a power saving function depending on the type of CPU, such as changing the operating voltage) in accordance with the traffic, so that the power saving effect can be further enhanced.
Note that among the processing described regarding the above-described embodiments, all or some of the processing described as being automatically performed can also be manually performed, or all or some of the processing described as being manually performed can also be performed automatically using a known method. Also, the processing procedure, the control procedure, specific names, and information including various types of data and parameters, which have been described in the above-presented description and drawings can be changed as appropriate unless otherwise specified.
Also, each constituent element of the illustrated devices is a functional concept, and does not necessarily need to be physically configured as illustrated in the drawings. That is, the specific forms of the distribution and integration of the devices are not limited to those illustrated in the drawings, and all or some of the specific forms can be functionally or physically distributed or integrated in any unit according to various types of loads, usage conditions, and the like.
Also, the above configurations, functions, processing parts, processing means, and the like may be embodied by hardware by designing a part or all of them with, for example, an integrated circuit, or the like. Also, each of the above configurations, functions, and the like may be embodied by software for the processor to interpret and execute a program for realizing each function. Information such as programs, tables, and files that embody each function can be stored in a memory, a recording device such as a hard disk, or an SSD (Solid State Drive), or a recording medium such as an IC (Integrated Circuit) card, an SD (Secure Digital) card, or an optical disk.
This is a National Stage Application of PCT Application No. PCT/JP2021/027049, filed on Jul. 19, 2021. The disclosure of the prior application is considered part of the disclosure of this application, and is incorporated in its entirety into this application.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/027049 | 7/19/2021 | WO |