The present invention relates to a parallel and distributed computing system in which a plurality of computers including a processor including a translation lookaside buffer (TLB), a physical memory, and a network interface controller (NIC) directly accessible to the physical memory are interconnected via a data link.
As described in Non Patent Literature 1, the inventors of the present application have been developing a memory-based communication facility (MBCF) of a communication and/or synchronization mechanism based on a memory-based communication and/or synchronization scheme. This MBCF is a mechanism that does not require any special communication and/or synchronization hardware but uses a stock network interface card (NIC) to implement high-speed high-performance communication and/or synchronization with remote memory operation only by software.
Specifically, the MBCF is configured using a computer including a processor including a translation lookaside buffer (TI-B), a physical memory, and a network interface controller (NIC) directly accessible to the physical memory. For example, a process of a transmission source computer (hereinafter, a transmission-side process) transmits an operation request packet including an identifier of an operation target process (hereinafter, a reception-side process) that defines a process of a transmission destination computer, an operation target address that defines a memory area of the reception-side process, a data size to be written, and a data sequence. Then, the transmission destination computer receives the operation request packet transmitted by the transmission-side process, and stores the data sequence in the memory area defined by the reception-side process and the operation target address.
In a parallel and distributed computing system in which a plurality of computers are coupled via a data link, an MBCF is provided as a means for implementing remote memory operations. In a conventional MBCF, a transmission-side process can rewrite a memory in a reception-side process without synchronizing with the reception-side process, and this is a major factor of high flexibility and high performance of communication by the MBCF. However, on the other hand, even if the memory of the reception-side process is rewritten by the transmission-side process, the reception-side process cannot recognize which memory has been changed. In a case where a change in a memory content causes another data change in a reception-side process, not knowing where the change is made can be a major disadvantage.
Therefore, the present invention has been made to solve the above problems, and a main object thereof is to provide a means that enables a reception-side process on which a remote memory operation has been performed to recognize the content of the memory operation and perform necessary processing at low cost in a parallel and distributed computing system in which a plurality of computers including a processor including a translation lookaside buffer (TLB), a physical memory, and a network interface controller (NIC) directly accessible to the physical memory are interconnected via a data link.
That is, a parallel and distributed computing system according to the present invention is a parallel and distributed computing system in which a plurality of computers including a processor including a translation lookaside buffer (TLB), a physical memory, and a network interface controller (NIC) directly accessible to the physical memory are interconnected via a data link, wherein a process of a transmission source computer (hereinafter, a transmission-side process) transmits an operation request packet including an identifier of an operation target process (hereinafter, a reception-side process) that defines a process of a transmission destination computer, an operation target address that defines a memory area of the reception-side process, a data size to be written, a data sequence, and a structure address that defines a history memory area for temporarily recording an operation content history in the reception-side process, and the transmission destination computer receives the operation request packet, stores the data sequence in the memory area defined by the reception-side process and the operation target address, and records the operation target address, the data size written, and information of the transmission-side process in the history memory area as operation content.
According to such a parallel and distributed computing system, in the parallel and distributed computing system in which a plurality of computers are interconnected via a data link, and the plurality of computers perform communication and/or synchronization by MBCF with each other, an operation request packet transmitted by a transmission-side process to a reception-side process includes a structure address that defines a history memory area for temporarily recording an operation content history in the reception-side process, and a transmission destination computer receives the operation request packet and records the operation target address, the data size written, and information of the transmission-side process in the history memory area as operation content. Therefore, it is possible to provide a means by which the reception-side process can recognize the content of the memory operation and perform necessary processing at low cost by referring to the history memory area.
It is conceivable that a computer in which the reception-side process exists activates an asynchronous user function to the reception-side process at a time point when the operation content history is accumulated in the history memory area.
It is conceivable that the transmission destination computer saves, in the history memory area, data before being overwritten in the memory area of the reception-side process.
It is conceivable that the transmission-side process transmits an operation request packet including an identifier of an operation target process (hereinafter, a reception-side process) that defines a process of a transmission destination computer, an operation target address that defines a table area on a memory of the reception-side process, information of a row and a column in a table, a data sire to be written, a data sequence, and a structure address that defines a history memory area for temporarily recording an operation content history in the reception-side process, and
the transmission destination computer receives the operation request packet, reads data from an area defining the table area defined by the reception-side process and the operation target address, obtains a memory address corresponding to the row and the column defined in the operation request packet, stores the data sequence from the memory address, and records the operation target address, information of the row and the column, the data size written, and information of the transmission-side process in the history memory area as operation content.
It is conceivable that the transmission destination computer activates an asynchronous user function to the reception-side process at a time point when the operation content history is accumulated in the history memory area.
It is conceivable that the transmission destination computer saves, in the history memory area, data before being overwritten in the table area of the reception-side process.
According to the present invention configured as described above, it is possible to provide a means for notifying the operation content to the reception-side process at low cost in the remote memory write operation by the MBCF.
100 parallel and distributed computing system
Hereinafter, a parallel and distributed computing system 100 according to an embodiment of the present invention will be described with reference to the drawings.
As illustrated in
As illustrated in
And the parallel and distributed computing system 100 does not require any special communication and/or synchronization hardware but uses a stock network interface card (NIC) 23 to construct a memory-based communication facility (MBCF) to implement high-speed high-performance communication and/or synchronization by remote memory operations only by software. Specifically, the parallel and distributed computing system 100 constructs a memory-based communication facility (MBCF) by an operating system (OS) stored in a kernel space of each computer 2.
The parallel and distributed computing system 100 has variations of various operation commands such as a WRITE command (MBCF_WRITE) for performing remote memory writing and a READ command (MBCF_READ) for performing remote memory reading described below.
For example, a process of a transmission source computer 2 (2X) (hereinafter, a transmission-side process) transmits an operation request packet including an identifier of an operation target process (hereinafter, a reception-side process) that defines a process of a transmission destination computer 2 (2Y), an operation target address that defines a memory area of the reception-side process, a data size to be written, and a data sequence, and the transmission destination computer 2 receives the operation request packet and stores the data sequence in the memory area defined by the reception-side process and the operation target address (MBCF_WRITE).
In addition, the transmission-side process transmits an operation request packet including an identifier of an operation target process (hereinafter, a reception-side process) that defines a process of the transmission destination computer 2 (2Y), an operation target address that defines a memory area of the reception-side process, a data size to be read, and a data storage area address of the transmission-side process, and the transmission destination computer 2 (2Y) receives the operation request packet, reads a data sequence from the memory area defined by the reception-side process and the operation target address, and returns the data sequence to the data storage area of the transmission-side process (MBCF_READ).
Here, the procedure of MBCF_WRITE will be described in detail with reference to
Next, with reference to
In the transmission-side process (request side task), parameters including an identifier [Ltask1] of the reception-side process (request destination task), an operation target address [Laddr1] of the reception-side process, an access key [AccessKey] for memory space operation of the reception-side process, a command type [MBCF_WRITE] of the MBCF, a data size [n] for performing remote writing, and a pointer [Laddr0] to the head of an area storing data to be written are prepared. Then, the MBCF request transmission system call is called with these parameters. Upon receiving the system call, the OS refers to the task table of the transmission-side process and converts the logical task ID indicating the reception-side process into a physical task ID [(Pnode2, Ptask5)]. Since the physical task ID includes Pnode2 which is a physical node ID, route information (delivery destination information) to the reception-side node can be set from this information. If the network to be used is Ethernet, the MAC address is used as the delivery destination information. This delivery destination information enables the NIC to deliver the operation request packet to the reception-side node. Then, the OS causes the NiC to transmit the operation request packet.
Next, with reference to
Next, the operation of the command (MBCF_FIFO) for registering data to the FIFO queue in the parallel and distributed computing system 100 will be described.
In MBCF_FIFO, an operation of a transmission-side process (request side task) is almost the same as that of MBCF_WRITE. The difference is that MBCF_WRITE of the command is replaced with MBCF_FIFO, and a destination indicated in the reception-side process (reception-side task) by the destination operation target address [Laddr1] is not an area for storing data but a FIFO structure in which a plurality of pointers defining a FIFO queue are stored. Therefore, description of the operation of the transmission-side process (request side task) of MBCF_FIFO will be omitted.
When the buffer defined by the FIFO structure is almost full and all the data carried by the MBCF_FIFO operation request packet cannot be stored, the operation of the MBCF reception routine at the request destination is canceled. More specifically, since the pointers of the FIFO structure are not updated at all, it is in the same state as no data is added to the FIFO queue.
Then, in the parallel and distributed computing system 100 of the present embodiment, a command (MBCF_WRITE_wLOG) in which the reception-side process records the operation history of the transmission-side process is implemented.
Specifically, the transmission-side process transmits an operation request packet (MBCF_WRITE_wLOG operation request packet) including an identifier of a reception-side process that defines a process of the transmission destination computer 2 (2Y), an operation target address that defines a memory area of the reception-side process, a data size to be written, a data sequence, and a structure address that defines a history memory area for temporarily recording an operation content history in the reception-side process.
Then, the transmission destination computer 2 (2Y) receives the MBCF_WRITE_wLOG operation request packet and stores the data sequence in the memory area defined by the reception-side process and the operation target address. In addition, the transmission destination computer 2 (2Y) records the operation target address, the written data size, and the information of the transmission-side process in the history memory area as the operation content.
Here, the FIFO queue of the MBCF is used as the history memory area for temporarily recording the operation history. The operation of the FIFO queue is defined and implemented by MBCF_FIFO (described above) or MBCF_FIFO_READ. However, the content registered in the FIFO queue as the history memory area is not data loaded in the operation request packet, but the transmission destination computer 2 (2Y) registers write operation content such as the identifier of the transmission-side process, the operation target address, and the write data size regarding the MBCF_WRITE_wLOG operation packet. As described above, by diverting the FIFO queue (registering the operation history instead of the data) as the history memory area, it is possible to save time and effort to develop a new function and to reduce the amount of code.
In addition, the transmission destination computer 2 (2Y) in which the reception-side process exists activates, to the reception-side process, an asynchronous user function that checks the content of the history memory area and performs a necessary operation at a time point when the operation content history is accumulated in the history memory area by a certain amount or more. In addition, the transmission destination computer 2 (2Y) is configured to save the data before being overwritten in the memory area of the reception-side process in the history memory area so that the reception-side process can perform rollback operations of remote writing and calculate various kinds of statistical calculations including summations at low cost.
The usefulness of leaving a log of the write operation in the history memory will be specifically described below. Assume that a large data area exists in the reception-side process, and the statistical data is calculated in the reception-side process. Assume that the data area is in a tabular form, a numerical value is stored in each item, a reception-side process calculates a sum and an average of columns and a sum and an average of rows, and the sum and the average are stored as statistical data in another area. In a situation where there is no write operation log, since the reception-side process does not know which data has been updated, it is necessary to recalculate the sum and the average for all columns and all rows. On the other hand, if the address and size of the updated data are found from the history memory, it is possible to determine which row or which column needs to be recalculated from the information. Therefore, it is possible to limit the rows and columns to be recalculated to only where the update is made, and it is possible to significantly reduce the processing cost of the reception-side process. When the data before the update is left in the history memory, the sum of the rows and the columns can be calculated by subtracting the numerical value before the update from the sum before the recalculation and adding a new numerical value. Therefore, the sum of the rows and the columns can be obtained without even accessing the entire data of the rows or columns, and the average value can also be calculated.
In addition, in the parallel and distributed computing system 100 of the present embodiment, not only a tabular write command (MBCF_TABLE) that does not leave an operation history but also a tabular write command (MBCF_TABLE_wLOG) that leaves an operation history is implemented. These commands are remote write commands that do not directly specify a remote write address, but specify an area in which information defining a table in a reception-side process is stored as an operation target address, and specify a data storage location by information of a row and a column in the table. Note that MBCF_TABLE is only different from MBCF_TABLE_wLOG in not leaving an operation history, and MBCF_TABLE_wLOG will be described below.
In MBCF_TABLE_wLOG, a transmission-side process transmits an MBCF_TABLE_wLOG operation request packet including an identifier of a reception-side process that defines a process of a transmission destination computer 2 (2Y), an operation target address that defines a table area on a memory of the reception-side process, information of a row and a column in the table, a data size to be written, a data sequence, and a structure address that defines a history memory area for temporarily recording an operation content history in the reception-side process.
Then, the transmission destination computer 2 (2Y) receives the MBCF_TABLE_wLOG operation request packet, reads data from an area defining the table area specified by the reception-side process and the operation target address, obtains a memory address corresponding to the row and the column specified by the transmission-side process and stored in the operation request packet, and stores the data sequence from the address. In addition, the transmission destination computer 2 (2Y) records the operation target address, the information of the row and the column, the written data size, and the information of the transmission-side process in the history memory area as the operation content.
Here, the FIFO queue of the MBCF is used as the history memory area for temporarily recording the operation history. The operation of the FIFO queue is defined and implemented by MBCF_FIFO or MBCF_FIFO_READ described above. However, the content registered in the FIFO queue as the history memory area is not data loaded in the operation request packet, but the transmission destination computer 2 (2Y) registers the write operation content such as the identifier of the transmission-side process, the operation target address, and the write data size regarding the MBCF_TABLE_wLOG operation packet. As described above, by diverting the FIFO queue (registering the operation history instead of the data) as the history memory area, it is possible to save time and effort to develop a new function and to reduce the amount of code.
In addition, the transmission destination computer 2 (2Y) in which the reception-side process exists activates, to the reception-side process, an asynchronous user function that checks the content of the history memory area and performs a necessary operation at a time point when the operation content history is accumulated in the history memory area by a certain amount or more. In addition, the transmission destination computer 2 (2Y) is configured to save the data before being overwritten in the memory area of the reception-side process in the history memory area so that the reception-side process can perform rollback operations of remote writing and calculate various kinds of statistical calculations including summations at low cost.
The usefulness of leaving a log of the write operation in the history memory will be specifically described below. Assume that a large tabular data area exists in the reception-side process and the statistical data is calculated in the reception-side process. Assume that a numerical value is stored in each item of the table, and the reception-side process calculates a total sum and an average of columns and a total sum and an average of rows and stores them as statistical data in another area. In a situation where there is no write operation log, since the reception-side process does not know which data has been updated, it is necessary to recalculate the sum and the average for all columns and all rows. On the other hand, if the row and column of the updated data are found from the history memory, it is possible to limit the rows and columns to be recalculated to only where the update is made, and it is possible to significantly reduce the processing cost of the reception-side process. When the data before the update is left in the history memory, the sum of the rows and the columns can be calculated by subtracting the numerical value before the update from the sum before the recalculation and adding a new numerical value. Therefore, the sum of the rows and the columns can be obtained without even accessing the entire data of the rows or columns, and the average value can also be calculated.
According to the parallel and distributed computing system 100 of the present embodiment configured as described above, in the parallel and distributed computing system 100 in which the plurality of computers 2 are interconnected via the data link 3, and the plurality of computers 2 perform communication and/or synchronization by MBCP with each other, the operation request packet transmitted by the transmission-side process to the reception-side process includes the structure address that defines the history memory area for temporarily recording the operation content history in the reception-side process, and the transmission destination computer receives the operation request packet and records the operation target address, the written data size, and the information of the transmission-side process in the history memory area as the operation content. Therefore, the reception-side process can recognize the content of the memory operation and perform necessary processing at low cost.
In addition, the present invention is not limited to the above embodiment, and it goes without saying that various modifications can be made without departing from the gist of the present invention.
According to the present invention, it is possible to provide a means that enables a reception-side process on which a remote memory operation has been performed by MBCF to recognize the content of the memory operation and perform necessary processing at low cost.
Number | Date | Country | Kind |
---|---|---|---|
2020-129496 | Jul 2020 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/022269 | 6/11/2021 | WO |