The present invention relates to a system provided with a plurality of processing units and at least one memory device.
Conventionally, an information processing system has been proposed which is constituted by a plurality of processors, a network linking the processors, a shared memory connected to the network, and a data transfer processing device which is connected to the network and connected to an expansion storage device (see Japanese Patent Application Publication No. H8-340348).
Furthermore, a distributed shared memory-type parallel computer system has been proposed in which a plurality of processor elements, each including a computation processing device, a cache memory and a local memory unit, are connected via a network, and each of the computation processing devices in the plurality of processor elements can access any of the local memories in the plurality of processor elements, as shared memories having the same address space (see Japanese Patent Application Publication No. H11-102321).
Conventionally, there have existed systems in which a plurality of central processing units (CPUs) carry out processing in parallel, and a memory is shared by the plurality of CPUs. However, with a conventional shared memory, either the plurality of CPUs are assigned to a predetermined network and reference to the shared memory is enabled by connecting the shared memory to the network, or a plurality of memories connected respectively to each of a plurality of CPUs are managed as a single address space, thereby enabling each of the CPUs to refer to the memories of the other CPUs, or the like. Therefore, when transferring computation results between CPUs while a plurality of CPUs are carrying out parallel processing, it is necessary to transmit and receive the computation results via the CPUs, a network interface which performs complex arbitration involving a large number of components, or the like, and this impedes increasing the speed of the parallel processing.
The present invention was devised in view of the problem described above, an object thereof being to provide a system including a plurality of processing units, wherein the transfer of computation results between processing units is made faster, and the overall performance of the system is improved.
The present invention adopts the following measures in order to resolve the problems described above. More specifically, the present invention is a system, including: a plurality of processing units, each having one or a plurality of processing unit-side ports; and at least one memory device having two or more memory device-side ports, wherein the memory device is shared by a predetermined two or more processing units among the plurality of processing units, by connecting the processing unit-side ports and the memory device-side ports one-to-one logically.
Furthermore, the system may further include a notification unit which notifies that a computation result produced by any of the processing units among the predetermined two or more processing units has been written to the memory device, or that any of the processing units among the predetermined two or more processing units has read out, from the memory device, a computation result produced by another processing unit among the predetermined two or more processing units, to the other processing unit among the predetermined two or more processing units.
Moreover, the notification unit may notify the other processing unit when the computation result has been written to the memory device, or when the computation result has been read out from the memory device, by generating an interrupt to the other processing unit.
Furthermore, the memory device may also include an interrupt unit which issues an interrupt to the other processing unit; and the notification unit may notify the other processing unit when the computation result has been written to the memory device, or when the computation result has been read out from the memory device, by causing the interrupt unit to issue an interrupt to the other processing unit.
Moreover, the notification unit may issue the interrupt by issuing a predetermined instruction to the memory device.
Furthermore, the plurality of processing units may be connected via the memory devices so as to logically configure a mesh in which the processing units are nodes; and the predetermined two or more processing units may be processing units which are arranged adjacent to each other in the mesh among the plurality of processing units.
Moreover, the plurality of processing units may be connected via the memory devices so as to logically configure a one-dimensional or multi-dimensional torus; and the predetermined two or more processing units may be processing units which are arranged adjacent to each other in the torus among the plurality of processing units.
Furthermore, the plurality of processing units may each include: computation result acquisition unit to acquire a computation result produced by another processing unit which shares the memory device with the processing unit, by reading out the computation result from the memory device; computation unit to carry out computation using the computation result acquired by the computation result acquisition unit; and computation result delivery unit to transfer a computation result produced by the computation unit to the other processing unit by writing the computation result to the memory device, and enabling the other processing unit to carry out computation using the computation result.
Moreover, the present invention can also be understood as an invention relating to a memory device. For example, the present invention is a memory device which can be shared by two or more processing units, including: a storage unit which stores a computation result written from the two or more processing units; and an interrupt unit which, upon receiving an instruction from a processing unit that has written the computation result to the memory device, or from a processing unit that has read out the computation result from the memory device, issues an interrupt to another processing unit among the two or more processing units.
The present invention can be understood as a computer system, an information processing device, a method executed by a computer, or a program which is executed in a computer. Furthermore, the present invention can also be understood as a recording medium on which such a program is recorded so as to be readable by a computer, or other device or machine, or the like. Here, a recording medium which is readable by a computer, or the like, is a recording medium on which information, such as data or programs, is stored by an electrical, magnetic, optical, mechanical or chemical action, and which can be read by the computer, or the like.
According to the present invention, in a system provided with a plurality of processing units, the transfer of computation results between the processing units is made faster and the overall performance of the system can be improved.
Below, embodiments of the system, memory device and method relating to the present disclosure are described below on the basis of the drawings. However, the embodiments given below are merely examples and the system, memory device and method relating to the present disclosure are not limited to the specific configuration given below. In implementing the invention, the concrete configurations corresponding to the embodiments can be adopted, as appropriate, and various improvements and modifications can be made.
<System Configuration>
In the system relating to the present embodiment, the transfer of computation results between the CPU 11a and the CPU 11b is made faster by connecting the CPUs 11a, 11b and the memory 12 as described above, whereby the overall performance of the system can be improved.
The computation result acquisition units 111a, 111b each acquire computation results produced by the other CPU 11a or 11b which is sharing the memory 12, by reading out the computation results from the memory 12.
The computation units 112a, 112b carry out computations by using computation results acquired by the computation result acquisition units 111a, 111b.
The computation result delivery units 113a, 113b transfer computation results produced by the computation units 112a, 112b, to the other CPU 11a or 11b sharing the memory 12, by writing the computation results to the memory 12, thereby enabling the CPU 11a or 11b to carry out computations using those computation results. In this case, the other CPU 11a or lib acquires the computation results by the computation result acquisition unit 111a, 111b.
The notification unit 114a, 114b notifies the other CPU 11a or 11b sharing the memory 12 that a computation result produced by the CPU 11a or 11b has been written to the memory 12, or read out from the memory 12. In the present embodiment, the notification units 114a, 114b provide a notification that a computation result has been written to the memory 12 or has been read out from the memory 12, by sending a predetermined message instructing an interrupt to the CPU 11a or 11b, to the memory 12. In the present embodiment, a message instructing writing or reading of a computation result produced by the computation result delivery units 113a, 113b is transmitted together with a predetermined message indicating an interrupt to the CPU 11a or 11b, by the notification units 114a, 114b. In this way, the notification unit 114a, 114b generates an interrupt to the other CPU 11a or 11b sharing the memory 12 (a so-called “door-bell interrupt”), thereby notifying the other processing unit, CPU 11a or 11b, that a computation result has been written to the memory 12, or read out from the memory 12.
Consequently, the memory 12 is provided with an interrupt unit 121 for issuing interrupts to the CPU 11a and the CPU 11b.
When the memory 12 receives a predetermined message from a CPU 11a or 11b that has written a computation result, instructing an interrupt to another CPU (a CPU other than one carrying out writing or reading, of the CPUs sharing the memory 12), the memory 12 generates an interrupt to the CPU 11a or 11b. By adopting a configuration of this kind, the notification units 114a, 114b notify the CPUs which are sharing the memory 12 that a computation result has been written or read out. More specifically, the memory 12 is able to identify the CPU issuing the interrupt instruction, on the basis of the content of the received message or the port at which the message is received, and issues an interrupt to the CPU other than the CPU that has issued the interrupt instruction.
In the example illustrated in
Furthermore, when a computation result is written to the computation result write region 122b for the CPU 11b, and an interrupt instruction is issued from the CPU 11b to the memory 12, then the interrupt unit 121 issues an interrupt to the CPU 11a, and notifies the CPU 11a that a computation result produced by the CPU 11b has been written. Thereupon, when the computation result is read out from the computation result write region 122b for the CPU 11b, and an interrupt instruction is issued from the CPU 11a to the memory 12, then the interrupt unit 121 issues an interrupt to the CPU 11b, and notifies the CPU 11b that the computation result has been read out by the CPU 11a. If the memory 12 is shared by three or more CPUs, then this notification is sent to two or more CPUs. In this case, the interrupt unit 121 issues an interrupt, either substantially simultaneously or consecutively, to these two or more CPUs.
When implementing the system, memory device and method relating to the present disclosure, another method may be used for notification, and an interrupt unit 121 such as that described above does not have to be provided in the memory 12. For example, a notification may be issued by a method (called a “spinlock”) wherein the CPU 11a or 11b on the side receiving a computation result repeatedly checks a write completion flag or read-out completion flag which has been set in the memory 12 by the CPU on the side that has carried out the writing or read-out (these flags differ from the writeable flag and the readable flag which are described below).
Furthermore, the CPU 11a or 11b which has written or read a computation result to or from the computation result write region 122a or 122b may generate an interrupt directly in the CPU that is the destination of the notification. In this case, the CPUs must each be connected to the interrupt controller of the other counterpart CPU. If a method of this kind is adopted, depending on the length and/or congestion status of the signal lines between the CPUs and the memory, and the timing at which the interrupt is issued, there is a possibility that an interrupt may be issued before writing/read-out to or from the computation result write region 122a or 122b is actually completed, and the counterpart CPU may read from a storage region where writing of the computation result has not been completed, or the counterpart CPU may write to a storage region where reading of a computation result has not been completed. Therefore, a countermeasure may be adopted whereby an interrupt is issued after reliably detecting that writing or read-out has been completed.
Furthermore, the present embodiment describes a case in which a fixed computation result write region 122a, 122b is used by the CPU 11a, 11b as a region for writing a computation result about which the counterpart CPU is notified, but the region where the computation result is written does not have to be fixed. When the region where the computation result has been written is not fixed, then the CPUs 11a, 11b notify the counterpart device of the address where the computation result has been written. This notification may be carried out simultaneously with the interrupt described above, or may be carried out by writing the address where the computation result has been written, to a predetermined location in the memory 12.
<Flow of Processing>
A write notification interrupt handler which is executed when a write notification interrupt is received from the memory 12 and a read-out notification interrupt handler which is executed when a read-out notification interrupt is received from the memory 12 are set respectively in each of the CPUs.
The write notification interrupt handler is a handler (not illustrated) which, when a write notification interrupt is received, raises a readable flag (sets the flag to TRUE), and if the CPU is sleeping, cancels the sleeping status of the CPU and returns the CPU to processing. The readable flag is a flag which is used in order to delay the read-out of the computation result in situations during computation result transfer processing where, for instance, a computation result has been written by another CPU but computation by the CPU in question has not been completed. In the present embodiment, the readable flag is held in the CPU 11 (for example, a register provided in the CPU 11).
Meanwhile, the read-out notification interrupt handler is a handler (not illustrated) which, when a read-out notification interrupt is received, raises a writable flag (sets the flag to TRUE), and if the CPU is sleeping, cancels the sleeping status of the CPU and returns the CPU to processing. The writable flag is a flag which is used in order to delay the writing of the computation result in situations during computation result transfer processing where, for instance, reading of a computation result produced by the other CPU has been completed, but computation by the CPU in question has not been completed. In the present embodiment, the writable flag is held in the CPU 11 (for example, a register provided in the CPU 11).
The CPU 11a carries out computation assigned to that CPU in the parallel processing (step S101). In a first cycle of processing, the CPU 11a carries out computation on the basis of previously determined data. In the second and subsequent cycles of processing, the CPU 11a carries out computation on the basis of a computation result produced by the other CPU (for example, the CPU 11b), which has been acquired by the CPU 11a in step S113 described below.
When computation has been completed, the CPU 11a checks the status of the writable flag in order to determine whether or not writing to the computation result write region 122a is possible, and when the writable flag indicates that writing to the computation result write region 122a is possible (when the flag is TRUE), then the CPU 11a sets the writable flag to a value (FALSE) indicating that writing to the computation result write region 122a is not permitted (step S102).
When it has been confirmed in step S102 that the writable flag has been raised (is set to TRUE), the CPU 11a writes the computation result to the computation result write region 122a of the memory 12 (step S103 and step S104). Here, the CPU 11a issues, to the memory 12, a write notification interrupt instruction relating to the other CPU (here, the CPU 11b) apart from the CPU 11a, which is sharing the memory 12 (step S105). The processing indicated in step S103 and step S104 is carried out by sending an instruction to the memory 12 from the CPU 11a, but the write instruction and the interrupt instruction may be sent separately or may be sent simultaneously as one message.
The interrupt unit 121 of the memory 12, upon receiving the write notification interrupt instruction from the CPU 11a, issues a write notification interrupt to inform the CPU 11b that writing of the computation result by the CPU 11a has been completed. In this way, the CPU 11b is notified of the fact that a computation result has been written to the memory 12.
Upon receiving the write notification interrupt from the memory 12, the CPU 11b executes the write notification interrupt handler described above. Thereafter, similarly to the CPU 11a, the CPU 11b which is executing the computation result transfer processing, acquires the computation result produced by the CPU 11a which is sharing the memory 12 with the CPU 11b, of the plurality of processing units, from the computation result write region 122a of the memory 12. In other words, by executing the processing described above, the CPU 11a can transfer the computation result to another processing unit (here, the CPU 11b), and enable the other processing unit to carry out computation using that computation result.
Meanwhile, when it is confirmed in step S102 that the writable flag has not been raised (is set to FALSE), then there is a possibility that read-out from the computation result write region 122a by another CPU, such as the CPU 11b, has not been completed, and therefore writing of a computation result cannot be carried out. Consequently, the CPU 11a sets itself to sleeping status (step S107). If a flag indicating a sleeping status is used to determine a sleeping status of the CPU in the interrupt handler described above, then the CPU 11a raises the sleeping flag immediately before entering a sleeping status. In this case, the series of processes from the process indicated in step S102 to the process wherein the CPU raises the sleeping flag and enters sleeping mode are desirably atomic.
The CPU 11a which has entered a sleeping status executes the write notification interrupt handler or the read-out notification interrupt handler described above, upon receiving an interrupt from another CPU, such as the CPU 11b. Here, when the received interrupt is a read-out notification interrupt, then the CPU is started up by the read-out notification interrupt handler (the sleeping status is cancelled) and the writable flag is raised, and therefore the CPU carries out writing of the computation result (step S104). When a sleeping flag is used to determine the sleeping status, a process for lowering the sleeping flag is carried out after cancelling the sleeping status, but the series of processes for cancelling the sleeping status and lowering the sleeping flag are desirably atomic.
Next, the CPU 11a checks the status of the readable flag in order to determine whether or not read-out from the computation result write region 122b is possible, and when the readable flag indicates that read-out from the computation result write region 122b is possible (when the flag is set to TRUE), then the CPU 11a sets the readable flag to a value (FALSE) indicating that read-out from the computation result write region 122b is not permitted (step S111).
When it has been confirmed in step S111 that the readable flag has been raised (is set to TRUE), then the CPU 11a acquires the computation result produced by the CPU 11b which shares the memory 12 with the CPU 11a, of the plurality of processing units, by reading out the computation result from the computation result write region 122b of the memory 12 (step S112 and step S113). The computation result which is read out here is the computation result that was written to the memory 12 as a result of computation result transfer processing carried out by the CPU 11b in parallel with the computation result transfer processing carried out by the CPU 11a.
The CPU 11a instructs the memory 12 to issue an interrupt to the CPU 11b for notification of the completion of read-out, in order to notify the CPU 11b that read-out of the computation result has been completed (read-out notification interrupt instruction; step S114). The processing indicated in step S113 and step S114 is carried out by sending an instruction to the memory 12 from the CPU 11a, but the read-out instruction and the interrupt instruction may be sent separately or may be sent simultaneously as one message.
The interrupt unit 121 of the memory 12, upon receiving the read-out notification interrupt instruction from the CPU 11a, issues a read-out notification interrupt to inform the CPU 11b that read-out by the CPU 11a has been completed. In this way, the CPU 11b is notified of the fact that a computation result has been read out from the memory 12. Upon receiving the read-out notification interrupt, the CPU lib executes the read-out notification interrupt handler mentioned above.
Meanwhile, when it is confirmed in step S111 that the readable flag has not been raised (the flag is set to FALSE), then there is a possibility that writing to the computation result write region 122b by another CPU, such as the CPU 11b, has not been completed, and therefore read-out of a computation result cannot be carried out. Consequently, the CPU 11a sets itself to sleeping status (step S116). If a flag indicating a sleeping status is used to determine a sleeping status of the CPU in the interrupt handler described above, then the CPU 11a raises the sleeping flag immediately before entering a sleeping status. In this case, the series of processes from the process indicated in step S111 to the process wherein the CPU raises the sleeping flag and enters sleeping status are desirably atomic.
The CPU 11a which has entered a sleeping status executes the write notification interrupt handler or the read-out notification interrupt handler described above, upon receiving an interrupt from another CPU, such as the CPU 11b. Here, when the received interrupt is a write notification interrupt, then the CPU is started up by the write notification interrupt handler (the sleeping status is cancelled) and the readable flag is raised, and therefore the CPU carries out read-out of the computation result (step S113). When a sleeping flag is used to determine the sleeping status, a process for lowering the sleeping flag is carried out after cancelling the sleeping status, but the series of processes for cancelling the sleeping status and lowering the sleeping flag are desirably atomic. Thereupon, the processing returns to step S101.
In the flowcharts in
In the present embodiment, an example has been described in which an interrupt is carried out each time a computation result is written, and the computation result is transferred to the counterpart CPU, but the interrupt may also be carried out once in respect of a plurality of computation result write operations. In other words, the computation result write process may be carried out after the computation process in the processing illustrated in
<Variations>
The system according to the present disclosure should be provided with a plurality of processing units and at least one memory device, and is not limited to the configurations in
For example, when data for analysis objects is stacked three-dimensionally (in so-called “voxels”), in the case of finite element analysis, or the like, then each analysis object affects, and is affected by, another six analysis objects which are situated adjacent thereto. Therefore, conventionally, the CPUs which carry out analysis are connected in a torus or mesh configuration, and computation results are transmitted and received between the CPUs working in parallel, while carrying out analysis. In the system relating to the present embodiment, the plurality of CPUs may be connected via memories, so as to compose a torus having a single-dimensional or multi-dimensional logic configuration. Furthermore, the plurality of CPUs may be connected via memories so as to logically configure a mesh in which the CPUs each placed at the nodes (intersections) (in other words, a torus configuration does not need to be adopted). However, with a system of this kind, there is no need to exchange data with adjacent nodes (CPUs) at each step of the processing, and therefore the time taken for this data exchange greatly affects the overall performance of the system. In particular, in a conventional parallel processing system which is connected in a torus or mesh configuration, the connection lines which interconnect between the CPUs are long, and the delay increases. Therefore, the system, memory device and method relating to the present disclosure also exhibit a beneficial effect in achieving faster transfer of computation results between CPUs, in a system in which the CPUs are connected in a torus or mesh configuration. In the system relating to the present embodiment, a mode other than a torus or mesh configuration may be employed for the connection mode of the plurality of CPUs.
From
In a torus-shaped system (see
In the mode shown in
In the mode shown in
According to the embodiment described above, in a system provided with a plurality of CPUs, the transfer of computation results between CPUs is made faster and the overall performance of the system can be improved. As described above, the embodiment described above serves as an example and does not limit the system, memory device and method relating to the present disclosure to a concrete configuration. In implementing the invention, the concrete configurations corresponding to the embodiment can be adopted, as appropriate, and various improvements and modifications can be made.
This application is a continuation application of International Application PCT/JP2014/053539 filed on Feb. 14, 2014 and designated the U.S., the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2014/053539 | Feb 2014 | US |
Child | 14669293 | US |