This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. JP2013-067651, filed on Mar. 27, 2013, the entire contents of which are incorporated herein by reference.
The present invention relates to an arithmetic processing apparatus including a cache memory and a plurality of processing units.
Such information processing is dominant in processing of a computer as the case may be that a processing unit, e.g., a processor, for program's processing, accesses a memory, reads out data, processes the readout data and writes the data back to the memory. The processing unit will hereinafter be referred to also as a core.
Such being the case, a high-speed and small-capacity memory called a cache is disposed between the processing unit and a memory existing outside the processing unit in order to improve a memory accessing speed. Namely, there is utilized a method of increasing a substantial speed at which the processing unit accesses the memory through the cache.
This cache technology involves widely utilizing “prefetch” of predicting a memory that will be accessed by the processing unit, reading data beforehand from the external memory and writing the readout data to the cache. The prefetch is realized by, e.g., embedding a prefetch instruction for instructing execution of prefetching into a binary program when compiled.
While on the other hand, a method of shortening a clock cycle of the processing unit and attaining a higher frequency has a limit to an improvement of a calculation speed. Therefore, such a method is taken at the present that a multiplicity of processing units conducting the calculations is operated in parallel. Further, a system is proposed, which previously acquires the data with a instruction such as the prefetch instruction by use of an auxiliary processing unit before, e.g., the processing unit performing the calculation.
According to an aspect of the embodiments, an arithmetic processing apparatus including a plurality of first processing units to be connected to a cache memory; a plurality of second processing units to be connected to the cache memory and to acquire, into the cache memory, data to be processed by the first processing unit before each of the plurality of first processing units executes processing; and a schedule processing unit to control a schedule for acquiring the data of the plurality of second processing units into the cache memory.
The object and advantage of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
An arithmetic processing apparatus according to one embodiment will hereinafter be described with reference to the drawings. A configuration of the following embodiment is an exemplification, and the present arithmetic processing apparatus is not limited to the configuration of the embodiment.
By the way, in the case of an arithmetic processing apparatus including a plurality of processing units performing the calculations, timing of acquiring the data beforehand is different on a per processing unit basis. Accordingly, if the technology using an auxiliary processing unit for acquiring the data beforehand is expanded to the arithmetic processing apparatus including the plurality of processing units, such a situation can occur that the data cannot be yet prepared when one of the plurality of processing units performing the calculation needs the data.
The arithmetic processing apparatus according to a comparative example will be described with reference to
The calculation cores 1 acquire sequences of instructions of the computer program deployed in an executable manner on the memory 5 and data via the cache memory 4. Then, the calculation cores 1 process the acquired data by executing the acquired sequences of instructions, and store processed results in the memory 5 via the cache memory 4.
The sequences of instructions executed by the calculation cores 1 contain the prefetch instruction embedded by a compiler when compiling a source program. Each of the calculation cores 1, when acquiring the prefetch instruction, requests the assistant core 2 to execute the prefetch instruction.
The assistant core 2 executes the prefetch instruction in accordance with the request issued from the calculation core 1. The data are acquired into the cache memory 4 by executing the prefetch instruction. Accordingly, it follows that when the calculation core 1 processes the data, processing target data are to exist in the cache memory 4. Namely, the assistant core 2 serving as a core for executing the prefetch assists the calculation core 1 in efficiently executing the process.
The cache memory 4 is a memory, though small of capacity, from and to which the data can be read and written at a high speed. The memory 5 has a larger capacity than the cache memory 4 has but has a slower speed in reading and writing the data than the cache memory 4 has. The calculation cores 1 efficiently make use of the cache memory 4, thereby speeding up the processes of the arithmetic processing apparatus 50.
In the architecture of
On the other hand, the assistant core 2, when receiving a request for executing the prefetch, during a period for which, e.g., the calculation core 1-1 executes the instructions 4 and 5, executes prefetching corresponding to the prefetch instruction 1.
By the way, the calculation core 1-2 acquires the prefetch instruction 2 next to the instruction 12. In the example of
An arithmetic processing apparatus 10 according to a working example will hereinafter be described with reference to
A configuration and an operation of the assistant core 2 are the same as those in the arithmetic processing apparatus 50 in the comparative example. The arithmetic processing apparatus 10 in the working example has, however, a difference from the arithmetic processing apparatus 50 in the comparative example in such a point that the plurality of assistant cores 2 accesses the cache memory 4 in parallel via the crossbar 6A.
To be specific, the plurality of calculation cores 1 and the plurality of assistant cores 2 access the cache memory 4 in parallel via the crossbar 6A. For instance, similarly to the case of
Moreover, in the first working example, each of the assistant cores 2 includes a register 7 that is readable from the cache scheduler 3. Each of the assistant cores 2 individually sets, in the register 7, a busy flag which may be called as in-use flag indicating whether each assistant core 2 is used underway or not. A status of “the assistant core 2 is used underway” can be exemplified by a status in which the assistant core 2 executes prefetching underway.
The cache scheduler 3 includes cores for executing the instructions deployed in the executable manner on, e.g., a main storage device and also the main storage device stored with the sequences of instructions executed by the cores and the data processed by the cores. The cache scheduler 3 executes the sequences of instructions on the main storage device, thereby communicating with the plurality of calculation cores 1 and the plurality of assistant cores 2 via the crossbar 6B. Note that the crossbar 6B and crossbar 6A may be configured as the same crossbar. Namely, such a configuration may be taken that the plurality of calculation cores 1, the assistant cores 2, the cache scheduler 3 and the cache memory 4 are connected by the crossbar 6A. However, the crossbar 6A may be configured to connect the respective memory banks of the cache memory 4 to the cores (core group) including the plurality of calculation cores 1 and the plurality of assistant cores 2 independently of the crossbar 6B. In this case, the crossbar 6B may be configured to connect the cache scheduler 3 to the cores (core group) including the plurality of calculation cores 1 and the plurality of assistant cores 2 independently of the crossbar 6A and the cache memory 4.
In any configurations, the cache scheduler 3 receives notification of the prefetch instruction from the calculation core 1 via the crossbar 6B. The prefetch instruction contains an address of the memory 5 becoming a prefetch target.
The cache scheduler 3, when receiving the prefetch instruction from any one of the calculation cores 1, determines the assistant core 2, which is in a null status and remains enabled to execute the prefetch instruction, in the plurality of assistant cores 2. For example, the cache scheduler 3 accesses the register 7 and, if there exists the plurality of assistant cores 2 kept in the null status, selects any one of these assistant cores 2. It does not, however, mean that there is a limit to a way of the selection. For instance, it may be sufficient that the cache scheduler 3 selects the assistant core 2 with its null status being recognized first through the register 7. Note that in the configuration of
Then, the cache scheduler 3 requests the selected assistant core 2 kept in the null status to execute the prefetch instruction of which the calculation core 1 notifies. The assistant core 2 receiving the request for executing the prefetch instruction executes prefetching from the address of the memory 5 specified by the prefetch instruction. Accordingly, when the calculation core 1 accesses the memory 5, it follows that the data of the accessed address will have been already prepared in the cache memory 4.
Subsequently, the cache scheduler 3 determines whether the prefetch instruction is left in a waiting status in the queue or not (S4). If it is determined S4 that there is the prefetch instruction in the waiting status, the cache scheduler 3 searches for the null assistant core 2 (S5). As described above, the cache scheduler 3 refers to the register 7 of each of the plurality of assistant cores 2, and it may be sufficient that the assistant core 2 determines whether in the null status or not.
Then, as a result of the process in S5, if the null assistant core 2 does not exist (No in S6), the cache scheduler 3 loops the control back to S1. Namely, the cache scheduler 3 repeats the process from determining whether there is the notification of the prefetch instruction or not. Whereas if it is determined in S6 that the null assistant core 2 exists (YES in S6), the cache scheduler 3 accesses the null assistant core 2 searched for in S5 via the crossbar 6B and requests this assistant core 2 to execute the prefetch instruction (S7). The scheduler 3 requests the assistant core 2 for the execution of the prefetch instruction by use of a predetermined instruction given thereto. Thereafter, the cache scheduler 3 loops the control back to S1.
Similarly, the calculation core 1-2 recognizes the prefetch instruction next to the instruction 12. For example, the calculation core 1-2, if the prefetch instruction exists in the decoded sequence of instructions at the instruction fetch next to the instruction 12, notifies the cache scheduler 3 of the prefetch instruction. The cache scheduler 3, when receiving the notification of the prefetch instruction from the calculation core 1-2, searches for any one of the plurality of assistant cores 2 in the null status and requests this assistant core 2 to execute the prefetch instruction. In this case, the data prefetched by the prefetch instruction is assumed to be used by the instruction 15.
Unlike the case of the comparative example, in the first working example, the plurality of assistant cores 2 in the null status, which is searched for by the cache scheduler 3, can access the memory banks of the cache memory 4 in parallel via the crossbar 6A. Accordingly, as illustrated in
The assistant core 2-1 and the assistant core 2-2 executes prefetching in parallel via the crossbar 6A and the plurality of memory banks of the cache memory 4. Therefore, unlike the example of the arithmetic processing apparatus 50 in the comparative example, the arithmetic processing apparatus 10 in the first working example enables the parallel operations of the plurality of assistant cores 2 on the basis of scheduling by the cache scheduler 3 in the case where the plurality of prefetch instructions are requested to be executed in the plurality calculation cores 1.
Namely, the cache scheduler 3, when receiving the prefetch request from the calculation core 1, searches for the assistant core 2 in the null status and requests the assistant core 2 in the null status to execute prefetching. As a result, in the first working example, also when the plurality of calculation cores 1 request for prefetching in parallel, the assistant cores 2 in the null status can prefetch in parallel. Accordingly, in the first working example, it is feasible to enhance a possibility that the loading of the data into the cache memory 4 in response to the prefetch request in each calculation core 1 catches up on the instruction execution requiring this data.
Still further, the assistant core 2 receiving the prefetch request from the cache scheduler 3 sets the busy flag in the register 7 readable from the cache scheduler 3, and clears the busy flag after the completion of prefetching. Hence, the cache scheduler 3 can easily manage the null status of the assistant core 2.
An arithmetic processing apparatus 10A according to a second working example will be described with reference to
A connection between the core group A and the core group B is established via a crossbar 6C. For example, for attaining an access of the calculation core 1-A of the core group A to the cache memory 4-B, it follows that this access is done via the crossbar 6A-A in the core group A, the crossbar 6A-B in the core group B and the crossbar 6C between these core groups. Accordingly, a period of time expended for the calculation core 1-A in the core group A to access the cache memory 4-B outside the core group A is longer than a period of time expended for the calculation core 1-A in the core group A to access the cache memory 4-A in the core group A, resulting in a low accessing speed. The core group A is one example of a first group. Any one of the plurality of calculation cores 1-A in the core group A is one example of a first processing unit. Any one of the plurality of assistant cores 2-A in the core group A is one example of a second processing unit. The cache memory 4-A is one example of a first cache memory.
The same is applied to a case where the calculation core 1-B in the core group B accesses the cache memory 4-A outside the core group B. The core group B is one example of a second group. Any one of the plurality of calculation cores 1-B in the core group B is an example of another part of the first processing unit. Any one of the plurality of calculation cores 2-B in the core group B is an example of another part of the second processing unit. The cache memory 4-B is one example of a second cache memory.
In the second working example, the cache scheduler 3, when receiving the prefetch notification, performs scheduling so that the assistant core belonging to the same core group as the calculation core giving the prefetch notification belongs to executes prefetching.
Then, the cache scheduler 3, if there is the prefetch instruction in the waiting status, determines the core group from the queue of the prefetch instruction. Then, the cache scheduler 3 searches for the null assistant core 2 belonging to the same core group as the calculation core 1 notifying of the prefetch instruction in the waiting status belongs to (S5A). For example, the calculation core 1-A of the core group A notifies of the prefetch instruction, and the cache scheduler 3 searches for any one of plurality of the assistant cores 2-A of the core group A with respect to the prefetch instruction retained in the queue of the core group A. Then, the cache scheduler 3 determines which cores in the plurality of assistant cores 2-A of the core group A are in the null status (S6). Subsequently, the cache scheduler 3, if some of the plurality of assistant cores 2-A of the core group A are in the null status, selects any one of the assistant cores 2-A, taking the null status, of the core group A, and requests the selected assistant core 2 to execute the prefetch instruction (S7). What has described so far is based on the processing example in the core group A, however, the same processing is applied also to the core group B.
As discussed above, in the configuration of the second working example, with respect to the calculation cores 1, the assistant cores 2 and the cache memories 4 that are separated into the plurality of core groups, the assistant core 2 belonging to the same core group as the calculation core 1 notifying of the prefetch instruction prefetches in the cache memory 4 of the same core group. The calculation core 1 is therefore enabled to acquire the result of prefetching in the cache memory 4 of the core group to which the calculation core 1 itself belongs to. Namely, the calculation core 1 can make use of the result of prefetching from the cache memory 4 within the self-group at a higher speed than from the cache memory 4 of the different core group. What has described so far is based on exemplifying the core groups A and B, however, the same is applied also to a case in which the number of the core groups is equal to or larger than “3”.
<Non-Transitory Computer-Readable Recording Medium>
A program for making a computer, other machines and devices (which will hereinafter be referred to as the computer etc) realize any one of the functions can be recorded on a non-transitory recording medium readable by the computer etc. Then, the computer etc is made to read and execute the program on this recording medium, whereby the function thereof can be provided.
Herein, the recording medium readable by the computer etc connotes a recording medium capable of accumulating information such as data and programs electrically, magnetically, optically, mechanically or by chemical action, which can be read by the computer etc. Among these recording mediums, for example, a flexible disc, a magneto-optic disc, a CD-ROM, a CD-R/W, a DVD, a Blu-ray disc, a DAT, an 8 mm tape, a memory card such as a flash memory, etc are given as those removable from the computer. Further, a hard disc, a ROM (Read-Only Memory), etc are given as the recording mediums fixed within the computer etc.
All example and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment(s) of the present invention(s) has(have) been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2013-067651 | Mar 2013 | JP | national |