Embodiments described herein relate generally to a compiler, an object code generation method, an information processing apparatus, and an information processing method.
Conventionally, there is a multi-thread processing as a program execution model for multiple cores. In such a multi-thread processing, a plurality of threads as an execution unit operate in parallel, and exchange data on a main memory, thereby to accomplish parallel processings.
An example of an execution form of parallel processings described above includes two elements, i.e., a runtime processing including a scheduler which assigns a plurality of execution unit elements to execution units (CPU cores), and a thread which operates on each of the execution units. For parallel processings, synchronization between threads is significant. If a synchronization processing is not proper, problems such as a deadlock and inconsistency of data occur. Hence, conventionally, synchronization between threads is maintained by scheduling an execution order of threads and by performing parallel processings, based on the schedule.
Further, for a framework of heterogeneous multi-core, a runtime environment is demanded. The runtime environment implicitly performs data copy between a main memory of a host CPU and memories of devices such as an accelerator, including a GPGPU (General-purpose computing on graphics processing units; a technology which applies general-purpose calculations of a GPU and calculation resources of the GPU to other purposes than image processings).
For example, buffer synchronization and parallel runtime in an acceleration calculation environment are considered important. When a CPU and an accelerator such as a GPU card cooperate with each other to execute a large-scale calculation, buffers are defined and data is transferred to a memory of a calculating side, in order to exchange data between the CPU and the GPU.
At this time, at what timing and in which direction the data is transferred are complex and cause a bug to be mixed in coding. In particular, which of CPU, GPU1, GPU2, . . . , is to perform a calculation is changed in the course of a program tuning process, a timing and a direction of data transfer need to be carefully considered.
Therefore, a method has been proposed in which data is copied upon necessity on demand by defining a buffer view of abstracting buffers and by maintaining which memory includes newest data in a data structure of the buffer view. With use of this method, data transfer needs not explicitly be described on a program code but data can be properly transferred as needed. Therefore, a reliable program can be written with simple codes.
However, in the method of copying data on demand, a need for data copy is not clearly determined before a timing of calling a parallel calculation processing (hereinafter referred to as a kernel). Therefore, a delay of data copy needs to be accepted.
There is a demand for a technology capable of simply mounting a more efficient acceleration calculation program.
A general architecture that implements the various features of the embodiments will now be described with reference to the drawings. The drawings and the associated descriptions are provided to illustrate the embodiments and not to limit the scope of the invention.
Various embodiments will be described hereinafter with reference to the accompanying drawings.
In general, according to one embodiment, a compiler applicable to a parallel computer including processors, wherein a source program is input to the compiler and a local code for each of the processors are generated, the compiler includes a generation module and an object code generation module. The generation module is configured to analyze the input source program, extract a data transfer point from a procedure described in the source program, and generate a call processing for data copy. The object code generation module is configured to generate an object code including the call processing.
The present embodiment relates to an object code generation method which can be used as an information processing apparatus or an information processing method, and is applicable to a compiler which is inputted with a source program and generates a local code for each of processors forming a parallel calculator. The object code generation method can generate local codes independent of processor configurations.
The first embodiment will be described with reference to
The core blocks 34 are identified by block IDs. In the example of
The host CPU 12 may also be a multi-core processor. The example of
The device memory 14 which can be accessed by the calculation device 10 is connected to the calculation device 10, and the main memory 16 is connected to the host CPU 12. Since two memories, which are the main memory 16 and the device memory 14, are connected, data is copied (synchronization) by the device memory 14 and the main memory 16 before and after executing a processing by the calculation device 10. Therefore, the main memory 16 and the device memory 14 are connected to each other. When a plurality of processings are performed successively, copy needs not be performed for each of the processings.
This data structure includes four elements, as shown in
Cpu_mem is a pointer expressing the position of the data A in the main memory 16, and Gpu_mem is a pointer expressing the position of the data A in the device memory 14.
States of BufferView are managed by four states, i.e., CPU only, GPU only, Shared, and Undefined (the statuses increase as calculation devices increase).
According to the prior art, data copy is performed on demand. As shown in
In order to solve this, copy of BufferView G may be started immediately after the end of the kernel KG. However, programming is then complicated and spoils convenience of abstraction using BufferView.
A schematic configuration of a general compiler which is configured by employing an object code generation method according to the present embodiment includes a compiler, an optimization converter, and a code generation section. The compiler reads a source program, analyzes syntax, converts the program into an intermediate code, and stores the code into a memory. Specifically, the syntax of a source program is analyzed, and an intermediate code is generated. Thereafter, optimization, code generation, and outputting of an object code are performed. In the course of this optimization, there is a flow of control flow analysis, data dependence analysis, and various optimization (intermediate code generation). Analysis of a Def-Use chain described later is data-dependent analysis, and insertion of a data transfer code is a function which is achieved by various optimization and a code generation section.
Here, an outline of an operation procedure of a general parallel compiler will be described with reference to
At first, in the beginning of compilation, a configuration B21 of a target processor is specified. In addition, the configuration may be specified by quoting what is called a compiler specifier. Further, the compiler reads a source program B25 in Step S22, analyzes syntax of the source program B25, and converts the source program B25 into an intermediate form B26 which is an internal expression.
Next, the compiler performs various optimization conversions on the intermediate form (internal expression) B26 in Step S23, and generates a converted intermediate form B27.
Next, the compiler scans in Step S24 the intermediate form B27 converted, and generates an object code B28 for each PE. As an operation example of the compiler, a machine language code is generated from a program of a C language system.
In the present embodiment, as shown in
The Def-Use chain is called a du-chain (definition-use chain). Creation of a definition-use chain (du-chain) is substantially the same calculation as analysis of a valid variable. For example, if a variable requires a right side value in a sentence s, the variable is used in s. For example, where a sentence a:=b+c and a sentence a[b]:=c, b and c are used in the respective sentences (a is not used). A problem concerning the du-chain is to obtain a set of the sentences s which use a variable x about a point p. Specific steps are as follows.
Step S71: Divide a program into basic blocks.
Step S72: Create a graph of a control flow.
Step S73: Analyze a data flow with respect to a BufferView and create a Def-Use chain.
Perform the following processings on Def-Use chains of all BufferViews.
Step S74A: Determine whether processings of Def-Use chains of all BufferViews have been performed or not. If the processings are determined to have been performed, a processing loop up to Step S74C ends, and the whole processings are terminated.
Step S74B: Determine whether an execution device of the kernel which subjects BufferView to Def and an execution device of the kernel which subjects BufferView to Use are different from each other or not. If this determination results in Yes, the flow goes to next Step S74C. If No, the flow returns to Step S74A.
Step S74C: Insert a code which starts data copy immediately after execution of the kernel to be subject to Def. The code for generating a call processing for this data copy is realized, for example, by a function.
The basic blocks are a sequence of continuous sentences in which control is given to the top sentence and thereafter leaves from the last sentence without stopping or branching halfway. For example, a so-called sequence of three addresses form basic blocks.
Further, as shown in
The second embodiment will be described with reference to
As shown in functional blocks in
As a result, in a SoC (System on Chip) which integrates a CPU and a GPU, data copy in the embodiment is substituted with prefetch to a cache and becomes an effective measure to improve performance in simple program description even when a CPU, a GPU, and any other accelerator share a memory. In addition, mem is a pointer indicating the position of the data A in the shared cache 16B.
As described above, a highly efficient program can be created by automatically hiding a delay of data transfer in an environment in which complicated and time-consuming GPU programming is simplified.
According to the embodiments, the followings are practiced in a runtime environment which implicitly performs data copy between memories of devices, such as accelerators including a GPU, and to/from a main memory of a host CPU by abstracting data buffers to calculate.
(1) A data copy is not performed on demand but a data copy is performed as early as possible. In this manner, delays in data transfer are reduced and performance is improved.
(2) Since data is copied at an early time point, a processing for calling data copy is generated to obtain a data transfer point when a program is complied.
(3) When a calculation is performed by a device which has a relatively low parallel degree, such as a multi-core CPU, an input data buffer is subdivided to make data flow in form of a stream and to set early calculation start timings in a multi-core CPU. System performance is thereby improved.
According to the embodiments, a programmer can create a program which starts data copy at a proper timing without describing a transfer processing for data. Therefore, an efficient acceleration calculation program can be mounted simply.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2013-019259 | Feb 2013 | JP | national |
This application is a Continuation Application of PCT Application No. PCT/JP2013/058157, filed Mar. 21, 2013 and based upon and claiming the benefit of priority from Japanese Patent Application No. 2013-019259, filed Feb. 4, 2013, the entire contents of all of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2013/058157 | Mar 2013 | US |
Child | 14015670 | US |