1. Field of the Invention
The present invention relates to a method and apparatus for transferring data in a computer, and in particular to a method and apparatus for transferring data between a memory for a central processing unit (CPU) and a memory for a graphics processing unit (GPU).
2. Description of the Related Art
A system having heterogeneous parallel computing power may use different types of processors. The different types of processors may be designed with different architectures so that they may prefer different data layouts. A representative example of such a system usually seen today is a CPU-GPU system.
Data layouts may include SoA (structure of array) and AoS (array of structure). Usually, the SoA approach is suitable for GPUs and the AoS approach is suitable for conventional CPUs. When a GPU processes data in the AoS layout, coalescing issues may occur. In order to achieve coalesced memory access and high memory performance, the data needs to be transformed from the AoS layout to the SoA layout before it is processed by a GPU. Similarly, when a CPU processes data in the SoA layout, the CPU needs to gather data in different memory locations, which would result in discrete memory access issues. Therefore, the data should be transformed from the SoA layout to the AoS layout before it is processed by the CPU.
Currently, the transformation of data from the SoA layout to the AoS layout, or vice versa, is by means of software, but it is insufficient and causes a burden to the CPU-GPU system.
In one embodiment of the present invention, an apparatus for transferring data in a computer comprises a transpose unit. The transpose unit comprises a plurality of banks, a write circuit, a plurality of selectors, and a parallel-to-serial circuit. Each bank has a plurality of storage units. The write circuit is coupled with each bank and a source memory. The write circuit is configured to perform selections on the plurality of banks for storing data from the source memory. The plurality of selectors correspond to the plurality of banks. Each selector comprises an output and a plurality of inputs respectively coupled with the plurality of storage units of the corresponding bank. The parallel-to-serial circuit is coupled with the plurality of selectors and a destination memory. The outputs of the plurality of selectors connect in parallel with the parallel-to-serial circuit.
In one embodiment of the present invention, a method for transferring data in a computer is disclosed. The method comprises providing data in a source memory, wherein the data is formed by a plurality of data segments corresponding to each other and each data segment comprises a plurality of data elements; simultaneously retrieving one data element from each data segment; converting the retrieved data elements into a serial data stream; and transferring the serial data stream to a destination memory.
The objectives and advantages of the present invention will become apparent upon reading the following description and upon referring to the accompanying drawings in which:
Referring to
The control unit 21 is configured to control the at least one transpose unit 22 and two address generators 23 and 24, and to communicate with the CPU 11 to receive a signal for initializing the transferring of data between the main memory 12 and the GPU global memory 13. When the CPU 11 gets a call from an API (Application Programming Interface), the CPU 11 sends a signal to the control unit 21 through the system bus 10, and the control unit 21 accordingly begins transferring data. The CPU 11 may continue the next job without waiting for the transfer completion. The CPU 11 may not wait for the completion of data transfer according to a non-blocking instruction. The CPU 11 may wait for a completion signal from the control unit 21 for executing an instruction needing the transferred data.
In one embodiment, the control unit 21 is configured to send a signal to the CPU 11 or GPU when the transfer of data is completed.
In one embodiment, after the apparatus 2 receives a data-transfer signal from the CPU 11, the apparatus 2 schedules the memory access and/or transpose operations on scalable data sets in a pipelined manner.
The transpose unit 22 is configured to convert the layout of data from AoS (Array of Structure) to SoA (Structure of Array), or vice versa. Although the embodiment of
The transpose unit 22 may be designed to transpose data of a predetermined size (n, m). The predetermined size (n, m) can be determined based on the memory bandwidths at which data is read from or stored into the main memory 12 by the CPU 11, and read from or stored into the GPU global memory 13 by the GPU. In one embodiment, supposing that the CPU memory bandwidth is c bytes, the GPU global memory bandwidth is g bytes, and the element of data has a width of w bytes, the following relationship can be obtained:
(n×w, m×w)=(g, c)
In one embodiment, the apparatus 2 has only one transpose unit 22.
In another embodiment, the apparatus 2 can have two transpose units 22. The control unit 21 can determine whether to use one or two transpose units 22 according to the size (A, S) of the source data to be transferred, wherein A is the array size of the source data and S is the structure size of the source data. If the size (A, S) of the source data satisfies the following conditions (1) and (2), the control unit 21 uses one transpose unit 22 to transpose data; if the size (A, S) of the source data satisfies the following condition (3), the control unit 21 uses two transpose units 22 to transpose data.
(A,S)=(n, m) (1)
(A,S)=(sn, sm) where 0<sn×sm<n×m (2)
(A,S)=(ln, lm) where ln×lm>n×n×m (3)
In one embodiment, the control unit 21 uses two transpose units 22, and the two transpose units 22 are operated in a ping-pong like manner.
In one embodiment, the control unit 21 is configured to control the transpose unit 22 to transfer data by an out-of-order manner, as shown in
Referring to
The write circuit 222 is configured to perform selections on the plurality of banks 221 for storing data from the main memory 12. The write circuit 222 can be controlled by the control unit 21. In one embodiment, the write circuit 222 selects one bank 221 at each clock cycle for storing data.
As shown in
Each bank 221 may comprise a plurality of storage units 2211. The storage unit 2211 can store one data element of a plurality of sequentially arranged data elements of data in the main memory 12. In one embodiment, the element is one-byte data. In one embodiment, the number of the plurality of storage units 2211 is equivalent to the maximum number of the data elements that can be read from or stored into the main memory 12 by the CPU 11.
Moreover, each selector 223 comprises a plurality of inputs 2231 that are respectively coupled with the plurality of storage units 2211 of the corresponding bank 221. The selector 223 is configured to individually output the data elements stored in the plurality of storage units 2211 of the corresponding bank 221.
Each selector 223 has an output, and the outputs of the plurality of selectors 223 connect in parallel with the parallel-to-serial circuit 224. The control unit 21 sends a Sel_out signal to each selector 223 when the transpose unit 22 is ready for outputting data, or when it is full. After each selector 223 receives the Sel_out signal, a one w-byte vector or data element is outputted and m×w -byte data (or m's data elements) is generated. When the parallel-to-serial circuit 224 receives the data elements outputted from the plurality of selectors 223, the parallel-to-serial circuit 224 converts the data elements received in parallel into a serial data stream, which is then sent to the GPU global memory 13 via a serial output of the parallel-to-serial circuit 224.
In one embodiment, the selector 223 is a multiplexer. In one embodiment, at each cycle, after each selector 223 receives the Sel_out signal, data elements in the storage units 2211 having the same index number are outputted.
Referring to
Referring to
Referring to
Alternatively, in one embodiment, the apparatus 2 can be employed to convert data when the data is being transferred from a GPU memory as a source memory to a CPU main memory as a destination memory. In another embodiment, the above-mentioned apparatus 2 is configured to bidirectionally transfer and convert data between a memory closely interacting with a CPU and a memory closely interacting with a GPU, and in such embodiment, a transpose unit (TU) 22 will have a number s of banks 221 and the bank 221 will have a number s of registers or storage units 2211, wherein s=min(n, m).
The present invention provides a method for transferring data in a computer. The method can be implemented on a hardware computer comprising a processor and a memory coupled with the processor. The method can be employed to transfer data in a main memory closely interacting with a central processing unit to a global GPU memory, or vice versa. The data is formed by a plurality of closely arranged data segments. The data segments correspond to each other; namely, the data segments have the same number of data elements, and the data elements of the data segments at the same corresponding position have a same type. In one embodiment, the data is in the AoS format. Each data segment is a structure and the plurality of data segments form an array.
In the embodiment of
The method then simultaneously retrieves one data element from each data segment to obtain a plurality of data elements arranged in parallel. Next, the method converts the retrieved data elements into a serial data stream and then sends the serial data stream to the global GPU.
In one embodiment, the method retrieves data elements from the same corresponding position of each data segment to generate the serial data stream.
In one embodiment, the data elements of each serial data stream have the same type.
In one embodiment, the method uses one transpose unit having a plurality of banks each having a plurality of storage units to store data elements. In one embodiment, data elements of each data segment are stored in a corresponding bank. In one embodiment, the method uses two transpose units to store data elements, and corresponding banks of the two transpose units store data elements of a data segment.
In one embodiment, as shown in
In one embodiment, the method uses a selector 223, such as a multiplexer, to simultaneously retrieve one data element from each data segment or the banks of a transpose unit.
In one embodiment, the method uses a shift register to convert the retrieved data elements into a serial data stream.
In one embodiment, the method uses an address generator to determine an address for the data.
In one embodiment, the method uses an address generator to determine an address for transferred data in the global memory.
The data structures and code described in this detailed description are typically stored on a non-transitory computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The non-transitory computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing code and/or data now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a non-transitory computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the non-transitory computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code stored within the non-transitory computer-readable storage medium. Furthermore, the methods and processes described below can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules.
The above-described embodiments of the present invention are intended to be illustrative only. Those skilled in the art may devise numerous alternative embodiments without departing from the scope of the following claims.