1. Field of the Invention
The present invention pertains in general to computation methods and more particularly to a computer system and computer-implemented method for computational acceleration of seismic data processing.
2. Discussion of Related Art
Seismic data processing including three-dimensional (3D) and four-dimensional (4D) seismic data processing and depth imaging applications are generally computer and time intensive due to the number of points involved in the calculation. For example, as many as a billion points (109 points) can be used in a computation. Generally, the greater the number of points the greater is the period of time required to perform the calculation. The calculation time can be reduced by increasing computational resources, for example by using multi-processor computers or by performing the calculation in a networked distributed computing environment.
Over the past decades, increasing a central processing unit (CPU) speed is implemented to boost computer capability so as to meet computation requirement in seismic exploration. However, CPU speed reaches a limit and further improvement becomes increasingly difficult. Computing systems using multi-cores or multiprocessors are used to deliver unprecedented computational power. However, the performance gained by the use of multi-core processors is strongly dependent on software algorithms and implementation. Conventional geophysical applications do not realize large speedup factors due to lack of interaction or synergy between CPU processing power and parallelization of software.
The present invention addresses various issues relating to the above.
An aspect of the present invention is to provide a computer-implemented method for computational acceleration of seismic data processing. The method includes defining a specific non-uniform memory access (NUMA) scheduling for a plurality of cores in a processor according to data to be processed; and running two or more threads through each of the plurality of cores.
Another aspect of the present invention is to provide a system for computational acceleration of seismic data processing. The system includes a processor having a plurality of cores. A specific non-uniform memory access (NUMA) scheduling for the plurality of cores is defined according to data to be processed, and each of the plurality of cores is configured to run two or more of a plurality of threads.
Yet another aspect of the present invention is to provide a computer-implemented method for increasing processing speed in geophysical data computation. The method includes storing geophysical data in a computer readable memory; applying a geophysical process to the geophysical data for processing using a processor; defining a specific non-uniform memory access scheduling for a plurality of cores in the processor according to data to be processed by the processor; and running two or more threads through each of the plurality of cores.
Although the various steps of the method of providing are described in the above paragraphs as occurring in a certain order, the present application is not bound by the order in which the various steps occur. In fact, in alternative embodiments, the various steps can be executed in an order different from the order described above or otherwise herein.
These and other objects, features, and characteristics of the present invention, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. In one embodiment of the invention, the structural components illustrated herein are drawn to scale. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
In the accompanying drawings:
In order to accelerate seismic processing and imaging applications or other data intensive applications, different level of parallelism and optimized memory usage can be implemented.
A cache memory is used by a core to reduce the average time to access main memory. The cache memory is a faster memory which stores copies of the data from the most frequently used main memory locations. When a core needs to read from or write to a location in main memory, the core first checks whether a copy of that data is in the cache memory. If a copy of the data is stored in the cache memory, the core reads from or writes to the cache memory, which is faster than reading from or writing to main memory. Most cores have at least three independent caches which include an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation look aside buffer used to speed up virtual-to-physical address translation for both executable instructions and data.
For instance, in the example shown in
As shown in
In one embodiment, the hyper-threading is implemented in new generation high-performance computing (HPC) machines such as Nehalem (e.g., using core i7 family) and Westmere (e.g., using core i3, i5 and i7 family) micro-architecture of Intel Corporation. Although, the hyper-threading process is described herein being implemented on a type of CPU family, the method described herein is not limited in any way to these examples of CPUs but can be implemented on any type of CPU architecture including, but not limited to, CPUs manufactured by Advanced Micro Devices (AMD) Corporation, Motorola Corporation, or Sun Microsystems Corporation, etc.
Because a geophysical dataset contains a very large number of data points and not enough fast cache memory is available to fill with data, the method further includes cache blocking the data among the cache memories allocated to the plurality of cores to divide the whole dataset into small data chunks or blocks, at S14. In one embodiment, a block of data fits within a cache memory allocated to a core. For example, in one embodiment, a first block of data fits into cache memory (1) 21, a second block of data fits into cache memory (2) 22, a third block of data fits into cache memory (3) 23, and a fourth block of data fits into cache memory (4) 24. In another embodiment, one or more data blocks can be assigned to one core. For example, two, three or more data blocks can be assigned to core111. In which case, core111 will be associated with two, three or more cache memories instead of one cache memory. In one embodiment, cache blocking restructures frequent operations on a large data array by sub-dividing the large data array into smaller data blocks or arrays. Each data point within the data array is provided within one block of data.
The method further includes loading the plurality of data blocks into a plurality of single instruction multiple data (SIMD) registers (e.g., REG1111 in core111, REG2121 in core212, REG3131 in core313 and REG4141 in core414), at S16. Each data block is loaded into SIMD registers of one core. In SIMD, one operation or instruction (e.g., addition, subtraction, etc.) is applied to each block of data in one operation. In one embodiment, streaming SIMD extensions (SSE) which is a set of SIMD instructions to the x86 architecture designed by Intel Corporation are applied to the data blocks so as to run the data-level vectorization computation. Different threads can be run with OpenMPI or with POSIX Threads (Pthreads).
Seismic data processing and imaging applications using a multi-core platform poses numerous challenges. A first challenge may be in the temporal data dependence. Indeed, the geophysical process may include a temporarily data dependent process. A temporarily data dependent process comprises a time-domain tau-p transform process, a time-domain radon transform, time-domain data processing and imaging, or any combination of two or more these processes. A tau-p transform is a transformation from a space-time domain into wavenumber-shifted time domain. Tau-p transform can be used for noise filtering in seismic data. A second challenge may be in spatial stencil or spatial data dependent computation. Indeed, the geophysical process may also include a spatial data dependent process. The spatial data dependent process includes a partial differential equation process (e.g., finite-difference modeling), ordinary differential equation (e.g., an eikonal solver), reservoir numerical simulation, or any combination of two or more of these processes.
In one embodiment, to tackle the first challenge and perform the Tau-p computation, for example, several copies of the original input datasets are generated and reorganized. The different data copies can be combined. In order to minimize memory access latency and missing data, the method includes cache blocking the data by dividing into a plurality of blocks of data. In one embodiment, the data is divided into data blocks and fetched into a L1/L2 cache memory for fast access. The data blocks are then transmitted or transferred via a pipeline technique to assigned SIMD registers to achieve SIMD computation and hence accelerating the overall data processing.
In one embodiment, to tackle the second challenge and perform the stencil computation, data are reorganized to take full advantage of memory hierarchies. First, the entire data set (e.g., provided in three dimension) is partitioned into smaller data blocks. By partitioning into smaller data blocks (i.e., by cache blocking), different levels of cache memory (for example, L3 cache) capacity misses can be prevented.
Furthermore, in one embodiment, each data block can be further partitioned into a series of thread blocks so as to run through a single thread block (each thread block can be dedicated to one thread). By further partitioning each block into a series of thread blocks, each thread can fully exploit the locality within the shared cache or local memory. For example, in the case discussed above where two threads are runs through one core (e.g., core111), the cache memory 21 associated with this core (core111) can be further portioned or divided into two thread blocks wherein each thread block is dedicated to one of the two threads.
Additionally, in another embodiment, each thread block can be decomposed into register blocks and processing the register blocks using SIMD through a plurality of registers with each core. By decomposing each thread block into register blocks, data-level parallelism SIMD may be used. For each computation step (e.g., mathematical operation), the input and output grids or points are each individually allocated as one large array. Since NUMA system considers a “first touch” page mapping policy, parallel initialization routine to initialize the data is used. The use of “first touch” page mapping policy enables allocating memory close to the thread which initializes the memory. In other words, memory is allocated on a node close to the node containing the core on which the thread is running. Each data point is correctly assigned to a thread block. In one embodiment, when using NUMA aware allocation, the speed computation performance is approximately doubled.
As shown in
In one embodiment, the method is implemented as a series of instructions which can be executed by a processing device within a computer. As it can be appreciated, the term “computer” is used herein to encompass any type of computing system or device including a personal computer (e.g., a desktop computer, a laptop computer, or any other handheld computing device), or a mainframe computer (e.g., an IBM mainframe).
For example, the method may be implemented as a software program application which can be stored in a computer readable medium such as hard disks, CDROMs, optical disks, DVDs, magnetic optical disks, RAMs, EPROMs, EEPROMs, magnetic or optical cards, flash cards (e.g., a USB flash card), PCMCIA memory cards, smart cards, or other media. The program application can be used to program and control the operation of one or more CPU having multiple cores.
Alternatively, a portion or the whole software program product can be downloaded from a remote computer or server via a network such as the internet, an ATM network, a wide area network (WAN) or a local area network.
Although the invention has been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred embodiments, it is to be understood that such detail is solely for that purpose and that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present invention contemplates that, to the extent possible, one or more features of any embodiment can be combined with one or more features of any other embodiment.
Furthermore, since numerous modifications and changes will readily occur to those of skill in the art, it is not desired to limit the invention to the exact construction and operation described herein. Accordingly, all suitable modifications and equivalents should be considered as falling within the spirit and scope of the invention.