The present invention relates generally to methods, systems, and apparatuses for calculating a Laplacian pyramid in a computing environment where operations related to computing individual pyramid layers may be performed in parallel. The disclosed methods, systems, and apparatuses may be used in, for example, various medical imaging applications.
An image pyramid is a type of multi-scale signal representation in which an image is subjected to repeated smoothing and subsampling. There are several different types of image pyramids known in art. For example, with a Gaussian pyramid, each layer is a low pass, Gaussian filtered and downsampled version of the previous layer. Downsampling is performed by removing (or otherwise ignoring) every second row and column in the image data. Upsampling may be performed with another type of pyramid, referred to as a Laplacian pyramid. The Laplacian pyramid layers, except for the coarsest one and the zero layer (i.e., the difference between the upsampled coarser layer and the initial image), store the difference between the upsampled coarser layer and the corresponding layer of the Gaussian pyramid. The coarsest layer in the Laplacian pyramid is the low-pass image, same as in the Gaussian pyramid. To build the i-th layer of the Laplacian pyramid, the next, smaller, i+1 layer of the Gaussian pyramid is upsampled and filtered, then the difference between the i-th Gaussian pyramid layer and the upsampled and low pass filtered i+1 layer is computed. Upsampling is accomplished by adding zero rows and zero columns after each existing row and column. Then, the upsampled image may be convolved with a Gaussian filter.
The Laplacian pyramid decomposition of an image is a common starting point in many multi-scale imaging algorithms in the areas such as image enhancement, coding, stitching, and restoration. Since its implementation is time consuming and involves expensive convolutions and upsampling/downsampling steps, it can easily become the bottleneck of the whole image processing application. Therefore, it would be desirable to reduce the time required for performing Laplacian pyramid calculations such that the overall processing time of the imaging processing application can be minimized.
Embodiments of the present invention address and overcome one or more of the above shortcomings and drawbacks, by providing methods, systems, and apparatuses which calculate a Laplacian pyramid using a parallel computing platform. Briefly, the operations associated with pyramid layer are performed in parallel and certain operations may be combined to minimize memory access requirements. This technology may be applied to various image processing applications. For example, for medical fluoroscopy procedures, the technology described herein may be applied to reduce the time required to process acquired images.
According to some embodiments, a first computer-implemented method for calculating a Laplacian pyramid in an image processing system comprising a parallel computing platform includes constructing a first layer of a Gaussian pyramid based on an original image. A plurality of Laplacian pyramid layers are constructed using a plurality of device kernels executing on a graphical processing device included in the parallel computing platform. Each respective Laplacian pyramid layer is constructed by a process implemented by one or more first device kernels and one or more second device kernels. The first device kernels are used to calculate a Gaussian pyramid layer based on an immediately preceding Gaussian pyramid layer. The second device kernels are used to calculate the respective Laplacian pyramid layer based on the immediately preceding Gaussian pyramid layer in parallel with calculation of the Gaussian pyramid layer. In one embodiment, two or more of the plurality of Laplacian pyramid layers may be calculated in parallel using the parallel computing platform.
In some embodiments of the aforementioned first method for calculating a Laplacian pyramid, each respective Gaussian pyramid layer is calculated using a single operation on a respective computation unit. The single operation combines upsampling the immediately preceding Gaussian pyramid layer to yield an upsampled layer and convolving the upsampled layer with a Gaussian filter to yield the Gaussian pyramid layer. In some embodiments, the single operation further comprises downsampling the Gaussian pyramid layer. In some embodiments, convolving the upsampled layer with the Gaussian filter to yield the Gaussian pyramid layer includes computing a plurality of horizontal convolutions using a horizontal filter and the upsampled layer and computing a plurality of vertical convolutions using a vertical filter and the upsampled layer. In one embodiment, the horizontal convolutions and the vertical convolutions are computed in separate device kernels included in the one or more first device kernels.
In some embodiments of the aforementioned first method for calculating a Laplacian pyramid, each respective Laplacian pyramid layer is calculated using a single operation on a respective computation unit. In this context, the single operation includes the steps of upsampling the immediately preceding Gaussian pyramid layer to yield a upsampled layer; smoothing the upsampled layer to yield a smoothed upsampled layer; and subtracting the smoothed upsampled layer from the original image or from a corresponding layer of the Gaussian pyramid to yield the respective Laplacian pyramid layer. In one embodiment, smoothing the upsampled layer to yield the smoothed upsampled layer includes the steps of computing a plurality of horizontal convolutions uses a horizontal filter and the upsampled layer and computing a plurality of vertical convolutions uses a vertical filter and the upsampled layer. The horizontal convolutions and the vertical convolutions may be computed in separate device kernels included in the one or more second device kernels.
According to other embodiments, a second computer-implemented method for calculating a Laplacian pyramid in an image processing system comprising a host computing unit and a graphical processing device includes copying an original image from a host memory at the host computing unit to a portion of device memory on the graphical processing device and constructing a first layer of a Gaussian pyramid based on the original image. A plurality of device kernels is executed on the graphical processing device to calculate the Laplacian pyramid. Each respective layer in the Laplacian pyramid is calculated using a set of device kernels. One or more first kernels in the set are configured to calculate a Gaussian pyramid layer based on an immediately preceding Gaussian pyramid layer. One or more second kernels in the set are configured to calculate a respective Laplacian pyramid layer based on the immediately preceding Gaussian pyramid layer. After the Laplacian pyramid is calculated, it is copied from the portion of device memory on the graphical processing device to the host memory.
Various additional features and/or enhancements may be added to the aforementioned second computer-implemented method for calculating a Laplacian pyramid. For example, in some embodiments, prior copying the original image to the portion of device memory, the portion of device memory on the graphical processing device is allocated based on a size of the original image. After executing the plurality of device kernels, the portion of device memory is deallocated. In some embodiments, the set of device kernels described in the method is executed in parallel on the graphical processing device. In some embodiments, the first layer of the Gaussian pyramid is constructed at the graphical processing device using a third kernel configured to calculate the first layer of the Gaussian pyramid based on the original image. In some embodiments, each respective device kernel in the plurality of device kernels is executed independently by a distinct grid of thread blocks on the graphical processing device. In some embodiments, a plurality of second kernels is configured in parallel to calculate a plurality of Laplacian pyramid layers.
Similar to the first computer-implemented method for calculating a Laplacian pyramid described above, in some embodiments of the second method, each respective first kernel is configured to calculate the Gaussian pyramid layer based on the immediately preceding Gaussian pyramid layer using a single operation which combines upsampling the immediately preceding Gaussian pyramid layer, convolving the upsampled layer with a Gaussian filter to yield the Gaussian pyramid layer, and downsampling the Gaussian pyramid layer. Also, in some embodiments, each respective second kernel is configured to calculate the respective Laplacian pyramid layer using a single operation which combines upsampling the immediately preceding Gaussian pyramid layer, smoothing the upsampled layer to yield a smoothed upsampled layer; and subtracting the smoothed upsampled layer from the original image to yield the respective Laplacian pyramid layer.
In other embodiments, a system for calculating a Laplacian pyramid includes a processor and a graphical processing device. The processor is configured to construct a first layer of a Gaussian pyramid based on an original image. The graphical processing device is configured to execute a plurality of device kernels to calculate the Laplacian pyramid. Each respective Laplacian pyramid layer is calculated using a set of device kernels. One or more first device kernels in the set are configured to calculate a Gaussian pyramid layer based on an immediately preceding Gaussian pyramid layer. One or more second device kernels in the set are configured to calculate the respective Laplacian pyramid layer based on the immediately preceding Gaussian pyramid layer.
In one embodiment of the aforementioned system, each respective first device kernel is configured to calculate the Gaussian pyramid layer based on the immediately preceding Gaussian pyramid layer using a single operation which combines upsampling the immediately preceding Gaussian pyramid layer, convolving the upsampled layer with a Gaussian filter to yield the Gaussian pyramid layer, and downsampling the Gaussian pyramid layer. In another embodiment of the system, each respective second device kernel is configured to the respective Laplacian pyramid layer using a single operation which combines upsampling the immediately preceding Gaussian pyramid layer, smoothing the upsampled layer to yield a smoothed upsampled layer, and subtracting the smoothed upsampled layer from the original image to yield the respective Laplacian pyramid layer.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying drawings.
The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
The following disclosure describes the present invention according to several embodiments directed at performing fast computation of the Laplacian pyramid using a parallel computing platform and programming model such as the NVIDIA™ Compute Unified Device Architecture (CUDA). The techniques described herein are based, in part, on application of the principle of out-of-order execution to the computation of pyramid layers. Computation of the Laplacian layer i only utilizes one preexisting layer i+1 of the Gaussian pyramid (assuming that layer i has been acquired in advance). Thus, computation of other layers of the Laplacian pyramid may be performed at other, more convenient times. For example, in some embodiments of the present invention a respective Laplacian pyramid layer may be computed in parallel with computation of the next layer of the Gaussian pyramid. The two operations are expensive and approximately equal in length, which makes them ideal candidates for parallel execution, as described herein. The invention is applicable to various image processing applications including, but not limited to, image denoising and compression.
The imaging system 100 may include one or more computing units (not shown in
Continuing with reference to
The present invention may be implemented across various computing architectures. In the example of
Parallel portions of an application may be executed on the memory architecture 200 as “device kernels” or simply “kernels.” A kernel comprises parameterized code configured to perform a particular function. The parallel computing platform is configured to execute these kernels in an optimal manner across the memory architecture 200 based on parameters, settings, and other selections provided by the user. Additionally, in some embodiments, the parallel computing platform may include additional functionality to allow for automatic processing of kernels in an optimal manner with minimal input provided by the user.
The processing required for each kernel is performed by grid of thread blocks (described in greater detail below). Using concurrent kernel execution, streams, and synchronization with lightweight events, the memory architecture 200 of
The device 210 includes one or more thread blocks 230 which represent the computation unit of the device. The term thread block refers to a group of threads that can cooperate via shared memory and synchronize their execution to coordinate memory accesses. For example, in
Continuing with reference to
Each thread can have one or more levels of memory access. For example, in the memory architecture 200 of
Using the techniques described herein, a parallel computing platform and programming model (including components such as the memory architecture 200 illustrated in
When computing individual pyramid layers, a single operation may be executed wherein the upsampling or downsampling are combined with low pass filtering into one step. For example, in some embodiments, each convolution involves a separable 5×5 Gaussian filter, which may be split into the horizontal and vertical filters of the sizes 5×1 and 1×5. Then, each combined operation will include two passes through the image (horizontal and vertical), each including both the elements of the 1-D convolution and upsampling or downsampling in the direction of the pass through the image.
Continuing with reference to
Continuing with reference to
As shown in
The computer system 710 also includes a system memory 730 coupled to the bus 721 for storing information and instructions to be executed by processors 720. The system memory 730 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 731 and/or random access memory (RAM) 732. The system memory RAM 732 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The system memory ROM 731 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 730 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 720. A basic input/output system 733 (BIOS) containing the basic routines that help to transfer information between elements within computer system 710, such as during start-up, may be stored in ROM 731. RAM 732 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 720. System memory 730 may additionally include, for example, operating system 734, application programs 735, other program modules 736 and program data 737.
The computer system 710 also includes a disk controller 740 coupled to the bus 721 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 741 and a removable media drive 742 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive). The storage devices may be added to the computer system 710 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
The computer system 710 may also include a display controller 765 coupled to the bus 721 to control a monitor or display 766, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system includes an input interface 760 and one or more input devices, such as a keyboard 762 and a pointing device 761, for interacting with a computer user and providing information to the processor 720. The pointing device 761, for example, may be a mouse, a trackball, or a pointing stick for communicating direction information and command selections to the processor 720 and for controlling cursor movement on the display 766. The display 766 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 761.
The computer system 710 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 720 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 730. Such instructions may be read into the system memory 730 from another computer readable medium, such as a hard disk 741 or a removable media drive 742. The hard disk 741 may contain one or more datastores and data files used by embodiments of the present invention. Datastore contents and data files may be encrypted to improve security. The processors 720 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 730. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 710 may include at least one computer readable medium or memory for holding instructions programmed according embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processor 720 for execution. A computer readable medium may take many forms including, but not limited to, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as hard disk 741 or removable media drive 742. Non-limiting examples of volatile media include dynamic memory, such as system memory 730. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the bus 721. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
The computing environment 700 may further include the computer system 710 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 780. Remote computing device 780 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 710. When used in a networking environment, computer system 710 may include modem 772 for establishing communications over a network 771, such as the Internet. Modem 772 may be connected to bus 721 via user network interface 770, or via another appropriate mechanism.
Network 771 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 710 and other computers (e.g., remote computing device 780). The network 771 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-11 or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 771.
The embodiments of the present disclosure may be implemented with any combination of hardware and software. In addition, the embodiments of the present disclosure may be included in an article of manufacture (e.g., one or more computer program products) having, for example, computer-readable, non-transitory media. The media has embodied therein, for instance, computer readable program code for providing and facilitating the mechanisms of the embodiments of the present disclosure. The article of manufacture can be included as part of a computer system or sold separately.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.