The invention relates to volume rendering, in particular to multi-planar reformatting (MPR) using a computer system that includes a graphics processing unit (GPU).
MPR volume rendering is a standard method of displaying two-dimensional (2D) representations of three-dimensional (3D) data sets collected by medical imaging equipment, such as computer-assisted tomography (CT) scanners, magnetic resonance (MR) scanners, ultrasound scanners and positron-emission-tomography (PET) systems. These 3D data sets are sometimes referred to as volume data. In the early days of medical imaging, rendering of volume data was performed on vendor-specific software and hardware associated with the scanner. However, for a number of years, application software to implement volume rendering on general purpose computers, for example standard personal computers and workstations, and which does not utilize any bespoke hardware has been well known.
Medical image data sets are generally large. Sizes of between 0.5 Gigabytes and 8 Gigabytes are not uncommon. For example, a medical image data set might comprise 1024×1024×1024 16-bit voxels which corresponds to approximately 2 Gigabytes of data. From this an image comprising 1024×1024 16-bit pixels might be rendered. Furthermore, a common desire when viewing rendered images is to generate a sequences of images (sometimes referred to as a cine) which is viewed by a user as a movie. By viewing a cine a trained user is able to form a three-dimensional mental image of the object represented by the volume data. To be efficient, a frame rate for a cine of 15 frames per second (fps) might be considered desirable, with higher frame rates being preferred. It is also preferable if cines can be generated and displayed in real time.
Medical volume rendering is thus highly computationally intensive and the processing power of a modern general purpose computer's CPU is often inadequate for performing the task at an acceptable speed.
Modern personal computers and workstations generally include a graphics card, and in most cases the graphics card includes a Graphics Processing Unit (GPU). In terms of aggregate processing power, modern GPUs typically outperform a computer's central processing unit (CPU) by roughly an order of magnitude.
Thus the present invention is based on the premise that it would be desirable to harness the processing power available in a GPU to perform the volume rendering process. This is not a new idea.
Although not originally designed with this use in mind, GPUs do have sufficient general programmability that they can be applied to the task of volume rendering, in particular, to volume rendering in medicine, where the task is usually to render images of the internal organs of human patients. However, while a GPU might have sufficient raw computing power to perform medical image rendering, it is nonetheless a difficult task to implement a practical GPU-based medical image renderer.
Difficulties arise because the medical image volume data is typically larger than the memory available on the graphics card supporting the GPU, and because of the limited bandwidth available for the transfer of data from the system memory associated with the CPU to the graphics card. A modern graphics card will typically have a memory of around 256 or 512 Megabytes, for example. Further difficulties arise because volume data are generally stored linearly in system memory. For example, volume data comprising voxels aligned with x- y- and z-co-ordinate axes will generally be stored such that neighboring voxels in the x-axis occupy neighboring locations in system memory, neighboring voxels in the y-axis are separated by one row of x-axis voxels, and neighboring voxels in the z-axis are separated by the number of voxels in an xy-plane. This difficulty is especially important in MPR volume rendering because MPR rendering frequently requires access to voxels arranged in an arbitrarily oriented plane which includes voxels spread throughout system memory and not in a contiguous series which would be more easy to access. Although for a given MPR view it is possible to duplicate the volume data in system memory in a more appropriate order, this is generally undesirable because of the cost in memory overheads.
One known way to help address these difficulties is to re-order the voxels in system memory into a series of regular gridded blocks, for example as described by Lichtenbelt et al. in “Introduction to Volume Rendering”, Hewlett-Packard Company, Prentice-Hall PTR, New Jersey, 1998 [1]. By doing this, groups of neighboring voxels in volume space can be more closely located in system memory. Furthermore, the size of individual blocks of voxels can be selected such that they can be processed separately from one another during rendering within the memory available on the graphics card supporting the GPU.
However, while Lichtenbelt et al.'s scheme can help to some extent in efficiently outsourcing the rendering computation from the CPU to the GPU, a number of performance restrictions remain. In particular, the system bus traffic associated with the transfer of blocks of voxels from system memory to the graphics card for processing by the GPU is highly variable. In the case that the block(s) containing the voxels required for rendering at a given instant are loaded in memory on the graphics card, the rendering process can proceed at the rate of the GPU's ability to process. However, as voxels which are in a block which is not loaded to the graphics card memory become required by the rendering algorithm, it is necessary for the processing to halt while the new block is retrieved from system memory by the CPU and transferred to the graphics card for use by the GPU. This can lead to stilted and jerky performance, especially during real-time cine.
Accordingly, there is a need for an apparatus and method for providing GPU-based volume rendering which provides for more consistent performance.
According to a first aspect of the invention, there is provided an apparatus for rendering a multiplanar reformatting (MPR) image of volume data, the apparatus comprising: a central processing unit (CPU) coupled to a system memory storing the volume data; and a graphics processing unit (GPU) coupled to a GPU memory and via the computer system bus to the CPU and system memory, wherein the CPU is operable to predict an MPR image which may be required for display at a future time and to identify blocks of voxels comprising the volume data which are needed to render the predicted MPR image, the CPU being further operable to retrieve said blocks from the system memory and to queue them for transfer to the GPU memory, wherein the apparatus further comprises a scheduler arranged to control the transfer of at least some of the queued blocks to the GPU memory prior to the predicted MPR image becoming required for display, the GPU being operable to retrieve the blocks from the GPU memory once transferred there from the CPU and to render corresponding parts of the predicted MPR image if it becomes required for display and to assemble these parts into an MPR image.
By conceptually dividing the volume data into blocks, the processing power of the GPU can be employed to render MPR images notwithstanding the modest memory available to a typical GPU. Furthermore, by scheduling the transfer of blocks which the CPU predicts are likely to be needed in the future, the effects of inconsistent performance associated with irregular bus traffic with known schemes for GPU-based rendering of medical image volume data are reduced and more rapid rendering of sequences of images can be performed.
The apparatus may be configured such that when the GPU memory allocated for storing blocks is full, blocks are overwritten in the GPU memory according to a replacement protocol having regard to the fraction of the GPU memory allocated for storing blocks which is required to store the blocks needed to render an MPR image. This approach allows, for example, a least recently used replacement protocol to be used when it would be most efficient to do so, but provides for switching to another replacement protocol (such as a most recently used replacement protocol, for example) when the least recently used protocol would cause thrashing because there is not enough room in the GPU memory allocated for storing blocks to efficiently handle the number of blocks needed for an image being rendered.
Another way to avoid thrashing is to configure the GPU to process the blocks needed to render alternate MPR images to be displayed in a sequence in alternating forward and reverse series order.
The CPU may be operable to assemble the blocks of voxels comprising the volume data to be transferred to the GPU memory as 3D textures. This provides the GPU with access to the voxel data comprising blocks which the GPU can process in an efficient manner.
The blocks of voxels comprising the volume data may be arranged on an irregular grid, such as a staggered grid. This arrangement can help to reduce variability in bus traffic because it reduces the likelihood that many new blocks will need to be uploaded simultaneously to the GPU memory. This can happen if blocks are arranged on a regular grid and an MPR slab progressing through the volume data to generate a cine crosses a boundary of the regular arrangement of blocks.
The blocks of voxels comprising the volume data need not be specified in the system memory but can instead be defined by the CPU at the time they are retrieved. By generating blocks “on the fly” in this way, it is not necessary to reorder the volume data in system memory. Furthermore, the sizes, shapes and configurations of the blocks can be dynamically chosen in accordance with prevailing conditions.
The apparatus may further include a display for displaying a rendered MPR image to a user. Alternatively, the rendered image may be stored for later retrieval.
Rendering speed may be further increased if the GPU is operable to generate a series of MPR images of a region of the volume data which correspond to a hierarchy of different slab MPR thicknesses, and to store the MPR images in the GPU memory. This means an MPR image having an arbitrary slab MPR thickness can be rendered by accumulating appropriate ones of the hierarchy of different slab MPR thickness images without needing to render all of the voxels within the MPR slab.
The apparatus may further be operable to render a series of MPR images for display to a user at a controlled rate. For example, a rate that corresponds to a progression through the volume data at constant speed. The apparatus may further be operable to render successive images in a series of images from corresponding successive MPR slabs which overlap one another by a significant amount, for example, by greater than 50% of their thickness, e.g. greater than 60% or 70%, more preferably by a still higher amount, such as 80%, 90% or 95% so that the visual impression of the user is one of the slab gradually progressing through the volume, rather than jumping from one slice to another. This mode of use differs from the conventional approach of moving the slab between frames by a distance equal to or only slightly less than the slab thickness (e.g. with one sample spacing overlap). A GPU based system lends itself to the proposed mode of use in that there is only a low additional cost to the system when progressing in increments of a small fraction of the slab thickness in view of the fact that a large proportion of the slices making up the slab can be cached or otherwise stored in memory on the GPU.
According to a second aspect of the invention there is provided a method of rendering a multiplanar reformatting (MPR) image of volume data, the method comprising: predicting an MPR image which may be required for display at a future time; identifying blocks of voxels comprising the volume data which are needed to render the predicted MPR image; retrieving said blocks from a system memory; queuing said blocks for transfer to a graphics processing unit (GPU) memory; transferring at least some of the queued blocks to the GPU memory prior to the predicted MPR image becoming required for display; reading blocks from the GPU memory by a GPU configured to render parts of the predicted MPR image corresponding to the blocks should the predicted MPR image become required for display; and assembling the parts to form an MPR image.
According to a third aspect of the invention there is provided a computer program product comprising machine readable instructions for implementing the method of the second aspect of the invention.
The computer program product according to the third aspect of the invention may comprise a computer program on a carrier medium, for example, a storage medium or a transmissions medium.
According to a fourth aspect of the invention there is provided a computer configured to perform the method of the second aspect of the invention.
According to a fifth aspect of the invention, there is provided an apparatus for rendering a multiplanar reformatting (MPR) image of volume data, the apparatus comprising: a CPU coupled to a system memory storing the volume data; and a GPU coupled to a GPU memory and via a bus to the CPU and system memory, wherein the CPU is operable to identify blocks of voxels comprising the volume data which are needed to render the MPR image, the CPU being further operable to retrieve said blocks from the system memory and to transfer them to the GPU memory for rendering of corresponding MPR image parts, wherein the blocks of voxels comprising the volume data are arranged on an irregular grid. For example a staggered grid.
According to a sixth aspect of the invention, there is provided an apparatus for rendering a multiplanar reformatting (MPR) image of volume data, the apparatus comprising: a CPU coupled to a system memory storing the volume data; and a GPU coupled to a GPU memory and via a bus to the CPU and system memory, wherein the CPU is operable to identify blocks of voxels comprising the volume data which are needed to render the MPR image, the CPU being further operable to retrieve said blocks from the system memory and to transfer them to the GPU memory for subsequent rendering of corresponding MPR image parts by the GPU, wherein the apparatus is configured such that when the GPU memory allocated for storing blocks is full, blocks are overwritten in the GPU memory according to a replacement protocol having regard to the fraction of the GPU memory allocated for storing blocks which is required to store the blocks needed to render an MPR image.
According to a seventh aspect of the invention there is provided an apparatus operable to render a series of MPR images for display to a user at a predetermined rate.
According to an eighth aspect of the invention there is provided an apparatus operable to render a series of MPR images for display to a user at a rate determined by the user.
The invention also provides a method for rendering cross-sectional images of volume data, including cross-sections with thickness, comprising:
defining volume data to be imaged, plane location and orientation parameters, optionally also one or more of thickness parameters, sample density, projection mode parameters, and display parameters;
dividing the volume data into blocks;
transferring said blocks to a graphics processor on demand based on the geometric relationship between the blocks and the cross section to be rendered; and
rendering the cross sectional image using the graphics processor.
The dividing can be viewed as a conceptual subdivision of the whole volume into blocks and individual blocks of data are gathered or created on demand between the dividing and transferring.
A cache of volume data blocks can be maintained on the graphics processor to accelerate rendering of subsequent cross-sectional images.
The transferring may involve a scheduling algorithm to transfer blocks to the graphics processor ahead of the time when they are needed.
The method can be applied to rendering a sequence of cross sectional images based on parallel planes, wherein the scheduling algorithm is based on the linear separation between the cross sectional planes.
The scheduling algorithm can also be based on a desired temporal interval between images. The scheduling algorithm can also include consideration of the communication link through which blocks will be transmitted to the graphics processor and be designed so as to avoid saturating the communication link. The method can be applied to rendering a sequence of cross sectional images based on radial planes that share a common axis, wherein the scheduling algorithm is based on the angular separation between the cross sectional planes. The scheduling algorithm can be based on a desired temporal interval between images. The scheduling algorithm can also include consideration of the communication link through which blocks will be transmitted to the graphics processor and be designed so as to avoid saturating the communication link. The method can be applied to rendering a sequence of cross sectional images that have spatial locality but a complex spatial relationship, wherein the scheduling algorithm is based on an estimate of the separation between the cross sectional planes. The complex spatial relationship can be based on successive planes perpendicular to a curve. The scheduling algorithm can also be based on a desired temporal interval between images. The scheduling algorithm can also include consideration of the communication link through which blocks will be transmitted to the graphics processor and is designed so as to avoid saturating the communication link.
The scheduling algorithm can be the sole arbiter of when blocks of volume data enter and leave the cache.
The scheduling algorithm can add blocks to the cache and a Least Recently Used (LRU) strategy is used to clear blocks from the cache. In conditions wherein the working set of data blocks is equal to or larger than the size of the cache, a Most Recently Used (MRU) replacement strategy is used instead.
The rendering algorithm can be designed to access blocks of volume data in an order that does not result in pathological cache performance when the working set exceeds the cache size. The block access order can be palindromic, in other words blocks are accessed in alternating increasing and decreasing passes, or an approximation thereof.
The method can be applied to the rendering of a series of cross sectional images with thickness, and further comprise: maintaining a cache of cross sectional images that constitute sampling planes of the cross sectional region, and/or accumulated subsets of such images; and creating cross sectional images with thickness by accumulating an appropriate selection of cached images and if necessary additional cross-sectional images. The image cache may contain cross sectional images. The image cache may contain a hierarchy of accumulated images where level 0 of the hierarchy is cross sectional images, level 1 is an accumulation of every K images, level 2 is an accumulation of every K2 images, and so forth. The lowest levels of the hierarchy can be elided wholly or in part, wherein in the latter case the lowest levels of the hierarchy are elided except close to the planes that delimit the cross sectional zone.
The accumulation mode may be maximum, maximum of pixels excluding those with a predefined value or falling within a predefined value range, minimum, minimum of pixels excluding those with a predefined value or falling within a predefined value range, average, average of pixels excluding those with a predefined value or falling within a predefined value range, inverse exponential sum, inverse exponential sum of pixels excluding those with a predefined value or falling within a predefined value range, opacity-based volume rendering, or some other scheme.
The method may be applied to the rendering of a sequence of cross sectional images with thickness wherein there is substantial overlap between successive positions of the cross-sectional zone, such that the majority of image data required for a new image is present in the cache.
The method may be applied to the rendering of a sequence of cross sectional images with thickness wherein there is substantial overlap between successive positions of the cross-sectional zone, such that rendering a new image requires at most O(log(N)) image accumulations and at most O(log(N)) cross sectional image renderings, where N is the thickness of the cross sectional zone.
The invention also provides a system for rendering a sequence of cross-sectional images with thickness incorporating a feedback loop so that the cross-sectional zone being rendered can advance through the volume at a predetermined rate of millimeters per second.
The system may incorporate a user interface that allows the user to set the desired rate of progression through the volume millimeters per second.
It will be understood that references to an MPR plane should not be construed to be limited to a flat plane, but should also include an arbitrary shape of plane. For example, non-flat planes are commonly used in curved MPR.
For a better understanding of the invention and to show how the same may be carried into effect reference is now made by way of example to the accompanying drawings in which:
a and 4b show a flow diagram schematically representing a method of processing volume data to generate two dimensional images using the computer system of
a to 6c schematically show section views of example block gridding patterns that may be used in embodiments of the invention; and
Different imaging modalities (e.g. CT, MR, PET, ultrasound) typically provide 15 different image resolutions (i.e. voxel size), and the overall size of the volume imaged will further depend on the nature of the study. However, in the following description, by way of concrete example it will be assumed that the volume data comprise an array of 512×512×1024 16-bit voxels arranged on a regular Cartesian grid defined by x-, y- and z-axes, with the voxels being spaced by 0.5 mm along each axis. This corresponds to an overall imaged volume of around 25 cm×25 cm×50 cm, for example so as to encompass a human head. As is conventional, the volume data are aligned with transverse, sagittal and coronal planes. The xy-axes are in a transverse plane, the xz-axes are in a coronal plane and the yz-axes are in a sagittal plane.
As noted above, a common technique for generating 2D output images from volume data is known as multi planar reformatting (MPR). MPR is a technique for presenting planar cross-sectional views through volume data to allow viewing of the data in any planar orientation. In zero thickness, or plane, MPR, output images are generated by sampling (typically involving interpolation) the volume data at locations corresponding to pixels in an output image plane passing through the volume data at a desired orientation and position. The specific mathematical processing applied to the volume data in order to generate such 2D images is well known and not described here.
A related form of MPR is known as MPR with thickness, or slab MPR. Slab MPR is often used where volume data are obtained on a grid which is denser than the image resolution required to be viewed by a user, to reduce noise, or to improve perception of anatomical structures in the data. In slab MPR, a planar slab of the volume data is identified which is parallel to the desired output image and which extends over a finite thickness in the vicinity of and perpendicular to the output image plane, i.e. along a viewing direction. The output image is obtained by collapsing this planar slab along the viewing direction according to a desired algorithm. Common collapse algorithms include determining the maximum, minimum or average signal value occurring for all voxels in the planar slab which project onto a single pixel in the output image. This signal value is then taken as the signal to be represented in the output image for that pixel. As with plane MPR, the mathematical processing applied to the volume data in order to generate slab MPR images is well known and not described here.
As previously noted, a common desire when studying medical image volume data is to view a series of parallel MPR images (which may be plane MPR or slab MPR images) in succession, for example to give the impression of an image slice moving through the volume data or of an image slab covering successive adjacent and/or overlapping zones of the anatomy. By presenting a cine of MPR images to a suitably trained user in this way, the user can form an accurate mental image of the object represented by the volume data.
The CPU 24 may execute program instructions stored within the ROM 26, the RAM 28 or the hard disk drive 30 to carry out processing of signal values associated with voxels of volume data that may be stored within the RAM 28 or the hard disk drive 30. The RAM 28 and hard disk drive 30 are collectively referred to as the system memory. The GPU may also execute program instructions to carry out processing of volume data passed to it from the CPU.
To assist in showing the different data transfer routes between features of the computer system 22, the common bus 42 shown in
a and 4b show a flow diagram schematically representing a method of processing volume data to generate two dimensional images using the computer system 22 of
In Step S1, a user wishing to view a cine defines a starting MPR slab (MPR slab#1) from which a corresponding desired initial image (image#1) is to be generated. The user also defines a step size between cine frames and a direction of travel for the cine. The user defines the required parameters using the keyboard 38 and mouse 40 in combination with a menu of options displayed on the display 34, for example using conventional techniques. In this example, MPR slab#1 has a thickness of 5 mm (corresponding to ten voxels) and is arranged parallel to the x-axis and inclined at 45 degrees to both the y-and z-axes. The center of MPR slab#1 coincides with the center of the volume data 48. The defined step size between cine frames is 4 mm (i.e. a step of 80% of MPR slab thickness with an overlap of 20%), and the desired direction of cine through the volume data is perpendicular to MPR slab#1 and in the positive z-direction.
In this example, this division of the volume data into blocks of voxels is conceptual and there is no corresponding re-ordering of the volume data in system memory. The volume data will typically be stored in system memory in linear order. In other examples, however, the volume data may be re-ordered in system memory such that voxel data associated with each individual block are linearly accessible. In general a duplication of the volume data ordered in this way (rather than an replacement) would be used as other volume data analysis tools may require a copy of the volume data to remain in linear order in system memory. While this approach can be appropriate in some circumstances, the re-ordering of volume data in system memory has costs both in terms of the time taken to perform the re-ordering and memory requirements. The scheme also lacks flexibility in that block sizes and shapes cannot be easily changed once the re-ordering has been done without repeating the re-ordering. A compromise scheme might involve duplicating only a subset of the blocks in linear order in system memory, for example those blocks in and around an MPR slab being rendered.
In the present example, the volume data are considered to be divided into cubes of 32×32×32 voxels arranged on a regular grid. It is noted that for MPR processes employing interpolation, it is helpful for voxels at the boundary of one side of each block to be duplicated in the neighboring block for each of the axes. That is to say, the blocks overlap by one voxel. This allows interpolations to be made over all of the volume space spanned by the volume data. Accordingly, a 32×32×32 block might be considered to contain 31×31×31 useful voxels and to properly span the 512×512×1024 voxel volume data set, 17×17×34 blocks will be needed. The outer “fractional” blocks can either be padded to 32×32×32 voxels, or can be smaller than the other “whole” blocks.
Once MPR slab#1 has been defined, the next task of the computer system 22 is to render and display image#1 corresponding to MPR slab#1.
In Step S2, the CPU calculates which of the conceptual blocks 72 comprising the volume data 48 are intersected by MPR slab#1. Some of these block, namely those on the visible outer faces of the volume data 48 shown in
In Step S3, the CPU retrieves the required blocks for MPR slab#1 from system memory. The CPU assembles the voxel values comprising each block in linear order following retrieval of the corresponding voxel values from system memory (where the volume data as a whole is arranged in linear order) and uploads the block data to the GPU memory via the GPU interface 60 and GPU memory I/O controller 62. The CPU conveniently assembles each block as a 3D texture in linear order for transfer to the GPU cache. Alternatively, each block may be assembled in a section of a larger 3D texture, for example using “texture atlas” techniques. In other examples, the CPU may “swizzle” the block data, or rearrange the block data in octal tree order before rendering (this could also be performed by the GPU following upload). It may or may not be practical to “swizzle” the block data depending on the efficiency of the swizzling operation. Because of locality and granularity effects in system memory, it will be more efficient when appropriate to retrieve block data from the linear order volume data in groups of at least a minimum number of consecutive bytes. Typically 32 to 128 byte groups, depending on system architecture.
In Step S4, the GPU cycles through each of the blocks which have been transferred to the GPU cache and processes the relevant voxels (i.e. those falling within MPR slab#1) in each block in order to render the output image. This can be done using conventional volume rendering algorithms and is performed by circuitry in the processing engine 64. Images may be rendered by forming a maximum intensity projection, for example. In this way, image#1 is rendered as a collage of tiles where each tile represents an image part rendered from those voxels in a given block. The GPU may commence processing the blocks to render the individual image parts as soon as the first block is transferred from the CPU to the GPU memory (i.e. Step S3 and Step S4 may execute to some extent in parallel), or alternatively, the GPU might wait until all blocks associated with MPR slab#1 have been transferred to the GPU memory. The former will generally be the quicker scheme. Following the processing of a single block by the GPU to render a corresponding part of image#1, the block data are not overwritten in the GPU memory but are maintained in case the same block is required for a later rendering. This is particularly beneficial because it will often be the case during a cine that once an image has been rendered which corresponds to one MPR slab, the next image to be rendered will correspond to an MPR slab which is adjacent to and/or overlaps to some extent with the previous one. This means that in many cases a number of the same blocks will be required to render the second image as were used to render the first. Maintaining the blocks in the GPU memory therefore reduces duplication of loading from system memory and transfer to the GPU for later rendered images.
The portion of the GPU memory allocated to the storage of blocks of volume data in this way is referred to here as GPU block cache. It will be appreciated that the GPU memory is not configured as a hardware cache since it lacks a hardware tag store and hardware means for associative addressing and replacement. Instead, the present invention maintains the GPU block cache as a software abstraction implemented on ordinary GPU memory.
As the amount of the GPU memory allocated as GPU block cache becomes full, it becomes necessary to overwrite blocks as new blocks are uploaded from the CPU. A common cache replacement protocol is to replace the least recently used entry in the cache. This is known as an LRU protocol. In the present case, this would mean that the block which has remained unused in the GPU block cache for the longest period will be overwritten. This protocol generally works efficiently because it is likely that cached blocks which were not used in the latest rendering will also not be used in the next or subsequent renderings. This is because the MPR slab used for subsequent images typically progresses steadily through the volume data during a cine. Accordingly, once a block has been used and the MPR slab has progressed though that part of the volume data, the block will not be used again unless the cine changes direction.
However, in certain circumstances, the LRU protocol can be very inefficient and heavily detrimental to performance. For example, where the working set of blocks, that is the number of blocks required to render each individual image, exceeds the number of blocks which can be stored in the GPU block cache, an LRU protocol can lead to thrashing. By way of example, suppose the GPU block cache can store N blocks and the working set is N+1 blocks. As the initial image is rendered, the first N blocks (blocks 1 . . . N) are loaded into the block cache and the corresponding N parts of the output image rendered, block N+1 is then loaded into the GPU block cash in place of block 1 (since this is the least recently used block) and the corresponding final part of the image rendered. To render the next image, it will frequently be the case that the same N+1 working blocks will be required. Accordingly, the GPU requires block 1 to be reloaded into the block cache. It does this by overwriting block 2 (the least recently used block) and renders the first part of the next image. However, the GPU now immediately requires block 2. Block 2 is thus loaded and overwrites block 3 in the block cache. Block 3 is then loaded and overwrites block 4, and so on. Accordingly, where the working set exceeds the GPU block cache size, the LRU protocol causes all blocks to be re-loaded for each subsequent image in the cine, i.e. thrashing occurs. One way this can be avoided is by overwriting the most recently used entry in the GPU block cache. This is known as an MRU protocol. In the above example having N+1 working blocks, an MRU protocol requires only a single block to be loaded per subsequent image (assuming the MPR slab continues to intersect the same N+1 blocks). For example, when image#1 has been rendered, the GPU block cache is holding blocks 1 to N−1 and N+1 (since block N, being the most recently used block, was overwritten by block N+1). When image#2 has been rendered, the GPU block cache is now holding blocks 1 to N−2, N and N+1 (since block N−2 was overwritten by block N−1), and so on.
Accordingly, an efficient cache replacement protocol is to allow adaptive switching from an LRU protocol when the working set of blocks is fewer than a the number of blocks which fit into the GPU RAM allocated as block cache to an MRU protocol when the working set exceeds the GPU block cache allocation. More generally, switching from an LRU protocol to an MRU protocol might occur when the amount of the GPU block cache required to store the working set of blocks exceeds a threshold, for example 50%, 75% or 100%, of the total GPU block cache size. Furthermore, the switch from an LRU protocol to an MRU protocol as the working set of blocks increases might occur at one threshold, while the switch from an MRU protocol to an LRU protocol as the working set of blocks decreases might occur at another lower threshold. This can help prevent frequent switching of the cache replacement protocol which might otherwise occur if a single threshold is used and the computer system is typically operating with a working set at or around this threshold.
It will be appreciated that other techniques to avoid thrashing could also be used. For example a palindromic rendering policy whereby blocks are alternately processed in forward and reverse in alternate images. For example, even-numbered images might be rendered from the bottom left corner to the top right corner of the MPR slab, whereas odd-numbered images might be rendered from top right to bottom left. Even with an LRU protocol this approach avoids thrashing.
In Step S5, the rendering of image#1 is complete and the image is stored in a part of the GPU RAM allocated to image storage. This is done so that if the same image is required again, for example, if the user stops the current cine and wants it to run backwards to review one or series of particular images of interest.
In Step S6, the GPU transfers image#1 via the GPU display I/O controller for display on the display 34. The GPU than instructs the CPU that image#1 has been completed.
During the execution of Steps S4 to S6 by the GPU in rendering image#1, the CPU executes Steps T4 to T6 in parallel.
The CPU executes Step T4 following transfer of the blocks associated with MPR slab#l to the GPU block cache in Step S3. In Step T4, the CPU calculates which future MPR slabs are likely to be required. In the present case, i.e. where the cine corresponds to a steadily progressing image plane moving through the volume data in a known direction with a known step size, the next required MPR slab (MPR slab#2) can be readily predicted. It is similarly easy to predict the next required MPR slab for a rotating cine or a cine combing rotation and translation through the volume data. (In other cases, for example where a cine is being scrolled forwards and backwards under interactive control of a user, it may be necessary to statistically predict which is the next most likely MPR slab to be required based on previous activity by the viewer, and also to assume that MPR slabs on both sides of the present MPR slab might be required.) The CPU then calculates which blocks comprising the volume data are intersected by MPR slab#2.
In Step T5, the CPU determines which of the blocks associated with MPR slab#2 (which at this stage are referred to as future blocks in that they are under consideration only as being likely to be needed in the future) are already in the GPU block cache. In many cases, the majority of future blocks will already be in the GPU block cache. The CPU retrieves the voxel data associated with those future blocks which are not already in the GPU block cache (referred to as predicted future blocks) from system memory and assembles the blocks in the same manner as described above with respect to Step S3.
In Step T6, the CPU queues the predicted future blocks in the CPU block cache. One predicted future block is schematically shown in the CPU cache 50 of
By way of example, the CPU might determine in Steps T4 and T5 that the current cine activity will require the transfer of:
10 predicted future blocks in 100 ms;
20 further predicted future blocks in 200 ms;
40 further predicted future blocks in 300 ms; and
15 further predicted future blocks in 400 ms.
Moreover, it might take 2 ms to transfer a block to the GPU block cache and a rule may be adopted that block transfers are preferably initiated 100 ms in advance. In these circumstances, the scheduler thread will instruct the immediate (i.e. at t=0 ms) transfer of the first 10 predicted future blocks. This takes 20 ms. The transfer of predicted future blocks is then halted for 80 ms. At t=100 ms, the next 20 predicted future blocks are transferred to the GPU block cache. This takes 40 ms. At t=200 ms, the next 40 predicted future blocks are transferred to the GPU block cache, and so on. If more than 50 predicted future blocks need to be uploaded at one time to meet a certain deadline, the scheduler thread may be configured to instigate transfer further in advance by making use of the halt time associated with a previous transfer activity.
In Step S7, the CPU, having been instructed by the GPU in Step S6 that image#1 is complete, instigates the rendering of the next image in the cine, i.e. image#2. This is done by first identifying the MPR slab corresponding to image#2, i.e. MPR slab#2. In cases where the cine represents a steady progression through the volume data, MPR slab#2 will be as predicted by the CPU in Step T4. However, in cases where the cine is responsive to user input, actual MPR slab#2 might not correspond to the most likely next MPR slab predicted in Step T4. This might be the case where the user stops the cine and instructs it to reverse, or to skip some distance, for example.
In Step S8, the CPU identifies which blocks are intersected by MPR slab#2. This is done in the same manner as described above in connection with Step S2 for MPR slab#1.
In Step S9, the CPU determines which of the blocks required to render image#2 are not already in the GPU block cache. Ideally, all of the necessary blocks will already be in the GPU block cache as a result of the predictive uploading of blocks associated with the previously executed Steps T4 to T6. If this is not the case, the CPU retrieves the voxel data associated with any required blocks which are not already in the GPU block cache from system memory and assembles the necessary blocks and transfers them to the GPU block cache in the same manner as described above with respect to Step S3.
In Steps S10 to S12 the GPU renders image#2 in the same manner as described above for Steps S4 to S6 with regard to image#1. Again, the processing in Step S10 can begin for any blocks intersected by MPR slab#2 which are already in the GPU block cache before completion of Step S9. Because the scheduler thread operating in Step T6 will have already transferred many (ideally all) of the blocks intersected by MPR slab#2 to the GPU block cache, the second image can be rendered quickly as there is reduced (ideally zero) delay associated with transferring data to the GPU block cache for subsequent processing during the rendering process.
After display of image#2 in Step 12, the GPU instructs the CPU that image#2 has been completed.
During the execution of Steps S10 to S12 by the GPU in rendering image#2, the CPU executes Steps T10 to T12 in parallel. These steps are similar to and will be understood from Steps T4 to T6 described above. During steps T4 to T6, the CPU continues to queue predicted future blocks and the scheduler thread continues to transfer them to the GPU block cache where space is available.
In Step S13, the CPU, having been instructed by the GPU in Step S12 that image#2 is complete, instigates the rendering of the next image in the cine, i.e. image#3. This is done in the same manner as described above for Step S7 for image#2.
As indicated in Step S14, subsequent images are generated by repeating the method described above for steps S8 to S13 and T10 to T12 for image#2 for each subsequent image in the cine.
Although the above example employs a regular grid of 32×32×32 voxel blocks, it will be appreciated that other block sizes and shapes may equally be used. For example, the blocks may comprise 64×64×64 voxels, or may not have the same dimension along each axis, e.g. a regular grid of 16×16×64 voxel blocks may be used.
The most appropriate characteristic size of blocks to use will depend on a number of conflicting factors. This is because some aspects of the method benefit from using a small number of large blocks whereas other aspects benefit from using a large number of small blocks. An appropriate block size to use can thus be determined by taking into account the impact of these different factors in any particular implementation.
The main factor which favors a large number of small blocks relates to the way in which the blocks must span the volume of the MPR slab to be rendered. An array of smaller blocks will be able to more closely map to the volume of the MPR slab. This means using a large number of small blocks minimizes the amount of redundant data that needs to be transferred to the GPU block cache to render any given MPR slab. This is because there are fewer voxels not within the MPR slab itself, but which must be uploaded to the GPU block cache nonetheless because they are in a block which includes voxels which are within the MPR slab. In this regard having blocks which correspond to individual voxels would be ideal.
However, this must be balanced against those factors which favor a small number of large blocks. For example, small numbers of blocks can be preferred because the number of geometric primitives that must be processed to render a given image rises with the number of blocks (because each block is rendered separately). The computational cost of performing clipping operations also rises with the number of separate blocks to be processed. A large number of blocks also requires a large number of state changes to be made by the GPU during processing. Having larger blocks is also preferred because it is more efficient for the CPU to retrieve data from system memory in contiguous blocks (e.g. as the block x-dimension increases). It is also generally more efficient to transfer larger blocks of data through a computer system's bus. There will also be relatively less wastage associated with the need to duplicate voxels at the boundaries of neighboring blocks in order to allow interpolation over all volume space when larger blocks are employed.
There is therefore a need to make compromise between these competing factors when deciding on an appropriate size and shape of blocks. The decision will depend on factors such as the CPU and system architecture, the GPU architecture, the total number of voxels in the volume data and the orientation and thickness of the MPR plane. In the above described implementation, characteristic block dimensions on the order of 32 or 64 voxels have been found to suitable. Furthermore, appropriate block configurations (size, shape, gridding pattern etc.) need not be predefined for a given implementation but may be selected interactively according to desired cine parameters (slab thickness, orientation, step size, etc.). This may be on the basis of a predefined set of different block configurations or on the basis of specifically generated block configurations for any given desired cine activity.
It will also be appreciated that the blocks need not be arranged on a regular grid and/or need not be cuboid in shape themselves. Such non-regular arrangements can help to reduce spikes in block transfer requirements which can be significant in some cine activities.
A common cine activity involves generating a series of images parallel to two axes of the volume data so as to scan along the third axis. For example, with reference to
a schematically shows a section view of an example of this cine activity taken in the xz-plane with blocks conceptually arranged on a regular grid. In this example the blocks are 16 voxels wide along the x-axis and 32 voxels wide along the z-axis. The cine activity effectively corresponds to a series of renderings of an MPR slab which is moving in the direction indicated by the arrow. In this example the slab is 8 voxels thick and advances at a rate of 6 voxels per cine frame and starts at the left-hand side of the figure. To assist explanation, columns of voxels will be referred to by number starting at 1 for the leftmost column. Images are rendered in a manner similar to that described above with reference to
b is similar to and will be understood from
The above described methods can provide for efficient and fast rendering of MPR images from 3D volume data using readily available non-specialized computer hardware. Whereas in the past it has been necessary for images to be displayed as soon as they become available because it is not possible to accurately predict how long an image might take to be rendered and because some images take unduly long to render (i.e. those requiring a large number of blocks to be retrieved from system memory and transferred to the GPU block cache), with the present invention, cine images can be displayed at a regular pre-set rate corresponding to a constant speed of progression through the volume data. The above described techniques can also allow real-time cineing of medical image data at a speed which opens up the possibility for specific cine activities which have not generally been possible to implement practically with previous methods.
For example, conventional cine activities have previously been based on step sizes between cine frames which are comparable to the thickness of the MPR slabs employed to generate the images. This has been considered necessary to allow cineing to proceed at a reasonable rate to allow a user to view a complete set of volume data in a reasonable time. With slab MPR algorithms such as maximum intensity projection, large step sizes are not considered to lead to the possibility of missing something important in the data because so long as there is some overlap between the MPR slabs used to generate successive cine frames, every voxel plays a roll in generating the cine, even if the majority are never displayed. However, the approach of using large step sizes (for example having overlaps of only 20% of the thickness of the MPR slabs) nonetheless leads to the appearance of viewing a series of separate images and not a smooth movie-like scan through the volume data. However, with the faster rendering provided by the present invention it is now possible to provide real time cineing at a reasonable speed having significantly higher overlaps between the MPR slabs used to generate successive images (i.e. frames) in the cine than has previously been possible.
Cine speed can be increased further still when significantly overlapping MPR slabs are used to render successive images by using accumulated image caching as now described.
A part of the GPU RAM is allocated to accumulated image cache storage. The accumulated image cache of the GPU RAM stores a hierarchical set (i.e. a tree) of accumulated images. In principal, if unlimited GPU RAM were available, the lowest level of accumulated images in the hierarchy (level 0) would correspond to individual zero-thickness MPR planes through the volume data which are parallel to a desired cine plane. The next level (level 1) images correspond to the accumulation of a number, for example 4, of the level 0 images. The level 1 images are generated by accumulating the level 0 images using the accumulation operator used to generate the desired output images in the cine. For example, this might be the minimum, maximum or average projection (collapse) algorithm. The level 2 images correspond to the accumulation of a number of the level 1 accumulated images, for example, again 4. Level 2 images thus correspond to an accumulation of 16 level 0 images. This hierarchy may continue up to the highest level consistent with the maximum expected thickness of slab that may need to be rendered. In practice, it is unlikely that the GPU accumulated image cache will be sufficient to store all of the level 0 images. Accordingly, only level 1 accumulated images and higher might be stored in the GPU accumulated image cache.
Now, suppose an MPR slab having a thickness of 16 voxels is to be rendered for the first image in a cine using a maximum intensity projection accumulation operator. One way to render the slab would to cast a short ray from each image pixel through the volume data perpendicular to the slab and to accumulate samples along the length of the ray. A functionally equivalent method is to generate a series of zero thickness MPR planes at each of the sample locations along these rays (i.e. images which correspond to the level 0 images described above) and to accumulate these zero thickness MPR planes together. If during this processing the GPU caches the level 1 and level 2 images, they may be re-used to render a future MPR slab that significantly overlaps the present slab. For example, if the next slab overlaps the first by 50%, two of the four level 1 accumulated images can be re-used and do not need to be re-rendered. As the cine progresses, the GPU accumulated image cache becomes increasingly populated with level 1 and level 2 accumulated images. When four level 2 images have been generated, they may be accumulated to form a level 3 image and so on.
As more and more accumulated images become cached, it becomes more likely that the GPU will be asked to render an image corresponding to an MPR slab which spans a region covered by the cache of accumulated images. In this case, the desired image can be generated by appropriate combining of the hierarchical set of accumulated images. For example, if the MPR slab has a thickness corresponding to the extent of 16 of the zero thickness MPR planes, and accumulated images for level 1 and above (each level corresponding to an accumulation of 4 accumulated images from the next lower level) are stored, there is a 1 in 16 chance that the MPR slab corresponds exactly to an already rendered level 2 image, and a 1 in 4 chance that it can be rendered by merely accumulating four level 1 images. For the remaining 3 in 4 cases, the image can be rendered merely by accumulating 3 level 1 images with 4 level 0 images (which will need to be generated if, as in this example case, level 0 images are not stored in the cache).
The accumulated images may be cached at the resolution of the final output display. However, because the resolution of volume data is generally less than the resolution of a final display device, it may be more appropriate to cache the images at a more modest resolution to save on storage requirements. The cached images may include blank padding parts for simplicity of operation or may alternatively be appropriately cropped to reduce storage requirements. Any regions of padding in the volume data should be carefully handled to ensure they do not contaminate accumulated images. This could be achieved with maximum and minimum projection accumulation by allocating padding pixels in rendered images a value to be ignored during accumulation. In the case of average projection, two values can be cached for each pixel in an accumulated image, one value provides a running total of valid pixels, while the other provides the number of valid pixels.
Although it will not generally be feasible to store all level 0 images in the GPU accumulated image cache, in cases where the cine progression through the volume data is regular and predictable, it can improve efficiency to maintain a number of level 0 images in the vicinity of the leading and trailing edges of the MPR slab. This is because these are the level 0 images most likely to be required when a subsequent MPR slab has advanced only a small distance from a previous MPR slab.
Cache replacement protocols for the GPU accumulated image cache may be similar to those described above for the GPU block cache. For example, an LRU protocol may be used where the working set of accumulated images is smaller than allocated GPU RAM but an MRU protocol instigated when the working set of accumulated images becomes comparable to, or exceeds, the allocated GPU RAM.
Other schemes governing cache replacement can be based on the spatial distribution of the cached images through the volume data, rather than on a temporal basis. For example, low level accumulated images which are farthest from the location of the presently rendered MPR slab might be overwritten first. This can be achieved in one example by organizing the accumulated image cache in such a way as to allocate storage for N level 0 images, N/K level 1 images (where K is the number of images accumulated together at each increase in level, i.e. 4 in the above examples) and so on. Level 0 images can then be indexed by a direct mapping of the MPR plane position according to the formula: cache_index=plane_index MOD N. This type of approach removes the need to monitor temporal usage of the accumulated images in the cache and is a robust protocol, so long as the slab thickness does not exceed N images.
Using the method of volume rendering described in relation to
Methods embodying the invention will often be used within a hospital environment. In this case, the methods may usefully be integrated into a stand-alone software application, or with a Picture Archiving and Communication System (PACS). A PACS is a hospital-based computerized network which can store volume data representing diagnostic images of different types in a digital format organized in a single central archive. For example, images may be stored in the Digital Imaging and Communications in Medicine (DICOM) format. Each image has associated patient information such as the name and date of birth of the patient also stored in the archive. The archive is connected to a computer network provided with a number of workstations, so that users all around the hospital site can access and process patient data as needed. Additionally, users remote from the site may be permitted to access the archive over the Internet.
In the context of the present invention, therefore, a plurality of image volume data sets can be stored in a PACS archive, and a computer-implemented method of generating 2D output images of a chosen one of the volume data sets according to embodiments of the invention can be provided on a workstation connected to the archive via a computer network. A user such as a radiologist, a consultant, or a researcher can thereby access any volume data set from the workstation, and generate and display images using methods embodying the invention.
In the described embodiments, a computer implementation employing computer program code for storage on a data carrier or in memory can be used to control the operation of the CPU and GPU of the computer system. The computer program can be supplied on a suitable carrier medium, for example a storage medium such as solid state memory, magnetic, optical or magneto-optical disk or tape based media. Alternatively, it can be supplied on a transmission medium, for example a medium with a carrier such as a telephone, radio or optical channel.
It will be appreciated that although particular embodiments of the invention have been described, many modifications/additions and/or substitutions may be made within the scope of the present invention. Accordingly, the particular examples described are intended to be illustrative only, and not limitative.
[1] Lichtenbelt, B., Crane, R. and Naqvi, S., “Introduction to Volume Rendering”, Hewlett-Packard Company, Prentice-Hall PTR, New Jersey, 1998