The present application relates to tomography imaging devices, e.g., computed tomography devices, and to tomography control systems.
Computed tomography (CT) is a general method for non-destructive imaging of objects or organisms using X-ray transmission measurements. CT scanners are used in a wide array of applications including healthcare, industrial inspection, transportation security, electron microscopy, and synchrotron imaging. In each case, a major challenge is to compute an accurate reconstruction of an object cross section from the available measurements. Reconstructed image quality may be limited due to many causes including limited amounts of data, limited data quality, excessive sensor noise, limited sensor resolution, motion blur, incomplete projections, missing projections, detector cross-talk, X-ray source spot size, X-ray source spectrum, approximations in material models, beam hardening, and X-ray scatter.
Model-Based Iterative Reconstruction (MBIR) can be used to improve the quality of reconstructed images in applications using X-ray computed tomography. However, MBIR is known to be very computationally expensive. The MBIR reconstruction is computed by minimizing a cost function, and a number of methods exist for performing this minimization. Some reconstruction methods, such as iterative coordinate descent (ICD), have very rapid convergence and are agnostic to the geometry of the CT scanner, which makes ICD desirable for use in applications such as security CT scanners. However, a disadvantage of ICD is that it tends to require memory access patterns that do not align with the organization of typical computer system caches. In practice, this poor alignment of memory accesses can dramatically slow down the rate at which data is accessed, and therefore can severely reduce the speed at which the ICD algorithm can run on high performance computers and graphical processing units (GPUs). Therefore, improvements are needed in the field.
According to one aspect, a tomography system is disclosed, comprising one or more computer processors having a first memory, the first memory having a first access speed, a second memory having a second access speed slower than the first access speed, and one or more program modules stored on a third memory and executable by the one or more processors to perform operations. The performed operations may comprise receiving measured data from a tomography scanner, storing the measured data in the second memory, and creating an initial image from the received measured data, the initial image comprising a plurality of voxels, and updating a super voxel in the initial image. The super voxel may comprise a plurality of voxels whose corresponding data locations in the second memory substantially overlap, with the updating performed by retrieving the corresponding data from the second memory and storing said corresponding data in the first memory, the corresponding data comprising a subset of the measured data. The first memory may be organized in cache lines, and the updating may include rearranging the portions of the measured data so that successive accesses to the measured data in the first memory by the at least one of the processors proceeds sequentially along each of the cache lines. The third memory may be separate from the first memory and second memory or may be part of the first memory or second memory. The voxels of the super voxel may also be substantially adjacent in the image. The measured data may form a sinogram. The performed operations may also include storing the measured data in a memory buffer, the memory buffer located in the first memory or the second memory, the measured data corresponding to the super voxel.
According to another aspect, a method for iterative reconstruction of computer tomography data is disclosed, comprising receiving computer tomography measured data, creating an initial image from the measured data using a computer processor, and updating spatially localized super-voxels in 2 or more dimensions, each super-voxel corresponding to a plurality of voxels whose corresponding measured data in memory substantially overlap.
According to another aspect, a method for iterative image reconstruction of computer tomography data is disclosed, comprising receiving computer tomography measured data and creating super-voxels with a size and shape so that memory access time is reduced, wherein the super-voxels corresponding measured data are localized in memory.
According to another aspect, a method for forward or back projection of computer tomography data is disclosed, comprising receiving computer tomography measured data and reorganizing stored measured data corresponding to a super-voxel so that the associated stored measured data are straightened to be in memory locations to allow efficient access by the computer, the super voxel comprising a plurality of image voxels whose corresponding stored measured data substantially overlap.
In the following description and drawings, identical reference numerals have been used, where possible, to designate identical features that are common to the drawings.
The attached drawings are for purposes of illustration and are not necessarily to scale.
X-ray computed tomography, positron-emission tomography, and other tomography imaging systems are referred to herein generically as “CT” systems.
Various aspects relate to Super-Voxel Buffering for Fast Inversion. The terms “I,” “we,” “our” and the like throughout this description do not refer to any specific individual or group of individuals.
Throughout this description, some aspects are described in terms that would ordinarily be implemented as software programs. Those skilled in the art will readily recognize that the equivalent of such software can also be constructed in hardware, firmware, or micro-code. Because data-manipulation algorithms and systems are well known, the present description is directed in particular to algorithms and systems forming part of, or cooperating more directly with, systems and methods described herein. Other aspects of such algorithms and systems, and hardware or software for producing and otherwise processing signals or data involved therewith, not specifically shown or described herein, are selected from such systems, algorithms, components, and elements known in the art. Given the systems and methods as described herein, software not specifically shown, suggested, or described herein that is useful for implementation of any aspect is conventional and within the ordinary skill in such arts.
The scanner typically has a translating table 312 that contains the object 210. The table is designed to move through the rotating gantry 314 so that a series of slices can be measured. In some embodiments, the scanner 200 includes multiple rows of detectors as described above. In the illustrated example, the scanner 200 is referred to as multi-slice since it can measure projections through multiple slices in a single rotation of the gantry. In other embodiments, the object 210 being scanned passes through the gantry 314 as it rotates, in which case the scanner 200 is referred to as a helical scanner since the effective trajectory of the scan is a helix. In still further embodiments, the scanner may have no gantry and may instead comprise a series of X-ray sources at different locations that can be multiplexed to make a series of projection view measurements.
Referring to
The sinogram can be stored in memory, e.g., RAM, in either row-major or column-major order format. In row-major order, consecutive elements of the rows of the array are contiguous in memory. In column-major order, consecutive elements of the columns are contiguous. In row-major format, consecutive elements in memory correspond to different view angles at the same channel whereas in column-major format, consecutive elements in memory correspond to different channels at the same view angle. When row-major format is used, the “fast variable” is the view index whereas when column-major format is used, the “fast variable” is the channel index.
Recent research has demonstrated great potential for MBIR to improve the performance of X-ray CT-based imaging in a wide range of applications including explosive detection systems (EDS), medical imaging, scientific imaging, and materials imaging. MBIR works by computing a 3D volumetric reconstruction of an object using a cost-optimization framework that takes into account a model of the scanner geometry and noise characteristics, as well as a model of the 3D image statistics.
Some MBIR systems permit reductions in X-ray dosage for a variety of clinical applications including lung cancer screening (˜80% reduction) and pediatric imaging (30-50% reduction).
In systems where dosage reduction is not as important, such as EDS scanners, MBIR offers other advantages:
Therefore MBIR offers the potential to a) reduce cost by reducing false alarm rates, b) extend EDS effectiveness in new threat scenarios, and c) enable the design of novel scanner hardware to meet emerging DHS/TSA needs.
To realize this potential, the image reconstruction speed of MBIR must be improved beyond that of currently available systems. Table 1 presents statistics collected for a conceptual EDS system design. The numbers and simulations are based on realistic models of scanner geometry and assumed 3D image phantoms.
n = 262,144
Based on the data in Table 1, Table 2 estimates the performance on a single processor core implementation for a full 3D reconstruction, and determines the speedup required for an implementation of MBIR in a real-world EDS application. Assuming a 15 sec processing time per bag (i.e., real-time processing with 240 bags per hour), then this implies that the required speedup is approximately 900:1, a metric which to date has limited the adoption of MBIR to EDS applications.
n = 262,144
MBIR is based on the numerical solution of an optimization problem described by,
where y is an m-dimensional vector containing the measured CT data, x is a n-dimensional vector containing the reconstructed image, A is an m×n-dimensional matrix containing the forward model of the scanner geometry, D is an m×m-dimensional diagonal matrix containing the inverse variance of the scanner noise statistics, and u(x) is the regularizing prior model used to incorporate knowledge of image behavior. Each element of x then represents the value of a volume element, or voxel. In practice, x and y can be very large, so computing the solution to the optimization problem can be a formidable task.
Broadly speaking, there are two general approaches to solving this large inverse problem.
Simultaneous methods are optimization approaches which are widely used. However, they have a number of disadvantages for MBIR. Traditional GD has very slow convergence and may require hundreds of iterations to converge. Hence in order to speed convergence, such methods must be combined with processes such as preconditioning and ordered subsets. Simultaneous methods, while potentially promising, have the disadvantage that preconditioners require very sophisticated and difficult custom algorithmic tuning for each CT system. This imposes a great limitation on the commercial potential and economies of scale for simultaneous methods since each reconstruction engine must be custom designed for its corresponding specific CT imaging system.
Ordered subset methods may also be combined with preconditioning, and are also a key ingredient to speed convergence. Ordered subset methods work by grouping views into subsets so that each sub-iteration can be much faster. Domain decomposition methods offer the potential for O (n log n) computation of simultaneous projections (Ax) as opposed to the direct O(n3/2) complexity. This is a potentially huge speedup for real systems in which one might have n≈108. However, domain decomposition methods require a full set of views, so they do not work well when combined with ordered subset methods or with CT systems that use sparse views. This severely limits the application of domain decomposition methods.
Alternatively, ICD and NHICD optimization methods have been shown to exhibit very rapid and robust numerical convergence for a wide variety of MBIR applications spanning a wide range of geometries, noise statistics, and image models. Intuitively, coordinate descent methods are fast and robust because each pixel is updated to best minimize the cost function, so there is tight feedback in the optimization process. In practice, NHICD has been shown to converge in 3 to 6 iterations. Also, when properly implemented on older computer processing unit (CPU) architectures it has been demonstrated that ICD can use the hardware resources very efficiently, so the actual Wall-Time Per Iteration (WTPI) is low and comparable to that of the best implementations of simultaneous methods. So, the rapid convergence is a decisive advantage of coordinate descent methods on multi-core platforms. However, a disadvantage of ICD is that it requires that memory be accessed along columns of the A matrix from equation (1). On GPUs with very wide buses and other specialized hardware, this constraint restricts the order of memory access, and can slow the theoretical WTPI of coordinate decent methods. Memory access can also be problematic on more modern CPUs with advanced multi-core architectures that may incorporate parallel computation into their design. For example, a modern CPU may be capable of computing 32 floating point multiply accumulate operations per core in a single clock cycle. Keeping all processing hardware fully employed has become increasingly more difficult as CPU and GPU architectures have become more sophisticated.
Once the image space and sinogram data and error sinogram are initialized, a coordinate descent algorithm such as ICD computes the reconstruction by computing a series of voxel updates until convergence occurs. In order to update a single voxel 502 it is necessary to access the voxel data 504 and then use this data to compute an update of the voxel. After the voxel is updated, it is also necessary to update the error sinogram so that the relationship e=y−Ax is maintained. In practice, many voxel updates must occur in order to compute a single reconstructed image.
In a modern high-performance computer or GPU, data to be processed is read from the main memory into a series of specialized cache memories before being processed. This is done because the cache memories are much smaller but also much faster than the main memory. In some cases, there is a series of ever smaller and faster cache memories. For example, an L1 cache memory is typically the fastest and smallest cache memory. By fast memory, we typically mean that the cache memory has both lower latency to read and also can provide more data in a fixed amount of time. The processor may also have L2 cache that is larger and slower than L1 cache, but still smaller and faster than main memory. There may also be L3, and L4 caches and so on. If a processor goes to read a single data entry in the main memory, it first checks to see if the data entry is in the L1 cache. If it is in the L1 cache, it uses this value. If it is not in the L1 cache, then the processor hardware reads a cache line of data from the main memory. A cache line corresponds to a fixed predetermined number of data values or bytes that is transferred from the main memory to the cache memory. These cache lines also start at predetermined locations in the main memory. Both the length of cache lines and their predetermined starting locations are fixed for each specific processor and typically cannot be changed. The key issue is that to read one byte of data from main memory it is required that the entire cache line containing that byte of data be read into the cache memory.
In
Accelerators, such as graphical processing units (GPUs), have local memory that can be viewed as a cache memory that is managed by the processor that controls it. For data to be efficiently transferred into this logical cache memory, large blocks of data should be located contiguously in memory. Therefore, in the graphic of
From
According to one embodiment of the present disclosure, the scanner system 200 is configured to group voxels into “super-voxels”. As shown in
Once used, the SVB 802 is typically automatically transferred to the processors cache memory, so repeated updates of the individual voxels in the super-voxel may be computed very efficiently and at very high speed. Therefore, multiple updates of voxels 502 in the super-voxel 602 can be computed very rapidly until the desired level of convergence of that MBIR algorithm is achieved on the super-voxel 602. Each update of a voxel 502 is typically designed to reduce or minimize the cost function of equation (1). This can be done using a variety of well known methods to one skilled in the art. These methods for updating the voxel 502 require access to the voxel data 504, which typically comprises the associated sinogram measurements, weights, and errors. Then, once updates of voxels 502 in the super-voxel 602 are complete, the super-voxel buffer 802 can be copied back to the sinogram memory. This super-voxel update is then performed repeatedly across the image until total convergence of the reconstructed image is achieved.
According to further embodiments, multiple super-voxel buffers 802 may be created and processed in a pipelined manner to further improve system efficiency. In different embodiments the super-voxel buffer 802 may be physically realized as different cache memories in a processor's cache memory hierarchy, RAM memory, GPU RAM memory, or special memory added to the computer processing system. In some cases, the sinogram memory can be directly copied to an explicitly defined super-voxel buffer section of memory. This is referred to herein as an explicit super-voxel buffer. In other cases, software commands can be used to cause the existing memory control system of the hardware to copy the data of the super-voxel buffer 802 into cache memory or other fast memory. This is referred to herein as an implicit super-voxel buffer. Various examples use corner-turning memory or GPU 2-D mapped memory.
The mapping from the sinogram to the super-voxel buffer 802 can be determined from the forward matrix A. This is important since it means that the super-voxel buffer method can be implemented without custom tuning based on the system geometry. This means that the same general purpose inversion engine 302 can be used to solve a wide range of inversion problems for different applications by simply loading the system matrix, A, and the parameters of the prior model u(x).
Table 3 illustrates the effectiveness of the above described super-voxel method on a specific CT reconstruction problem. Table 3 lists processor efficiency and execution time for the baseline algorithm and the super-voxel algorithm. In this problem, we directly compared the super-voxel MBIR reconstruction approach to an equivalent benchmark implementation of MBIR without the super-voxel method. In the particular example, the processor efficiency and execution time was measured for a reconstruction problem with the following parameters: Image size: 512×512, number of slices: 1, number of view angles: 720, number of channels: 1024. Each method was run to approximate convergence with an approximate average of 4 updates per image voxel. The efficiency measures the percentage of floating point operations per second as compared to the theoretical maximum for the processor. Here we see that the super-voxel method results in a speedup of approximately 22× in efficiency and 25× in time.
The super-voxel algorithm can be implemented in further embodiments to work in various ways that are described in more detail below.
For example, in one embodiment, updates within a super-voxel 602 can be done in parallel. In this case, ICD voxel updates in the super-voxel 602 are computed independently. While the original ICD algorithm assumed sequential updates, parallel updates may be desirable since they allow a large number of processing elements in the hardware to be used. In addition, the update of a single voxel 502 can be parallelized by assigning multiple processing elements to different views or regions of the sinogram or SVB. Also, parallelization can be achieved in multiple ways by using both parallel voxel updates and parallel sinogram computations.
Because the size of a super-voxel 602 affects both the processing speed and convergence of the algorithm, in some examples super-voxels 602 have a uniform size and all voxels 502 are contained in a super-voxel 602 of this uniform size. Some embodiments include techniques to spatially pack voxels and update them in non-uniform sequences in order to speed convergence.
In some examples, to maximize cache efficiency, each super-voxel 602 may be substantially round or hexagonal so that the width of the super-voxel buffer 802 is minimized. In general, the best super-voxel shape represents a tradeoff between these the two competing needs of minimizing the size of the super-voxel data and minimizing overlap of the super-voxels. In practice, hexagonal super-voxels can pack optimally in 2D while providing nearly spherical shape. In addition, super-voxels can be visited in non-sequential order so that more important super-voxels 602 are visited more frequently. In some cases, the same super-voxel 602 can also be updated two or more times in a row. This can be particular useful if a particular super-voxel 602 has a large change in its first update. Both optimized packing and non-sequential updates can substantially improve convergence speed and reduce computation. In some cases, the updating of a super voxel 602 or a voxel 502 in the super-voxel 602 can be skipped. This is particularly desirable if the all the voxels 502 in the super-voxel are zero or if the voxels surrounding a particular voxel are zero since this can be a strong indication that the voxel or super-voxel have already converged to their final value.
According to one embodiment, four levels of parallelism are listed below, any of which may be implemented according to the present disclosure. The forms of parallelism are listed from finest grain to coarsest grain. So the fine grain parallelism is best exploited by computational hardware that is tightly coupled, whereas the course grain parallelism is best exploited by loosely couple hardware.
Inter-SV parallelism (with each SV 602 containing more than one voxel 502) has fewer conflicts on shared data than intra-SV parallelism, reduced communication overhead and sufficient work to keep every core busy. Intuitively, fewer conflicts mean that inter-SV parallelism allows each core to have more work to do before a possible multicore communication. In addition, the frequency of communication is also reduced proportionally to the SV 602 size.
Moreover, these four major levels of parallelism are largely orthogonal, so that the total number of parallel operations increases as the product of the number of parallel threads in each level. This is important because the number of cores in modern computing systems is rapidly increasing with each new generation of device. Hence, keeping all these cores efficiently busy is a crucial requirement for fast efficient MBIR. In addition, it is also important to note that these four levels of parallelisms are hierarchical since one slice comprises a plurality of SVs 602 and each SV 602 comprises a plurality of voxels 502.
For Inter-SV parallelism, we operate on multiple SVs 602 that are far apart in the image space. The distance between these SVs 602 also minimizes interference and lock contention resulting from simultaneous SVs update in CT scanning systems. This is because when super-voxels are further apart, then the amount of overlap in the sinogram space is smaller. This means that there is less potential conflict when updating the sinogram data from two different SVBs. As shown in
Table 1 shows the results of applying the SV-ICD and the parallel SV-ICD (PSV-ICD algorithms on a CPU multicore: 2 Intel Xeon ES-2660 v3 2.6 GHz processors with 20 cores using a standard tomographic data set [25,26]. The column labeled “Equits” represents the number of equivalent iterations required for convergence of the MBIR algorithm. The column labeled “Recon time” is the total time in seconds required to perform MBIR reconstruction, and the column labeled “speedup” shows the speedup resulting from SV-ICD or PSV-ICD for each case. Notice that with only one core the serial version of SV-ICD runs 55 times faster than the baseline ICD implementation of MBIR. However, the PSV-ICD algorithm allows for the use of multiple parallel cores and results in a total speedup of 496× when 20 cores are used.
Finally, it can be observed that the SV-ICD and PSV-ICD algorithms can be used to compute the conventional forward and back projection without full implementation of MBIR. In this case, SV-ICD and PSV-ICD can be used to only compute the forward projection Ax or the back projection Aty. This is important since the SV-ICD and PSV-ICD algorithms can then be used to implement either conventional FBP reconstructions, or other types of iterative reconstruction based on simultaneous updates. These other iterative reconstruction algorithms include but are not limited to gradient decent, preconditioned gradient descent, conjugate gradient, preconditioned conjugate gradient, simultaneous iterative reconstruction technique, algebraic reconstruction technique and variations of these using ordered subset techniques.
Steps of various methods described herein can be performed in any order except when otherwise specified, or when data from an earlier step is used in a later step. Exemplary method(s) described herein are not limited to being carried out by components particularly identified in discussions of those methods.
Various aspects provide more effective processing of CT data. A technical effect is to improve the functioning of a CT scanner by substantially reducing the time required to process CT data, e.g., by performing ICD or other processes to determine voxel values. A further technical effect is to transform measured data from a CT scanner into voxel data corresponding to the scanned object.
Various aspects described herein may be embodied as systems or methods. Accordingly, various aspects herein may take the form of an entirely hardware aspect, an entirely software aspect (including firmware, resident software, micro-code, etc.), or an aspect combining software and hardware aspects These aspects can all generally be referred to herein as a “service,” “circuit,” “circuitry,” “module,” or “system.”
Furthermore, various aspects herein may be embodied as computer program products including computer readable program code (“program code”) stored on a computer readable medium, e.g., a tangible non-transitory computer storage medium or a communication medium. A computer storage medium can include tangible storage units such as volatile memory, nonvolatile memory, or other persistent or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. A computer storage medium can be manufactured as is conventional for such articles, e.g., by pressing a CD-ROM or electronically writing data into a Flash memory. In contrast to computer storage media, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transmission mechanism. As defined herein, “computer storage media” do not include communication media. That is, computer storage media do not include communications media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.
The invention is inclusive of combinations of the aspects described herein. References to “a particular aspect” (or “embodiment” or “version”) and the like refer to features that are present in at least one aspect of the invention. Separate references to “an aspect” (or “embodiment”) or “particular aspects” or the like do not necessarily refer to the same aspect or aspects; however, such aspects are not mutually exclusive, unless otherwise explicitly noted. The use of singular or plural in referring to “method” or “methods” and the like is not limiting. The word “or” is used in this disclosure in a non-exclusive sense, unless otherwise explicitly noted.
The invention has been described in detail with particular reference to certain preferred aspects thereof, but it will be understood that variations, combinations, and modifications can be effected within the spirit and scope of the invention.
The present patent application is related to and claims the priority benefit of U.S. Provisional Patent Application Ser. No. 62/116,235, filed Feb. 13, 2015 and U.S. Provisional Patent Application Ser. No. 62/247,771, filed Oct. 29, 2015. The contents of both of these applications are hereby incorporated by reference in their entirety into the present disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US16/18123 | 2/16/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62247771 | Oct 2015 | US | |
62116235 | Feb 2015 | US |