Embodiments of the present invention are directed to computer graphics and more particularly to processing of textures on a parallel processor.
Three dimensional (3D) computer graphics often use a technique known as rasterization or to convert a two-dimensional image described in a vector format into pixels or dots for output on a video display or printer. Each pixel may be characterized by a location, e.g., in terms of vertical and horizontal coordinates, and a value corresponding to intensities of different colors that make up the pixel. Vector graphics represent an image through the use of geometric objects such as curves and polygons. On simple 3D rendering engines, object surfaces are normally transformed into triangle meshes, and then the triangles rasterised in order of depth in the 3D scene.
Scan-line algorithms are commonly used to rasterize polygons. A scan-line algorithm overlays a grid of evenly spaced horizontal lines over the polygon. On each line, where there are successive pairs of polygon intersections, a horizontal run of pixels is drawn to the output device. These runs collectively cover the entire area of the polygon with pixels on the output device.
In certain graphics applications bitmapped textures are “painted” onto the polygon. In such a case each pixel value drawn by the output device is determined from one or more pixels sampled from the texture. As used herein, a bitmap generally refers to a data file or structure representing a generally rectangular grid of pixels, or points of color, on a computer monitor, paper, or other display device. The color of each pixel is individually defined. For example, a colored pixel may be defined by three bytes—one byte each for red, green and blue. A bitmap typically corresponds bit for bit with an image displayed on a screen, probably in the same format as it would be stored in the display's video memory or maybe as a device independent bitmap. A bitmap is characterized by the width and height of the image in pixels and the number of bits per pixel, which determines the number of colors it can represent.
The process of transferring a texture bitmap to a surface often involves the use of texture MIP maps (also known as mipmaps). Such mipmaps are pre-calculated, optimized collections of bitmap images that accompany a main texture, intended to increase rendering speed and reduce artifacts. They are widely used in 3D computer games, flight simulators and other 3D imaging systems. The technique is known as mipmapping. The letters “MIP” in the name are an acronym of the Latin phrase multum in parvo, meaning “much in a small space”.
Each bitmap image of the mipmap set is a version of the main texture, but at a certain reduced level of detail. Although the main texture would still be used when the view is sufficient to render it in full detail, the graphics device rendering the final image (often referred to as a renderer) will switch to a suitable mipmap image (or in fact, interpolate between the two nearest) when the texture is viewed from a distance, or at a small size. Rendering speed increases since the number of texture pixels “texels”) being processed can be much lower than with simple textures. Artifacts may be reduced since the mipmap images are effectively already anti-aliased, taking some of the burden off the real-time renderer. If the texture has a basic size of 256 by 256 pixels (textures are typically square and must have side lengths equal to a power of 2), then the associated mipmap set may contain a series of 8 images, each half the size of the previous one: 128×128 pixels, 64×64, 32×32, 16×16, 8×8, 4×4, 2×2, 1×1 (a single pixel). If, for example, a scene is rendering this texture in a space of 40×40 pixels, then an interpolation of the 64×64 and the 32×32 mipmaps would be used. The simplest way to generate these textures is by successive averaging, however more sophisticated algorithms (perhaps based on signal processing and Fourier transforms) can also be used. The increase in storage space required to store all of these mipmaps for a texture is a third, because the sum of the areas ¼+ 1/16+ 1/256+ . . . converges to ⅓. (This assumes compression is not being used.)
The blending between mipmap levels typically involves some form of texture filtering. As used herein, texture filtering refers to a method used to map texels (pixels of a texture) to points on a 3D object. A simple texture filtering algorithm may take a point on an object and look up the closest texel to that position. The resulting point then gets its color from that one texel. This simple technique is sometimes referred to as nearest neighbor filtering. More sophisticated techniques combine more than one texel per point. The most often used algorithms in practice are bilinear filtering and trilinear filtering using mipmaps. Anisotropic filtering and higher-degree methods, such as quadratic or cubic filtering, result in even higher quality images.
Texture filtering operations for electronic devices such as video games, computers and the like are typically performed using a specially designed hardware referred to as graphics processors or graphics cards. Graphics cards typically have a large memory capacity that facilitates the handling of large textures. Unfortunately, typical graphics processors have clock rates that are slower than other processors, such as cell processors. In addition, graphics processors typically implement graphics processing functions in hardware. It would be more advantageous to perform graphics processing functions on a faster processor that can be programmed with appropriate software.
Cell processors are used in applications such as vertex processing for graphics. The processed vertex data may then be passed on to a graphics card for pixel processing. Cell processors are a type of microprocessor that utilizes parallel processing. The basic configuration of a cell processor includes a “Power Processor Element” “PPE”) (sometimes called “Processing Element”, or “PE”), and multiple “Synergistic Processing Elements” (“SPE”). The PPEs and SPEs are linked together by an internal high speed bus dubbed “Element Interconnect Bus” (“EIB”). Cell processors are designed to be scalable for use in applications ranging from the hand held devices to main frame computers.
A typical cell processor has one PPE and up to 8 SPE. Each SPE is typically a single chip or part of a single chip containing a main processor and a co-processor. All of the SPEs and the PPE can access a main memory, e.g., through a memory flow controller (MFC). The SPEs can perform parallel processing of operations in conjunction with a program running on the main processor. The SPEs have small local memories (typically about 256 kilobytes) that must be managed by software—code and data must be manually transferred to/from the local SPE memories.
Direct memory access (DMA) transfers of data into and out of the SPE local store are quite fast. A cell processor chip with SPUs may run at about 3 gigahertz. A graphics card, by contrast, may run at about 500 MHz, which is six times slower. However, a cell processor SPE usually has a limited amount of memory space (typically about 256 kilobytes) available for texture maps in its local store. Unfortunately, texture maps can be very large. For example, a texture covering 1900 pixels by 1024 pixels would require significantly more memory than is available in an SPE local store. Furthermore, DMA transfers of data into and out of the SPE can have a high latency.
Thus, there is a need in the art, for a method for performing texture mapping of pixel data that overcomes the above disadvantages.
To overcome the above disadvantages, embodiments of the invention are directed to methods and apparatus for performing texture mapping of pixel data. A block of texture fetches is received with a co-processor element having a local memory. Each texture fetch includes pixel coordinates for a pixel in an image. The co-processor element determines one or more corresponding blocks of a texture stored in the main memory from the pixel coordinates of each texture fetch and a number of blocks NB that make up the texture. Each texture block contains all mipmap levels of the texture and N is chosen such that a number N of the blocks can be cached in a local store of the co-processor element, where N is less than NB. One or more of the corresponding blocks of the texture are loaded to the local memory if they are not currently loaded in the local memory. The co-processor element performs texture filtering with one or more of the texture blocks in the local memory to generate a pixel value corresponding to one of the texture fetches.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the invention. Accordingly, the exemplary embodiments of the invention described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed invention.
Embodiments of the present invention allow parallel processors, such as cell processors to produce graphics without the use of specialized graphics hardware. According to an embodiment of the present invention, a texture unit fetches and blends image pixels from various levels of detail of a texture called mipmaps and returns the resultant value for a target pixel in an image. The texture unit is an approach to retrieving filtered texture data that may be implemented entirely in software. The texture unit utilizes no specialized hardware for this task. Instead, the texture unit may rely on standard parallel processor (e.g., cell processor) hardware and specialized software. Prior art graphics cards, which have specialized hardware for this task generally run at a much slower clock rate than a cell processor chip. A cell processor chip with a power processor element (PPE) and multiple synergistic processor elements (SPEs) may run at about 3 gigahertz. A graphics card, by contrast, may run at about 500 MHz, which is six times slower. Certain embodiments of the present invention take advantage of an SPE's independent DMA manager and, in software, try to achieve the performance of a hardware unit, and do this with the limited SPU local store that is typically available,
Embodiments of the present invention allow for texture unit operations to be done in software on single or multiple co-processor units (e.g., SPUs in a cell processor) having limited local memory and no cache. Therefore, paging of texture data can be handled by software as opposed to hardware. Achieving these types of operations requires achieving random memory access using a processor with a very small local store and no cache for hardware paging textures in and out of main memory. In embodiments of the present invention memory management steps may be managed by software where dedicated hardware has traditionally been used to solve this problem.
Embodiments of the invention may be understood by referring simultaneously to
The texture blocks 116 are used to determine pixel values for each pixel in the image. In addition, all mipmap levels 118 may be embedded in each texture block 116 to limit the paging for any texture level to one at a time. Each pixel value may be structured as shown in Table I below.
Multiple fetch structures may be loaded into the local memory 104 at a time. For example 1024 fetch 16 byte structures may be loaded at a time for a total of 16 Kbytes.
At 204 a check is performed on one or more of the pre-fetched instructions to determine if one or more texture blocks 116 that are to be operated upon are already stored in the texture block section 106 of the local memory 104. If the requisite texture blocks are not already in the local store 104 they are transferred from the main memory 101. The local memory 104 may also include a list 111 that describes which texture blocks are currently loaded.
If the check performed at 204 reveals that a required block of main textured 112 isn't loaded in the texture block section 106 of the local memory 104 then, as indicated at 208, that block may be loaded from the main memory 101 to the local memory 104. The transferred texture blocks 116 may be transferred a stream of texture coordinates that are converted to hashed values that coincide with the texture block that should be loaded for that texture coordinate.
To facilitate transfer of texture blocks 116, the main texture 112 may be separated in a pre-process step into one or more groups 114 of texture blocks to facilitate efficient transfers to and from the storage locations 106. Each texture block 116 would contain all mipmap levels 118 for that part of the texture 112. In preferred embodiments, the texture group 114 is a square array of texture blocks 116.
From the u and v coordinates in the fetch structures and the number NB of blocks in the texture stored in main memory, a first set of hash equations can determine which texture block to fetch from main memory. By way of example, and without limitation, the main memory block coordinates (denoted MMU and MMV respectively) for the corresponding block may be determined as follows:
MMU=int((remainder(u))*sqrt(NB))
MMV=int((remainder(v))*sqrt(NB))
Here MMU and MMV refer to the row and column coordinates for the block containing the texture that is to be mapped to the pixel coordinates (u, v) on the image. The first set of hash equations multiplies each coordinate (u, v) of a pixel in the image by the square root of the number of blocks in the texture and returns an integer number corresponding to a main memory block coordinate. To determine MMU the u coordinate of the pixel location is multiplied by the square root of the number of blocks and the result is rounded to the nearest integer value. By way of example, the result may be rounded down to the nearest integer.
A second hash equation may be used to determine where the 16 K texture block will go in the SPU cache. The block location corresponds to a square within square array of N blocks. The second set of has equations preferably retains the relative positions of blocks from the texture in main memory with respect to each other. The SPU memory block location may be determined as follows:
SPUMU=int((remainder(u)*sqrt(N)
SPUMV=int((remainder(v)*sqrt(N)
where N is the number of texture blocks to be cached in the local memory 104. For example, if 9 blocks are cached, N=9 and sqrt(N)=3. SPUMU=0.4*3=1.2, which rounds down to 1 and SPUMV=0.3*3=0.9, which rounds down to zero. Thus, block 8 of the texture in main memory would be stored at the location corresponding to row 0, column 1 of the texture block location 106. The texture blocks 116 may then be paged in as needed based on a hashed value of the required texture coordinate addresses. The list 111 keeps track of which blocks 116 are currently loaded to facilitate the check performed at 204.
At 210, the co-processor 102 processes pixels from a texture block 116 in the local memory 104. By way of example, the co-processor 102 may perform the bi-linear filtering for the current mipmap level and a bi-linear filter of the next mipmap level and then do a linear interpolation of the two to get the final texture color value which will be returned in a stream of data as output. The output pixels may be stored in the output section 110 of the local memory 104 as indicated at 210. The texture unit 100 may output multiple pixel values at a time, e.g., 1024 pixel values of 16 bytes each for a total of 16 Kbytes of output at one time.
By way of example, and without limitation,
In the example depicted in
Each PPE group includes a number of PPEs PPE_0 . . . PPE_g SPE. In this example a group of SPEs shares a single cache SL1. The cache SL1 is a first-level cache for direct memory access (DMA) transfers between local storage and main storage. Each PPE in a group has its own first level (internal) cache L1. In addition the PPEs in a group share a single second-level (external) cache L2. While caches are shown for the SPE and PPE in
An Element Interconnect Bus EIB connects the various components listed above. The SPEs of each SPE group and the PPEs of each PPE group can access the EIB through bus interface units BIU. The cell processor 400 also includes two controllers typically found in a processor: a Memory Interface Controller MIC that controls the flow of data between the EIB and the main memory MEM, and a Bus Interface Controller BIC, which controls the flow of data between the I/O and the EIB. Although the requirements for the MIC, BIC, BIUs and EIB may vary widely for different implementations, those of skill in the art will be familiar their functions and circuits for implementing them.
Each SPE is made includes an SPU (SPU0 . . . SPUg). Each SPU in an SPE group has its own local storage area LS and a dedicated memory flow controller MFC that includes an associated memory management unit MMU that can hold and process memory-protection and access-permission information.
The PPEs may be 64-bit PowerPC Processor Units (PPUs) with associated caches. A CBEA-compliant system includes a vector multimedia extension unit in the PPE. The PPEs are general-purpose processing units, which can access system management resources (such as the memory-protection tables, for example). Hardware resources defined in the CBEA are mapped explicitly to the real address space as seen by the PPEs. Therefore, any PPE can address any of these resources directly by using an appropriate effective address value. A primary function of the PPEs is the management and allocation of tasks for the SPEs in a system.
The SPUs are less complex computational units than PPEs, in that they do not perform any system management functions. They generally have a single instruction, multiple data (SIMD) capability and typically process data and initiate any required data transfers (subject to access properties set up by a PPE) in order to perform their allocated tasks. The purpose of the SPU is to enable applications that require a higher computational unit density and can effectively use the provided instruction set. A significant number of SPUs in a system, managed by the PPEs, allow for cost-effective processing over a wide range of applications. The SPUs implement a new instruction set architecture.
MFC components are essentially the data transfer engines. The MFC provides the primary method for data transfer, protection, and synchronization between main storage of the cell processor and the local storage of an SPE. An MFC command describes the transfer to be performed. A principal architectural objective of the MFC is to perform these data transfer operations in as fast and as fair a manner as possible, thereby maximizing the overall throughput of a cell processor. Commands for transferring data are referred to as MFC DMA commands. These commands are converted into DMA transfers between the local storage domain and main storage domain.
Each MFC can typically support multiple DMA transfers at the same time and can maintain and process multiple MFC commands. In order to accomplish this, the MFC maintains and processes queues of MFC commands. The MFC can queue multiple transfer requests and issues them concurrently. Each MFC provides one queue for the associated SPU (MFC SPU command queue) and one queue for other processors and devices (MFC proxy command queue). Logically, a set of MFC queues is always associated with each SPU in a cell processor, but some implementations of the architecture can share a single physical MFC between multiple SPUs, such as an SPU group. In such cases, all the MFC facilities must appear to software as independent for each SPU. Each MFC DMA data transfer command request involves both a local storage address (LSA) and an effective address (EA). The local storage address can directly address only the local storage area of its associated SPU. The effective address has a more general application, in that it can reference main storage, including all the SPE local storage areas, if they are aliased into the real address space (that is, if MFC_SR1[D] is set to ‘1’).
An MFC presents two types of interfaces: one to the SPUs and another to all other processors and devices in a processing group. The SPUs use a channel interface to control the MFC. In this case, code running on an SPU can only access the MFC SPU command queue for that SPU. Other processors and devices control the MFC by using memory-mapped registers. It i:; possible for any processor and device in the system to control an MFC and to issue MFC proxy command requests on behalf of the SPU. The MFC also supports bandwidth reservation and data synchronization features. To facilitate communication between the SPUs and/or between the SPUs and the PPU, the SPEs and PPEs may include signal notification registers that are tied to signaling events. The PPEs and SPEs may be coupled by a star topology in which the PPE acts as a router to transmit messages to the SPEs. Such a topology may not provide for direct communication between SPEs. In such a case each SPE and each PPE may have a one-way signal notification register referred to as a mailbox. The mailbox can be used for SPE to host OS synchronization.
The IIC component manages the priority of the interrupts presented to the PPEs. The main purpose of the IIC is to allow interrupts from the other components in the processor to be handled without using the main system interrupt controller. The IIC is really a second level controller. It is intended to handle all interrupts internal to a CBEA-compliant processor or within a multiprocessor system of CBEA-compliant processors. The system interrupt controller will typically handle all interrupts external to the cell processor.
In a cell processor system, software often must first check the IIC to determine if the interrupt was sourced from an external system interrupt controller. The IIC is not intended to replace the main system interrupt controller for handling interrupts from all I/O devices.
There are two types of storage domains within the cell processor: local storage domain and main storage domain. The local storage of the SPEs exists in the local storage domain. All other facilities and memory are in the main storage domain. Local storage consists of one or more separate areas of memory storage, each one associated with a specific SPU. Each SPU can only execute instructions (including data load and data store operations) from within its own associated local storage domain. Therefore, any required data transfers to, or from, storage elsewhere in a system must always be performed by issuing an MFC DMA command to transfer data between the local storage domain (of the individual SPU) and the main storage domain, unless local storage aliasing is enabled.
An SPU program references its local storage domain using a local address. However, privileged software can allow the local storage domain of the SPU to be aliased into main storage domain by setting the D bit of the MFC_SR1 to ‘1’. Each local storage area is assigned a real address within the main storage domain. (A real address is either the address of a byte in the system memory, or a byte on an I/O device.) This allows privileged software to map a local storage area into the effective address space of an application to allow DMA transfers between the local storage of one SPU and the local storage of another SPU.
Other processors or devices with access to the main storage domain can directly access the local storage area, which has been aliased into the main storage domain using the effective address or I/O bus address that has been mapped through a translation method to the real address space represented by the main storage domain.
Data transfers that use the local storage area aliased in the main storage domain should do so as caching inhibited, since these accesses are not coherent with the SPU local storage accesses (that is, SPU load, store, instruction fetch) in its local storage domain. Aliasing the local storage areas into the real address space of the main storage domain allows any other processors or devices, which have access to the main storage area, direct access to local storage. However, since aliased local storage must be treated as non-cacheable, transferring a large amount of data using the PPE load and store instructions can result in poor performance. Data transfers between the local storage domain and the main storage domain should use the MFC DMA commands to avoid stalls.
The addressing of main storage in the CBEA is compatible with the addressing defined in the PowerPC Architecture. The CBEA builds upon the concepts of the PowerPC Architecture and extends them to addressing of main storage by the MFCs.
An application program executing on an SPU or in any other processor or device uses an effective address to access the main memory. The effective address is computed when the PPE performs a load, store, branch, or cache instruction, and when it fetches the next sequential instruction. An SPU program must provide the effective address as a parameter in an MFC command. The effective address is translated to a real address according to the procedures described in the overview of address translation in PowerPC Architecture, Book III. The real address is the location in main storage which is referenced by the translated effective address. Main storage is shared by all PPEs, MFCs, and I/O devices in a system. All information held in this level of storage is visible to all processors and to all devices in the system. This storage area can either be uniform in structure, or can be part of a hierarchical cache structure. Programs reference this level of storage using an effective address.
The main memory of a system typically includes both general-purpose and nonvolatile storage, as well as special-purpose hardware registers or arrays used for functions such as system configuration, data-transfer synchronization, memory-mapped I/O and I/O subsystems. There are a number of different possible configurations for the main memory. By way of example and without limitation, Table I lists the sizes of address spaces in main memory for a particular cell processor implementation known as Cell Broadband Engine Architecture (CBEA).
Note:
The values of “m,” “n,” and “p” are implementation-dependent.
The cell processor 400 may include an optional facility for managing critical resources within the processor and system. The resources targeted for management under the cell processor are the translation lookaside buffers (TLBs) and data and instruction caches. Management of these resources is controlled by implementation-dependent tables. Tables for managing TLBs and caches are referred to as replacement management tables RMT, which may be associated with each MMU. Although these tables are optional, it is often useful to provide a table for each critical resource, which can be a bottleneck in the system. An SPE group may also contain an optional cache hierarchy, the SL1 caches, which represent first level caches for DMA transfers. The SL1 caches may also contain an optional RMT.
In a preferred embodiment, only two blocks are loaded at a time into SPE LS. As shown in
In this embodiment a hash table look up is only applied to texture blocks being loaded from main memory. However, as described above each texture is processed into blocks for paging in and out of the SPE local store memory 600. Each texture contains extra bordering columns and rows of bordering pixels to the left, right, top, and bottom and if on the edge of the texture the columns that wrap around to the opposite side of the textures so that bilinear filtering of each pixel lookup can be done without the need to load an additional block of the texture. Also as before, each mipmap level may be built this same way and included in this block. This allows for bi-linear filtering of pixels even if the fetched pixels are located on the edge of a fetched texture block.
Studies performed on how a texture unit in accordance with embodiments of the invention would work indicate that there is an 80-90% hit rate for texture fetch requests where the texture block that is needed is already in the cache and doesn't need to be transferred from main memory by DMA. Such a system may take advantage of this locality of fetches if the hash texture block look-up operation pre-fetches some number of fetches (e.g., about 100 fetches) ahead of actually processing the fetched data. If a new block needs to be loaded, it can be loaded into the second buffer while processing proceeds on the block stored in the first buffer. This way DMA time may be hidden as much as possible and the SPU will spend the vast majority of its time processing pixel data that is already in the cache.
The studies described above were performed using an operation accurate CELL simulator called MAMBO, a property of IBM. The code was compiled using Cell processor compilers. Thus, the same compiled code was used as if it were running on a CELL. Although MAMBO is a simulated environment not running at CELL speed it still is a good gauge of this algorithm behavior, e.g., Hit Rate. It is expected that embodiments of the invention implemented with the constraints described herein on an actual CELL processor would achieve Hit Rates consistent with those observed in the studies performed using the CELL simulator. The principal factor limiting the hit rate is not speed the speed at which the code is run but rather the locality of the texture fetches and cache allocations on the SPE. The randomness of the fetches and the effectiveness of the cache in avoiding loading of new blocks are the primary factors limiting the Hit Rate.
By way of example, the processing of the data in the cache may be bi-linear filtering of pixel data from one mipmap level or tri-linear filtering or blending of between bi-linear filtered pixels from each mipmap level. The blending between mipmap levels typically involves some form of texture filtering. Such texture filtering techniques are well-known to those of skill in the art and are commonly used in computer graphics, e.g., to map texels (pixels of a texture) to points on a 3D object.
Several SPUs performing Texture Unit operations could be comparable to dedicated graphics hardware for moderate performance. In testing, a range of 80-95% hit rate of texture already in cache was found minimizing the amount of loading of texture blocks from main memory. An entire software system built around this software texture unit could allow for any given TV, computer, or media center to have 3D rendering capabilities in software just be having a cell processor inside.
Embodiments of the present invention can be tremendously beneficial because the DMA bandwidth is minimized by the use of the specially created 64 k texture blocks that contain border data and their mipmaps. Also the SPU fetch time may be minimized using a fast early hash sort of these texture blocks to hide DMA latency when new a new block needs to be loaded. This way, SPUs can spend their time blending pixels and packing the resultant pixels into the output buffers with very little time spent waiting on texture block DMA or having to worry about edge cases for bi-linear or tri-linear filtering.
Embodiments of the present invention allow for processor intensive rendering of highly detailed textured graphics in software on an SPU. Embodiments of the present invention avoid problems that would otherwise arise due to the large amount random memory access that texturing operations typically require. With embodiments of the present invention, texture unit operation may be done by SPUs on a cell processor very efficiently once the textures are processed into the blocks. Therefore even video game consoles, televisions, and telephones, containing a Cell processor could produce advanced texture graphics without the use of specialized graphics hardware, thus saving cost considerably and raising profits.
While the above includes a complete description of the preferred embodiment of the present invention, it is possible to use various alternatives, modifications and equivalents. Therefore, the scope of the present invention should be determined not with reference to the above description but should, instead, be determined with reference to the appended claims, along with their full scope of equivalents. Any feature described herein, whether preferred or not, may be combined with any other feature described herein, whether preferred or not. In the claims that follow, the indefinite article “A” . or “An” refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase “means for.”