One or more embodiments generally relate to a graphical processing unit (GPU) and, in particular, to a dynamic configurable GPU and reduction of GPU data movement.
Graphical processing units (GPUs) are primarily used to perform graphics rendering. A GPU typically contains a number of different physical storage structures. Examples of such structures may include: register file (RF), first level instruction cache (L1I$), first level data cache (L1D$), first level constant cache (L1C$), texture cache (T$), and second level cache (L2$). With this GPU architecture, at design time a trade-off occurs in determining an amount of chip area to dedicate to each of these physical structures. A factor for this determination is that the optimal area allocation is different for different applications that will run on the GPU. That is, the chosen allocation may be a compromise and cannot be tailored to each specific application.
One or more embodiments generally relate to a dynamic configurable GPU and reduction of GPU data movement. In one embodiment, a method provides for storage allocation for a graphical processing unit. In one embodiment, the method includes maintaining a unified storage structure for the graphical processing unit. In one embodiment, multiple physical storage structures are virtualized in the unified storage structure by dynamically forming multiple logical storage structures from the unified storage structure for the multiple physical storage structures.
In one embodiment a non-transitory computer-readable medium having instructions which when executed on a computer perform a method comprising maintaining a unified storage structure for a graphical processing unit. In one embodiment, multiple physical storage structures are virtualized in the unified storage structure by dynamically forming a plurality of logical storage structures from the unified storage structure for the multiple physical storage structures.
In one embodiment, a graphics processor for an electronic device comprises: one or more processing elements coupled to a memory heap device. In one embodiment, the memory heap device comprises: a physical memory structure including a plurality of logical storage structures representing a plurality of physical storage structures. In one embodiment, the plurality of logical storage structures are each mapped into the physical memory structure. In one embodiment, a shared memory storage device is dynamically shared between each of the plurality of logical storage structures.
These and other aspects and advantages of one or more embodiments will become apparent from the following detailed description, which, when taken in conjunction with the drawings, illustrate by way of example the principles of the one or more embodiments.
For a fuller understanding of the nature and advantages of the embodiments, as well as a preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings, in which:
The following description is made for the purpose of illustrating the general principles of one or more embodiments and is not meant to limit the inventive concepts claimed herein. Further, particular features described herein can be used in combination with other described features in each of the various possible combinations and permutations. Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation including meanings implied from the specification as well as meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc.
One or more embodiments provide a dynamically configurable GPU and reduction of GPU data movement using a unified storage device for logically mapping physical storage devices to the unified storage device. In one embodiment, a unified heap architecture (UHA) is used for unifying the separate GPU physical memory storage structures into a single physical structure or UHA. In one embodiment, the various logical structures are then mapped into the heap. Examples of mapped storage structures are the register file, a plane equation table, a primitive mapping table, thread descriptor queues, a graphics state table, first level instruction cache, first level data cache, first level constant cache, texture cache, and second level cache.
In one embodiment, the storage structures that are mapped to the UHA may be divided into categories, such as (1) cache structures, and (2) fixed structures. In one embodiment, some metadata is needed to represent the storage structures. In one embodiment, the UHA structure is implemented as multiple banks of random access memory (RAM), such as static RAM (SRAM), etc. In one or more embodiments, multiple alternatives are used to organize the UHA structure, such as: pointer-based mapping (pointers and optionally size descriptors are used in implementing the storage structures and point to locations within a storage structure); fixed mapping (there is fixed mapping from a particular logical structure of the UHA heap structure and a location in the storage structure, and a separate portion of the UHA structure identifies the current use of the specific location of the UHA structure; and hybrid of pointer-based and fixed-based mapping (i.e., some storage structures use pointers while other storage structures use fixed mapping).
In one embodiment, a method provides for storage allocation for a GPU. In one embodiment, the method includes maintaining a unified storage structure for the GPU. In one embodiment, multiple physical storage structures are virtualized in the unified storage structure by dynamically forming multiple logical storage structures from the unified storage structure for the multiple physical storage structures.
Any suitable circuitry, device, system or combination of these (e.g., a wireless communications infrastructure including communications towers and telecommunications servers) operative to create a communications network may be used to create communications network 110. Communications network 110 may be capable of providing communications using any suitable communications protocol. In some embodiments, communications network 110 may support, for example, traditional telephone lines, cable television, Wi-Fi (e.g., an IEEE 802.11 protocol), Bluetooth®, high frequency systems (e.g., 900 MHz, 2.4 GHz, and 5.6 GHz communication systems), infrared, other relatively localized wireless communication protocol, or any combination thereof. In some embodiments, the communications network 110 may support protocols used by wireless and cellular phones and personal email devices (e.g., a Blackberry®). Such protocols can include, for example, GSM, GSM plus EDGE, CDMA, quadband, and other cellular protocols. In another example, a long range communications protocol can include Wi-Fi and protocols for placing or receiving calls using VOIP, LAN, WAN, or other TCP-IP based communication protocols. The transmitting device 12 and receiving device 11, when located within communications network 110, may communicate over a bidirectional communication path such as path 13, or over two unidirectional communication paths. Both the transmitting device 12 and receiving device 11 may be capable of initiating a communications operation and receiving an initiated communications operation.
The transmitting device 12 and receiving device 11 may include any suitable device for sending and receiving communications operations. For example, the transmitting device 12 and receiving device 11 may include a mobile telephone devices, television systems, cameras, camcorders, a device with audio video capabilities, tablets, wearable devices, and any other device capable of communicating wirelessly (with or without the aid of a wireless-enabling accessory system) or via wired pathways (e.g., using traditional telephone wires). The communications operations may include any suitable form of communications, including for example, voice communications (e.g., telephone calls), data communications (e.g., e-mails, text messages, media messages), video communication, or combinations of these (e.g., video conferences).
In one embodiment, all of the applications employed by the audio output 123, the display 121, input mechanism 124, communications circuitry 125, and the microphone 122 may be interconnected and managed by control circuitry 126. In one example, a handheld music player capable of transmitting music to other tuning devices may be incorporated into the electronics device 120.
In one embodiment, the audio output 123 may include any suitable audio component for providing audio to the user of electronics device 120. For example, audio output 123 may include one or more speakers (e.g., mono or stereo speakers) built into the electronics device 120. In some embodiments, the audio output 123 may include an audio component that is remotely coupled to the electronics device 120. For example, the audio output 123 may include a headset, headphones, or earbuds that may be coupled to communications device with a wire (e.g., coupled to electronics device 120 with a jack) or wirelessly (e.g., Bluetooth® headphones or a Bluetooth® headset).
In one embodiment, the display 121 may include any suitable screen or projection system for providing a display visible to the user. For example, display 121 may include a screen (e.g., an LCD screen) that is incorporated in the electronics device 120. As another example, display 121 may include a movable display or a projecting system for providing a display of content on a surface remote from electronics device 120 (e.g., a video projector). Display 121 may be operative to display content (e.g., information regarding communications operations or information regarding available media selections) under the direction of control circuitry 126.
In one embodiment, input mechanism 124 may be any suitable mechanism or user interface for providing user inputs or instructions to electronics device 120. Input mechanism 124 may take a variety of forms, such as a button, keypad, dial, a click wheel, or a touch screen. The input mechanism 124 may include a multi-touch screen.
In one embodiment, communications circuitry 125 may be any suitable communications circuitry operative to connect to a communications network (e.g., communications network 110,
In some embodiments, communications circuitry 125 may be operative to create a communications network using any suitable communications protocol. For example, communications circuitry 125 may create a short-range communications network using a short-range communications protocol to connect to other communications devices. For example, communications circuitry 125 may be operative to create a local communications network using the Bluetooth® protocol to couple the electronics device 120 with a Bluetooth® headset.
In one embodiment, control circuitry 126 may be operative to control the operations and performance of the electronics device 120. Control circuitry 126 may include, for example, a processor, a bus (e.g., for sending instructions to the other components of the electronics device 120), memory, storage, or any other suitable component for controlling the operations of the electronics device 120. In some embodiments, a processor may drive the display and process inputs received from the user interface. The memory and storage may include, for example, cache, Flash memory, ROM, and/or RAM/DRAM. In some embodiments, memory may be specifically dedicated to storing firmware (e.g., for device applications such as an operating system, user interface functions, and processor functions). In some embodiments, memory may be operative to store information related to other devices with which the electronics device 120 performs communications operations (e.g., saving contact information related to communications operations or storing information related to different media types and media items selected by the user).
In one embodiment, the control circuitry 126 may be operative to perform the operations of one or more applications implemented on the electronics device 120. Any suitable number or type of applications may be implemented. Although the following discussion will enumerate different applications, it will be understood that some or all of the applications may be combined into one or more applications. For example, the electronics device 120 may include an automatic speech recognition (ASR) application, a dialog application, a map application, a media application (e.g., QuickTime, MobileMusic.app, or MobileVideo.app), social networking applications (e.g., Facebook®, Twitter®, Etc.), an Internet browsing application, etc. In some embodiments, the electronics device 120 may include one or multiple applications operative to perform communications operations. For example, the electronics device 120 may include a messaging application, a mail application, a voicemail application, an instant messaging application (e.g., for chatting), a videoconferencing application, a fax application, or any other suitable application for performing any suitable communications operation.
In some embodiments, the electronics device 120 may include a microphone 122. For example, electronics device 120 may include microphone 122 to allow the user to transmit audio (e.g., voice audio) for speech control and navigation of applications 1-N 127, during a communications operation or as a means of establishing a communications operation or as an alternative to using a physical user interface. The microphone 122 may be incorporated in the electronics device 120, or may be remotely coupled to the electronics device 120. For example, the microphone 122 may be incorporated in wired headphones, the microphone 122 may be incorporated in a wireless headset, the microphone 122 may be incorporated in a remote control device, etc.
In one embodiment, the camera module 128 comprises one or more camera devices that include functionality for capturing still and video images, editing functionality, communication interoperability for sending, sharing, etc., photos/videos, etc.
In one embodiment, the GPU module 129 comprises processes and/or programs for processing images and portions of images for rendering on the display 121 (e.g., 2D or 3D images). In one or more embodiments, the GPU module may comprise GPU hardware and memory (e.g., a unified heap architecture (UHA) 410 (
In one embodiment, the electronics device 120 may include any other component suitable for performing a communications operation. For example, the electronics device 120 may include a power supply, ports, or interfaces for coupling to a host device, a secondary input mechanism (e.g., an ON/OFF switch), or any other suitable component.
In one or more embodiments, the boundary between the logically mapped storage structures 411-419 may be chosen dynamically so that the allocation is the best allocation for a currently running application on an electronic device (e.g., electronic device 120) using the GPU 400. In one example embodiment, if one application benefits more from having a large register file than it benefits from having a large data cache, then more of the UHA 410 physical structure may be used as a register file for that application (e.g., allocating more of the UHA 410 storage to the vRF 417). Similarly, if another application may benefit more from a large data cache, then the GPU 400 using the UHA 410 may be configured to allocate more of the storage space of the UHA 410 to the logical data cache (e.g., vL1D$ 416). One or more embodiments provide the ability to tailor the allocation to the currently running application, which results in a more efficient working point (e.g., higher performance and/or lower power dissipation) as compared with architectures that would otherwise be limited in memory structure size at the time of manufacturing (e.g., GPU 300,
One or more embodiments provide for data movement elimination since the various physical structures are mapped logically in the same physical structure (e.g., UHA 410). In one example embodiment, instead of physically moving data from the second level cache into the texture cache, the texture cache may be implemented as vT$ 413 so that it just references the data in its current location. In one embodiment, the ability to eliminate data movement results in a more power efficient design than architectures that require data movement between multiple physical devices (e.g., GPU 300,
In one embodiment, for both of cache and fixed structure categories, some metadata is needed. In one embodiment, in the case of cache structures, the metadata may consist of a tag array. In one embodiment, for fixed structures, the metadata contains the necessary information to fully identify the data region, e.g., one or more base pointers with associated size descriptors. In one embodiment, fixed structures may have more complex metadata structures than just a base pointer and size descriptor. In one example embodiment, if the fixed structure virtualizes an array of records, then the metadata may consist of a bit vector that indicates valid records in addition to the base pointer and the size descriptor. In another example embodiment, if the fixed structure virtualizes a circular buffer, the metadata may consist of a head and a tail pointer in addition to the base pointer and the size descriptor. In one embodiment, the metadata is not part of the UHA 600 itself, but is a necessary building block of the UHA 600.
In one embodiment, the UHA 600 comprises metadata 620 representing all the logically mapped storage structures 411-419 as well as unified storage 610 that is dynamically shared between the logically mapped storage structures 411-419. In one embodiment, metadata for the logically mapped storage structures 411-419 is represented as metadata: PMT 601, PEQ 602, T$603, GST 604, TDQ 605, L1D$ 606, RF 607, L1I$ 608, L1C$ 609 and L2$ 630. In one embodiment, the shared unified storage 610 may comprise a banked unified storage device including multiple arrays of memory (e.g., RAM arrays 640).
In one embodiment, although all the metadata 620 is shown as grouped into a single metadata structure in
In one embodiment, a number of ways of organizing the metadata structures 601-609 and 630 for a UHA 600 may be implemented. In one example, different logical metadata structures may map to disjoint regions of the shared storage 610. In another example embodiment, the different metadata structures map to overlapping locations, which may result in avoided data moves between structures (which results in lower power dissipation).
In one embodiment, for pointer-based metadata mapping, the metadata structures use map functions (e.g., L1D$ map function 721, T$ map function 722, 1$ map function 723, L2$ map function 724, etc.) for mapping into an array of tags and pointers, e.g., 4-way associative L1D$ tags with pointers array 731 (shown in example detail as array 740), 64-way associative T$ tags with pointers array 732, 4-way associative 1$ tags with pointers 733, 4-way associative L2 tags with pointers 734, etc.
In one example embodiment, the L2$ map function 724 is arranged for forming an L2 tag array 810 and the unified storage structure array 710, with way 0811, way 1812, way 2813 and way 3814. In one embodiment, the L2 tag array 810 represents the second level cache. In one example embodiment, a fixed mapping from each tag in the L2 tag array 810 to a location in the unified storage structure array 710. In one embodiment, the tags span the entire cache. In one example embodiment, the right side of
In one embodiment, the L1D$, T$ and 1$ may have different associativity than the L2$ and may have different (but fixed) mapping functions into the shared data array. In one embodiment, the data address) of the L1D$ tag array 816 is input to the L1D$ map function 819, the data from the T$ tag array 817 is input to the T$ map function 820 and the data from the 1$ tag array 818 is input to the 1$ map function 821, where each function then operates to provide separate mapping operations.
In one example embodiment, the GST metadata 604 is mapped into way 1812, the RF metadata 607 indicates that the register file is mapped into way 2813, and the caches are mapped into way 3814.
In one example embodiment, the L2$ tags are augmented with a bit indicating if the line is part of the L2 cache or if it is used for another structure. Similarly, the tags for the other structures are augmented with a similar bit. In one example embodiment, the bit determines if the corresponding tag should participate in tag matching on a cache lookup. In one embodiment, the fixed mappings are carefully chosen to ensure that a cache always has at least one way enabled for each set. In one example embodiment, it would be made improper to choose mappings that result in that an entire row in the unified storage structure array 710 to get “blacked out.”
In one example embodiment, a hybrid mapping scheme is built where one or more structures (e.g., the L2$) have a fixed mapping (e.g., fixed mapping 800) and the other structures have a pointer-based mapping (e.g., pointer-based mapping 700,
In one embodiment, the UHA structure (e.g., UHA 410,
In one example embodiment, a free-list based memory management technique is employed. In one example embodiment, unused portions of the UHA (e.g., UHA 410,
In one example embodiment, a bit vector-based memory management technique is employed. In one example embodiment, a possible problem with the free-list based memory management technique is that the shared storage itself needs to be accessed to perform memory allocation operations, which consumes some of the available bandwidth. In one example embodiment, a separate metadata structure of bit vectors is maintained where each bit represents a specific region in the shared storage. In one example embodiment, the bits of the bit vectors indicate if the region is currently allocated or not.
In one embodiment, a combination of two or more of the above mentioned techniques may be employed. In one example embodiment, the UHA may be semi-statically divided into multiple regions by the driver and allocations inside one of the regions may be controlled using bit vectors and the others may use a free-list.
In one example embodiment, draw commands enter the pipeline 1100 from the graphics driver (or optionally from a command processor) into the IA 1101. In one example embodiment, associated with a draw command is a graphics state (GS) (the current state of the OpenGL state machine for a pipeline implementing the OpenGL API, the current state of a Direct3D implemented pipeline, etc.). In one embodiment, the GS is written into the GST 304. In one example embodiment, the GST 304 is implemented as a circular buffer in the UHA. In one example embodiment, space for the GST 304 is originally allocated by the graphics driver and its location is defined by a base pointer and a size descriptor, and additionally has an associated tail and head pointer that describes the region that currently contains the GS. In one example embodiment, when a new GS is written into the GST 304, the tail pointer is modified, and when it is later removed from the GST 304 the head pointer is modified. GS is used by a large number of units in the pipeline 1100 (for simplification, not all connections are shown in pipeline 1100).
In one example embodiment, the IA 1101 fetches vertices and other information from memory. The IA 1101 writes primitive mappings into the PMT 301 that is virtualized into the UHA. In one embodiment, the PMT 301 is managed like a circular buffer. In one embodiment, the IA 1101 also creates VS 1102/1103 threads. In one example embodiment, the corresponding thread descriptors are written into the TDQ 305 (there may be multiple TDQs 305). In one example embodiment, arbitration logic will launch VS threads and PS threads onto the shader cores (the shader cores show up twice in the pipeline 1100 as VS 1102/1103 and PS 1106/1107 since a unified shader core is used). In one embodiment, the shader cores may also be used for a geometry shader, compute shader, hull shader, etc. In one embodiment, the TDQ 305 is virtualized as an array in the UHA.
In one embodiment, even as a thread is launched onto a shader core, it stays in the TDQ 305 virtualized array, but a corresponding bit in the TDQ metadata structure (e.g., TDQ metadata 605,
In one example embodiment, when the VS 1102/1103 performs a memory request, it does so by first checking one or more of the caches in the memory hierarchy. In one embodiment, the caches 1110 are virtualized into the UHA. In one embodiment, the caches 1110 have traditional tag structures outside of the UHA that indicates if the data is present in the UHA. In one embodiment, if the data is present in the UHA, it may be located using a particular mapping function. In one embodiment, it is assumed that the mapping between a particular tag and a heap location is fixed and decided at design time.
In one embodiment, output from the VS 1102/1103 goes to the CCV unit 1104. In one embodiment, the CCV unit 1104 reads primitive mappings from the PMT 301 and output from the VS 1102/1103 and passes primitives that pass the clip and cull test to the RAST 1105. In one embodiment, if a primitive mapping is not needed anymore, the CCV unit 1104 instructs the PMT metadata structure (e.g., PMT metadata 601,
In one embodiment, the RAST 1105 creates plane equations and writes them into the PEQ 302 which is virtualized by the UHA. In one example embodiment, the PEQ 302 has been allocated as a fixed size structure and is fully defined by a base pointer and a size descriptor. In one example embodiment, the RAST 1105 also creates pixel shader threads and writes them into the TDQ 305 and updates appropriate metadata (e.g., clearing bits to indicate that the threads are not yet launched). In one embodiment, in order to create pixel shader threads, register files need to be allocated just as for vertex shader threads.
In one embodiment, as threads are launched onto the shader core (enters the PS 1106/1107 stage), the corresponding metadata bits are set to indicate that they have been launched. In one embodiment, the PS 1106/1107 often does texture requests through the TEX 321, which accesses the PEQ 302. In one embodiment, the PS 1106/1107 also does memory accesses through the texture cache.
In one embodiment, output from the PS 1106/1107 goes to the depth and blend 1108 stage. In one embodiment, the depth and blend 1108 stage performs depth testing and blending as defined by the GS that is virtualized in the UHA. In one example embodiment, this is the last stage in the example graphics pipeline 1100.
In one embodiment, in block 1220 the plurality of logical storage structures are mapped into a physical device structure. In one embodiment, the physical device structure comprises a unified storage structure or UHA. In one embodiment, in block 1230 a storage device (e.g., storage device 610,
In one example embodiment, in block 1240 the physical device structure and shared storage device are used for a GPU (e.g., GPU 400,
In one embodiment, in process 1200 the metadata is stored in one or more dedicated metadata structures (e.g., metadata structures 620,
In one embodiment, in process 1200 a fixed mapping exists between the one or more metadata structures and locations in the shared storage device. In one embodiment, in process 1200 a portion of the one or more metadata structures contain pointers into the unified storage structure and a fixed mapping exists between metadata structures without pointers into the unified storage structure and the unified storage structure. In one embodiment, in process 1200 unused space in the unified storage structure is tracked using a combination of a free-list and metadata organized as bit vectors.
The communication interface 517 allows software and data to be transferred between the computer system and external devices through the Internet 550, mobile electronic device 551, a server 552, a network 553, etc. The system 500 further includes a communications infrastructure 518 (e.g., a communications bus, cross-over bar, or network) to which the aforementioned devices/modules 511 through 517 are connected.
The information transferred via communications interface 517 may be in the form of signals such as electronic, electromagnetic, optical, or other signals capable of being received by communications interface 517, via a communication link that carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an radio frequency (RF) link, and/or other communication channels.
In one implementation of one or more embodiments in a mobile wireless device (e.g., a mobile phone, tablet, wearable device, etc.), the system 500 further includes an image capture device 520, such as a camera 128 (
In one embodiment, the system 500 includes a graphics processing module 530 that may implement processing similar as described regarding the UHA 410 (
As is known to those skilled in the art, the aforementioned example architectures described above, according to said architectures, can be implemented in many ways, such as program instructions for execution by a processor, as software modules, microcode, as computer program product on computer readable media, as analog/logic circuits, as application specific integrated circuits, as firmware, as consumer electronic devices, AV devices, wireless/wired transmitters, wireless/wired receivers, networks, multi-media devices, etc. Further, embodiments of said Architecture can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
One or more embodiments have been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to one or more embodiments. Each block of such illustrations/diagrams, or combinations thereof, can be implemented by computer program instructions. The computer program instructions when provided to a processor produce a machine, such that the instructions, which execute via the processor create means for implementing the functions/operations specified in the flowchart and/or block diagram. Each block in the flowchart/block diagrams may represent a hardware and/or software module or logic, implementing one or more embodiments. In alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures, concurrently, etc.
The terms “computer program medium,” “computer usable medium,” “computer readable medium”, and “computer program product,” are used to generally refer to media such as main memory, secondary memory, removable storage drive, a hard disk installed in hard disk drive. These computer program products are means for providing software to the computer system. The computer readable medium allows the computer system to read data, instructions, messages or message packets, and other computer readable information from the computer readable medium. The computer readable medium, for example, may include non-volatile memory, such as a floppy disk, ROM, flash memory, disk drive memory, a CD-ROM, and other permanent storage. It is useful, for example, for transporting information, such as data and computer instructions, between computer systems. Computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
Computer program instructions representing the block diagram and/or flowcharts herein may be loaded onto a computer, programmable data processing apparatus, or processing devices to cause a series of operations performed thereon to produce a computer implemented process. Computer programs (i.e., computer control logic) are stored in main memory and/or secondary memory. Computer programs may also be received via a communications interface. Such computer programs, when executed, enable the computer system to perform the features of the embodiments as discussed herein. In particular, the computer programs, when executed, enable the processor and/or multi-core processor to perform the features of the computer system. Such computer programs represent controllers of the computer system. A computer program product comprises a tangible storage medium readable by a computer system and storing instructions for execution by the computer system for performing a method of one or more embodiments.
Though the embodiments have been described with reference to certain versions thereof; however, other versions are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the preferred versions contained herein.