The novel features believed characteristic of the illustrative embodiments are set forth in the appended claims. The illustrative embodiments, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The illustrative embodiments provide for a compiler assisted re-configurable software implemented cache. With reference now to the figures and in particular with reference to
Logically, cell broadband engine architecture 100 defines four separate types of functional components: Power PC® processor element (PPE) 101 or 102, synergistic processor unit (SPU) 103, 104, 105, or 106, memory flow controller (MFC) 107, 108, 109, or 110, and the internal interrupt controller (IIC) 111. The computational units in cell broadband engine architecture-compliant processor 100 are Power PC® processor elements 101 and 102 and synergistic processor units 103, 104, 105, and 106. Each synergistic processor unit 103, 104, 105, and 106 must have dedicated local storage 112, 113, 114, or 115, a dedicated memory flow controller 107, 108, 109, or 110 with its associated memory management unit (MMU) 116, 117, 118, or 119, and replacement management table (RMT) 120, 121, 122, or 123, respectively. The combination of these components is referred to as synergistic processor unit element (SPE) group 124 or 125.
Cell broadband engine architecture-compliant processor 100 depicts synergistic processor element groups 124 and 125 that share a single SL1 cache 126 and 127, respectively. An SL1 cache is a first-level cache for direct memory access transfers between local storage and main storage. Power PC® processor element groups 101 and 102 share single second-level (L2) caches 128 and 129, respectively. While caches are shown for the synergistic processor element groups 124 and 125 and Power PC® processor element groups 101 and 102, they are considered optional in the cell broadband engine architecture. Also included in
Cell broadband engine architecture-compliant processor 100 may include multiple groups of Power PC® processor elements (PPE groups), such as Power PC® processor element group 101 or 102, and multiple groups of synergistic processor elements (SPE groups), such as synergistic processor element group 124 or 125. Hardware resources may be shared between units within a group. However, synergistic processor element groups 124 and 125 and Power PC® processor element groups 101 and 102 must appear to software as independent elements.
Each synergistic processor unit 103, 104, 105, and 106 in synergistic processor element groups 124 and 125 has its own local storage area 112, 113, 114, or 115 and dedicated memory flow controller 107, 108, 109, or 110 that includes an associated memory management unit 116, 117, 118, or 119, which can hold and process memory-protection and access-permission information.
Cell broadband engine architecture-compliant processor 100 includes one or more Power PC® processor element group 101 or 102. Power PC® processor element groups 101 and 102 consist of 64-bit Power PC® processor units (PPUs) 133, 134, 135, and 136 with associated L1 caches 137, 138, 139, and 140, respectively. Cell broadband engine architecture-compliant processor 100 system must include a vector multimedia extension unit (not shown) in the Power PC® processor element groups 101 and 102. Power PC® processor element groups 101 and 102 also contain replacement management table (RMT) 141, 142, 143, and 144 and bus interface unit (BIU) 145 and 146, respectively. Bus interface units 145 and 146 connect Power PC® processor element groups 101 or 102 to the element interconnect bus 132. Bus interface units 147 and 148 connect replacement management tables 120, 121, 122, and 123 to element interconnect bus 132.
Power PC® processor element groups 101 and 102 are general-purpose processing units, which can access system management resources, such as the memory-protection tables, for example. Hardware resources defined in the cell broadband engine architecture are mapped explicitly to the real address space as seen by Power PC® processor element groups 101 and 102. Therefore, any Power PC® processor element groups 101 and 102 may address any of these resources directly by using an appropriate effective address value. A primary function of Power PC® processor element groups 101 and 102 is the management and allocation of tasks for the synergistic processor element groups 124 and 125 in a system.
Cell broadband engine architecture-compliant processor 100 includes one or more synergistic processor units 103, 104, 105, or 106. Synergistic processor units 103, 104, 105, and 106 are less complex computational units than Power PC® processor element groups 101 and 102, in that they do not perform any system management functions. Synergistic processor units 103, 104, 105, and 106 have a single instruction multiple data (SIMD) capability and typically process data and initiate any required data transfers, subject to access properties set up by Power PC® processor element groups 101 and 102, in order to perform their allocated tasks.
The purpose of synergistic processor units 103, 104, 105, and 106 is to enable applications that require a higher computational unit density and may effectively use the provided instruction set. A significant number of synergistic processor units 103, 104, 105, and 106 in a system, managed by Power PC® processor element group 101 or 102, allow for cost-effective processing over a wide range of applications.
Memory flow controllers 107, 108, 109, and 110 are essentially the data transfer engines. Memory flow controllers 107, 108, 109, and 110 provide the primary method for data transfer, protection, and synchronization between main storage and the local storage. Memory flow controllers 107, 108, 109, and 110 commands describe the transfer to be performed. A principal architectural objective of memory flow controllers 107, 108, 109, and 110 is to perform these data transfer operations in as fast and as fair a manner as possible, thereby maximizing the overall throughput of cell broadband engine architecture-compliant processor 100.
Commands that transfer data are referred to as memory flow controller direct memory access commands. These commands are converted into direct memory access transfers between the local storage domain and main storage domain. Each of memory flow controllers 107, 108, 109, and 110 may typically support multiple direct memory access transfers at the same time and may maintain and process multiple memory flow controller commands.
In order to accomplish this, memory flow controllers 107, 108, 109, and 110 maintain and process queues of memory flow controller commands. Each memory flow controllers 107, 108, 109, and 110 provide one queue for the associated synergistic processor unit 103, 104, 105, or 106, memory flow controller synergistic processor unit command queue, and one queue for other processors and devices, memory flow controller proxy command queue. Logically, a set of memory flow controller queues is always associated with each synergistic processor unit 103, 104, 105, or 106 in cell broadband engine architecture-compliant processor 100, but some implementations of the architecture may share a single physical memory flow controller between multiple synergistic processor units. In such cases, all the memory flow controller facilities must appear to software as independent for each synergistic processor unit 103, 104, 105, or 106.
Each memory flow controller direct memory access data transfer command request involves both a local storage address (LSA) and an effective address (EA). The local storage address can directly address only the local storage area of its associated synergistic processor unit 103, 104, 105, or 106.
The effective address has a more general application, in that it can reference main storage, including all the synergistic processor unit local storage areas, if they are aliased into the real address space.
Memory flow controllers 107, 108, 109, and 110 present two types of interfaces: one to the synergistic processor units 103, 104, 105, and 106 and another to all other processors and devices in a processing group.
Memory flow controllers 107, 108, 109, and 110 also support bandwidth reservation and data synchronization features.
Internal interrupt controller 111 manages the priority of the interrupts presented to Power PC® processor element groups 101 and 102. The main purpose of internal interrupt controller 111 is to allow interrupts from the other components in the processor to be handled without using the main system interrupt controller. Internal interrupt controller 111 is really a second level controller. Internal interrupt controller 111 is intended to handle all interrupts internal to a cell broadband engine architecture-compliant processor 100 or within a multiprocessor system of cell broadband engine architecture-compliant processor 100. The system interrupt controller will typically handle all interrupts external to cell broadband engine architecture-compliant processor 100.
In cell broadband engine architecture-compliant system, software must first check internal interrupt controller 111 to determine if the interrupt was sourced from an external system interrupt controller. Internal interrupt controller 111 is not intended to replace the main system interrupt controller for handling interrupts from all I/O devices.
The described illustrative embodiments provide a compiler optimized software implemented configurable cache that minimizes cache overhead for data that is accessed together. The optimizing compiler may dynamically re-configure the cache specific to different phases of a single execution and tailored to the requirements of that phase. The components of a software implemented cache that may be re-configured include total cache size, cache line size, number of lines, associativity, replacement policy, and a method used to determine where in the cache a particular data item should be placed when it is brought in. The optimizing compiler may classify application data based on its access properties, such as how often and how far apart the data is re-used, whether the data is being read, written, or both read and written, and whether the data is being shared across multiple threads of computation. This classification of data will result in the formation of one or more data classes. The optimizing compiler may use multiple co-existent cache configurations, one for each data class.
In exemplary compiling operation 200, optimizing compiler 202 performs analysis of source code 204 for all references to data that are contained in the code. This analysis includes alias analysis, data dependence analysis, and analysis of the properties of cacheable data, which is illustrated in
Optimizing compiler 202 then performs a data reference analysis, which is an analysis to determine certain properties associated with the cacheable data. These properties include, but are not limited to, whether the data is read, written, or both read and written, how often and how far apart a data item is referenced again, whether multiple threads of execution may share the data, the size of a data item, the alignment of the address at which a data item is located, the affinity with which a group of data items are referenced together, and the number of data references active at the same time within a code region.
Once a determination is made as to what data is cacheable or non-cacheable, optimizing compiler 202 is able to generate compiled code 210. In compiled code 210, optimizing compiler 202 inserts modified cache lookup code 212 for cacheable data x 206, based on the software cache configuration to be used. Modified cache lookup code 212 is code that uses memory address bits as keys and provides a mapping from a memory address to the cache line that has the data corresponding to that address, when this data is contained in the cache. Typically, the cache is divided into a number of lines, and depending on what the total cache size and the line size is, a certain number of bits chosen from the address bits define a number that is used as an index into the cache. This index serves as a key that uniquely identifies the cache line(s) that correspond to that address. Additionally, optimizing compiler 202 links cache manager code 214 to compiled code 210 in order to configure the cache and implement modified cache lookup code 212. Cache manager code 214 is separate code that interfaces with modified cache lookup code 212. That is, the modified cache lookup code 212 provides an entry-point into cache manager code 214 when an application executes. Cache manager code 214 is responsible for, among other things, deciding what policy to use when deciding where to place newly fetched data, what data to replace when no space is available, and whether to perform data pre-fetching.
Each of local memory areas 304a, 304b, 304c, or 304d are comprised of content, such as, application code, cache data, other storage data and may include unused space. Each of local memory areas 304a, 304b, 304c, or 304d are also connected to system memory 306 using direct memory access (DMA) controller 308. System memory 306 may be a memory such as memory 150 of
In this illustrative embodiment, local memory areas 304a, 304b, 304c, and 304d contain a software implemented cache and usage of memory space may be optimized by optimizing compiler, such as optimizing compiler 202 of
Software implemented cache configuration may be determined by the choice of values for the total cache size, cache line size, number of lines, associativity parameters, replacement policy, and a method used to determine where in the cache a particular data item should be placed when it is brought in. Cache line size decides the basic unit of data that is transferred between the software cache and system memory each time that the cache has to bring in data or write back data. This affects factors such as bandwidth usage, spatial locality, and false sharing. Bandwidth refers to the capacity of the bus used to transfer data to or from the cache, and larger cache lines may unnecessarily use up more bandwidth. Spatial locality is when the application code accesses data at consecutive memory addresses in succession. Since larger cache lines result in the transfer of more consecutive data at a time, they are likely to benefit applications with spatial locality.
False sharing is when data at the same address is included in a cache line in the cache of more than one processing unit, but is actually accessed by code executing on only one of the processing units. False sharing occurs due to the fact that a single large cache line contains data located at multiple consecutive memory addresses. False sharing may lead to overhead associated with keeping the multiple cached copies of the same data coherent.
Cache line size and number of lines together determine the total size of the software implemented cache. Since the optimizing compiler aims to make optimal use of limited local memory available in the system, it will judiciously choose values for these parameters so that the software implemented cache size is balanced with the memory space requirements of other code and data used in an execution of the application.
A higher associativity factor provides more flexibility with regard to choosing a cache line to hold data corresponding to a particular memory address. Higher associativity may promote better use of the memory space occupied by the cache by reducing the number of cache conflicts, that is by reducing the number of times that two different memory addresses map to the same cache line such that only one or the other can be contained in the cache at a given time. However, higher associativity also entails more overhead in the cache lookup code, so the optimizing compiler will choose an associativity factor that accounts for both the cache lookup overhead as well as data reference patterns in the code that may contribute to cache conflicts that have a detrimental effect on application performance.
The parameters of line size, number of lines, and associativity will also influence the cache lookup code modified and/or inserted by the optimizing compiler in the application program since the number of address bits used in the key will be variable for different points in the lookup. Cache lookup code is code that uses memory address bits as keys and provides a mapping from a memory address to the cache line that has the data corresponding to that address, when this data is contained in the cache. Cache lookup code depends on the cache configuration that will be used by the application in configuring the software cache that will be used during the operation of the application.
Positions of the address bits that are used to form the key(s) for cache lookup code may be arbitrarily chosen by the optimizing compiler in order to optimize the compiled code in conjunction with other software analyses, such as data placement. The software implemented cache configuration may also include a replacement policy, a write-back policy, and/or a pre-fetch policy. A replacement policy determines which data item to kick out of the cache in order to free up a cache line in case of conflicts. A write-back policy determines when to reflect changes to cached data back to system memory, and whether to use read-modify-write when writing back a cache line. A pre-fetch policy makes use of a compiler-defined rule to anticipate future data references and transfer corresponding data to cache ahead-of-time.
The cache configuration determined by the optimizing compiler is not required to be used for the entire execution of an application. Rather, the cache configuration may be re-configured dynamically depending on application characteristics and requirements. Additionally, the data may be divided into different data classes based on reference characteristics. Multiple cache configurations may be defined and associated with different data classes.
In this illustrative embodiment, there is flexibility to choose which particular x bits in the system memory address determine the index into tag array 514. Tag array 514 consists of 2x entries 506 and y bits 504. The compiler may use this, in conjunction with data placement, to minimize cache conflicts for data that is accessed together, as shown in data array 516. Depending on data reference patterns and re-use characteristics, the replacement policy may be varied from the default least recently used (LRU) policy. Software implemented cache 518 may be dynamically re-configured specific to different phases of a single execution, tailored to the requirements of that phase. Application data may be classified based on its re-use/sharing/read-write properties, and multiple co-existent cache configurations may be used, one for each data class.
References to objects A, B, C, D, and E in software code 618 are marked as cacheable as those objects are in common shared memory 610 and have to be fetched into processor's 604 software implemented cache 612 before use. When compiling software code 618, a compiler inserts modified lookup code to access this data via software implemented cache 612. References to objects “param”, “i”, and “local” in software code 618 are not marked as cacheable, as these objects are allocated on the local stack frame of software code 618 when it is executed on processor 604, and they do not exist in common shared memory 610.
The compiler analyzes references to objects A, B, C, D, and E in software code 618 to determine how best to cache them in software implemented cache 612. As an illustrative example, although not an exhaustive listing of the kind of information gathered by the data reference analysis, the analysis may determine:
The data reference analysis results are used to determine the best cache configuration to use, including the total cache size, cache line size, number of lines, associativity, and default policies for placement and replacement of data in software implemented cache 612.
The lookup code for the cacheable references is modified in the compiler to carry the results of the data reference analysis. The modified lookup code is inserted into the compiled software code. The information in the modified lookup code is subsequently used at runtime to efficiently manage and/or configure software implemented cache 612. It is also possible to dynamically update the lookup code based on characteristics observed during execution.
Software code 618 represents exemplary code that is input to a compiler. Lookup code inserted by the compiler typically performs a function similar to a lookup in a hardware cache. In terms of
This basic functionality remains the same for the lookup code in the re-configurable cache. However, the number and position of the “x” and “y” bits may change. Additionally, a change may occur where a function is invoked to handle a cache miss, then the function may be passed information relating to the properties of the data reference that were earlier analyzed in the compiler.
Thus, the illustrative embodiments provide a compiler optimized software implemented cache that minimizes cache overhead for data accesses. The optimizing compiler may dynamically re-configure the cache specific to different phases of a single execution and tailored to the requirements of that phase. The optimizing compiler may classify application data based on access properties, such as how often and how far apart the data is re-used, whether the data is being read, written, or both read and written, and whether the data is being shared across multiple threads of computation. This classification of data will result in the formation of one or more data classes. The optimizing compiler may use multiple co-existent cache configurations, one for each data class.
The illustrative embodiments can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In an illustrative embodiment, implementation is in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the illustrative embodiments can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the different embodiments have been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the illustrative embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the illustrative embodiments, the practical application, and to enable others of ordinary skill in the art to understand the illustrative embodiments for various embodiments with various modifications as are suited to the particular use contemplated.