Embodiments relate to control of a cache memory hierarchy of a processing device.
In typical processor-based systems, a processor couples to one or more memory devices with which it communicates information. As processor speeds continue to increase, memory, and the processor's interaction with memory, becomes a bottleneck to performance enhancements. This is the case, as memory bandwidth and latency continues to limit the performance of both single core and multi-core workloads. A large last level cache (LLC) within the processor can help reduce the fraction of memory requests served by the memory and improve performance. Typically LLCs seek to increase hit rate within the LLC in order to reduce traffic to the memory. However such operation does not take memory efficiency into account. As an example, data evicted from the LLC (victim data) and written to the memory may consume large amounts of memory bandwidth, reducing available for incoming memory traffic, resulting in lower delivered bandwidth from the memory. This situation thus adversely affects performance.
In various embodiments, techniques are provided to control at least a portion of a cache memory hierarchy of a processor to dynamically allocate, within the cache memory hierarchy, an explicit virtual write buffer. In this way, during periods of high communication bandwidths between the processor and a memory, write interference on a memory interconnect that couples the processor to the memory may be reduced. Various features of this virtual write buffer, including its creation, allocation, maintenance and so forth may be dynamically controlled based on operating conditions, including bandwidth(s) on the memory interconnect, virtual write buffer occupancy, cache hit rates and so forth. Note that in embodiments, dynamic sizing of the virtual write buffer (or its presence at all) may be based at least in part on cache hit rate, in addition to bandwidth. In this way, the cache memory may maintain a sufficient hit rate, while improving memory efficiency and thus gaining overall performance benefits.
It is noted that by converting some portion of a cache memory such as a portion of a last level cache (LLC) to be a virtual write buffer, hit rates may be compromised, potentially impacting performance. As such, embodiments provide machine learning techniques to identify and minimize this impact. To this end, a learning mechanism may be provided that periodically profiles hit rate of a workload (e.g., one or more applications, threads, processes or so forth) using cache hardware. Based at least in part on this hit rate information, a hit rate loss may be minimized, effectively trading off a small drop in LLC hit rate in order to improve memory efficiency and gain overall performance. In an example embodiment, the techniques described herein may enable a performance gain on memory-sensitive multicore workloads.
With the embodiments described herein in which a virtual write buffer is provided within an LLC, other write buffering resources may be minimized. For example, an integrated memory controller typically includes a small amount of storage for a write buffer. By providing a cache-based virtual write buffer, the size of this resource may be kept relatively small, e.g., on the order of between approximately 2-4 kilobytes per memory controller channel. By maintaining the write buffers with a small size, performance may be enhanced, as increasing this write buffer is not a scalable solution, given area, power and timing concerns.
Referring now to
As seen, processor 100 includes an execution circuit 102, an L1 instruction cache 104, an L1 data cache 106, an L2 cache 108, a LLC 115, and a memory controller 112. Execution circuit 102 may be a portion of a processor configured to execute instructions. In some implementations, a processor may have multiple cores, with each core having a processing unit and one or more caches.
In operation, execution circuit 102 may perform an instruction fetch after executing a current instruction. The instruction fetch may request a next instruction from L1 instruction cache 104 for execution by execution circuit 102. If the instruction is present in L1 instruction cache 104, an L1 hit may occur and the next instruction may be provided to execution circuit 102 from L1 instruction cache 104. If not, an L1 miss occurs, and L1 instruction cache 104 may request the next instruction from L2 cache 108, which includes a cache controller 109.
If the next instruction is in L2 cache 108, an L2 hit occurs and the next instruction is provided to L1 cache 104. If not an L2 miss occurs, and L2 cache 108 may request the next instruction from LLC 115.
If the next instruction is in LLC 115, an LLC hit occurs and the next instruction is provided to L2 cache 108 and/or to L1 instruction cache 104. If not, an LLC miss may occur and LLC 115 may request the next instruction from memory controller 112. Memory controller 112 may read a block 114 that includes the next instruction and fill block 114 into L2 cache 108, in a non-exclusive cache hierarchy implementation. Other fill techniques of course are possible. And understand that while an instruction-based cache fill example is given, the same operations occur for a data-based fill (with the exception that the data is finally filled back to L1 data cache 106).
In some implementations, a core 118 may include execution circuit 102 and one or more of caches 104, 106, or 108. For example, in
In addition to the above discussion of fill operations, eviction operations also may be performed within the cache memory hierarchy. As shown, due to capacity issues within a lower level cache (e.g., one of L1 instruction cache 104 or L1 data cache 106) a data block 107 may be evicted and stored temporarily in storage within L2 cache 108. Still further due to capacity issues, evicted block 107 in turn may become evicted from L2 cache 108 and be provided to LLC 115 as evicted block 117.
In embodiments herein, when evicted data block 107 includes dirty data to be written back to memory 150, such dirty data may be maintained in a virtual write buffer 118 of LLC 115. As used herein, note that the term “virtual write buffer” is used to refer to a dedicated allocation of one or more cache lines (per set) within a cache memory in which dirty data are to be stored and maintained instead of cache lines storing clean data. Stated another way, a virtual write buffer is a dedicated cache memory storage for writeback data, so that such writeback data can be maintained for longer periods of time within the cache memory before an actual writeback to memory occurs, thus reducing memory traffic. At the same time, an upper bound on the size of the virtual write buffer may be maintained to ensure that hit rates within the cache memory do not impact performance to an undesired extent. As will be described in more detail herein, the presence of virtual write buffer 118 may be dynamically controlled based at least in part on a bandwidth on an interconnect 140 that couples processor 110 with memory 150.
In embodiments herein, a cache controller 120 may be configured to dynamically allocate virtual write buffer 118, dynamically control its size based at least in part on hit statistics, and adaptively drain entries from virtual write buffer 118 to memory 150 according to varying conditions, including bandwidth on interconnect 140 and/or capacity issues within virtual write buffer 118. In embodiments, virtual write buffer 118 may be implemented as an explicit write buffer. This explicit write buffer may be formed of at least a predetermined number of lines or ways within each set of LLC 115.
Note that memory controller 112 includes its own write buffer 113. However, embodiments may leverage virtual write buffer 118, implemented within already existing storage within LLC 115, such that the expense of an increased size of this additional memory structure within memory controller 112 can be mitigated. As such, virtual write buffer 118 is a separate structure from write buffer 113 of memory controller 112.
Referring now to
As illustrated in
Still referring to
For purposes of discussion herein, cache controller 240 may include a replacement circuit 245 which, when MLC 210 is at a capacity (or at least a set is at such capacity) may be used to identify an eviction candidate to be evicted from MLC 210 as an evicted cache line 235, in turn to be provided to LLC 250. In different implementations, replacement circuit 245 may perform replacement operations based on various replacement policies. For purposes of discussion herein, assume that evicted cache line 235 includes dirty data that is to be stored within LLC 250. Understand while shown with a single particular sub-circuit in the embodiment of
As illustrated in
Note that the virtual write buffer and its control may be implemented with very little area, as the buffer itself is formed of already existing cache lines within LLC 250. In this way, a portion of LLC 250 may be repurposed as an effective and broadly applicable virtual write buffer, controlled based at least in part on dynamic learning mechanisms.
This virtual write buffer may be configured to absorb write data. The buffered writes in turn drain out of the virtual write buffer in periods of low memory bandwidth. This virtual write buffer is dynamically adjustable and balances the conflicting goals of improving memory efficiency by absorbing writes and maintaining a good hit rate in LLC 250. The virtual write buffer dynamically grows inside LLC 250. To avoid interfering with reads when the virtual write buffer is present, an LLC fill operation seeks to find a clean victim to evict. As a result, the extent to which the virtual write buffer is allowed to grow in LLC 250 influences the availability of clean victims and the quality of victims. For example, if the virtual write buffer is allowed to grow too large, the possibility of replacing clean live blocks increases.
Of interest here, cache controller 270 includes a virtual write buffer control circuit 273. In the illustration shown, virtual write buffer control circuit 273 includes constituent sub-circuits, including an allocation circuit 277 and a drain circuit 279. Still further, control circuit 273 may maintain a set of counters 275. Of interest here, such counters may include a set of read hit counters 2760-n and a set of write hit counters 2780-n. In an embodiment, for a 16-way cache arrangement, there may be 16 read hit counters and 16 write hit counters, each of which may be implemented with 16 bits, in an embodiment. Such counters may count, respectively, read and write hits within particular positions of a stack, organized by recency of access.
As described herein, cache controller 270 may maintain read and write hit statistics with regard to particular LRU positions of an LRU stack using counters 276, 278. Allocation circuit 277 may trigger allocation of a virtual write buffer within LLC 250, e.g., based on bandwidth information of a memory interconnect, which may be received from a memory controller (not shown for ease of illustration in
Hit histogram information based on the hit count information maintained by counters 276, 278 may be used to prevent hit rate loss in LLC 250. More specifically this information may be used to dynamically adjust the size of the virtual write buffer in order to limit the LLC hit rate loss. In an embodiment, the virtual write buffer is sized periodically based on two metrics. First, a bound is imposed on the percentage of sacrificed LLC hits due to implementation of the clean LRU replacement policy. Second, a bound is imposed on the probability of dirty inclusion victims.
A read hit histogram (RHH) is maintained using information from read hit counters 276. More specifically, this RHH records the number of LLC read hits in each LRU stack position. In one particular embodiment, the number of ways beginning from the tail of the LRU stack (namely the LRU position) that cover 1/16th of all LLC read hits may define a maximum stretch of the virtual write buffer. Let this be called MaxReadStretch. This value guarantees that if the write buffer becomes full, a clean LRU replacement policy will not sacrifice more than 1/16th of LLC hits.
A write hit histogram (WHH) is also maintained using information from write hit counters 278. More specifically, this WHH records the number of LLC write hits in each LRU stack position. In one particular embodiment, the number of ways starting from the tail of the LRU stack that cover half of all LLC write hits may define a maximum stretch of the virtual write buffer. Let this be called MaxWriteStretch. Evicting a block beyond the MaxWriteStretch has the probability of generating a dirty inclusion victim equal to (#dirty blocks/#all blocks)*inclusion victim fraction*fraction of write hits covered by MaxWriteStretch. Assuming ⅓rd dirty blocks, this leads to (⅓)*(¼)*(½), or about a 4% chance of generating a dirty inclusion victim. Note that the inclusion victim fraction may be set as the ratio of MLC to LLC capacity.
In one particular embodiment, these two values may be used in determining an appropriate size of the virtual write buffer. In such embodiments, the write buffer size may be dynamically controlled to be a maximum of a predetermined (or initialization) value and these stretch values, as follows: max(3, min(MaxWriteStretch, MaxReadStretch)). In this example in other words, the minimum virtual write buffer capacity is set to 3 LLC ways. Of course, in other embodiments other values may exist, or other techniques may be used to determine the virtual write buffer size.
Different sets in the virtual write buffer fill up with dirty blocks at different rates. Some sets fill up quickly and put pressure on the clean LRU replacement policy for those sets. In embodiments there may be multiple criteria to be considered in determining when it is appropriate to drain entries from the virtual write buffer. In one embodiment, a first drain or scrub trigger may be based on the number of overflown sets in the virtual write buffer. As used herein, the term “overflown set’ means a set in which all of the number of ways that constitute the virtual write buffer include dirty data. Note that different applications can tolerate different numbers of overflown sets. For example, an application with reasonably high hit rate can sacrifice a bigger number of hits to delay a scrub operation. In one implementation, a lookup table (LUT) may be used to map hits per fill to the number of tolerable overflown sets. More specifically, the LUT may be used to identify how many overflown sets an application can tolerate. If the hits per fill is high for that application (e.g., hit rate is high), more overflown sets can be tolerated. The rationale is that since the hit rate is high, the memory bandwidth demand is lower, and hence a little decrease in hit rate will not hurt memory bandwidth demand. Note that RHH information may still cap the overall hit rate loss. In an embodiment, each set within LLC 250 may include an overflow indicator, e.g., a single bit, which when set is to indicate that the virtual write buffer for the set is full (namely each way of the virtual write buffer, to the specified depth stores dirty data). This information may be used to identify when a scrub may be triggered due to a capacity issue.
A second criterion for triggering a scrub may be based on the number of reads pending on a memory channel. In a particular system configuration, an LLC may be configured with multiple banks, where each LLC bank is associated with a specific memory channel. If the number of reads pending on that channel is higher than a threshold, scrubbing is deferred until the number of pending reads falls below the threshold. In one embodiment, this threshold may be set to: (number of miss status holding registers (MSHRs), which is a measure of pending reads in a bank))×(number of LLC banks feeding to a DRAM channel). In other words, a write scrubbing is triggered in an LLC bank only if the memory channel is not saturated with the maximum number of reads. Of course in other embodiments, a LLC-wide analysis may be performed.
In typical embodiments, each of these criteria may be considered independently, such that scrubbing is triggered when either criterion is met. In another embodiment, both the criteria may be satisfied for an LLC bank to enter the scrub mode.
In the scrub mode, cache controller 270 or other control circuit may analyze each set (e.g., one or more times or rounds). In each visit or analysis to a set, at most one dirty block closest to the LRU position may be scrubbed, in one embodiment. Note that this scrub operation may be implemented as a write of the dirty data to memory and a corresponding update to cache coherency metadata of the line to indicate that the line now stores clean data. Stated another way, this scrub of a line is a write to memory and update to the cache coherency state of the line, without victimizing the line. In one embodiment, within a set, the search for a dirty block is restricted to the lowest N LRU stack positions to minimize over scrubs. This variable N may be determined periodically by visiting the WHH and computing the number of ways, starting from the LRU tail that cover at most 1/16th of all LLC write hits.
Referring now to
At a high level, method 300 may be used to allocate a virtual write buffer in the LLC based on system conditions and maintain the virtual write buffer during at least portions of operation, including inserting dirty lines into entries of the virtual write buffer and adaptively draining or scrubbing these dirty lines from the virtual write buffer. Maintenance may further include dynamic control of the virtual write buffer, including dynamic sizing of the virtual write buffer, dynamic allocation/deallocation of the virtual write buffer and so forth.
As illustrated, method 300 begins by monitoring a write bandwidth with a memory (block 310). More specifically, a write bandwidth of an interconnect that couples a processor to a memory such as a DRAM can be monitored. In embodiments, an integrated memory controller of the processor may maintain such statistics regarding channel usage. As an example, statistics may be maintained as to read and write bandwidths for read and write operations on the interconnect. In different implementations, such statistics may be maintained independently for multiple channels. Furthermore, while in the embodiment of
In any event, control next passes to diamond 315 to determine whether this monitored bandwidth exceeds a first bandwidth threshold. Although the scope of the present invention is not limited in this regard, this first bandwidth threshold may be set at a given level of bandwidth, e.g., as a percentage of maximum bandwidth. If the bandwidth is determined not to exceed this first bandwidth threshold, control passes back to block 310 for further monitoring of the bandwidth. Note that this bandwidth monitoring may occur periodically.
Still with reference to
Allocation of the virtual write buffer may include additional operations, such as updating a replacement policy within the LLC. For example, the virtual write buffer allocation may be performed by updating the replacement policy to a clean LRU policy. That is, to enable a virtual write buffer as described herein, the replacement policy may be set such that clean lines are preferentially evicted from the LLC and dirty lines are not selected as victims. Note that by evicting clean lines, there is no impact on memory bandwidth, as these lines may simply be dropped, since they are clean and thus include the same data as present in the memory. Understand that in some embodiments, additional operations to allocate the virtual write buffer may occur.
At this point in operation, the virtual write buffer is allocated, such that when dirty lines are written into the LLC, they are more likely to be maintained within the LLC and be less likely to be evicted from the LLC (to the memory controller (more specifically, a write buffer within the memory controller) and in turn to the memory).
Still referring to
Still further with reference to
Note with regard to
Referring now to
As illustrated, method 400 begins by accumulating a hit counter for a given LRU stack position with an accumulated hit counter value (block 410). Note that this accumulation may begin at an LRU position by access to a hit counter associated with the LRU position (within a set of such hit counters). Thus in an initial iteration of the accumulation at block 410, this accumulated hit counter value may be set at an initialized value of zero. Understand that a cache controller may maintain independent hit counters for read and write hits. Still further, in an embodiment herein, multiple hit counters may be maintained, with a hit counter for each LRU position within an LRU stack. In an example of a cache arrangement having 16 ways, there thus may be 16 LRU positions. As such, a cache controller may maintain 16 read hit counters and 16 write hit counters. Understand that the cache controller may update the appropriate read/write hit counter on a given read or write hit to the corresponding LRU position when a request hits within that LRU position in one of the sets of the LLC.
Still with reference to
Still with reference to
Note that method 400 may proceed independently for read hit counters and write hit counters. As such, two different maximum stretch values may be set, one associated with read hits and the other associated with write hits. As described herein, both of these values may be used to determine a size or depth of the virtual write buffer. For example, the cache controller of the LLC may set the virtual write buffer depth to a smallest one of these two maximum stretch values, assuming that the value of whichever is the smaller maximum stretch value exceeds the baseline or predetermined virtual write buffer depth. Of course other examples are possible.
Referring now to
As an option, it may be determined whether the identified dirty line is within a threshold distance of the LRU position itself (diamond 520). As an example, this threshold distance may restrict the search for dirty lines to the lowest N LRU stack positions, thus minimizing over scrubs. In other cases, this optional determination may not occur. In any event, control passes to block 530 where the identified dirty line may be written to the memory controller (for eventual write back to the memory). Note that this write of the identified dirty line does not cause an eviction of the dirty line. Instead as further shown in
Still with reference to
Still referring to
Referring now to
Coupled between front end units 610 and execution units 620 is an out-of-order (OOO) engine 615 that may be used to receive the micro-instructions and prepare them for execution. More specifically OOO engine 615 may include various buffers to re-order micro-instruction flow and allocate various resources needed for execution, as well as to provide renaming of logical registers onto storage locations within various register files such as register file 630 and extended register file 635. Register file 630 may include separate register files for integer and floating point operations. For purposes of configuration, control, and additional operations, a set of machine specific registers (MSRs) 638 may also be present and accessible to various logic within core 600 (and external to the core).
Various resources may be present in execution units 620, including, for example, various integer, floating point, and single instruction multiple data (SIMD) logic units, among other specialized hardware. For example, such execution units may include one or more arithmetic logic units (ALUs) 622 and one or more vector execution units 624, among other such execution units.
Results from the execution units may be provided to retirement logic, namely a reorder buffer (ROB) 640. More specifically, ROB 640 may include various arrays and logic to receive information associated with instructions that are executed. This information is then examined by ROB 640 to determine whether the instructions can be validly retired and result data committed to the architectural state of the processor, or whether one or more exceptions occurred that prevent a proper retirement of the instructions. Of course, ROB 640 may handle other operations associated with retirement.
As shown in
Referring now to
A floating point pipeline 730 includes a floating point (FP) register file 732 which may include a plurality of architectural registers of a given bit width such as 128, 256 or 512 bits. Pipeline 730 includes a floating point scheduler 734 to schedule instructions for execution on one of multiple execution units of the pipeline. In the embodiment shown, such execution units include an ALU 735, a shuffle unit 736, and a floating point adder 738. In turn, results generated in these execution units may be provided back to buffers and/or registers of register file 732. Of course understand while shown with these few example execution units, additional or different floating point execution units may be present in another embodiment.
An integer pipeline 740 also may be provided. In the embodiment shown, pipeline 740 includes an integer (INT) register file 742 which may include a plurality of architectural registers of a given bit width such as 128 or 256 bits. Pipeline 740 includes an integer execution (IE) scheduler 744 to schedule instructions for execution on one of multiple execution units of the pipeline. In the embodiment shown, such execution units include an ALU 745, a shifter unit 746, and a jump execution unit (JEU) 748. In turn, results generated in these execution units may be provided back to buffers and/or registers of register file 742. Of course understand while shown with these few example execution units, additional or different integer execution units may be present in another embodiment.
A memory execution (ME) scheduler 750 may schedule memory operations for execution in an address generation unit (AGU) 752, which is also coupled to a TLB 754. As seen, these structures may couple to a data cache 760, which may be a L0 and/or L1 data cache that in turn couples to additional levels of a cache memory hierarchy, including an L2 cache memory and which may be part of a cache memory hierarchy, and which may dynamically implement an adaptive write buffer as described herein.
To provide support for out-of-order execution, an allocator/renamer 770 may be provided, in addition to a reorder buffer 780, which is configured to reorder instructions executed out of order for retirement in order. Although shown with this particular pipeline architecture in the illustration of
Referring to
With further reference to
Referring to
Also shown in
Decoded instructions may be issued to a given one of multiple execution units. In the embodiment shown, these execution units include one or more integer units 935, a multiply unit 940, a floating point/vector unit 950, a branch unit 960, and a load/store unit 970. In an embodiment, floating point/vector unit 950 may be configured to handle SIMD or vector data of 128 or 256 bits. Still further, floating point/vector execution unit 950 may perform IEEE-754 double precision floating-point operations. The results of these different execution units may be provided to a writeback unit 980. Note that in some implementations separate writeback units may be associated with each of the execution units. Furthermore, understand that while each of the units and logic shown in
A processor designed using one or more cores having pipelines as in any one or more of
In the high level view shown in
Each core unit 1010 may also include an interface such as a bus interface unit to enable interconnection to additional circuitry of the processor. In an embodiment, each core unit 1010 couples to a coherent fabric that may act as a primary cache coherent on-die interconnect that in turn couples to a memory controller 1035. In turn, memory controller 1035 controls communications with a memory such as a DRAM (not shown for ease of illustration in
In addition to core units, additional processing engines are present within the processor, including at least one graphics unit 1020 which may include one or more graphics processing units (GPUs) to perform graphics processing as well as to possibly execute general purpose operations on the graphics processor (so-called GPGPU operation). In addition, at least one image signal processor 1025 may be present. Signal processor 1025 may be configured to process incoming image data received from one or more capture devices, either internal to the SoC or off-chip.
Other accelerators also may be present. In the illustration of
In some embodiments, SoC 1000 may further include a non-coherent fabric coupled to the coherent fabric to which various peripheral devices may couple. One or more interfaces 1060a-1060d enable communication with one or more off-chip devices. Such communications may be via a variety of communication protocols such as PCIe™, GPIO, USB, I2C, UART, MIPI, SDIO, DDR, SPI, HDMI, among other types of communication protocols. Although shown at this high level in the embodiment of
Referring now to
As seen in
With further reference to
As seen, the various domains couple to a coherent interconnect 1140, which in an embodiment may be a cache coherent interconnect fabric that in turn couples to an integrated memory controller 1150. Coherent interconnect 1140 may include a shared cache memory, such as an L3 cache, in some examples. In an embodiment, memory controller 1150 may be a direct memory controller to provide for multiple channels of communication with an off-chip memory, such as multiple channels of a DRAM (not shown for ease of illustration in
Referring now to
In turn, a GPU domain 1220 is provided to perform advanced graphics processing in one or more GPUs to handle graphics and compute APIs. A DSP unit 1230 may provide one or more low power DSPs for handling low-power multimedia applications such as music playback, audio/video and so forth, in addition to advanced calculations that may occur during execution of multimedia instructions.
As further illustrated, a shared cache 1235 may couple to various domains and may act as an LLC that has an adaptive write buffer as described herein. In turn, a communication unit 1240 may include various components to provide connectivity via various wireless protocols, such as cellular communications (including 3G/4G LTE), wireless local area protocols such as Bluetooth™, IEEE 802.11, and so forth.
Still further, a multimedia processor 1250 may be used to perform capture and playback of high definition video and audio content, including processing of user gestures. A sensor unit 1260 may include a plurality of sensors and/or a sensor controller to interface to various off-chip sensors present in a given platform. An image signal processor 1270 may be provided with one or more separate ISPs to perform image processing with regard to captured content from one or more cameras of a platform, including still and video cameras.
A display processor 1280 may provide support for connection to a high definition display of a given pixel density, including the ability to wirelessly communicate content for playback on such display. Still further, a location unit 1290 may include a GPS receiver with support for multiple GPS constellations to provide applications highly accurate positioning information obtained using as such GPS receiver. Understand that while shown with this particular set of components in the example of
Referring now to
In turn, application processor 1310 can couple to a user interface/display 1320, e.g., a touch screen display. In addition, application processor 1310 may couple to a memory system including a non-volatile memory, namely a flash memory 1330 and a system memory, namely a dynamic random access memory (DRAM) 1335. As further seen, application processor 1310 further couples to a capture device 1340 such as one or more image capture devices that can record video and/or still images.
Still referring to
As further illustrated, a near field communication (NFC) contactless interface 1360 is provided that communicates in a NFC near field via an NFC antenna 1365. While separate antennae are shown in
A power management integrated circuit (PMIC) 1315 couples to application processor 1310 to perform platform level power management. To this end, PMIC 1315 may issue power management requests to application processor 1310 to enter certain low power states as desired. Furthermore, based on platform constraints, PMIC 1315 may also control the power level of other components of system 1300.
To enable communications to be transmitted and received, various circuitry may be coupled between baseband processor 1305 and an antenna 1390. Specifically, a radio frequency (RF) transceiver 1370 and a wireless local area network (WLAN) transceiver 1375 may be present. In general, RF transceiver 1370 may be used to receive and transmit wireless data and calls according to a given wireless communication protocol such as 3G or 4G wireless communication protocol such as in accordance with a code division multiple access (CDMA), global system for mobile communication (GSM), long term evolution (LTE) or other protocol. In addition a GPS sensor 1380 may be present. Other wireless communications such as receipt or transmission of radio signals, e.g., AM/FM and other signals may also be provided. In addition, via WLAN transceiver 1375, local wireless communications can also be realized.
Referring now to
A variety of devices may couple to SoC 1410. In the illustration shown, a memory subsystem includes a flash memory 1440 and a DRAM 1445 coupled to SoC 1410. In addition, a touch panel 1420 is coupled to the SoC 1410 to provide display capability and user input via touch, including provision of a virtual keyboard on a display of touch panel 1420. To provide wired network connectivity, SoC 1410 couples to an Ethernet interface 1430. A peripheral hub 1425 is coupled to SoC 1410 to enable interfacing with various peripheral devices, such as may be coupled to system 1400 by any of various ports or other connectors.
In addition to internal power management circuitry and functionality within SoC 1410, a PMIC 1480 is coupled to SoC 1410 to provide platform-based power management, e.g., based on whether the system is powered by a battery 1490 or AC power via an AC adapter 1495. In addition to this power source-based power management, PMIC 1480 may further perform platform power management activities based on environmental and usage conditions. Still further, PMIC 1480 may communicate control and status information to SoC 1410 to cause various power management actions within SoC 1410.
Still referring to
As further illustrated, a plurality of sensors 1460 may couple to SoC 1410. These sensors may include various accelerometer, environmental and other sensors, including user gesture sensors. Finally, an audio codec 1465 is coupled to SoC 1410 to provide an interface to an audio output device 1470. Of course understand that while shown with this particular implementation in
Referring now to
Processor 1510, in one embodiment, communicates with a system memory 1515. As an illustrative example, the system memory 1515 is implemented via multiple memory devices or modules to provide for a given amount of system memory.
To provide for persistent storage of information such as data, applications, one or more operating systems and so forth, a mass storage 1520 may also couple to processor 1510. In various embodiments, to enable a thinner and lighter system design as well as to improve system responsiveness, this mass storage may be implemented via a SSD or the mass storage may primarily be implemented using a hard disk drive (HDD) with a smaller amount of SSD storage to act as a SSD cache to enable non-volatile storage of context state and other such information during power down events so that a fast power up can occur on re-initiation of system activities. Also shown in
Various input/output (I/O) devices may be present within system 1500. Specifically shown in the embodiment of
For perceptual computing and other purposes, various sensors may be present within the system and may be coupled to processor 1510 in different manners. Certain inertial and environmental sensors may couple to processor 1510 through a sensor hub 1540, e.g., via an I2C interconnect. In the embodiment shown in
Also seen in
System 1500 can communicate with external devices in a variety of manners, including wirelessly. In the embodiment shown in
As further seen in
In addition, wireless wide area communications, e.g., according to a cellular or other wireless wide area protocol, can occur via a WWAN unit 1556 which in turn may couple to a subscriber identity module (SIM) 1557. In addition, to enable receipt and use of location information, a GPS module 1555 may also be present. Note that in the embodiment shown in
To provide for audio inputs and outputs, an audio processor can be implemented via a digital signal processor (DSP) 1560, which may couple to processor 1510 via a high definition audio (HDA) link. Similarly, DSP 1560 may communicate with an integrated coder/decoder (CODEC) and amplifier 1562 that in turn may couple to output speakers 1563 which may be implemented within the chassis. Similarly, amplifier and CODEC 1562 can be coupled to receive audio inputs from a microphone 1565 which in an embodiment can be implemented via dual array microphones (such as a digital microphone array) to provide for high quality audio inputs to enable voice-activated control of various operations within the system. Note also that audio outputs can be provided from amplifier/CODEC 1562 to a headphone jack 1564. Although shown with these particular components in the embodiment of
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 1690 includes an interface 1692 to couple chipset 1690 with a high performance graphics engine 1638, by a P-P interconnect 1639. In turn, chipset 1690 may be coupled to a first bus 1616 via an interface 1696. As shown in
One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, the machine-readable medium may include instructions which represent various logic within the processor. When read by a machine, the instructions may cause the machine to fabricate the logic to perform the techniques described herein. Such representations, known as “IP cores,” are reusable units of logic for an integrated circuit that may be stored on a tangible, machine-readable medium as a hardware model that describes the structure of the integrated circuit. The hardware model may be supplied to various customers or manufacturing facilities, which load the hardware model on fabrication machines that manufacture the integrated circuit. The integrated circuit may be fabricated such that the circuit performs operations described in association with any of the embodiments described herein.
The RTL design 1715 or equivalent may be further synthesized by the design facility into a hardware model 1720, which may be in a hardware description language (HDL), or some other representation of physical design data. The HDL may be further simulated or tested to verify the IP core design. The IP core design can be stored for delivery to a third party fabrication facility 1765 using non-volatile memory 1740 (e.g., hard disk, flash memory, or any non-volatile storage medium). Alternately, the IP core design may be transmitted (e.g., via the Internet) over a wired connection 1750 or wireless connection 1760. The fabrication facility 1765 may then fabricate an integrated circuit that is based at least in part on the IP core design. The fabricated integrated circuit can be configured to perform operations in accordance with at least one embodiment described herein.
Write mask registers 1815—in the embodiment illustrated, there are 8 write mask registers (k0 through k7), each 64 bits in size. In an alternate embodiment, the write mask registers 1815 are 16 bits in size. As previously described, in one embodiment of the invention, the vector mask register k0 cannot be used as a write mask; when the encoding that would normally indicate k0 is used for a write mask, it selects a hardwired write mask of 0xFFFF, effectively disabling write masking for that instruction.
General-purpose registers 1825—in the embodiment illustrated, there are sixteen 64-bit general-purpose registers that are used along with the existing x86 addressing modes to address memory operands. These registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
Scalar floating point stack register file (x87 stack) 1845, on which is aliased the MMX packed integer flat register file 1850—in the embodiment illustrated, the x87 stack is an eight-element stack used to perform scalar floating-point operations on 32/64/80-bit floating point data using the x87 instruction set extension; while the MMX registers are used to perform operations on 64-bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.
Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, less, or different register files and registers.
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput). Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip that may include on the same die the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Exemplary core architectures are described next, followed by descriptions of exemplary processors and computer architectures.
In
The front end unit 1930 includes a branch prediction unit 1932 coupled to an instruction cache unit 1934, which is coupled to an instruction translation lookaside buffer (TLB) 1936, which is coupled to an instruction fetch unit 1938, which is coupled to a decode unit 1940. The decode unit 1940 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 1940 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 1990 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 1940 or otherwise within the front end unit 1930). The decode unit 1940 is coupled to a rename/allocator unit 1952 in the execution engine unit 1950.
The execution engine unit 1950 includes the rename/allocator unit 1952 coupled to a retirement unit 1954 and a set of one or more scheduler unit(s) 1956. The scheduler unit(s) 1956 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler unit(s) 1956 is coupled to the physical register file(s) unit(s) 1958. Each of the physical register file(s) units 1958 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit 1958 comprises a vector registers unit, a write mask registers unit, and a scalar registers unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) unit(s) 1958 is overlapped by the retirement unit 1954 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit 1954 and the physical register file(s) unit(s) 1958 are coupled to the execution cluster(s) 1960. The execution cluster(s) 1960 includes a set of one or more execution units 1962 and a set of one or more memory access units 1964. The execution units 1962 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit(s) 1956, physical register file(s) unit(s) 1958, and execution cluster(s) 1960 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 1964). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access units 1964 is coupled to the memory unit 1970, which includes a data TLB unit 1972 coupled to a data cache unit 1974 coupled to a level 2 (L2) cache unit 1976. In one exemplary embodiment, the memory access units 1964 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 1972 in the memory unit 1970. The instruction cache unit 1934 is further coupled to a level 2 (L2) cache unit 1976 in the memory unit 1970. The L2 cache unit 1976 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 1900 as follows: 1) the instruction fetch 1938 performs the fetch and length decoding stages 1902 and 1904; 2) the decode unit 1940 performs the decode stage 1906; 3) the rename/allocator unit 1952 performs the allocation stage 1908 and renaming stage 1910; 4) the scheduler unit(s) 1956 performs the schedule stage 1912; 5) the physical register file(s) unit(s) 1958 and the memory unit 1970 perform the register read/memory read stage 1914; the execution cluster 1960 perform the execute stage 1916; 6) the memory unit 1970 and the physical register file(s) unit(s) 1958 perform the write back/memory write stage 1918; 7) various units may be involved in the exception handling stage 1922; and 8) the retirement unit 1954 and the physical register file(s) unit(s) 1958 perform the commit stage 1924.
The core 1990 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 1990 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 1934/1974 and a shared L2 cache unit 1976, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
The local subset of the L2 cache 2004 is part of a global L2 cache that is divided into separate local subsets, one per processor core. Each processor core has a direct access path to its own local subset of the L2 cache 2004. Data read by a processor core is stored in its L2 cache subset 2004 and can be accessed quickly, in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 2004 and is flushed from other subsets, if necessary. The ring network ensures coherency for shared data. The ring network is bi-directional to allow agents such as processor cores, L2 caches and other logic blocks to communicate with each other within the chip. Each ring data-path is 1012-bits wide per direction.
Thus, different implementations of the processor 2100 may include: 1) a CPU with the special purpose logic 2108 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 2102A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 2102A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 2102A-N being a large number of general purpose in-order cores. Thus, the processor 2100 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 2100 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache units 2106, and external memory (not shown) coupled to the set of integrated memory controller units 2114. The set of shared cache units 2106 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 2112 interconnects the integrated graphics logic 2108, the set of shared cache units 2106, and the system agent unit 2110/integrated memory controller unit(s) 2114, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 2106 and cores 2102-A-N.
In some embodiments, one or more of the cores 2102A-N are capable of multi-threading. The system agent 2110 includes those components coordinating and operating cores 2102A-N. The system agent unit 2110 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 2102A-N and the integrated graphics logic 2108. The display unit is for driving one or more externally connected displays.
The cores 2102A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 2102A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
The following examples pertain to further embodiments.
In one example, a processor includes: a cache memory to store a plurality of cache lines; and a cache controller to control the cache memory. The cache controller may include a control circuit to allocate a virtual write buffer within the cache memory in response to a bandwidth on an interconnect to couple the processor with a memory that exceeds a first bandwidth threshold. The cache controller may further include a replacement circuit to control eviction of cache lines from the cache memory.
In an example, the control circuit is to cause the replacement circuit to update a replacement policy in response to the allocation of the virtual write buffer.
In an example, the update to the replacement policy comprises a switch to a least recently used clean policy in which cache lines including unmodified data are to be preferentially evicted from the cache memory.
In an example, the control circuit is to initiate a drain of the virtual write buffer in response to the bandwidth on the interconnect being less than a second bandwidth threshold.
In an example, the replacement circuit, during the drain, is to write a cache line including modified data to the memory and maintain the cache line in the cache memory, where the cache line is within a threshold distance of a least recently used position.
In an example, the control circuit is further to update a state of the cache line including the modified data to a clean state.
In an example, the cache controller comprises a first set of hit counters associated with corresponding positions within a least recently used stack, and to be updated in response to read hits within the cache memory.
In an example, the cache controller comprises a second set of hit counters associated with corresponding positions within the least recently used stack, and to be updated in response to write hits within the cache memory.
In an example, the control circuit is to dynamically update a size of the virtual write buffer based on hit histogram information obtained from at least one of the first set of hit counters and the second set of hit counters.
In an example, the virtual write buffer comprises one or more ways of a plurality of sets of the cache memory.
In an example, the one or more ways comprises N least recently used ways of the plurality of sets of the cache memory, where N is dynamically controllable.
In an example, the cache controller is to initiate a drain of the virtual write buffer in response to a number of the plurality of sets having the one or more ways that store dirty data that exceeds a threshold.
In another example, a method comprises: monitoring a bandwidth of an interconnect that couples a processor to a memory; in response to the bandwidth exceeding a first bandwidth threshold, allocating a virtual write buffer in a cache memory of the processor; and dynamically controlling a size of the virtual write buffer based at least in part on hit histogram information.
In an example, the method further comprises: monitoring a consumption of the virtual write buffer; and initiating a draining of the virtual write buffer in response to the consumption exceeding a threshold.
In an example, the draining comprises: writing dirty data from a plurality of cache lines of the virtual write buffer to the memory; and updating a state of the plurality of cache lines of the virtual write buffer to a clean state.
In an example, the method further comprises initiating a draining of the virtual write buffer in response to the bandwidth being less than a second bandwidth threshold.
In an example, allocating the virtual write buffer comprises updating a replacement policy of the cache memory to preferentially evict clean data instead of dirty data.
In another example, a computer readable medium including instructions is to perform the method of any of the above examples.
In another example, a computer readable medium including data is to be used by at least one machine to fabricate at least one integrated circuit to perform the method of any one of the above examples.
In another example, an apparatus comprises means for performing the method of any one of the above examples.
In another example, a system comprises a processor that includes: a plurality of cores each including a first level cache memory and a cache memory hierarchy coupled to the plurality of cores. The cache memory hierarchy may include: the first level cache memory included in the plurality of cores; and a shared cache memory coupled to the first level cache memory. The shared cache memory may include: a cache controller to control the shared cache memory, the cache controller including a control circuit, in response to a bandwidth on a memory interconnect that couples the processor with a memory that exceeds a first bandwidth threshold, to allocate a virtual write buffer within the shared cache memory and update a replacement policy to preferentially evict clean data from the shared cache memory. The processor may further include a memory controller to interact with the memory and maintain bandwidth information for the memory interconnect. The system may further include the memory interconnect to couple the processor to the memory, and the memory coupled to the processor via the memory interconnect.
In an example, the control circuit is to initiate a drain of the virtual write buffer in response to the bandwidth on the memory interconnect being less than a second bandwidth threshold, and where the cache controller, during the drain, is to write a cache line including modified data to the memory and maintain the cache line in the shared cache memory, where the cache line is within a threshold distance of a least recently used position.
In an example, the cache controller comprises a set of hit counters associated with corresponding positions within a least recently used stack of the shared cache memory, and to be updated in response to hits within the shared cache memory, and where the control circuit is to dynamically update a size of the virtual write buffer based on hit histogram information obtained from the set of hit counters.
Understand that various combinations of the above examples are possible.
Note that the terms “circuit” and “circuitry” are used interchangeably herein. As used herein, these terms and the term “logic” are used to refer to alone or in any combination, analog circuitry, digital circuitry, hard wired circuitry, programmable circuitry, processor circuitry, microcontroller circuitry, hardware logic circuitry, state machine circuitry and/or any other type of physical hardware component. Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. Embodiments also may be implemented in data and may be stored on a non-transitory storage medium, which if used by at least one machine, causes the at least one machine to fabricate at least one integrated circuit to perform one or more operations. Still further embodiments may be implemented in a computer readable storage medium including information that, when manufactured into a SoC or other processor, is to configure the SoC or other processor to perform one or more operations. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.