A system-on-a-chip (SoC) commonly includes one or more processing devices, such as central processing units (CPUs) and cores, as well as one or more memories and one or more interconnects, such as buses. A processing device may issue a data access request to either read data from a system memory or write data to the system memory. For example, in response to a read access request, data is retrieved from the system memory and provided to the requesting device via one or more interconnects. The time delay between issuance of the request and arrival of requested data at the requesting device is commonly referred to as “latency.” Cores and other processing devices compete to access data in system memory and experience varying amounts of latency.
Caching is a technique that may be employed to reduce latency. Data that is predicted to be subject to frequent or high-priority accesses may be stored in a cache memory from which the data may be provided with lower latency than it could be provided from the system memory. As commonly employed caching methods are predictive in nature, an access request may result in a cache hit if the requested data can be retrieved from the cache memory or a cache miss if the requested data cannot be retrieved from the cache memory. If a cache miss occurs, then the data must be retrieved from the system memory instead of the cache memory, at a cost of increased latency. The more requests that can be served from the cache memory instead of the system memory, the faster the system performs overall.
Although caching is commonly employed to reduce latency, caching has the potential to increase latency in instances in which requested data too frequently cannot be retrieved from the cache memory. Display systems are known to be prone to failures due to latency. “Underflow” is a failure that refers to data arriving at the display system too slowly to fill the display in the intended manner.
One known solution that attempts to mitigate the above-described problem in display systems is to increase the sizes of buffer memories in display and camera system cores. This solution comes at the cost of increased chip area. Another known solution that attempts to mitigate the problem is to employ faster memory. This solution comes at costs that include greater chip area and power consumption.
Systems, methods, and computer programs are disclosed for reducing worst-case memory latency in a system comprising a system memory and a cache memory. One embodiment is a method comprising receiving a translation request from a memory client for a translation of a virtual address to a physical address. If the translation is not available at a translation buffer unit and a translation control unit in a system memory management unit, the translation control unit initiates a page table walk. During the page table walk, the method determines a page table entry for an intermediate physical address in the system memory. In response to determining the page table entry for the intermediate physical address, the method preloads data at the intermediate physical address to the system cache before the page table walk for a final physical address corresponding to the intermediate physical address is completed.
Another embodiment is a computer system comprising a system memory, a system cache, and a system memory management unit. The system memory management unit comprises a translation buffer unit and a translation control unit. The translation buffer unit is configured to receive a translation request from a memory client for a translation of a virtual address to a physical address. The translation control unit is configured to initiate a page table walk if the translation is not available at the translation buffer unit and the translation control unit. The computer system further comprises control logic for reducing worst-case memory latency in the system. The control logic is configured to: determine a page table entry for an intermediate physical address in the system memory; and in response to determining the page table entry for the intermediate physical address, preload data at the intermediate physical address to the system cache before the page table walk for a final physical address corresponding to the intermediate physical address is completed.
In the Figures, like reference numerals refer to like parts throughout the various views unless otherwise indicated. For reference numerals with letter character designations such as “102A” or “102B”, the letter character designations may differentiate two like parts or elements present in the same Figure. Letter character designations for reference numerals may be omitted when it is intended that a reference numeral to encompass all parts having the same reference numeral in all Figures.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
The terms “component,” “database,” “module,” “system,” and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes, such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal).
The term “application” or “image” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, an “application” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “content” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, “content” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “task” may include a process, a thread, or any other unit of execution in a device.
The term “virtual memory” refers to the abstraction of the actual physical memory from the application or image that is referencing the memory. A translation or mapping may be used to convert a virtual memory address to a physical memory address. The mapping may be as simple as 1-to-1 (e.g., physical address equals virtual address), moderately complex (e.g., a physical address equals a constant offset from the virtual address), or the mapping may be complex (e.g., every 4 KB page mapped uniquely). The mapping may be static (e.g., performed once at startup), or the mapping may be dynamic (e.g., continuously evolving as memory is allocated and freed).
In this description, the terms “communication device,” “wireless device,” “wireless telephone”, “wireless communication device,” and “wireless handset” are used interchangeably. With the advent of third generation (“3G”) wireless technology and four generation (“4G”), greater bandwidth availability has enabled more portable computing devices with a greater variety of wireless capabilities. Therefore, a portable computing device may include a cellular telephone, a pager, a PDA, a smartphone, a navigation device, or a hand-held computer with a wireless connection or link.
As illustrated in
An SMMU 104 comprises a translation buffer unit (TBU) 112 and a translation control unit (TCU) 114. TBU 112 stores recent translations of virtual memory to physical memory in, for example, a translation look-aside buffer (TLB). If a virtual-to-physical address translation is not available in TBU 112, TCU 114 may perform a page table walk executed by a page table walker module 118. In this regard, the main functions of TCU 114 include address translation, memory protection, and attribute control. Address translation is a method by which an input address in a virtual address space is translated to an output address in a physical address space. Translation information is stored in page tables 116 that SMMU 104 references to perform address translation. There are two main benefits of address translation. First, address translation allows memory clients 102 to address a large physical address space. For example, a 32 bit processing device (i.e., a device capable of referencing 232 address locations) can have its addresses translated such that memory client 102 may reference a larger address space, such as a 36 bit address space or a 40 bit address space. Second, address translation allows processing devices to have a contiguous view of buffers allocated in memory, despite the fact that memory buffers are typically fragmented, physically non-contiguous, and scattered across the physical memory space.
Page tables 116 contains information necessary to perform address translation for a range of input addresses. Although not shown in
The process of traversing page tables 116 to perform address translation is known as a “page table walk.” A page table walk is accomplished by using a sub-segment of an input address to index into the translation sub-table, and finding the next address until a block descriptor is encountered. A page table walk comprises one or more “steps.” Each “step” of a page table walk involves: (1) an access to a page table 116, which includes reading (and potentially updating) it; and (2) updating the translation state, which includes (but is not limited to) computing the next address to be referenced. Each step depends on the results from the previous step of the walk. For the first step, the address of the first page table entry that is accessed is a function of the translation table base address and a portion of the input address to be translated. For each subsequent step, the address of the page table entry accessed is a function of the page table entry from the previous step and a portion of the input address.
Having generally described the components of computing system 100, various embodiments of systems and methods for reducing a worst-case memory latency will now be described. It should be appreciated that, in the computing system 100, a worst-case memory latency refers to the situation in which address translation results in successive “misses” by TBU 112, TCU 114, and last level cache 108 (i.e., a TBU/TCU/LLC miss). An exemplary embodiment of a TBU/TCU/LLC miss is illustrated by steps 1-10 in
In step 1, a memory client 102 requests translation of a virtual address. Memory client 102 may send a request identifying a virtual address to TBU 112. If a translation is not available in the TLB, TBU 112 sends the virtual address to TCU 114 (step 2). TCU 114 may access a translation cache 117 and, if a translation is not available, may perform a page table walk comprising a number of table walks (steps 3, 4, and 5) to get a final physical address in the system memory 110. It should be appreciated that some intermediate table walks may already be stored in translation cache 117. Steps 3, 4, and 5 are repeated for all translations that TCU 114 does not have available in translation cache 117. The worst-case memory latency occurs when steps 3, 4, and 5 go to last level cache 108/system memory 110 for a next page table entry. At step 6, TCU 114 may send the final physical address to TBU 112. Step 7 involves TBU 112 requesting the read-data at the final physical address which it received from TCU 114. Steps 8 and 9 involve getting the read-data at the final physical address to TBU 112. Step 10 involves TBU 112 returning the read-data from the physical address back to the requesting memory client 102. Table 1 below illustrates an approximate structural latency, representing a worst-case memory latency scenario, for each of the steps illustrated in the embodiment of
As described below in more detail, the page table walk may comprise two stages. A first stage may determine the intermediate physical address. A second stage may involve resolving data access permissions at the end of which the physical address is determined. After obtaining the intermediate physical address during the first stage, TCU 114 may not be able to send the intermediate physical address to TBU 112 until access permissions are cleared by TCU 114 based on subsequent table walks. Although the intermediate physical address may not be sent to TBU 112 until the second stage is completed, the method 200 enables the data at the intermediate physical address to be preloaded into last level cache 108 before the second stage is completed. When TBU 112 does get the final physical address after all access permission checking page table walks have completed, the data at the final physical address will be available in last level cache 108 instead of having to go to system memory 110. In this manner, the method 200 may eliminate the structural latency associated with step 8 (
In some embodiments, TBU 112 may be configured to provide page offset information or other “hints” to TCU 114. Where a lowest page granule size comprises, for example, 4 KB, the TCU 114 may fetch page descriptors without processing the lower 12 bits of an address. It should be appreciated that, for the last level cache 108 to perform a prefetch, the TBU 112 may pass on a bit range (11:6) of the address to TCU 114. It should be further appreciated that the bit range (5:0) of the address is not required as the cache line size in the last level cache 108 may comprise 64 B. In this regard, the page offset information or other “hint) may originate from the memory clients 102 or the TBU 112. In either case, the TBU 112 will pass on the hint to the TCU 114, which may comprise information such as a page offset and a pre-load size.
As illustrated in
It should be appreciated that the subsequent rows in
The third row in
The fourth row in
The fifth row in
It should be appreciated that the last row in the page table walk represents stage 1 and stage 2 page tables associated with the system memory 110. In this regard, the previous rows illustrate that the page table walk resulted in a TCU miss and a last level cache miss. At reference numeral 462 in the page table walk, TCU 114 may determine the intermediate physical address (IPA) 415. In response, TCU 114 may generate the data cache preload command 306 (
The PTE snooper module 502 may then use the page descriptor information captured from the PTE read data to initiate a prefetch to the system memory 110. The offset provided by the TCU 114 to the last level cache 108 may be added to the page address captured through the PTE read data to calculate the final system memory address for which the prefetch needs to be initiated. The offset may be 12 bits as page size is 4 KB in granularity. For the TCU 114 initiated prefetch, the offset may be only 6 bits wide (e.g., bits[11:6]) as addresses may be cache-line aligned (64 Bytes).
Having described the page table walk sequences associated with the embodiments of
A display controller 728 and a touch screen controller 730 may be coupled to the CPU 702. In turn, the touch screen display 707 external to the on-chip system 722 may be coupled to the display controller 728 and the touch screen controller 730.
Further, as shown in
As further illustrated in
As depicted in
Alternative embodiments will become apparent to one of ordinary skill in the art to which the invention pertains without departing from its spirit and scope. Therefore, although selected aspects have been illustrated and described in detail, it will be understood that various substitutions and alterations may be made therein without departing from the spirit and scope of the present invention, as defined by the following claims.