Portable computing devices (e.g., cellular telephones, smart phones, tablet computers, portable game consoles, wearable devices, and other battery-powered devices), Internet of things (IoT) devices (e.g., smart home appliances, automotive and other embedded systems), and other computing devices continue to offer an ever-expanding array of features and services, and provide users with unprecedented levels of access to information, resources, and communications. To keep pace with these service enhancements, such devices have become more powerful and more complex. Smart computing devices now commonly include a system on chip (SoC) comprising an application processor and one or more non-application SoC processing devices embedded on a single substrate. The application processor and the non-application SoC processing devices comprise memory clients that read data from and store data in a system memory.
The application processor may include a memory management unit (MMU) and the non-application SoC processing device(s) may including system MMUs configured to perform processing operations with reference to virtual memory addresses. In the process of supporting various virtual memory maintenance or optimization operations (e.g., changing address mapping, page permissions, etc.), stale page table entries in a translation lookaside buffer (TLB) cache may need to be invalidated. While invalidation of stale page table entries is a common process in systems with SMMUs, it comes at the cost of increased translation latencies during and after invalidation, particularly for real-time sensitive memory clients, such as, for example, a display processor and a camera digital signal processor. In complex SoCs, the amount of time to perform the TLB invalidation process is relatively long due to larger TLB cache sizes. For such systems, the invalidation duration parameter may be critical for real-time clients, which may not be able to sustain high translation latencies for longer duration that may eventually cause display overrun or camera overflow.
Accordingly, there is a need in the art for improved systems for performing page table entry invalidation in systems with larger TLB cache sizes without unnecessarily increasing hardware cost.
Systems, methods, and computer programs are disclosed for emulating single cycle translation lookaside buffer invalidation. One embodiment of a method comprises defining a translation lookaside buffer (TLB) cache marking variable comprising a first marker value and a second marker value. A context bank marker associated with a translation context bank is initiated with one of the first marker value and the second marker value. A TLB cache entry table is stored in a memory. The TLB cache entry table specifies whether each of a plurality of TLB cache entries associated with the translation context bank has a corresponding entry marker set to the first marker value or the second marker value. In response to a TLB invalidate command associated with the translation context bank, the context bank marker is changed from the one of the first marker value and the second marker value to the other of the first marker value and the second marker value prior to initiating TLB invalidation. During the TLB invalidation associated with the translation context bank, the TLB cache entry table is accessed to determine whether each of the plurality of TLB cache entries has the corresponding entry marker set to the first marker value or the second marker value. If the entry marker for the TLB cache entry is set to a same value as the context bank marker, the method bypasses invalidation for the TLB cache entry and changes the entry marker to a different value than the context bank marker. If the entry marker for the TLB cache entry is set to a different value than the context bank marker, the method determines that the TLB cache entry comprises a stale entry and invalidates the TLB cache entry.
Another embodiment of a system comprises an application processor, one or more memory clients, and a single-cycle translation lookaside buffer (TLB) invalidation emulator component. The application processor comprises a memory management unit (MMU) having a first TLB. The one or more memory clients have a corresponding system memory management unit (SMMU) comprising a corresponding second TLB. The single-cycle TLB invalidation emulator component is in communication with the MMU and the SMMU, and comprises logic configured to: define a translation lookaside buffer (TLB) cache marking variable comprising a first marker value and a second marker value; initiate a context bank marker associated with a translation context bank with one of the first marker value and the second marker value; store in a memory a TLB cache entry table specifying whether each of a plurality of TLB cache entries associated with the translation context bank has a corresponding entry marker set to the first marker value or the second marker value; in response to a TLB invalidate command associated with the translation context bank, change the context bank marker from the one of the first marker value and the second marker value to the other of the first marker value and the second marker value prior to initiating TLB invalidation; and during the TLB invalidation associated with the translation context bank; access the TLB cache entry table to determine whether each of the plurality of TLB cache entries has the corresponding entry marker set to the first marker value or the second marker value; if the entry marker for the TLB cache entry is set to a same value as the context bank marker, bypass invalidation for the TLB cache entry and change the entry marker to a different value than the context bank marker; and if the entry marker for the TLB cache entry is set to a different value than the context bank marker, determine that the TLB cache entry comprises a stale entry and invalidate the TLB cache entry.
In the Figures, like reference numerals refer to like parts throughout the various views unless otherwise indicated. For reference numerals with letter character designations such as “102A” or “102B”, the letter character designations may differentiate two like parts or elements present in the same Figure. Letter character designations for reference numerals may be omitted when it is intended that a reference numeral to encompass all parts having the same reference numeral in all Figures.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
The terms “component,” “database,” “module,” “system,” and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be a component. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components may execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes, such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the Internet with other systems by way of the signal).
The term “application” or “image” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, an “application” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “content” may also include files having executable content, such as: object code, scripts, byte code, markup language files, and patches. In addition, “content” referred to herein, may also include files that are not executable in nature, such as documents that may need to be opened or other data files that need to be accessed.
The term “task” may include a process, a thread, or any other unit of execution in a device.
The term “virtual memory” refers to the abstraction of the actual physical memory from the application or image that is referencing the memory. A translation or mapping may be used to convert a virtual memory address to a physical memory address. The mapping may be as simple as 1-to-1 (e.g., physical address equals virtual address), moderately complex (e.g., a physical address equals a constant offset from the virtual address), or the mapping may be complex (e.g., every 4 KB page mapped uniquely). The mapping may be static (e.g., performed once at startup), or the mapping may be dynamic (e.g., continuously evolving as memory is allocated and freed).
In this description, the terms “communication device,” “wireless device,” “wireless telephone”, “wireless communication device,” and “wireless handset” are used interchangeably. With the advent of third generation (“3G”), fourth generation (“4G”), and fifth generation (“5G”) wireless technology, greater bandwidth availability has enabled more portable communication devices with a greater variety of wireless capabilities. Therefore, the term “portable communication device” or “portable computing device” may refer to cellular telephones, smart phones, tablet computers, portable game consoles, wearable devices, Internet of things (IoT) devices (e.g., smart home appliances, automotive and other embedded systems), or other battery-powered computing devices.
As illustrated in
It should be further appreciated that application processor 102 may be independent from one or more additional processing devices residing on the SoC that may access system memory 134. In this regard, the independent SoC processing device(s) may be referred to as “non-application” processing device(s) or memory client(s) because they are distinct from application processor 102. In the embodiment of
Application processor 102 and non-application SoC processing device(s) (e.g., memory clients 104) may be configured to perform processing operations with reference to virtual memory addresses. In this regard, application processor 102 comprises a memory management unit (MMU) 142 and each non-application SoC processing device may comprise (or may be electrically coupled to) a subsystem MMU. In the embodiment of
MMU 142 comprises logic (e.g., hardware, software, or a combination thereof) that performs address translation for application processor 102. Although for purposes of clarity MMU 142 is depicted in
As illustrated in
Page tables 136 contain information necessary to perform address translation for a range of input addresses. Although not shown in
The process of traversing page tables 136 to perform address translation is known as a “page table walk.” A page table walk is accomplished by using a sub-segment of an input address to index into the translation sub-table, and finding the next address until a block descriptor is encountered. A page table walk comprises one or more “steps.” Each “step” of a page table walk involves: (1) an access to a page table 136, which includes reading (and potentially updating) it; and (2) updating the translation state, which includes (but is not limited to) computing the next address to be referenced. Each step depends on the results from the previous step of the walk. For the first step, the address of the first page table entry that is accessed is a function of the translation table base address and a portion of the input address to be translated. For each subsequent step, the address of the page table entry accessed is a function of the page table entry from the previous step and a portion of the input address. In this manner, the page table walk may comprise two stages. A first stage may determine the intermediate physical address. A second stage may involve resolving data access permissions at the end of which the physical address is determined.
As further illustrated in
As further illustrated in
It should be appreciated that single-cycle PTE invalidation emulation provided via TLB cache marking scheme 108 may advantageously provide the benefits of single-cycle invalidation behavior for systems employing larger TLB cache sizes and real-time sensitive memory clients (e.g., display, camera, etc.) without unnecessarily increasing hardware costs.
As illustrated in
System 100 further comprises a memory (e.g., static random access memory (SRAM) 112) for storing a TLB cache entry table 114. As further illustrated in the embodiment of
As mentioned above, conventional TLB invalidation may take multiple clock cycles to complete because the hardware has to go through each index of the TLB cache for context bank 210 and virtual address values, and then invalidate if there is a match. For complex systems, the TLB cache size may be relatively large (e.g., 8K indexes). For example, in a 4-way associative cache, the maximum and minimum invalidation duration may be in the range of 4K-6K clock cycles. During the duration of 6K cycles of context bank 210 (CB) based invalidation, the content of the TLB cache remains unusable for the corresponding context bank 210 in order to avoid a “hit” to the stale entries that are yet to be invalidated. Furthermore, any update request to the same context bank 210 may be dropped during the entire 6K clock cycle. The TLB cache warmup with new valid entries may start only after the completion of invalidation, which may cause performance degradation for real-time clients (e.g., display, camera) because they may have started sending traffic based on new page tables much earlier. As a result, most of the traffic from these clients may get a “miss” even though they could have benefited from the “hit” of the newly updated entries. Conventional single-cycle invalidation may address these issues, but they come at the expense of additional hardware cost because the tag for each index of the TLB cache must be “flopped” separately.
System 100 combines single-cycle PTE invalidation emulation via TLB cache marking scheme 108 and TLB cache entry marker table 114 to advantageously provide the benefits of single-cycle invalidation behavior for systems employing larger TLB cache sizes and real-time sensitive memory clients (e.g., display, camera, etc.) without unnecessarily increasing hardware costs. The 1-bit marker scheme illustrated in
Furthermore, it should be appreciated that single-cycle PTE invalidation emulation via TLB cache marking scheme 108 and TLB cache entry marker table 114 advantageously provides various performance benefits. For example, as described below in more detail, the TLC cache marking scheme 108 allows TCU cache updates and look-ups while invalidation is in progress for the same translation context bank 210 and ignores invalidation for any newly updated entries in TCU cache. This will ensure that TCU cache is usable for both updates and look-ups even during self-invalidation. In addition, interleaved TBU traffic may be advantageously processed by, for example, data sharing at the TCU cache during self-invalidation, which may avoid redundant page table walks for a second TBU look-up. It should be further appreciated that the managed values of the context bank marker (column 202) and the associated entry markers (column 206) may be used to distinguish between new and old/stale entries in cache. In response to a TLB invalidation command, the context bank marker value may be toggled to a different value prior to initiating invalidation. In this regard, any page table walk in progress before the start of invalidation may not get updated in TBU and TCU cache. Any look-up of newly updated entries post-invalidation start may get a “hit” during TCU cache look-up due to the modified context bank marker value. During invalidation, only entries matching the prior context bank marker value will get invalidated and all newly updated entries will not get invalidated.
Having generally described the operation of the TLB cache marking scheme 108, an exemplary method 300 will be described with reference to
In response to a TLB invalidate command associated with the translation context bank CB0 (decision block 308), the corresponding context bank marker is changed (block 310) from the one of the first marker value and the second marker value to the other of the first marker value and the second marker value prior to initiating TLB invalidation. For example, if prior to receiving a TLB invalidate command the CB0 has an associated context bank marker with the first marker value (0), the context marker may be toggled (or otherwise changed in the event of a multi-bit scheme) to the second marker value (1) before initiating TLB invalidation. At block 312, TLB invalidation may be initiated. During TLB invalidation associated with the context bank 210 (C0), at block 314, TLB cache entry marker table 114 may be accessed to determine whether each of the plurality of TLB cache entries associated with context bank 210 (C0) has the corresponding entry marker set to the first marker value or the second marker value. At decision block 316, the system 100 determines whether the entry marker for each TLB cache entry has the same value as the current context bank marker (which was toggled/changed upon initiation of the TLB invalidation in block 310). If the entry marker for the TLB cache entry is set to a same value as the current context bank marker, the method 300 bypasses invalidation (block 318) for the TLB cache entry and changes the entry marker to a different value than the current context bank marker. If the entry marker for the TLB cache entry is set to a different value than the current context bank marker, the method 300 determines that the TLB cache entry comprises a stale entry and invalidates the TLB cache entry (block 320). As illustrated at decision block 322, upon a last TLB entry being processed, flow may return to, for example, block 306 and/or 308 for additional updates to the TLB cache entry marker table 114.
At block 412, a CB0 invalidation command may be initiated to begin CB0 invalidation. In response to the CB0 invalidation command, the context bank marker value may be toggled from CB0_marker=0 to CB0_marker=1. At block 414, the first cache entry may be invalidated because the entry marker has a value (entry_marker=0) and the context bank marker has a different value (CB0_marker=1). CB0 invalidation may progress through timing window 404 for additional cache entries. During CB0 invalidation, further cache updates (block 416) and look-ups (block 418) may be performed. At block 416, a cache update may be performed to the first cache entry with CB0_marker=1. The updated first cache entry has the following values: (valid=1, entry_marker=1, CB=0, virtual address). At block 418, a lookup operation to an index=0 may be performed with CB0_marker=1. Upon completion of CB0 invalidation (block 420), the context marker has the value CB0_marker=1. After CB0 invalidation during timing window 406, further cache updates (block 422) and look-ups (block 424) may be performed. At block 422, a look-up to index=1 may be performed with CB0_marker=1. At block 424, a cache update may be performed to a second cache entry having an associated entry marker stored in SRAM 110 at location SRAM(1). After the cache update, the second cache entry has the following value: {valid=1, entry_marker=1, CB=0, virtual address).
A display controller 528 and a touch screen controller 530 may be coupled to the CPU 502. In turn, the touch screen display 507 external to the on-chip system 522 may be coupled to the display controller 528 and the touch screen controller 530.
Further, as shown in
As further illustrated in
As depicted in
Alternative embodiments will become apparent to one of ordinary skill in the art to which the invention pertains without departing from its spirit and scope. Although selected aspects have been illustrated and described in detail, it will be understood that various substitutions and alterations may be made therein without departing from the spirit and scope of the present invention, as defined by the following claims.