This invention relates to the area of translation lookaside buffers and more specifically to translation lookaside buffer architectures for rapid design cycles.
Modern microprocessor systems typically utilize virtual addressing. Virtual addressing enables the system to effectively create a virtual memory space larger than an actual physical memory space. The process of breaking up the actual physical memory space into the virtual memory space is termed paging. Paging breaks up a linear address space of the physical memory space into fixed blocks called pages. Pages allow a large linear address space to be implemented with a smaller physical main memory plus cheap background memory. This configuration is referred to as “virtual memory.” Paging allows virtual memory to be implemented by managing memory in pages that are swapped to and from the background memory. Paging offers additional advantages, including reduced main memory fragmentation, selective memory write policies for different pages, and varying memory protection schemes for different pages. The presence of a paging mechanism is typically transparent to the application program.
The size of a page is a tradeoff between flexibility and performance. A small page size allows finer control over the virtual memory system but it increases the overhead from paging activity. Therefore many CPUs support a mix of page sizes, e.g. a particular MIPS implementation supports any mix of 4 kB, 16 kB, 64 kB, 256 kB, 1 MB, 4 MB and 16 MB pages.
A processor is then able to advantageously operate in the virtual address space using virtual addresses. Frequently, however, these virtual addresses must be translated into physical addresses—actual memory locations. One way of accomplishing this translation of virtual addresses into physical addresses is a use of translation tables that are regularly accessed and stored in main memory. Translation tables are stored in main memory because they are typically large in size. Unfortunately, regularly accessing of translation tables stored in main memory tends to slow overall system performance.
Modern microprocessor systems often use a translation lookaside buffer (TLB) to store or cache recently generated virtual to physical address translations in order to avoid the need to regularly access translation tables in main memory to accomplish address translation. A TLB is a special type of cache memory. As with other types of cache memories, a TLB is typically comprised of a relatively small amount of memory storage specially designed to be quickly accessible. A TLB typically incorporates both a tag array and a data array, as are provided in cache memories. Within the tag array, each tag line stores a virtual address. This tag line is then associated with a corresponding data line in the data array in which is stored a physical address translation for the virtual address. Thus, prior to seeking a translation of a virtual address from translation tables in main memory, a processor first refers to the TLB to determine whether the physical address translation of the virtual address is presently stored in the TLB. In the event that the virtual address and corresponding physical address are stored in the TLB, the TLB provides the corresponding physical address at an output port thereof, and a time and resource-consuming access of main memory is avoided. To facilitate operation of the TLB and to reduce indexing requirements therefore, a content addressable memory (CAM) is typically provided within the TLB. CAMs are parallel pattern matching circuits. In a matching mode of operation the CAM permits searching of all of its data in parallel to find a match.
Unfortunately, traditional TLBs require custom circuit design techniques to implement a CAM. Using custom circuit designs is not advantageous since each TLB and associated CAM requires a significant design effort in order to implement same in a processor system design. Of course, when a processor is absent CAM circuitry, signals from the processor propagate off chip to the CAM, thereby incurring delays.
It is therefore an object of this invention to provide a CAM architecture formed of traditional synthesisable circuit blocks.
In accordance with the invention there is provided a translation lookaside buffer (TLB) comprising: at least an input port for receiving a portion of a virtual address;
a random access memory; a set of registers; and, synthesisable logic for determining a hash value from the received portion of the virtual address and for comparing the hash value to a stored hash value within the set of registers to determine a potential that a physical address associated with the virtual address is stored within a line within the random access memory and associated with a register, from the set of registers, within which the hash value is stored.
In accordance with an aspect of the invention there is provided a translation lookaside buffer comprising: a random access memory; a first register associated with a line in the memory; and, a hashing circuit for receiving a virtual address other than a virtual address for which a translation is presently stored in the memory, for determining a hash value and for storing the hash value in the first register; and the hashing circuit for storing the virtual address and a translation therefor in the line in memory.
In accordance with yet another aspect of the invention there is provided a translation lookaside buffer comprising: RAM; and, synthesisable logic for determining from a virtual address at least one potential address within the RAM in fixed relation to which to search for a physical address associated with the virtual address, the at least one potential address being other than the one and only known address within the RAM in fixed relation to which the physical address associated with the virtual address is stored.
In accordance with yet another aspect of the invention there is provided a method of performing a virtual address lookup function for a translation lookaside buffer including RAM and synthesisable logic including the steps of: providing a virtual address to the synthesisable logic; hashing the provided virtual address to provide a hash result;
based on the hash result determining a memory location within the RAM relative to which is stored a virtual address identifier and a physical address related thereto;
comparing the virtual address to the virtual address identifier to determine if the physical address corresponds to the provided virtual address; and, when the physical address corresponds to the provided virtual address, providing the physical address as an output value.
The invention will now be described with reference to the drawings in which:
a illustrates a prior art transistor implementation of a SRAM circuit;
b illustrates a prior art transistor implementation of a CAM circuit;
a generally illustrates a translation lookaside buffer formed using synthesisable logic components and a random access memory;
b illustrates a translation lookaside buffer in more detail formed from synthesisable logic components;
c outlines the steps taken for operation of the TLB;
CAM circuits include storage circuits similar in structure to SRAM circuits. However, CAM circuits also include search circuitry offering an added benefit of a parallel search mode of operation, thus enabling searching of the contents of the CAM in parallel using hardware. When searching the CAM for a particular data value, the CAM provides a match signal upon finding a match for that data value within the CAM. A main difference between CAM and SRAM is that in a CAM, data is presented to the CAM representative of a virtual address and an address relating to the data is returned, whereas in a SRAM, an address is provided to the SRAM and data stored at that address is returned.
The cells of the CAM are arranged so that each row of cells holds a memory address and that row of cells is connected by a match line to a corresponding word line of the data array to enable access of the data array in that word line when a match occurs on that match line. In a fully associative cache each row of the CAM holds the full address of a corresponding main memory location and the inputs to the CAM require the full address to be input.
A prior art publication, entitled “A Reconfigurable Content Addressable Memory,” by Steven A Guccione et al., discusses the implementation of a CAM within an FPGA. As is seen in Prior Art
In the prior art publication the implementation of the CAM in an FPGA is discussed. Using gate level logic to implement a CAM often results in an undesirable size of the CAM. Flip-flops are used as the data storage elements within the CAM and as a result the size of the CAM circuit attainable using an FPGA is dependent upon the number of flip-flops available within the FPGA. Implementing the CAM in an FPGA quickly depletes many of the FPGA resources and as a result is not a viable solution. Unfortunately this has lead prior designers to conclude that the CAM is only efficiently implemented at a transistor level.
The prior art publication also addresses implementing of a CAM using look up tables (LUTs) in an FPGA. Rather than using flip-flops within the FPGA to store the data to be matched, this implementation addresses the use of LUTs for storing of the data to be matched. By using LUTs rather than flip-flops a smaller CAM architecture is possible.
Unfortunately, forming CAMs from synthesisable elements is not easily done so prior art processors that offer CAM are provided with a CAM core within the processor. Providing a CAM core within the processor unfortunately makes the resulting circuit expensive because of the added design complexity. Such additional design complexity is ill-suited for small batch custom design processors.
Once the TLB data array 305 is accessed and a match is found between the VPN and an entry within the TLB data array 305a, the PPN 206 is retrieved and is provided to the cache memory 301 and used for comparison to the tag retrieved 302a from the tag array 302. A match being indicative of a cache “hit” 306. If a match is found between the VPN 203 and an entry within the TLB tag array 304a then a TLB hit signal 307 is generated. In this manner, the cache is only accessed using bits of the PPN 206. The above example illustrates the use of a direct mapped cache memory; however, the same translation of a VA to a PA is applicable to set-associative caches as well. When set-associative caches are used, those of skill in the art appreciate that the size of a cache way is less than or equal to the size of a virtual page.
Unfortunately, when a TLB is implemented in SRAM, an exhaustive search of the memory is required to support CAM functionality. Thus, when a TLB has storage for 1024 virtual addresses and their corresponding Physical Address, each address translation requires up to 1024 memory access and comparison operations. Such a CAM implementation is unworkable as the performance drops linearly with CAM size.
a generally illustrates a TLB 400 formed using synthesisable logic components 499 and a random access memory (RAM) 410. A VPN for translation is provided via a VPN_IN input port 450, where bits VPN_IN[31:12] are provided from the VA[31:0] to this input port 450. A page mask signal is provided via a CP0_PAGE_MASK input port 451. A CP0_TRANSLATION input signal is provided via a CP0_TRANSLATION input port 452. A TLB_TRANSLATION output signal is provided via TLB_TRANSLATION output port 453, in dependence upon a translation from a VA to a PA using the TLB 400.
b illustrates a TLB 400 in more detail formed from synthesizeable logic components, and in
The page mask encoder 408 is used for accepting the CPO_PAGE_MASK input signal on an input port thereof and for correlating this input signal to a 3-bit vector, MASK[2:0]. The 3-bit vector MASK[2:0] is further provided to a hashing circuit 406. The hashing circuit 406 receives VPN_IN[31:12] via a first input port 406a and MASK[2:0] via a second input port 406b. A hashed vector H_VPN[5:0] is provided from an output port 406c thereof via a hashing operation 481 of the hashing circuit 406. The hashed vector H_VPN[5:0] and the MASK[2:0] are further provided to each one of 48 registers 409, where each register consists of multiple flip-flops collectively referred to as 491. Each of the registers 409 has two output ports. A first output signal from a first output port thereof is provided to a comparator circuit 403. A second output signal from a second output port is provided to the second input port 406b on one of 48 hashing circuits 406. The first input port on this hashing circuit receives VPN_IN[31:12]. The hashing circuit 406 output port is coupled to one of 48 comparator circuits 403 for performing a comparison between the register output and the hashing circuit output signal. Each of the comparators, in dependence upon a comparison of two input signals, provides a ‘1’ if the signals are the same and a ‘0’ if they are different. Output signals hit, from each of the 48 comparators is provided to one of 48 single bit 2-input multiplexers 411. Outputs ports from each of the multiplexers are coupled to a flip-flop 404. Each of the flip-flop 404 generates an output signal provided at the output ports labeled try1, where collectively these output signals try[0 . . . 47], for 0≦i≦47 are provided to a priority encoder circuit 401. The priority encoder circuit is further coupled to a binary decoder circuit 402, where the priority encoder circuit asserts a TLB_ENTRY[5:0] signal to the binary decoder circuit 402 and to the RAM 410. Three output ports are provided within the TLB 400, an ENTRY_FOUND output port 454, an ENTRY_NOT_FOUND output port 455 and a TLB_TRANSLATION output port 453, for providing ENTRY_FOUND, ENTRY_NOT_FOUND, and TLB_TRANSLATION output signal, respectively.
An address for translation from a VA to a PA is stored in a random access memory (RAM) 410, with the RAM 410 preferably having 48-entries, in the form of lines. In use, whenever a new translation is to be performed, input signals VPN_IN, CP0_PAGE_MASK, and CP0_TRANSLATION are provided to the TLB circuit 400 via input ports 450, 451, and 452, respectively. Translations performed by the TLB are stored in RAM 410 for a given index, i. The given index, indexes one of the lines 410a within the RAM that holds the translation to the PPN. The hashing circuit 406 computes the hash function H (VPN_IN, MASK) and stores the result in a corresponding 6-bit register h1 490. The page mask is stored in the 3-bit register mi 491.
When a translation is requested using the TLB, a VPN is provided via the input port 450 and the hash functions H (VPN_IN, m1) is computed for all i and compared to h1. This yields a 48 bit vector 492 hit0 . . . hit47 which is subsequently loaded into a 48 bit register 493 try0 . . . try47. In order to determine whether the requested VPN_IN is present in the translation table stored in RAM 482, only those entries, or lines, in RAM are checked for which tryi is asserted. An entry in the 48-bit try1 vector is asserted if it yields a ‘1’ 483. Of course, there may be more than one bit asserted in the try1 vector, but the priority encoder 401 selects the entry with the lowest index to address entries within the RAM. The decoder 402 converts this index to a 48-bit one-hot vector 494 clr0 . . . clr47. When the clock pulse arrives from a clock circuit (not shown), the try1 vector is reloaded, except for a bit corresponding to an index just used to address the RAM, which is cleared. This process is repeated, one entry at a time 483. The process stops as soon as the requested entry is found 484, as indicated by the ENTRY_FOUND signal on the ENTRY_FOUND output port 454, or when all bits in try1 are 0. When all bits in tryi are ‘0’ then the ENTRY_NOT_FOUND signal is provided via the ENTRY_NOT_FOUND output port 455. In the first case the translation is successful and information for the translation is provided 485 from the RAM 410 using a TLB_TRANSLATION signal on the TLB_TRANSLATION output port 453. In the second case the translation is not successful and the TLB reports a TLB refill exception.
Preferably, the hash function H_VPN[5:0] is uniformly distributed for MASK[2:0] and for VPN_IN[31:12] input signals. In the case of a TLB miss, all entries within the RAM are looked up for which try1 is initially asserted. The number of cycles Nmiss is given by the following equation:
where p is the probability that a comparator output signal hit, is asserted. The term:
gives the probability that exactly j bits in the try vector tryi are initially asserted. Having a uniform hashing function H with n bits at the output signal thereof, p=2−n, wherein the case of
In the case of a TLB hit, at least one access to the RAM 410 us required, as opposed to a TLB miss condition which is detected without accessing the RAM, since in a TLB miss condition the try vector try1 contains all zeros.
The average number of cycles to perform a translation that hits in the TLB is given by the following formula:
For a TLB hit, there must be at least one ‘1’ in the try vector try1. The only uncertainty is with the remaining elements within the vector. The variable k is used to represent the number of remaining entries that are set to ‘1’ within the try vector try1 for k in the range from 0 . . . 47. If k=0 then only one entry within the RAM is looked up. Therefore, since one clock cycle was used to find the translation in the first location for i=0, then a total of two clock cycles are utilized to perform the translation. On average, it takes 2+k/2 cycles to return the requested translation from RAM 410.
In terms of performing the translation and interrupt latency, the number of clock cycles required is examined for long lookup sequences, for instance having a k as high as 25 or more. The following relation:
gives the probability that the TLB will use 25 or more cycles to complete a translation. Table 2 lists, for a range of hash function widths (n), the average number of cycles it takes to find a translation Nhit, to detect a miss Nmiss and the probability that the TLB operation takes 25 cycles or more.
From Table 2 it is evident that P {N25} is so small that even with a 4 bit hash function it takes more than 6000 years of continuous operation to run into a case where the TLB translation requires between 25 and 48 clock cycles.
The column Nhitq (“hit quick”) applies to the case where the VPN_IN is applied continuously to the TLB circuit 400. From this table it is evident that having n=5 or n=6 is sufficient when focusing on the most important number, which is Nhit. There is not much to be gained beyond 6 bits, since Nhit approaches 2.0 when n=>20. A value of n=6 is used in the TLB circuit 400 since the hash function may not be very uniform. Therefore, 6-bit hash function used within the TLB approximates the performance of a 5-bit truly uniform hash function.
Advantageously, when VA is provided to the TLB it is propagated to the synthesized logic for each line and a result is provided indicated by at least an asserted bit within the try1 vector of bits. Only those lines for which a result indicative of a match occurred are then physically accessed to provide the PPN As such only a small fraction of the TLB lines are accessed for the translation process, thus resulting in a substantial performance improvement.
Numerous other embodiments may be envisaged without departing from the spirit or scope of the invention.
The present patent application is a continuation of U.S. application Ser. No. 10/242,785, filed Sep. 13, 2002. The present patent application incorporates the above-identified application by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4170039 | Beacom et al. | Oct 1979 | A |
4215402 | Mitchell et al. | Jul 1980 | A |
4638426 | Chang et al. | Jan 1987 | A |
4680700 | Hester et al. | Jul 1987 | A |
5136702 | Shibata | Aug 1992 | A |
5526504 | Hsu et al. | Jun 1996 | A |
5574875 | Stansfield et al. | Nov 1996 | A |
5574877 | Dixit et al. | Nov 1996 | A |
5752275 | Hammond | May 1998 | A |
5860147 | Doweck et al. | Jan 1999 | A |
6014732 | Naffziger | Jan 2000 | A |
6026476 | Rosen | Feb 2000 | A |
6205531 | Hussain | Mar 2001 | B1 |
6212603 | McInerney et al. | Apr 2001 | B1 |
6233652 | Mathews et al. | May 2001 | B1 |
6356990 | Aoki et al. | Mar 2002 | B1 |
6360220 | Forin | Mar 2002 | B1 |
6381673 | Srinivasan et al. | Apr 2002 | B1 |
6581140 | Sullivan et al. | Jun 2003 | B1 |
6625714 | Lyon | Sep 2003 | B1 |
6625715 | Mathews | Sep 2003 | B1 |
6687789 | Keller et al. | Feb 2004 | B1 |
6925464 | Kurupati | Aug 2005 | B2 |
20020073073 | Cheng | Jun 2002 | A1 |
20030037055 | Cheng et al. | Feb 2003 | A1 |
20030074537 | Pang et al. | Apr 2003 | A1 |
20030233515 | Honig | Dec 2003 | A1 |
20040170039 | MacDonald et al. | Sep 2004 | A1 |
Number | Date | Country |
---|---|---|
0150272 | Jul 2001 | WO |
Entry |
---|
Yamagata, et al., “A 288-kb Fully Parallel Content Addressable Memory Using a Stacked-Capacitor Cell Structure”, IEEE Journal of Solid-State Circuits, vol. 27, No. 12, p. 1927-1933, Dec. 1992. |
Guccione, et al., paper entitled “A Reconfigurable Content Addressable Memory”, Xilinx Inc., published May 18, 2000. |
Number | Date | Country | |
---|---|---|---|
20120066475 A1 | Mar 2012 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10242785 | Sep 2002 | US |
Child | 13298800 | US |