The field of invention relates generally to hash mapping; and, more specifically, to hash mapping with a secondary table having linear probing.
A cache is often implemented with a hash map.
Because the key K1 can be a random value and because the cache's resources are tied to memory having 105 a range of addressing space 106 and corresponding data space 107, the various keys associated with the various items of data must be able to “map” to the cache's addressing space 106. A hash function 104 is used to perform this mapping. According to the depiction of
Thus, in order to store the data D1 in cache, its key K1 is provided to a hashing function 104. The hashing function 104 produces the appropriate address A1. The data structure containing the key K1 and the data D1 are then stored in the memory resources used to implement the cache (hereinafter referred to as a “table” 105 or “hash table” 105) at the A1 address.
Frequently the number of separate items of data which “could be” stored in the hash table 105 is much greater than the size of the hash table's addressing space 106. The mathematical behavior of the hashing function 104 is such that different key values can map to the same hash table address. For example, as depicted in
b shows a first approach that involves multiple hash tables. When a “collision” occurs, that is, when an attempt is made to store a second data structure that maps to a table location (also referred to as a “slot”) where a first data structure already resides (because the key values K1, K2 for the pair of data structures map to the same table address A1 and the first data structure was stored into the hash table before the second), the second data structure is stored into a next, “deeper” hash table. Here, the first table that is looked to is referred to the primary table 109 and the second table that is looked to is referred to as the secondary table 110.
As an example, consider the situation depicted in
The value produced by the second hashing function 112 produces a second address value AS1 from the K2 key that is to be used for accessing the secondary table 110. According to this example, the AS1 slot is empty and the second data structure is therefore stored in the AS1 slot of the secondary table 110. Depending on implementation, hash functions 111 and 112 may be the same hashing function or may be different hashing functions. Reading/writing a data structure from/to the primary table 109 should consume less time than the reading/writing a data structure from/to the secondary table 110 because at least an additional table access operation is performed if not an additional hash function operation.
c shows another technique referred to as linear probing. According to the linear probing technique, rather than use a secondary table, when a collision occurs, an offset ΔA is summed with the address A1 produced by the hashing function to produce a second slot address A1+ΔA in the hash table 113 where the data structure that seeks to be inserted into the table 113 can be placed.
According to the exemplary depiction of
A method is described that involves hashing a key value to locate a slot in a primary table, then, hashing the key value to locate a first slot in a secondary table, then, linearly probing the secondary table starting from the first slot.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
a (prior art) shows a simple hash map;
b (prior art) shows a hash map implemented with primary and secondary tables;
c (prior art) shows a hash map implemented with linear probing;
a shows a deletion process for the hash map population of
b shows an insertion process for a hash map having a secondary table with linear probing that respects a deletion mark;
a and 6b show a hash table having a secondary table with linear probing where respective slots are flipped;
Referring to
When the data structure with key value K2 is to be inserted into the hash table a hashing function is first performed 301 to produce a primary table 201 address value A1. Because of the prior insertion of the first data structure into the A1 slot of the primary table, a collision occurs at the primary table 201 (i.e., the answer to inquiry 302 is “yes”). According to the particular embodiment being described herein, a second hash function is performed 304 on the K2 key to identify the first slot AS1 in the secondary table 201 that the K2 key maps to. Again, assuming the entire secondary table 202 is empty, the AS1 slot of the secondary table 201 will not be occupied with any other data structure 305. As such, the second data structure will be inserted into the AS1 slot of the primary table 306.
When the data structure with key value K3 is to be inserted into the hash table a hashing function is first performed 301 to produce a primary table 201 address value A1. Because of the prior insertion of the first data structure into the A1 slot of the primary table, a collision occurs at the primary table 201 (i.e., the answer to inquiry 302 is “yes”). A second hash function is performed 304 on the K3 key to identify the first slot AS1 in the secondary table 201 that the K3 key maps to. Because of the prior insertion of the second data structure into the AS1 slot of the secondary table 202, a collision occurs at the first slot in the secondary table 202 (i.e., the answer to inquiry 305 is “yes”).
As such, consistent with a linear probing scheme, the AS1 value is summed 307 with a fixed value ΔA to identify a next secondary slot AS1+ΔA within the secondary table 202 where the third data structure can be entered. According to the specific embodiment observed in
The process for the fourth data structure having the K4 key value will be the same as described above for the third data structure, except that the AS1+ΔA slot of the secondary table 202 will be occupied by the third data structure (i.e., the answer to the initial inquiry 308 will be “yes”). As such, again consistent with a linear probing scheme, another ΔA offset will be added 307 to the AS1+ΔA slot value producing a next slot address of AS1+2ΔA. The entry at the AS1+2ΔA slot will be empty 308 resulting in the insertion of the fourth data structure into the AS1+2ΔA slot.
According to the technique described herein, no flag or other identifier needs to be written into the third or fourth data structure to indicate the end of the K1-K2-K3-K4 “chain” of stored data structures. This is so because, according to the scheme being described, the next secondary table “linear probe” slot after the last data structure entry in the secondary table is guaranteed to be empty. As such, as observed in
In an embodiment, in order to guarantee the existence of an empty space at the end of a chain as described just above, the initial size of the secondary table (i.e., the number of slots in the secondary table) is made to be a prime number as is the linear probing offset ΔA. For example, in a further embodiment, the initial size of the secondary table is set equal to 2p−1 where p=5, 7, 11, 17, or 19 (i.e., different secondary table initial sizes are possible with p=5 corresponding to the smallest secondary table size and p=19 corresponding to the largest secondary table size); and, the linear probing offset ΔA is equal to 7. Setting the probing offset equal to 7 allows for enhanced efficiency if the hash table structure will tend to store values that are divisible by 8 or 16. Techniques also exist for growing the size of the secondary table from its initial size if it overflows. These are discussed in more detail further ahead with respect to
With a basic insertion process being described, a deletion process is next described with respect to
According to the deletion process of
According to the process of
As such, again consistent with a linear probing scheme, the AS1 value is summed 505 with a fixed value ΔA to identify the next secondary slot AS1+ΔA within the secondary table 402. At the AS1+ΔA slot 405, the key K3 found in the AS1+ΔA slot will match the K3 key being searched for (i.e., the answer to inquiry 506 will be “yes”). Because of the matching K3 keys, a check 507 is made to see if the AS1+ΔA slot corresponds to last substantive entry in the chain of substantive entries. The check is made simply by summing to the AS1+ΔA value with the offset ΔA to the next linear probe value AS1+2ΔA and seeing if the entry at the next linear probe slot 406 is empty or not. According to the example being described, the AS1+2ΔA entry 406 is non-empty, therefore the data entry D3 in the third data structure will be marked as “deleted” 508 rather than actually deleted 509.
As an example, if an attempt was made to delete the fourth data structure residing at the AS1+2ΔA slot, the check 507 into the next AS1+3ΔA secondary table slot would recognize that the slot is empty. As such, the fourth data structure would be actually deleted 509 from the table resulting in an empty slot at the location AS1+3ΔA. Because the deletion, insertion and access algorithms identify the end of a collision chain by the presence of the empty secondary slot, deleting the fourth data structure as just described would essentially shorten the maximum search length for a key for subsequent deletions, insertions and accesses.
According to a further implementation, if a “chain” of entries marked as deleted run to the empty slot, the entire chain of entries marked as deleted will be actually deleted. For example, if after the hash map state observed in
b shows an insertion process, which can be used to write new data structures or update existing ones, that contemplates the possible presence of primary or secondary table slot entries that are marked as deleted. As an example, it will be assumed that the hash map state is that observed in
According to the process of
As such, again consistent with a linear probing scheme, the AS1 value is summed 520 with a fixed value ΔA to identify the next secondary slot AS1+ΔA within the secondary table 402. Because of the presence of the third data structure at the AS1+ΔA slot 405, slot 405 is non empty and the key K3 found in the AS1+ΔA slot will match the K3 key being searched for (i.e., the answer to inquiry 521 will be “no” and the answer to inquiry 523 will be “yes”). Because of the matching K3 keys, the deletion mark that appears at slot 405 will be removed and the third data structure will be re-inserted 524.
Because a key match during an insertion process may find an entry that is not marked deleted (i.e., the insertion may correspond to a simple write operation), or may find an entry that is marked for deletion (i.e., the insertion corresponds to a re-insertion of a previous deleted data structure), each of the insertion processes 514, 519, 524 also indicate that the deletion mark should be removed if one exists. Because insertion processes 512, 517, 522 that trigger off of the discovery of an empty space 511, 516, 521 (which marks the end of a collision chain) by definition cannot find a deletion mark at the empty space, no such process for removing a deletion mark exists.
Recall from the background that accesses to a primary table should take less time than accesses to a secondary table.
a shows an initial hash map state that conforms to the hash map state originally observed in
Because of the presence of the second data structure in the AS1 table location, the K2 key found at the AS1 table location will not match the K4 key being searched for (i.e., the answer to inquiry 705 will be “no”). As such, again consistent with a linear probing scheme, the AS1 value is summed 706 with a fixed value ΔA to identify the next secondary slot AS1+ΔA within the secondary table 602b. Because of the presence of the third data structure at the AS1+ΔA slot, the K3 key found at the AS1+ΔA slot will not match the K4 key value being searched for (i.e., the answer to inquiry 707 will be “no”). As such, the AS1+ΔA value is summed 706 with the fixed value ΔA to identify the next secondary slot AS1+2ΔA within the secondary table 602b (i.e., slot 604a).
Because of the presence of the fourth data structure at slot 604a, the K4 key found at slot 604a will match the K4 key value being searched for (i.e., the answer to inquiry 707 will be “yes”), access will be made to the data D4 of slot 604a (e.g., to perform a cache read), and, a result will be randomly generated (e.g., a number will be generated through a random number generation process 710). If the result is a “looked for” result, a “hit” results. For example, if a random number generator generates any integer from 1.0 to 10.0 inclusive and the number 1.0 is the “looked for” number; and, if the number 1.0 actually results from the number generation process, then, a “hit” results.
If the result is a “hit” the accessed data structure at the secondary table slot 604a is “flipped” 713 with the primary table slot entry 603a.
The flip of slot entries based on a hit from a random event will cause more frequently used entries to reside in the primary table slot more often than less frequently used entries. For example, if the fourth data structure is very heavily accessed and the looked for value is 1.0 from a random generator that produces any integer between 1.0 and 10.0 inclusive, each access to the fourth data structure would have only a 10% of being flipped with the entry in the primary slot. However, an expectation of a flip would result by the tenth access to the fourth data structure. Because of the heavy usage of the fourth data structure, there is an expectation it would eventually reach the primary table slot.
Moreover, the percentage of total time that the data structure would spend in the primary slot is apt to be a result of the relative usages of collision chain siblings. For example, if the fourth data structure received significantly more accesses than all its other collision chain siblings, the fourth data structure could expect to spend disproportionate amount of time in the primary table slot. As another example, if the fourth and first data structures both received significantly more accesses than their collision chain siblings, the fourth and first data structures could expect to spend disproportionate amounts of time in the primary table slot relative to their collision chain siblings. The amount of time the fourth and first data structures spend in the primary table slot relative to each other would be a result of the frequency of their accesses relative to one another. For example, if they had approximately equal usages they would expect to split the amount of time that their collision siblings were not in the primary table slot.
Note from
Referring back to
In an embodiment, the amount by which the secondary table is increased is its “initial size”. For example, from the discussion concerning
Thus, for example, if the initial size of the secondary table is set to 31 (which corresponds to a p value of 5), upon its overflow, the secondary table is resized to 62 (which corresponds to its “previous size” plus its “initial size”). If the secondary table again overflows, the secondary table is again resized by effectively increasing its size by the initial size. This corresponds to adding the initial size (31) to the secondary table's previous size (62) which results in a new secondary table size of 93.
In a further embodiment, an “overflow” condition is not “every slot in the secondary table is filled”. Here, because key chains are supposed to end with an empty slot, by definition, empty slots are supposed to be existing within the secondary table at the moment it is deemed to be overflowing sufficiently to trigger a resize to a new, larger size. Specifically, according to one embodiment, the increasing of the secondary table size to a next larger size is triggered if either of the following two conditions arise: 1) the number of separate linear probe chains being supported in the secondary table is less than the hash number size; 2) the number of slots in the secondary table is less than ¾ that of the hash number size. By triggering a resize on the occurrence of either of these events, an empty space can be guaranteed at the end of each linear probe chain.
Processes taught by the discussion above may be performed with program code such as machine-executable instructions which cause a machine (such as a “virtual machine”, a general-purpose processor disposed on a semiconductor chip or special-purpose processor disposed on a semiconductor chip) to perform certain functions. Alternatively, these functions may be performed by specific hardware components that contain hardwired logic for performing the functions, or by any combination of programmed computer components and custom hardware components.
An article of manufacture may be used to store program code. An article of manufacture that stores program code may be embodied as, but is not limited to, one or more memories (e.g., one or more flash memories, random access memories (static, dynamic or other)), optical disks, CD-ROMs, DVD ROMs, EPROMs, EEPROMs, magnetic or optical cards or other type of machine-readable media suitable for storing electronic instructions. Program code may also be downloaded from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a propagation medium (e.g., via a communication link (e.g., a network connection)).
It is believed that processes taught by the discussion above can be practiced within various software environments such as, for example, object-oriented and non-object-oriented programming environments, Java based environments (such as a Java 2 Enterprise Edition (J2EE) environment or environments defined by other releases of the Java standard), or other environments (e.g., a NET environment, a Windows/NT environment each provided by Microsoft Corporation).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
| Number | Name | Date | Kind |
|---|---|---|---|
| 5287499 | Nemes | Feb 1994 | A |
| 5530958 | Agarwal et al. | Jun 1996 | A |
| 5542087 | Neimat et al. | Jul 1996 | A |
| 5544340 | Doi et al. | Aug 1996 | A |
| 5619676 | Fukuda et al. | Apr 1997 | A |
| 5796977 | Sarangdhar et al. | Aug 1998 | A |
| 5812418 | Lattimore et al. | Sep 1998 | A |
| 6097725 | Glaise et al. | Aug 2000 | A |
| 6374250 | Ajtai et al. | Apr 2002 | B2 |
| 6412038 | Mehale | Jun 2002 | B1 |
| 6487641 | Cusson et al. | Nov 2002 | B1 |
| 6549987 | Rappoport et al. | Apr 2003 | B1 |
| 6675265 | Barroso et al. | Jan 2004 | B2 |
| 6683523 | Takashima et al. | Jan 2004 | B2 |
| 7051164 | Smith | May 2006 | B2 |
| 7096323 | Conway et al. | Aug 2006 | B1 |
| 20020010702 | Ajtai et al. | Jan 2002 | A1 |
| 20030163643 | Riedlinger et al. | Aug 2003 | A1 |
| 20040123046 | Hum et al. | Jun 2004 | A1 |
| 20060041715 | Chrysos et al. | Feb 2006 | A1 |
| 20060143384 | Hughes et al. | Jun 2006 | A1 |
| Number | Date | Country |
|---|---|---|
| 689141 | Dec 1995 | EP |
| 905628 | Mar 1999 | EP |
| WO 03088048 | Oct 2003 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 20060143168 A1 | Jun 2006 | US |