Embodiments of the present invention generally relate to the field of electronic storage systems. More specifically, embodiments of the present invention relate to high performance data management using open hash tables.
High performance systems typically manage large amounts of data, often including millions of individual items. Designing indices to keep track of the information is critical to the performance of these systems. The process of locating or deleting existing items or adding new items must be very efficient in both access time (e.g., read time and write time) and required storage space in memory.
A standard way of managing large numbers of independent data items is with hash tables. One familiar with the art will know that the most common implementations of hash tables are linked lists and open hash tables. High performance systems often need to service multiple look-up, insertion and deletion operations in parallel. One familiar with the art will understand that maintaining the integrity of a hash table in the face of parallel operations generally requires some form of serialization.
Linked list hash tables are generally implemented as an array of list headers. A hash function maps an item key into an index into this array (often called a hash bucket). The header points to a linked list of items that hashed into this “bucket”. Serialization is generally accomplished by adding a lock to each header. That lock protects all operations within that list. Each operation requires only a single lock, potentially keeping overhead low. If the number of list headers is much larger than the number of parallel requestors, the probability of conflict will be low and the parallelism will be high. There are, however, a few problems with linked list hash tables. Searching long lists can be time consuming, and the linked list pointers can significantly increase the required memory. These problems are well-addressed by open hash tables.
For an Open hash table, the item key is mapped into an initial location, which might contain the desired item. But if multiple keys map to the same initial location, there will be a collision followed by an overflow search. There are numerous well known algorithms for conducting overflow searches, but they all involve looking at successive entries in a well-defined order. Unfortunately, it is much more difficult to serialize insertions into an open hash table. Overflows and secondary overflows may involve numerous cells, and parallel operations may cause dead-locks. Obtaining and releasing a large number of locks greatly increases the execution costs for an open hash table. Adding locks to each cell uses almost as much memory as the pointers in a linked list. These issues limit the use of open hash tables in applications that must support parallel insertions. What is needed is a technique that offers the performance and efficiency of open hash tables while supporting parallel operations without the need for locks.
The present disclosure provides the performance and efficiency of open hash tables while supporting parallel operations without the need for locks. It does so by partitioning a single open hash table into multiple independent sub-tables.
According to one approach, an apparatus for servicing requests to an open hash table using sub-tables is disclosed. The apparatus includes a memory for storing a plurality of sub-tables, and a processor coupled to the memory that receives the incoming request which includes a key value to perform a first action in a first sub-table, calculates a routing hash value associated with the key value using a hash function, determines an index for the routing hash value by calculating a modulo (N) function of the routing hash value, and appends the incoming request to a queue associated with the first sub-table, where N is a number of sub-tables associated with the processor, the index corresponds to the first sub-table, and the processor is operable to retrieve the incoming request from the queue to perform the first action.
According to another approach, a method of servicing requests to an open hash table using sub-tables is disclosed according to embodiments of the present invention. The method includes receiving the incoming request comprising a key value, calculating a routing hash value associated with the key value using a hash function, determining an index for the routing hash value by calculating a modulo (N) function of the routing hash value, where N is a number of sub-tables associated with the request router and the index corresponds to a sub-table, appending the incoming request to a service queue associated with the sub-table, retrieving the incoming request from the service queue, and performing an operation requested by the incoming request in the sub-table.
The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention:
Reference will now be made in detail to several embodiments. While the subject matter will be described in conjunction with the alternative embodiments, it will be understood that they are not intended to limit the claimed subject matter to these embodiments. On the contrary, the claimed subject matter is intended to cover alternative, modifications, and equivalents, which may be included within the spirit and scope of the claimed subject matter as defined by the appended claims.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the claimed subject matter. However, it will be recognized by one skilled in the art that embodiments may be practiced without these specific details or with equivalents thereof. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects and features of the subject matter.
Portions of the detailed description that follows are presented and discussed in terms of a method. Although steps and sequencing thereof may be disclosed in a figure herein describing the operations of this method (such as
Some portions of the detailed description are presented in terms of procedures, steps, logic blocks, processing, and other symbolic representations of operations on data bits that can be performed on computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, computer-executed step, logic block, process, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout, discussions utilizing terms such as “accessing,” “writing,” “including,” “storing,” “transmitting,” “traversing,” “associating,” “identifying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Computing devices, such as computer system 100, typically include at least one processor 101, and some amount of memory 102. Computer system 100 may also include communications devices (e.g., Communications Interface 103). Memory 102 may include volatile and/or nonvolatile, removable and/or non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Memory 102 may comprise RAM, ROM, NVRAM, EEPROM, flash memory or other memory technology. Communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signals such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above should also be included within the scope of communications media.
With regard to
With regard still to
Embodiments of the apparatuses disclosed herein include a request router and multiple partition managers executed by a processor. The request router receives requests, determines which partition it belongs in, and queues that request for service by the appropriate partition manager. The partition managers services requests (e.g., locate, insert, or delete entries from the hash table) and performs the requested action one at a time. Where all operations in a particular partition (sub-table) are performed by a single partition manager, there is no possibility for conflicting operations. As long as the number of partitions is greater than or equal to the maximum desired parallelism (number of requests that can be processed at a single time), serialization through the partition managers will not become a bottle-neck request router.
A request router executed by a processor determines which partition a particular request belongs in by computing a deterministic function on the request key. The most obvious approach is a simple hash (polynomial function of the bytes in the key, modulo the number of partitions). In another approach, the simple hash is used to index into a redirection table, whose entries specify which partition manager should service these requests. Adding a level of indirection makes it easier to adjust the assignment of sub-tables to partition managers for load balancing, failure, or the addition of new partition managers.
There are many approaches to implementing the partition managers. Depending on the required hash table size, performance, scalability, and availability requirements, partition managers can be implemented in distinct nodes on a network, distinct CPUs in a multi-processor system, distinct cores in one or more multi-core CPU, distinct processes, or any other mechanism that can support multiple independent logically sequential execution sequences.
There are also multiple approaches to implementing the passing of requests from the request router to the partition managers. If the service and partition managers operate in a shared address space the queues can be implemented with standard in-memory data structures. If the service and partition managers share a common file system that supports message queues, operation requests can be delivered through those files. If the service and partition managers do not share an address space or file system, operations can be forwarded from the request router to the partition managers as inter-process or network messages.
With regard to
With regard to
With regard to
Efficient Parallel Insertion into an Open Hash Table
With regard to
With regard still to
With regard to
With regard still to
While in some cases the association of sub-tables with partition managers is relatively static, according to some embodiment of the present invention, superior performance may be obtained by dynamically adjusting this association in response to changes in load and resources. As load increases, it may be desirable to add additional partition managers either within a single computer system, or by adding additional servers. When a new partition manager is added, some of the partitions currently managed by existing managers should be transferred to the new manager. If a partition manager becomes overloaded or fails, some or all of its partitions should be transferred to the surviving partition managers. For such transfers to work efficiently the number of sub-tables is significantly (e.g. 10x) greater than the maximum expected number of partition managers. If the number of sub-tables is not much larger than the number of partition managers, the achieved load distribution will likely be very uneven.
For some embodiments of this invention that forward requests to partition managers through queues, the reassignment of sub-tables to partition managers may be adjusted by updating the configured queue list. If all partition managers do not share a single address space, updates to the configured queue lists should be coordinated. One familiar with the art will be aware of multiple standard mechanisms for configuring application-specific parameters and notifying those applications of changes to their configuration.
For some embodiments of this invention that forward requests to partition managers via inter-process or network messages, the reassignment of sub-tables to partition managers can be adjusted by updating the routing table. If there are multiple client computer systems, the client computer systems have an identical copy of the routing table and/or the misdirected requests will be forwarded to the correct server. This is a fundamental requirement of some consistent placement schemes. One familiar with the art will recognize that there are numerous well-known content distribution and distributed consensus protocols for achieving this.
One way to distribute sub-tables among partition managers is by taking the sub-table number modulo the number of partition managers. However, one skilled in the art will recognize that this approach results in considerable reassignment whenever the number of partition managers changes. Much greater efficiency may be obtained by minimizing the reassignment.
With regard to
With regard to
With regard to
One skilled in the art will recognize that it is critical to avoid oscillations in a feedback network. Two critical parameters are chosen at step 901: a polling rate (used in step 906), and a minimum imbalance threshold (used in step 903). If the polling rate is too fast subsequent changes will be made before the system has reached a new equilibrium from the previous changes. If the imbalance threshold is set to be too low, the system will constantly be redistributing load and never reach an equilibrium. Ideal values will make the system responsive while avoiding wasteful oscillations, and are usually found through measurement and tuning.
Embodiments of the present invention are thus described. While the present invention has been described in particular embodiments, it should be appreciated that the present invention should not be construed as limited by such embodiments, but rather construed according to the following claims.