The field of invention pertains generally to the computing sciences, and, more specifically, to a nearest neighbor search logic circuit with reduced latency and power consumption.
A number of applications depend on finding one or more specific items of data in a large database where the location of the items in the database is unknown. That is, the database needs to be searched in order to find the items of data. A class of searches, referred to as nearest neighbor searches (or “kNN” searches), return the k items of data in the database whose data content closest matches an input query item.
The challenge of performing kNN searches is exacerbated with the emergence of data-centric (e.g., cloud) computing, “big-data”, artificial intelligence, machine learning and other computationally intensive applications that execute from large databases, and, the general objective of keeping power consumption in check. Here, with database sizes becoming extremely large, the brute force method of comparing the input query item against every item of data in the database until the closest k matches are found is not feasible because too much time and energy are consumed per search query.
A better approach, depicted in
Nevertheless, implementing the overall search as a customized function within a semiconductor chip (e.g., as a kNN search accelerator) with minimal latency and power consumption offers challenges, particularly in the case of large databases that are to be searched.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
With respect to the first search stage 101, the hashing logic circuit 201 receives the input query term and generates an output “hash word” composed of B separate hash chunks or “bands” of S bits each. The CAM 202 contains entries that are the results of the same hashing algorithm used by the hashing logic circuitry 201 applied to each of the database's (RAM's) data items. That is, if the RAM 203 has M data items, the CAM 202 has M entries, where, each entry in the CAM 202 contains the hash word (composed of B bands of S bits) that results from application of the hashing algorithm to a different one of the RAM's data items.
The CAM 202 receives the hash word from the hashing logic circuit 201 and compares the hash word against the CAM's entries. However, as will be described immediately below with respect to
That is, referring to
In the particular example depicted in
In continuing comparisons for the CAM entries other than the first entry, the CAM compares the second band of each of these CAM entries in parallel against a corresponding second band of the hash word. Again, if the second band of any of these CAM entries exhibits a bit-for-bit match with the second band of the hash word, such entries are deemed to be sufficiently similar to the hash word for the first search stage, and no more comparisons are carried out for these entries going forward in order to save power. As observed in
The process then continues in succession with each next comparison being performed on the next one of the bands of the hash word and the corresponding next one of the bands of the (remaining) entries in the CAM that have not been deemed sufficiently near the hash word. As matches are found over the succession of comparisons, commonly, the number of entries deemed sufficiently near the hash word grows and fewer CAM entries are subjected to comparisons resulting in increased power savings.
Eventually, the last one of the bands of the hash word is compared to the last one of the bands of the remaining CAM entries that have not been deemed sufficiently close to the hash word. If the last band of any of the remaining CAM entries exhibits a match to the last band of the hash word, such CAM entries are the last to be deemed sufficiently close to the hash word and the first stage of the search process is complete.
As observed in
With respect to the operation of the circuit itself, each multiplexer receives a particular key and the logical inverse of that key. The channel select input of each multiplexer receives its own respective bit of a control vector, where, the control vector is essentially a random value (some bits are 1 s and the other bits are 0 s). Each multiplexer, therefore, presents at its output either its key or the logical inverse of its key depending on the bit of the control vector that is presented to its channel select input. The W outputs from the W multiplexers are then added. A particular bit of the summation result is chosen for the hash word bit. In an embodiment, the single generated bit corresponds to the most significant bit of the addition resultant determined by the adder tree.
Both comparison circuits 501, 502 use the bit line (“ML”) coupled to its respective comparison cells to perform a logical AND function across the collective output of its comparison cells. If all S comparisons performed by the S comparison cells indicate a match, the bit line will be pulled to a first logical value. By contrast, if one or more of these comparisons indicate a mismatch, the bit line will be pulled to a second logical value.
For example, in one embodiment, the bit line is passively pulled to a logic high (e.g., with a resistor), and, the comparison cells are designed to provide a high impedance output state in the case of a match. In the case where all comparison cells indicate a match, the comparison cells do not influence the bit line and the bit line manifests a logic high by the weak pull-up. By contrast, if one or more of the comparison cells indicate mis-match, the mis-matching cell(s) will actively drive the bit line low.
Comparison circuit 502 also includes an OR gate 503 that performs a logical OR on the aforementioned AND value from the match/mis-match bit line (ML2) and the band match/mis-match result 504 generated from the immediately prior band. Here, from the discussion of
If OR gate 503 observes at logic high at either of its inputs (e.g., its local comparisons all match for band 1, and/or, the output of comparison circuit 501 indicates a match for band 0), the OR gate 503 will generate a logical high output. Essentially, a logical high at the output of the OR gate 503 means that comparison circuit 502 has observed a match on all of its S bits, and/or, the preceding comparison circuit 501 that performed a comparison on a preceding band of S bits observed a match on all of its S bits.
The output of OR gate 503 is tied to a respective enable input of each of the comparison cells of its immediately following comparison circuit so that, if the OR gate 503 provides a logically high output, the following comparison circuit does not perform any comparison as a power saving measure. The OR gate of the following comparison will also present a logic high in response to its prior OR gate 503 issuing a logic high. As such, once an comparison circuit issues an output indicating a match, the outputs of all subsequent comparison circuits will indicate a match which disables the comparison cells for the remainder of the CAM entry. The OR gate from the last comparison circuit for the last band (band B-1) enters a final match/mis-match decision bit for the first CAM entry into an element of a vector register 505 reserved for the first CAM entry. With each CAM entry operating concurrently with the first CAM entry according to the band-by-band comparison sequence, all CAM entries will, in parallel, register a match/mis-match final result into the vector register 505 after the last (B-1) band has been compared. The output of the vector register 505 presents the output of the first stage search.
Referring to
Similar to the approach of the hashing logic circuit 201 and CAM 202 of the first stage search process, in which comparisons are made in discrete bands of bits, the full query vector is viewed as being composed of discrete chunks of bits (referred to as “domains”) and comparisons of the full query vector against selected data items are made on a domain by domain basis. In an embodiment, there are T total domains and D bits per domain. As such, the size of the query vector and the data entries in the RAM is TD.
As observed in
Thus, in order to read the bits of each selected data item's first domain along the data item's crosswise bit line, according to a first phase, a first control signal that is coupled to the highest ordered bit in the first domain (SBS1) of each data item is activated. The storage cells of only those data items that were selected by the first search phase respond to the control signal. The highest ordered bit in the first domain of each selected data item is then presented on the data item's crosswise bit line for sampling by the compare and sort circuitry 704. As depicted in
Then, according to a second phase, a second control signal that is coupled to the second-highest ordered bit in the first domain (SBS2—not shown in
Again, the phases are performed in parallel across all selected data items, thus, the crosswise bit lines of the selected data items will present a succession of bits in parallel across the D phases of the readout process. As observed in
In particular, for each selected data item, the compare and sort circuit 704 first compares the sequence of D bits for the first domain that is received from the data item's crosswise bit line against the corresponding D bits for the first domain in the search vector. In an embodiment, the comparison is mathematically performed as an inner product. Here, the inner product yields a scalar value that can be deemed a “score” whose value increases with increasing higher ordered bits that match and decreases with increasing higher ordered bits that do not match.
Notably, the inner product performed by the compare and sort circuit 704, can be pipelined with the readout process of the selected data entries. That is, for instance, while a next bit is being read out from the appropriate storage cell of each of the selected data entries, the prior bit is being compared or otherwise used in a calculation to determine whether the prior bit matches its counterpart in the search vector.
At the conclusion of the comparison of the D bits of the first domain, referring to
The process then repeats 604, 605, 606 with the remaining selected data entries for a next, lesser order domain of D bits. For each remaining selected entry, the entry's earlier score is accumulated with its newly determined score. However, as the process flows in the direction from the most significant bit to the least significant bit, later scores have less weight in the total score for the data entries than earlier scores. Again, the next one or more data entries having the lowest total score are eliminated from consideration (their status changes from selected to non-selected) 606. As observed in
The iterations then continue until the number of remaining data entries reaches k, or, the last domain of D bits is processed. In the case of the former, the k remaining data entries are returned as the result of the search. In the case of the later, the sorting circuit chooses the set of k entries having the highest total score.
By contrast, when the cell is to be read for purposes of performing the second stage of a search process as described at length above, the search word line 813 is activated. Here, search word line 813 is activated when the data entry that the storage cell belongs to is selected by the vector output of the first search stage process. The search word line 813 remains active unless and until data entry is discarded from consideration as the nearest neighbor. When the domain the cell belongs to is being read and its the cells turn/phase to provide its data on the differential crosswise bit lines 814, the search strobe line 815 is activated. With both the search word and search strobe lines 813, 815 being active, transistors M1, M2, M3 and M4 are all ON which causes the data stored by latch 816 to be presented on the crosswise bit lines 814.
As observed in
A distance sorting network ripples higher score values from the distance compute circuits closer to the end of the network. Thus, score values that ripple forward the least are candidates for elimination from consideration as the nearest neighbor.
Here, as observed in
The instruction set architecture (ISA) of the processing cores could include a special instruction to execute the nearest neighbor search in any/all of the processor's caches. Such instruction could specify a nearest neighbor search opcode and a search query vector and target cache (e.g., L1 cache, L2 cache, L3 caches, all caches, etc.) as input operands.
An application processor or multi-core processor 1150 may include one or more general-purpose processing cores 1115 within its CPU 1101, one or more graphical processing units 1116, a memory management function 1117 (e.g., a memory controller) and an I/O control function 1118. The general-purpose processing cores 1115 typically execute the system and application software of the computing system. The graphics processing unit 1116 typically executes graphics intensive functions to, e.g., generate graphics information that is presented on the display 1103. The memory control function 1117 interfaces with the system memory 1102 to write/read data to/from system memory 1102.
Any of the system memory 1102 and/or non volatile mass storage 1120 can be composed with a three dimensional non volatile random access memory composed, e.g., of an emerging non volatile storage cell technology. Examples include Optane™ memory from Intel Corporation, QuantX™ from Micron Corporation, and/or other types of resistive non-volatile memory cells integrated amongst the interconnect wiring of a semiconductor chip (e.g., resistive random access memory (ReRAM), ferroelectric random access memory (FeRAM), spin transfer torque random access memory (STT-RAM), etc.). Mass storage 1120 at least can also be composed of flash memory (e.g., NAND flash).
The computing system can further include L1, L2, L3 or even deeper CPU level caches that have associated nearest neighbor search logic circuitry as described at length above. Conceivably system memory could also perform nearest neighbor searching with similar circuitry (e.g., with a DRAM RAM having crosswise bit lines as described at length above).
Each of the touchscreen display 1103, the communication interfaces 1104-1107, the GPS interface 1108, the sensors 1109, the camera(s) 1110, and the speaker/microphone codec 1113, 1114 all can be viewed as various forms of I/O (input and/or output) relative to the overall computing system including, where appropriate, an integrated peripheral device as well (e.g., the one or more cameras 1110). Depending on implementation, various ones of these I/O components may be integrated on the applications processor/multi-core processor 1150 or may be located off the die or outside the package of the applications processor/multi-core processor 1150. The power management control unit 1112 generally controls the power consumption of the system 1100.
Embodiments of the invention may include various processes as set forth above. The processes may be embodied in machine-executable instructions. The instructions can be used to cause a general-purpose or special-purpose processor to perform certain processes. Alternatively, these processes may be performed by specific/custom hardware components that contain hardwired logic circuitry or programmable logic circuitry (e.g., FPGA, PLD) for performing the processes, or by any combination of programmed computer components and custom hardware components.
Elements of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.