The present invention relates to the field of data searching, and particularly to improving efficiency and throughput of edit distance searching.
Fuzzy string searching (also referred to as approximate string searching) plays an increasingly important role in modern big data era. Fuzzy string searching aims to find strings that approximately match a given pattern. Different metrics can be used to quantify the proximity between two strings, among which the edit distance (or levenshtein distance) is the most widely used metric. Edit distance between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. Calculation of the edit distance is however very computation intensive, i.e., given two strings with the length of m and n, the computational complexity for an edit distance calculation is O(m·n).
In general, given the search pattern p and the edit distance d, the objective of a fuzzy string search is to find all the strings whose edit distance from the search pattern p is no more than d. The most straightforward approach is to directly calculate the edit distance between p and all the strings in a brute-force manner, which is however subject to very high computational complexity. There are two options to reduce the computational complexity. In a first approach, one could pre-process all the strings to build indexed data structures (e.g., suffix trees), which could significantly reduce the fuzzy search computational complexity. However, the pre-processing stage also tends to have very high computational complexity. Hence this option is suitable only for scenarios where the same content will be searched for many times with different search patterns.
In a second approach, one could employ a two-stage search process to reduce the overall search computational complexity: The first stage carries out simple exact matching to filter out most strings that are guaranteed not to be the matched strings, and the second stage carries out edit distance calculations on what are left by the first stage. This method has been used in the well-known open-source fuzzy search software tool called agrep. The efficiency of this method however quickly degrades as the edit distance d increases. As a result, conventional CPU-based edit distance searching can only work well for relatively small values of d (e.g., 2 or 3).
Accordingly, embodiments of the present disclosure are directed to systems and methods for improving the efficiency and throughput in the realization of edit distance searching. Aspects of this invention aim to improve the edit distance search throughput for large value of d (e.g., 6 and above) by leveraging hybrid CPU/FPGA computing platforms.
A first aspect provides a hybrid system for performing fuzzy string searches, comprising: an FPGA (field programmable gate array) appliance, having: a data input manager that receives an m-byte input pattern and loads an n-byte substring of the m-byte input pattern into a first set of registers, and streams input strings of searchable data through a second set of registers; an edit distance calculation engine having an array of processing elements (PEs) implemented using FPGAs coupled to the first and second set of registers, wherein the array of PEs calculate an edit distance for each input string of searchable data relative to n-byte substring; and an output manager that identifies matching input strings having an edit distance less than a threshold, and forwards matching input strings to a CPU for software-based edit distance processing relative to the m-byte input pattern.
A second aspect provides a method for performing fuzzy string searches, comprising: receiving at an FPGA (field programmable gate array) appliance an m-byte input pattern to be search for, and loading an n-byte substring of the m-byte input pattern into a first set of registers; streaming input strings of searchable data through a second set of registers; calculating an edit distance for each input string of searchable data relative to the n-byte substring using an edit distance calculation engine having an array of processing elements (PEs) implemented using FPGAs coupled to the first and second set of registers; and identifying matching input strings having an edit distance less than a threshold, and forwarding matching input strings to a CPU for software-based edit distance processing relative to the m-byte input pattern.
A third aspect provides an FPGA (field programmable gate array) appliance for performing fuzzy string searches, comprising: a data input manager that loads an n-byte input pattern into a first set of registers, and streams input strings of searchable data through a second set of registers; an edit distance calculation engine having an array of processing elements (PEs) implemented using FPGAs coupled to the first and second set of registers, wherein the array of PEs calculate an edit distance for each input string of searchable data relative to n-byte input pattern, wherein the edit distance calculation engine is implemented with a parallel architecture that utilizes an array of n by (n+k+t) PEs, wherein n is a number of bytes that can be stored in the first set of registers, k is a maximum edit distance and t is a parallelism factor, and wherein each of the second set of registers are segmented such that each segment is configure to hold t bytes; and an output manager that identifies matching input strings having an edit distance less than a threshold.
The numerous advantages of the present invention may be better understood by those skilled in the art by reference to the accompanying figures in which:
Reference will now be made in detail to the presently preferred embodiments of the invention, examples of which are illustrated in the accompanying drawings.
Shown in
FPGA appliance 18 generally comprises a data input manager 20 that receives and loads a pattern to be searched for (i.e., an input pattern) into a first set of registers in the edit distance calculation engine 22 and receives and streams the data to be searched among (i.e., searchable data) into a second set of registers in the edit distance calculation engine 22. During each clock cycle, a new byte of searchable data is loaded into the right most register, and data from the other registers are shifted left. Data from the left-most register is removed. In this manner, a new string can be searched during each clock cycle. Searchable data may come from flash memory 12 or from another source, e.g., storage devices in a data center (not shown). Edit distance calculation engine 22 includes an array of hardware processing elements (PEs) arranged in a parallel architecture to generate edit distance calculations for the stream of inputted searchable data. Data output manager 24 receives the distance calculations and, e.g., filters them based on a predetermined threshold to identify matches.
As described in further detail below, FPGA appliance 18 may be utilized as a preprocessing operation to search for a substring (e.g., the first n bytes) of an m-byte input pattern. Searchable input strings that result in a match by the preprocessing operation can then be fully evaluated against the full m-byte input pattern by edit distance calculation software 16 on the host 14. A learning system may further be utilized to select the optimal portion of the m-byte input pattern to be used by the edit distance calculation engine 22 during the preprocessing operation.
Calculation of edit distance can be formulated as a recursive computation through dynamic programming. This naturally matches to a two-dimensional data flow diagram, as illustrated in
Note that the straightforward implementation of FPGA-based edit distance calculation as illustrated in
As shown in
A second issue of the system shown in
An example of a hybrid embodiment is shown in
The FPGA-based edit distance calculation engine 18 contains an array of n by (n+r) PEs. To support a fuzzy search against an input pattern with the length of m>n and edit distance of k≤r, a length-n substring (Pn) is chosen from the complete length-m input pattern (Pm) as the partial search pattern being held in the FPGA appliance 18. The FPGA appliance is configured to receive (r−k)-bytes of search data 48 per clock cycle, i.e., the total (n+r) registers are partitioned into a number of segments, and each segment has (r−k) bytes and during each clock cycle the content in one segment is moved to the next segment. Once the FGPA appliance finds a match, it will send the matched content to the CPU 50 for further processing, where the CPU 50 calculates the edit distance against the full length-m search pattern.
As illustrated in
In the example of
It is understood that the FPGA appliance 18 may be implemented in any manner, e.g., as an integrated circuit board or a controller card that includes a processing core, I/O and processing logic. Aspects of the processing logic may be implemented in hardware/software, or a combination thereof. For example, aspects of the processing logic may be implemented using field programmable gate arrays (FPGAs), ASIC devices, or other hardware-oriented system.
Other aspects, such as I/O, may be implemented with a computer program product stored on a computer readable storage medium. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, etc. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Python, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by hardware and/or computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The foregoing description of various aspects of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to an individual in the art are included within the scope of the invention as defined by the accompanying claims.
This application claims priority to co-pending provisional application entitled, HYBRID SOFTWARE-HARDWARE IMPLEMENTATION OF EDIT DISTANCE SEARCH, Ser. No. 62/517,880, filed on Jun. 10, 2017, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62517880 | Jun 2017 | US |