The present disclosure is directed to a method for searching and retrieving relevant documents in a large database efficiently and effectively, where the search query is an entire textual document. The method exploits the Hilbert curve to map multidimensional semantic embedding vectors into a one-dimensional vector, thereby reducing search time. The proposed method comprises four stages: mapping semantic embedding vectors from m-dimensions to 1-dimension, building an index table, searching the index, and performing a filtration stage. This invention enables faster retrieval of similar documents to a query document from a very large database by reducing the time complexity and computational overhead, while maintaining high-quality search results.
Searching more relevant documents for a query document in very large database of documents is a challenging problem due to the lack of quality and efficiency. To achieve the desired quality of search, the document must be matched semantically to ensure that the query and reference documents have similar meanings. Therefore, semantic embedding vectors for sentences must be considered in the matching process instead of an embedding vector for the entire document. However, sequential matching of longer embedding vectors consumes time, Hilbert curve is used to map multidimensional vector to a single dimensional vector which in turn accelerates the matching and searching process.
An object of the present disclosure is to provide a method and system for document searching that includes a pre-processing phase in which an index file is created for a very large database of documents. This index file consists of three attributes, embedding vector number, the number of documents that contain the embedding vector, and a Hilbert number that corresponds to the embedding vector. A further object is to provide an offline search phase in which a set of Hilbert numbers are generated for the set of embedding vectors included in a query document. A further object is to provide a method of performing a binary search for each Hilbert number in the index file.
An aspect of the present disclosure is a method for a textual document search engine, that can include initializing the textual document search engine, by inputting, into a memory, a plurality of documents, wherein each document of the plurality of documents has a plurality of sentences, and each sentence of each document has m semantic embedding vectors, where m is an integer greater than 1; mapping, via a processing circuitry, the m semantic embedding vectors for each document of the plurality of documents to 1-dimensional vectors of Hilbert numbers using a Hilbert curve transformation; constructing, via the processing circuitry, an index table with the plurality of 1-dimensional vectors; and storing the index table in the memory.
A further aspect of the present disclosure is a method for searching a textual document database with an entire query document, that can include loading an index table into a memory, wherein the index table Ψ={ζ, η, ζ} is a triple attribute table where ζ is an embedding vector number in respect to all embedding vectors of document D, η is a document number that contains embedding vector ζ, and ζ is a Hilbert number that corresponds to a vector of number ζ; inputting, into the memory, a query document, which has a plurality of embedding vectors corresponding to sentences in the query document, wherein the query embedding vectors are mapped into Hilbert numbers using Hilbert curve transformation; searching, via a processing circuitry, the index table using the Hilbert numbers and retrieving candidate documents that are similar to the query document based on the Hilbert numbers.
The foregoing general description of the illustrative embodiments and the following detailed description thereof are merely exemplary aspects of the teachings of this disclosure, and are not restrictive.
A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
In the drawings, like reference numerals designate identical or corresponding parts throughout the several views. Further, as used herein, the words “a,” “an” and the like generally carry a meaning of “one or more,” unless stated otherwise. The drawings are generally drawn to scale unless specified otherwise or illustrating schematic structures or flowcharts.
Furthermore, the terms “approximately,” “approximate,” “about,” and similar terms generally refer to ranges that include the identified value within a margin of 20%, 10%, or preferably 5%, and any values therebetween.
An aspect of the present disclosure is a method for efficiently and effectively searching and retrieving relevant documents in a large database. The method exploits the Hilbert curve to map multidimensional semantic embedding vectors into a one-dimensional vector. The method preferably includes four stages: mapping semantic embedding vectors from m-dimensions to 1-dimension, building an index table, searching the index, and performing a filtration stage. The method accomplishes faster retrieval of similar documents from a very large database by reducing the time complexity and computational overhead, while maintaining high-quality search results.
Algorithm Formulation
Let D={d1, d2, . . . , dn} be a set of n documents where each document di={i, si} has a number i and a set si of semantic embedding vectors for all sentences in the document di. Let si={ei1, e12, . . . , ei|s
For purposes of this disclosure, a large database of documents may include a number m of m documents is on the order of several thousand to several millions of documents, where each document, including a query document, can have a few hundred sentences to several thousand sentences. Documents can range from stand-alone documents, journal articles, conference papers, to entire books.
The method of the present disclosure exploits the Hilbert curve to map multidimensional semantic embedding vectors into a 1-dimensional vector to reduce the searching time. Each embedding vector must be converted into a Hilbert number and this number is used as a search key in the index file.
A Hilbert curve (also known as the Hilbert space-filling curve) is a continuous fractal space-filling curve. Because it is space-filling, its Hausdorff dimension is 2, precisely, its image is the unit square. The Hilbert number is a positive integer of the form 4n+1. The sequence of Hilbert numbers begins 1, 5, 9, 13, 17, and so on.
The index file Ψ={ζ, η, ζ} is a triple attribute table where ζ is the embedding vector number in respect to all vectors of D, n be the document number that contains embedding vector ζ, and ζ be the Hilbert number corresponds vector of number ζ. For a query document q, the set of embedding vectors are mapped into Hilbert numbers and the search is performed for each Hilbert number alone so that the corresponding document of the best match can be retrieved.
The method includes four stages: mapping semantic embedding vectors from n-dimensions to 1-dimension, building index table, searching in index, and a filtration stage. The different stages are elaborated below.
Mapping m-D to 1-D
In the first stage, a Hilbert curve is used for mapping semantic embedding vectors from m-dimensions to 1-dimension. A Hilbert curve is a space filling curve which maps every m-dimensional point to 1-dimensional point. A feature of a Hilbert curve is that it preserves the spatial property. Using this property, the distances in the Hilbert curve are used for indexing the database so that similar documents can be retrieved for the query more efficiently.
To use a Hilbert curve for mapping m-dimensional point to 1-dimensional point, two parameters must be specified: the number of dimensions (m) and the number of bits for each coordinate value (p), where all values must be positive integer. For example, if the number of dimensions is 5 and the number of bits is 10, this means that the maximum coordinate value of embedding vectors is 210−1=1023 and the maximum number on the Hilbert curve is 250−1.
First the negative real embedding vector values must be converted into a positive integer. Two operations must be performed on the negative real values, shifting and scaling. All coordinate values are shifted relative to the absolute value of the global minimum coordinate value (α) and then all coordinate values are converted to an integer by scaling and rounding (see
Building Index
In this stage, the index table, Ψ, is created and sorted according to the string Hilbert numbers. It includes three attributes, id, doc_no, and Hilbert_no (see
Performing a search using an index is more efficient than performing a direct search for a nearest similar document using the binary search algorithm. In either case, a binary search algorithm is much faster than a sequential search.
Since the query document has many embedding vectors, a list of candidate similar documents is created by searching for each embedding vector separately. The next subsection describes the searching process in detail.
Searching Process
Once the index table Ψ 106 is loaded into memory and a new query document 102 is entered, in 104, query embedding vectors are mapped into Hilbert numbers (see
Time and Space Complexity Analysis
The time complexity of Building index table Ψ in Algorithm-1 can be analyzed as follows. Lines 4-10 consume time of O(ns), where n is the number of database documents and s is the average number of statements in each document, we assume that all documents of both database and query have the same number of statements. Lines 18-28 have time complexity of O(nsmr) where m is the size of embedding vectors and r is the number of bits to represent vector values. Lines 29-30 consume time complexity of O(ns). Line 31 represent the sorting of index file of size ns and it consumes O(ns log(ns)). Therefore, the total time complexity of building index table in Algorithm-1 is O(ns log(ns)+nsmr) while the space complexity of Algorithm-1 is O(ns), size of index table.
The time complexity of the searching process in Algorithm-2 can be summarized as follows. Lines 4-5 has time of O(m), line 6 (Hilbert mapping) has time complexity of O(mr), line 7 calls the binary search algorithm, Algorithm-3, which consumes O(log(ns)) time complexity, line 8 calls Algorithm-4 which retrieves t right nearest neighbor documents and T left nearest neighbor documents having similar Hilbert numbers for the current Hilbert number i at position p of the index Ψ. Assuming that the statements of each document are contiguous in the index table, the time complexity of Algorithm-4 is O(sτ2), line 9 which appends a list of at most 2 to another list Ĺ without duplication in time complexity of O(sτ2). According to the for loop in line 3, lines 4-5 and lines 6-9 are repeated s times so that the total time complexity of lines 3-9 is O(s (m+mr+log(ns)+sτ2+sτ2)). Finally, line 10 calls Algorithm-5 to process the candidate documents by counting frequency of each document number in list L among sub-lists of list T and this process costs O(s2τ2). As a result, the overall time complexity of the searching process as described in Algorithm-2 is O(smr+s log(ns)+s2τ2). On the other hand, the space complexity of Algorithm-2 is O(sτ).
By analyzing the time complexity of building index table and searching for the input query document, the time of performing a searching process is very fast in contrast to the time to build the index table whereas the time of search is dependent heavily on the average number of statements in each document and the user cutoff variable t.
Experiments were conducted on a laptop Intel® Core™ i5-2410M CPU running at 2.3 GHz and 4 GB RAM. In the experiments, the scalability of proposed methods were evaluated on databases with different numbers of documents, 50-400, step 50. The elapsed time is reported in seconds for querying 20 documents (see Table 1) and the top 10 similar documents were retrieved.
As shown from Table 1, the brute force search method is the worst method for retrieving similar documents by measuring the Euclidean distance between embedding vectors. As listed in the table, the elapsed time is 2335 seconds for querying 20 documents in a database of 50 records (i.e., 117 seconds for querying a single record). The dashed line in Table 1 indicates that the elapsed time exceeds 3 hours which makes the method not scalable and thus useless with very large databases. In contrast, dimensionality reduction methods, random projection and statistical reduction, have more efficient running time while preserving the quality of results, and produce the same set of similar records. The Hilbert reduction method is more efficient than random projection and statistical reduction because the Hilbert curve mapping reduces the long embedding vector to a single value.
From a theoretical point of view, a statistical reduction method is faster than random projection. To reduce n embedding vectors from length m to length k, the reduction process of random projection consumes time complexity of O(nmk) while statistical reduction has time complexity of 4 nm=O(nm).
In spite that the three document search methods, random projection, statistical reduction, and Hilbert reduction are fast, they perform the sequential search which is slower than binary search. With the low efficiency of sequential search, indexing the database is of interest especially with very large databases that prohibit internal search in RAM. Indexing a very large database enables fast access to similar records indirectly. In addition, binary search can be exploited to accelerate the matching process. As shown in Table 1, an indexing method using a Hilbert curve is used with two types of evaluations for candidate similar records, evaluated by frequency ranking and evaluated using Euclidean distance and greedy similarity matching. However, the Hilbert frequency ranking is much faster than the Hilbert similarity matching. The last search method is of higher quality in retrieving correct records in comparison with the naïve method.
The indexing process is a point of interest to study the scalability on very large databases. Note that the indexing process for both methods, frequency and similarity, is the same and they differ in the evaluation process only. The indexing time is thus described for one method.
As illustrated in
Indexing time in seconds=28.945*(n/50)+15.977 seconds
As a consequence, for databases of records 1000, 10000, 100000, and 1000000, the expected indexing times are 595, 5805, 57906, 578916 seconds, respectively.
As listed in Table 1, searching the index table of 400 records for 20 query documents and verifying candidate similar documents using frequency ranking consumes less than a second (0.333) while verifying candidate records with similarity ranking consumes 355 seconds (18 seconds for a single query record). For different sizes of a database, the searching time is steady in both evaluating methods which is a promising indicator for very large index tables.
Table 2 (confusion matrix) describes the primitive performance metrics, true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
The performance of the Hilbert reduction method is measured using Accuracy, Precision, Recall, and F1-score. Following are the equations of the 4 measures.
Table 3 (confusion matrix) shows the primitive performance metrics on top 25 similar records retrieved from a database of 100 records by the Hilbert reduction method when setting cutoff=50 compared with the ground truth, the naïve method.
The confusion matrix for the performance of random projection method is shown in Table 4.
The confusion matrix for the performance of statistical reduction method is shown in Table 5.
The confusion matrix for the performance of Hilbert indexing with similarity method is shown in Table 6.
The performance measures are listed in Table 7 for the best methods while neglecting the Hilbert indexing with frequency method due to the lack of performance.
The accuracy is computed for top 25 similar records when the cutoff=50. As seen from Table 7, Hilbert indexing-similarity method is competitive to random and statistical reduction in quality while in contrast it is more efficient. As listed in Table 1, with a database of 400 records, Hilbert-indexing-similarity method takes time of 600 seconds to build index table and search while random projection method takes 1566 seconds.
The accuracy of Hilbert-indexing-similarity can be improved by increasing the cutoff. In an embodiment, the efficiency of this method can be enhanced by reducing the dimensions before mapping with Hilbert curve. That is because much time is consumed by Hilbert mapping function. Therefore, it is preferred to perform an appropriate reduction before mapping.
In S902, the method begins with four reference documents and one query document. An example of documents with its extracted embedding vectors is as shown in
In S904, transform from an m-Dimensional vector to a 1-Dimensional vector.
In this example, regarding the flowchart in
In S906, a Hilbert curve transformation is performed to obtain a 1-value embedding vector, as the 1-Dimensional vector, as in
In S908, similarities are computed.
The problem now is how to process these big numbers? The method needs to find the difference between every two numbers to compute the similarity.
The method can be summarized by the following steps.
Now, in S910, new-code=previous code+ (max length-j)*(difference at the mismatch position)/max length.
Now, in S912, it is enough to multiply each number with 100, as in
Now, in S914, the new numbers will be assigned to document sentences as in
To compute the semantic similarity between Qry-D1 and Ref-D1, in S916, compute the similarity as in
Then similarity between Qry-D1 and Ref-D1=0.80.
The computing hardware is for implementing the methods for semantic search according to an exemplary aspect of the disclosure. The methods for semantic search may be a software program executed on the computer system 2000. In some embodiments, the methods for semantic search may be a computer program stored on computer readable storage medium for execution on a computer system 2000. The computer readable storage medium may be any built-in storage medium (hard drive, solid state drive) or removable storage medium (DVD or the like). The computer system 2000 may be any general purpose computer, such as a laptop computer, desktop computer, or workstation, running an operating system, for example Ubuntu Linux OS, Windows, a version of Unix OS, or Mac OS. The computer system 2000 may include processing circuitry implementing one or more central processing units (CPU) 2050 having multiple cores. In some embodiments, the computer system 2000 may include a graphics board 2012 having multiple GPUs, each GPU having GPU memory. The graphics board 2012 may perform many of the mathematical operations of the disclosed machine learning methods. The computer system 2000 includes main memory 2002, typically random access memory RAM, which contains the software being executed by the processing cores 2050 and GPUs 2012, as well as a non-volatile storage device 2004 for storing data and the software programs. Several interfaces for interacting with the computer system 2000 may be provided, including an I/O Bus Interface 2010, Input/Peripherals 2018 such as a keyboard, touch pad, mouse, Display Adapter 2016 and one or more Displays 2008, and a Network Controller 2006 to enable wired or wireless communication through a network 99. The interfaces, memory and processors may communicate over the system bus 2026. The computer system 2000 includes a power supply 2021, which may be a redundant power supply.
Numerous modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.
The present application claims the benefit of priority to U.S. Provisional Application No. 63/514,579 filed Jul. 20, 2023, the entire contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5963956 | Smartt | Oct 1999 | A |
6460034 | Wical | Oct 2002 | B1 |
8219564 | Shao | Jul 2012 | B1 |
9183203 | Tuchman | Nov 2015 | B1 |
9336192 | Barba | May 2016 | B1 |
10346737 | Benitez | Jul 2019 | B1 |
11321296 | Rivers | May 2022 | B1 |
11392651 | McClusky | Jul 2022 | B1 |
11983486 | Mariko | May 2024 | B1 |
12008026 | Sanz | Jun 2024 | B1 |
20030004938 | Lawder | Jan 2003 | A1 |
20040113953 | Newman | Jun 2004 | A1 |
20110035656 | King | Feb 2011 | A1 |
20140095502 | Ziauddin | Apr 2014 | A1 |
20140222826 | DaCosta | Aug 2014 | A1 |
20170123382 | Ruzicka | May 2017 | A1 |
20180004815 | Zhou | Jan 2018 | A1 |
20190065550 | Stankiewicz | Feb 2019 | A1 |
20190102400 | Kumaran | Apr 2019 | A1 |
20210056150 | Karandish | Feb 2021 | A1 |
20210090694 | Colley | Mar 2021 | A1 |
20210150201 | Reisswig | May 2021 | A1 |
20210158176 | Wan | May 2021 | A1 |
20210202045 | Neumann | Jul 2021 | A1 |
20210295822 | Tomkins | Sep 2021 | A1 |
20210312266 | Youn | Oct 2021 | A1 |
20220036209 | Horesh | Feb 2022 | A1 |
20220198133 | Bar | Jun 2022 | A1 |
20220222235 | Menghani | Jul 2022 | A1 |
20220261430 | Kataoka | Aug 2022 | A1 |
20220261545 | Lauber | Aug 2022 | A1 |
20230394235 | Rahman | Dec 2023 | A1 |
20240176673 | Roseberg | May 2024 | A1 |
Entry |
---|
Ting Li, et al., “A locality-aware similar information searching scheme”, International Journal on Digital Libraries, vol. 17, Oct. 12, 2014, pp. 79-93. |
Number | Date | Country | |
---|---|---|---|
63514579 | Jul 2023 | US |