Method for accelerated long document search using Hilbert curve mapping

Description

BACKGROUND
Technical Field

The present disclosure is directed to a method for searching and retrieving relevant documents in a large database efficiently and effectively, where the search query is an entire textual document. The method exploits the Hilbert curve to map multidimensional semantic embedding vectors into a one-dimensional vector, thereby reducing search time. The proposed method comprises four stages: mapping semantic embedding vectors from m-dimensions to 1-dimension, building an index table, searching the index, and performing a filtration stage. This invention enables faster retrieval of similar documents to a query document from a very large database by reducing the time complexity and computational overhead, while maintaining high-quality search results.

Description of Related Art

Searching more relevant documents for a query document in very large database of documents is a challenging problem due to the lack of quality and efficiency. To achieve the desired quality of search, the document must be matched semantically to ensure that the query and reference documents have similar meanings. Therefore, semantic embedding vectors for sentences must be considered in the matching process instead of an embedding vector for the entire document. However, sequential matching of longer embedding vectors consumes time, Hilbert curve is used to map multidimensional vector to a single dimensional vector which in turn accelerates the matching and searching process.

An object of the present disclosure is to provide a method and system for document searching that includes a pre-processing phase in which an index file is created for a very large database of documents. This index file consists of three attributes, embedding vector number, the number of documents that contain the embedding vector, and a Hilbert number that corresponds to the embedding vector. A further object is to provide an offline search phase in which a set of Hilbert numbers are generated for the set of embedding vectors included in a query document. A further object is to provide a method of performing a binary search for each Hilbert number in the index file.

SUMMARY

An aspect of the present disclosure is a method for a textual document search engine, that can include initializing the textual document search engine, by inputting, into a memory, a plurality of documents, wherein each document of the plurality of documents has a plurality of sentences, and each sentence of each document has m semantic embedding vectors, where m is an integer greater than 1; mapping, via a processing circuitry, the m semantic embedding vectors for each document of the plurality of documents to 1-dimensional vectors of Hilbert numbers using a Hilbert curve transformation; constructing, via the processing circuitry, an index table with the plurality of 1-dimensional vectors; and storing the index table in the memory.

A further aspect of the present disclosure is a method for searching a textual document database with an entire query document, that can include loading an index table into a memory, wherein the index table Ψ={ζ, η, custom character _ζ} is a triple attribute table where ζ is an embedding vector number in respect to all embedding vectors of document D, η is a document number that contains embedding vector ζ, and _ζ is a Hilbert number that corresponds to a vector of number ζ; inputting, into the memory, a query document, which has a plurality of embedding vectors corresponding to sentences in the query document, wherein the query embedding vectors are mapped into Hilbert numbers using Hilbert curve transformation; searching, via a processing circuitry, the index table using the Hilbert numbers and retrieving candidate documents that are similar to the query document based on the Hilbert numbers.

The foregoing general description of the illustrative embodiments and the following detailed description thereof are merely exemplary aspects of the teachings of this disclosure, and are not restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

FIGS. 1A, 1B is an algorithm for creating an index table for searching query documents, in accordance with an exemplary aspect of the disclosure;

FIG. 2 is an algorithm for searching an index table, in accordance with an exemplary aspect of the disclosure;

FIG. 3 is an algorithm for binary search in the index table, in accordance with an exemplary aspect of the disclosure;

FIG. 4 is an algorithm for retrieving candidate similar documents, in accordance with an exemplary aspect of the disclosure;

FIG. 5 is an algorithm for processing candidate similar documents, in accordance with an exemplary aspect of the disclosure;

FIG. 6 illustrates a framework for searching similar documents in a very large database, in accordance with an exemplary aspect of the disclosure;

FIG. 7 is a graph of a comparison of running time between three fast search methods, in accordance with an exemplary aspect of the disclosure;

FIG. 8 is a graph of the elapsed time for indexing a database using Hilbert curve mapping with respective number of records, in accordance with an exemplary aspect of the disclosure;

FIG. 9 is a flowchart for a method of determining the similar reference document to the query document;

FIG. 10 illustrates an example of original embedding vectors for four reference documents and one query document, in accordance with an exemplary aspect of the disclosure;

FIG. 11 is a flowchart for a process of transforming from m-D to 1-D, in accordance with an exemplary aspect of the disclosure;

FIG. 12 illustrates resulting negative values estimation, in accordance with an exemplary aspect of the disclosure;

FIG. 13 illustrates resulting embedding vectors after multiplication, in accordance with an exemplary aspect of the disclosure;

FIG. 14 illustrates resulting Hilbert curve distances, in accordance with an exemplary aspect of the disclosure;

FIG. 15 is a flowchart for a process of computing similarities, in accordance with an exemplary aspect of the disclosure;

FIGS. 16A, 16B illustrates a step when a mismatch is encountered, in accordance with an exemplary aspect of the disclosure;

FIG. 17 illustrates results of sorting strings with new numbers, in accordance with an exemplary aspect of the disclosure;

FIG. 18 is a step of assigning the new numbers to document sentences, in accordance with an exemplary aspect of the disclosure;

FIG. 19 illustrates a step of computing similarity;

FIG. 20 is an illustration of a non-limiting example of details of computing hardware used in the computing system, according to aspects of the present disclosure; and

FIG. 21 is an illustration of a non-limiting example of distributed components that may share processing with the controller, according to aspects of the present disclosure.

DETAILED DESCRIPTION

In the drawings, like reference numerals designate identical or corresponding parts throughout the several views. Further, as used herein, the words “a,” “an” and the like generally carry a meaning of “one or more,” unless stated otherwise. The drawings are generally drawn to scale unless specified otherwise or illustrating schematic structures or flowcharts.

Furthermore, the terms “approximately,” “approximate,” “about,” and similar terms generally refer to ranges that include the identified value within a margin of 20%, 10%, or preferably 5%, and any values therebetween.

An aspect of the present disclosure is a method for efficiently and effectively searching and retrieving relevant documents in a large database. The method exploits the Hilbert curve to map multidimensional semantic embedding vectors into a one-dimensional vector. The method preferably includes four stages: mapping semantic embedding vectors from m-dimensions to 1-dimension, building an index table, searching the index, and performing a filtration stage. The method accomplishes faster retrieval of similar documents from a very large database by reducing the time complexity and computational overhead, while maintaining high-quality search results.

Algorithm Formulation

Let D={d₁, d₂, . . . , d_n} be a set of n documents where each document d_i={i, s_i} has a number i and a set s_iof semantic embedding vectors for all sentences in the document d_i. Let s_i={e_i¹, e₁², . . . , e_i^|sⁱ^|} be a set of embedding vectors of document d_iand |s_i| be its cardinality. The goal is to find the t best similar documents in D for a query document q.

For purposes of this disclosure, a large database of documents may include a number m of m documents is on the order of several thousand to several millions of documents, where each document, including a query document, can have a few hundred sentences to several thousand sentences. Documents can range from stand-alone documents, journal articles, conference papers, to entire books.

The method of the present disclosure exploits the Hilbert curve to map multidimensional semantic embedding vectors into a 1-dimensional vector to reduce the searching time. Each embedding vector must be converted into a Hilbert number and this number is used as a search key in the index file.

A Hilbert curve (also known as the Hilbert space-filling curve) is a continuous fractal space-filling curve. Because it is space-filling, its Hausdorff dimension is 2, precisely, its image is the unit square. The Hilbert number is a positive integer of the form 4n+1. The sequence of Hilbert numbers begins 1, 5, 9, 13, 17, and so on.

The index file Ψ={ζ, η, custom character _ζ} is a triple attribute table where ζ is the embedding vector number in respect to all vectors of D, n be the document number that contains embedding vector ζ, and _ζbe the Hilbert number corresponds vector of number ζ. For a query document q, the set of embedding vectors are mapped into Hilbert numbers and the search is performed for each Hilbert number alone so that the corresponding document of the best match can be retrieved.

The method includes four stages: mapping semantic embedding vectors from n-dimensions to 1-dimension, building index table, searching in index, and a filtration stage. The different stages are elaborated below.

FIGS. 1A, 1B is Algorithm 1 for creating an index table for searching query documents, in accordance with an exemplary aspect of the disclosure.

FIG. 2 is Algorithm 2 for searching an index table, in accordance with an exemplary aspect of the disclosure.

FIG. 3 is Algorithm 3 for binary search in the index table, in accordance with an exemplary aspect of the disclosure.

FIG. 4 is Algorithm 4 for retrieving candidate similar documents, in accordance with an exemplary aspect of the disclosure.

FIG. 5 is Algorithm 5 for processing candidate similar documents, in accordance with an exemplary aspect of the disclosure.

Mapping m-D to 1-D

In the first stage, a Hilbert curve is used for mapping semantic embedding vectors from m-dimensions to 1-dimension. A Hilbert curve is a space filling curve which maps every m-dimensional point to 1-dimensional point. A feature of a Hilbert curve is that it preserves the spatial property. Using this property, the distances in the Hilbert curve are used for indexing the database so that similar documents can be retrieved for the query more efficiently.

To use a Hilbert curve for mapping m-dimensional point to 1-dimensional point, two parameters must be specified: the number of dimensions (m) and the number of bits for each coordinate value (p), where all values must be positive integer. For example, if the number of dimensions is 5 and the number of bits is 10, this means that the maximum coordinate value of embedding vectors is 2¹⁰−1=1023 and the maximum number on the Hilbert curve is 2⁵⁰−1.

First the negative real embedding vector values must be converted into a positive integer. Two operations must be performed on the negative real values, shifting and scaling. All coordinate values are shifted relative to the absolute value of the global minimum coordinate value (α) and then all coordinate values are converted to an integer by scaling and rounding (see FIG. 2, Algorithm 2). Every real number value in the embedding vectors must be converted into an integer. At first, there is a need to find the global minimum and maximum coordinate values, variables α and β, respectively (see FIG. 1, lines 2-3 in Algorithm 1). The value |α| must be added to all embedding vector values of the entire database. Thus, the minimum coordinate value will be 0 and the maximum coordinate value will be |α|+β so that all values are positive. By setting m=20, this means that the maximum integer coordinate value will be 1,048,575 which has 7 digits. Now, all coordinate values must have at most 6 digits. To do this step, the number of integer digits of |α|+β must be subtracted from 6. The remaining digits will be the necessary required order, prec (see FIG. 1, line 17 of Algorithm 1). Then, each coordinate value must scale 10^prectimes and rounded to the right most integer digit.

Building Index

In this stage, the index table, Ψ, is created and sorted according to the string Hilbert numbers. It includes three attributes, id, doc_no, and Hilbert_no (see FIG. 1, line 24 of Algorithm 1). The index is preferably built once and afterward loaded to memory each time there is search for a new query document.

Performing a search using an index is more efficient than performing a direct search for a nearest similar document using the binary search algorithm. In either case, a binary search algorithm is much faster than a sequential search.

Since the query document has many embedding vectors, a list of candidate similar documents is created by searching for each embedding vector separately. The next subsection describes the searching process in detail.

Searching Process

FIG. 6 illustrates a framework for searching similar documents in a very large database 108.

Once the index table Ψ 106 is loaded into memory and a new query document 102 is entered, in 104, query embedding vectors are mapped into Hilbert numbers (see FIG. 2, lines 2-6 of Algorithm 2). Next, in 106, a binary search is performed for each Hilbert number custom character _i(see FIG. 2, called in line 7 of Algorithm 2, and described in FIG. 3, Algorithm 3) and, in 112, a list of candidate documents T (i) is created according to Algorithm 4. In the meantime, a common list Ĺ is constructed of unique document numbers in all sub-lists of T (see FIG. 2, line 9 of Algorithm 2). Finally, in 112, the candidate similar documents in T and Ĺ are processed according to Algorithm 5 (FIG. 5). In Algorithm 5, each document number in Ĺ is counted in all sub-lists of T. Then, these documents are sorted in a decreasing order where the top t documents are retrieved and reordered according to the average Euclidean similarity among all embedding vectors using a greedy matching. Greedy matching is type of matching where the search engine attempts to match as much text as possible.

Time and Space Complexity Analysis

The time complexity of Building index table Ψ in Algorithm-1 can be analyzed as follows. Lines 4-10 consume time of O(ns), where n is the number of database documents and s is the average number of statements in each document, we assume that all documents of both database and query have the same number of statements. Lines 18-28 have time complexity of O(nsmr) where m is the size of embedding vectors and r is the number of bits to represent vector values. Lines 29-30 consume time complexity of O(ns). Line 31 represent the sorting of index file of size ns and it consumes O(ns log(ns)). Therefore, the total time complexity of building index table in Algorithm-1 is O(ns log(ns)+nsmr) while the space complexity of Algorithm-1 is O(ns), size of index table.

The time complexity of the searching process in Algorithm-2 can be summarized as follows. Lines 4-5 has time of O(m), line 6 (Hilbert mapping) has time complexity of O(mr), line 7 calls the binary search algorithm, Algorithm-3, which consumes O(log(ns)) time complexity, line 8 calls Algorithm-4 which retrieves t right nearest neighbor documents and T left nearest neighbor documents having similar Hilbert numbers for the current Hilbert number custom character _iat position p of the index Ψ. Assuming that the statements of each document are contiguous in the index table, the time complexity of Algorithm-4 is O(sτ²), line 9 which appends a list of at most 2 to another list Ĺ without duplication in time complexity of O(sτ²). According to the for loop in line 3, lines 4-5 and lines 6-9 are repeated s times so that the total time complexity of lines 3-9 is O(s (m+mr+log(ns)+sτ²+sτ²)). Finally, line 10 calls Algorithm-5 to process the candidate documents by counting frequency of each document number in list L among sub-lists of list T and this process costs O(s²τ²). As a result, the overall time complexity of the searching process as described in Algorithm-2 is O(smr+s log(ns)+s²τ²). On the other hand, the space complexity of Algorithm-2 is O(sτ).

By analyzing the time complexity of building index table and searching for the input query document, the time of performing a searching process is very fast in contrast to the time to build the index table whereas the time of search is dependent heavily on the average number of statements in each document and the user cutoff variable t.

Experiments were conducted on a laptop Intel® Core™ i5-2410M CPU running at 2.3 GHz and 4 GB RAM. In the experiments, the scalability of proposed methods were evaluated on databases with different numbers of documents, 50-400, step 50. The elapsed time is reported in seconds for querying 20 documents (see Table 1) and the top 10 similar documents were retrieved.

As shown from Table 1, the brute force search method is the worst method for retrieving similar documents by measuring the Euclidean distance between embedding vectors. As listed in the table, the elapsed time is 2335 seconds for querying 20 documents in a database of 50 records (i.e., 117 seconds for querying a single record). The dashed line in Table 1 indicates that the elapsed time exceeds 3 hours which makes the method not scalable and thus useless with very large databases. In contrast, dimensionality reduction methods, random projection and statistical reduction, have more efficient running time while preserving the quality of results, and produce the same set of similar records. The Hilbert reduction method is more efficient than random projection and statistical reduction because the Hilbert curve mapping reduces the long embedding vector to a single value.

FIG. 7 shows the efficiency of Hilbert reduction compared to random projection and statistical reduction. The random projection and statistical reduction have similar efficiency while the Hilbert reduction method is faster than both methods.

From a theoretical point of view, a statistical reduction method is faster than random projection. To reduce n embedding vectors from length m to length k, the reduction process of random projection consumes time complexity of O(nmk) while statistical reduction has time complexity of 4 nm=O(nm).

In spite that the three document search methods, random projection, statistical reduction, and Hilbert reduction are fast, they perform the sequential search which is slower than binary search. With the low efficiency of sequential search, indexing the database is of interest especially with very large databases that prohibit internal search in RAM. Indexing a very large database enables fast access to similar records indirectly. In addition, binary search can be exploited to accelerate the matching process. As shown in Table 1, an indexing method using a Hilbert curve is used with two types of evaluations for candidate similar records, evaluated by frequency ranking and evaluated using Euclidean distance and greedy similarity matching. However, the Hilbert frequency ranking is much faster than the Hilbert similarity matching. The last search method is of higher quality in retrieving correct records in comparison with the naïve method.

TABLE 1

average elapsed time in seconds for querying 20 documents using different methods.

document search methods

# of database
Naïve
Random
Statistics
Hilbert
Hilbert-frequency
Hilbert-similarity

documents
method
projection
reduction
reduction
indexing (search)
indexing (search)

50
2335.23
180.49
182.26
93.90
38.84 (0.08)
38.19 (518.55)

100
5518.09
445.11
428.88
192.45
79.26 (0.398)
77.11 (431.27)

150
8125.55
620.67
610.56
263.52
108.05 (0.443)
106.75 (698.86)

200
—
816.44
832.27
332.46
137.15 (0.336)
134.58 (401.01)

250
—
1011.88
1014.32
420.64
168.96 (0.311)
160.06 (388.09)

300
—
1182.32
1146.69
484.56
192.05 (0.281)
188.47 (429.35)

350
—
1372.55
1331.24
554.21
217.26 (0.371)
219.54 (430.24)

400
—
1566.06
1528.11
622.36
245.91 (0.333)
245.13 (354.65)

The indexing process is a point of interest to study the scalability on very large databases. Note that the indexing process for both methods, frequency and similarity, is the same and they differ in the evaluation process only. The indexing time is thus described for one method.

As illustrated in FIG. 8, the curve of time indexing behaves linearly, and the trend line function is shown in the figure. Therefore, the time of indexing very large databases can be expected easily. To predict the required time to index a database of n records is computed as follows.

Indexing time in seconds=28.945*(n/50)+15.977 seconds

As a consequence, for databases of records 1000, 10000, 100000, and 1000000, the expected indexing times are 595, 5805, 57906, 578916 seconds, respectively.

As listed in Table 1, searching the index table of 400 records for 20 query documents and verifying candidate similar documents using frequency ranking consumes less than a second (0.333) while verifying candidate records with similarity ranking consumes 355 seconds (18 seconds for a single query record). For different sizes of a database, the searching time is steady in both evaluating methods which is a promising indicator for very large index tables.

Table 2 (confusion matrix) describes the primitive performance metrics, true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

TABLE 2

Confusion matrix elementary metrics.

Prediction

Actual
Negative
Positive

True
FN
TP

False
TN
FP

The performance of the Hilbert reduction method is measured using Accuracy, Precision, Recall, and F1-score. Following are the equations of the 4 measures.

$\begin{matrix} Accuracy = \frac{TP + TN}{TP + FP + TN + FN} & Eq . (1) \end{matrix}$

$\begin{matrix} Precision = \frac{TP}{TP + FP} & Eq . (2) \end{matrix}$

$\begin{matrix} Recall = \frac{TP}{TP + FN} & Eq . (3) \end{matrix}$

$\begin{matrix} F 1 - score = \frac{2 (Precision) (Recall)}{Precision + Recall} = \frac{2 TP}{2 TP + FP + FN} & Eq . (4) \end{matrix}$

Table 3 (confusion matrix) shows the primitive performance metrics on top 25 similar records retrieved from a database of 100 records by the Hilbert reduction method when setting cutoff=50 compared with the ground truth, the naïve method.

TABLE 3

Confusion matrix for Hilbert reduction method.

Prediction

Actual
Negative
Positive

True
9
16

False
66
9

The confusion matrix for the performance of random projection method is shown in Table 4.

TABLE 4

Confusion matrix for random projection reduction method.

Prediction

Actual
Negative
Positive

True
4
21

False
71
4

The confusion matrix for the performance of statistical reduction method is shown in Table 5.

TABLE 5

Confusion matrix for statistical reduction method.

Prediction

Actual
Negative
Positive

True
5
20

False
70
5

The confusion matrix for the performance of Hilbert indexing with similarity method is shown in Table 6.

TABLE 6

Confusion matrix for Hilbert indexing method.

Prediction

Actual
Negative
Positive

True
6
19

False
69
6

The performance measures are listed in Table 7 for the best methods while neglecting the Hilbert indexing with frequency method due to the lack of performance.

The accuracy is computed for top 25 similar records when the cutoff=50. As seen from Table 7, Hilbert indexing-similarity method is competitive to random and statistical reduction in quality while in contrast it is more efficient. As listed in Table 1, with a database of 400 records, Hilbert-indexing-similarity method takes time of 600 seconds to build index table and search while random projection method takes 1566 seconds.

TABLE 7

Performance measures for the Hilbert reduction method for

predicting 50 similar documents from a database of 100 records.

Method
Accuracy
Precision
Recall
F1-score

Hilbert reduction
82%
64%
64%
64%

Random projection
92%
84%
84%
84%

Statistical reduction
90%
80%
80%
80%

Hilbert indexing-similarity
88%
76%
76%
76%

The accuracy of Hilbert-indexing-similarity can be improved by increasing the cutoff. In an embodiment, the efficiency of this method can be enhanced by reducing the dimensions before mapping with Hilbert curve. That is because much time is consumed by Hilbert mapping function. Therefore, it is preferred to perform an appropriate reduction before mapping.

Example—Determining Semantic Similarity Based on Hilbert Curve Distances

FIG. 9 is a flowchart for a method of determining the similar reference document to the query document.

In S902, the method begins with four reference documents and one query document. An example of documents with its extracted embedding vectors is as shown in FIG. 10. The method may now identify the best reference document similar to the query document.

In S904, transform from an m-Dimensional vector to a 1-Dimensional vector.

In this example, regarding the flowchart in FIG. 11, the steps of a transformation from 5-D to 1-D, where the original number of embedding vectors is 5, are:

- 1. Since Hilbert curve distance imposes the condition that the vector values must be a positive integer, the method preprocess the embedding vector values as follows.
- a. In S1102, find the largest and smallest value in FIG. 10.=>largest value=3.407773752 and smallest value=−5.093494283
- b. In S1104, add absolute value of the smallest value to all values in FIG. 10 to eliminate all negative values. Add (5.093494283) as in FIG. 12.
- c. Now, in S1106, the largest value in FIG. 12 is (3.407773752+5.093494283)=8.501268035
- d. In S1108, to transform real numbers in FIG. 12 to integers of at most 3 digits, multiply each value by 10{circumflex over ( )}2=100 to get the largest number with 3 digits (850).
- e. In S1110, after multiplication with 100, get integer values in FIG. 13.
- f. The values in FIG. 13 are now ready to transform to Hilbert numbers with parameters (p=10, n=5)

In S906, a Hilbert curve transformation is performed to obtain a 1-value embedding vector, as the 1-Dimensional vector, as in FIG. 14.

In S908, similarities are computed. FIG. 15 is a flowchart for computing similarities.

The problem now is how to process these big numbers? The method needs to find the difference between every two numbers to compute the similarity.

The method can be summarized by the following steps.

- a. In S1502, find the maximum length number. In this example, all numbers have the same length 15.
- b. In S1504, convert all numbers to strings of length 15. If there is a number of length less than 15 digits, say 11, we must add 4 zeros to the left of string. (“0000xxxxxxxxxxx”)
- c. In S1506, sort strings in ascending order.
- d. In FIG. 16A, create zero string of length 15 to encode the first number. Z=“000000000000000”
- e. FIG. 17 contains the sorted strings.
- f. Now, encode numbers.
- g. In FIG. 16A, to encode first number (“352388435475722”), we should compare it with the Z=“000000000000000” from left to right. When the first mismatch is encountered, the code will be computed as in FIG. 16B.

Now, in S910, new-code=previous code+ (max length-j)*(difference at the mismatch position)/max length.

$Number 1 = 0 + \frac{(15 - 0) (3 - 0)}{15} = 3$

$Number 2 = 3 + \frac{(15 - 0) (6 - 3)}{15} = 6$

$Number 3 = 6 + \frac{(15 - 0) (7 - 6)}{15} = 7$

$Number 4 = 7 + \frac{(15 - 6) (8 - 4)}{15} = 9.4$

$Number 5 = 9.4 + \frac{(15 - 5) (4 - 2)}{15} = 10.73$

$Number 6 = 10.73 + \frac{(15 - 5) (7 - 4)}{15} = 12.73$

$Number 7 = 12.73 + \frac{(15 - 1) (4 - 3)}{15} = 13.66$

$Number 8 = 13.66 + \frac{(15 - 1) (5 - 4)}{15} = 14.59$

$Number 9 = 14.59 + \frac{(15 - 1) (8 - 5)}{15} = 17.39$

$Number 10 = 17.39 + \frac{(15 - 0) (8 - 7)}{15} = 18.39$

$Number 11 = 18.39 + \frac{(15 - 0) (9 - 8)}{15} = 19.39$

Now, in S912, it is enough to multiply each number with 100, as in FIG. 17.

Now, in S914, the new numbers will be assigned to document sentences as in FIG. 18.

To compute the semantic similarity between Qry-D1 and Ref-D1, in S916, compute the similarity as in FIG. 19.

Then similarity between Qry-D1 and Ref-D1=0.80.

FIG. 20 is an illustration of a non-limiting example of details of computing hardware used in the computing system, according to aspects of the present disclosure.

The computing hardware is for implementing the methods for semantic search according to an exemplary aspect of the disclosure. The methods for semantic search may be a software program executed on the computer system 2000. In some embodiments, the methods for semantic search may be a computer program stored on computer readable storage medium for execution on a computer system 2000. The computer readable storage medium may be any built-in storage medium (hard drive, solid state drive) or removable storage medium (DVD or the like). The computer system 2000 may be any general purpose computer, such as a laptop computer, desktop computer, or workstation, running an operating system, for example Ubuntu Linux OS, Windows, a version of Unix OS, or Mac OS. The computer system 2000 may include processing circuitry implementing one or more central processing units (CPU) 2050 having multiple cores. In some embodiments, the computer system 2000 may include a graphics board 2012 having multiple GPUs, each GPU having GPU memory. The graphics board 2012 may perform many of the mathematical operations of the disclosed machine learning methods. The computer system 2000 includes main memory 2002, typically random access memory RAM, which contains the software being executed by the processing cores 2050 and GPUs 2012, as well as a non-volatile storage device 2004 for storing data and the software programs. Several interfaces for interacting with the computer system 2000 may be provided, including an I/O Bus Interface 2010, Input/Peripherals 2018 such as a keyboard, touch pad, mouse, Display Adapter 2016 and one or more Displays 2008, and a Network Controller 2006 to enable wired or wireless communication through a network 99. The interfaces, memory and processors may communicate over the system bus 2026. The computer system 2000 includes a power supply 2021, which may be a redundant power supply.

FIG. 21 is an illustration of a non-limiting example of distributed components that may share processing with the controller. In some embodiments, the computer system 2000 may connect to a remote database 2104 or a database system implemented in a cloud service 2102. The remote database 2104 or cloud service 2102 contains a database of documents, and the computer system 2000 is configured to search the database of documents using the algorithms disclosed herein.

Numerous modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

Claims

1. A method for a textual document search engine, comprising: initializing the textual document search engine, byinputting, into a memory, a plurality of documents,wherein each document of the plurality of documents has a plurality of sentences, and each sentence of each document has m semantic embedding vectors, where m is an integer greater than 1;mapping, via a processing circuitry, the m semantic embedding vectors for each document of the plurality of documents to 1-dimensional vectors of Hilbert numbers using a Hilbert curve transformation;constructing, via the processing circuitry, an index table with the plurality of 1-dimensional vectors; andstoring the index table in the memory,wherein, in the mapping, the processing circuitry performs the Hilbert curve transformation by converting m embedding vectors into a Hilbert number and the Hilbert numbers are search keys in the index table,the method further comprising:performing a search using the initialized textual document search engine, by receiving a query document, which has query embedding vectors;searching, via the processing circuitry, the search keys in the index table; andperforming, via the processing circuitry, a filtration stage,wherein the searching, via the processing circuitry, includes: mapping the query embedding vectors into Hilbert numbers for the query document; andperforming a binary search based on the Hilbert numbers, andwherein the performing the filtration stage, via the processing circuitry, includes: outputting a predetermined number of candidate documents that are similar to the query document.
2. The method of claim 1, wherein the index table Ψ={ζ,η,ζ} is a triple attribute table where ζ is an embedding vector number in respect to all of the embedding vectors of document D, η is a document number that contains embedding vector ζ, and ζ is the Hilbert number that corresponds to a vector of number ζ.
3. The method of claim 2, wherein the constructing the index table, via the processing circuitry, includes creating the index table according to the Hilbert numbers; andsorting the Hilbert numbers.
4. The method of claim 1, further comprising: inputting, via the processing circuitry, a number of dimensions m and a number of bits for each coordinate value p, where all values are positive integers; andspecifying, via the processing circuitry, the Hilbert curve based on the number of dimensions m and the number of bits for each coordinate value p.
5. The method of claim 3, further comprising: reducing, via the processing circuitry, the input number of dimensions m; andmapping, via the processing circuitry, the semantic embedding vectors from the reduced number of dimensions to the 1-dimension.
6. The method of claim 1, wherein the searching, via the processing circuitry, includes: mapping the query embedding vectors into Hilbert numbers using Hilbert curve transformation;assigning the Hilbert numbers to the sentences in the documents, including the query document; anddetermining semantic similarity between the Hilbert numbers for the sentences in the query document and the sentences in each stored document, andwherein the filtration stage includes outputting a predetermined number of candidate documents based on highest to lowest semantic similarity.
7. The method of claim 1, further comprising: encoding, via the processing circuitry, the Hilbert numbers;assigning, via the processing circuitry, the encoded Hilbert numbers to the sentences in the documents, including the query document; anddetermining, via the processing circuitry, semantic similarity between encoded Hilbert numbers for the sentences in the query document and the sentences in each stored document.
8. A method for searching a textual document database with a query document, comprising: loading an index table into a memory, wherein the index table Ψ={ζ,η,ζ} is a triple attribute table where ζ is an embedding vector number in respect to all embedding vectors of document D, η is a document number that contains embedding vector ζ, and ζ is a Hilbert number that corresponds to a vector of number ζ;inputting, into the memory, a query document, which has a plurality of embedding vectors corresponding to sentences in the query document, wherein the query embedding vectors are mapped into Hilbert numbers using Hilbert curve transformation;searching, via a processing circuitry, the index table using the Hilbert numbers and retrieving candidate documents that are similar to the query document based on the Hilbert numbers;inputting, into the memory, a plurality of documents,wherein each document of the plurality of documents has a plurality of sentences, and each sentence of each document has a plurality of semantic embedding vectors;mapping, via the processing circuitry, the semantic embedding vectors from m-dimensions to 1-dimension vectors of Hilbert numbers using the Hilbert curve transformation;constructing, via the processing circuitry, an index table with the 1-dimensional vectors;storing the index table in the memory;assigning, via the processing circuitry, the Hilbert numbers to sentences in the query document;determining, via the processing circuitry, semantic similarity between the Hilbert numbers for the sentences in the query document and the sentences in each document; andoutputting, via the processing circuitry, a predetermined number of candidate documents based on highest to lowest semantic similarity.
9. The method of claim 8, further comprising: performing, via the processing circuitry, a binary search based on the Hilbert numbers of the query document; andoutputting, via the processing circuitry, a predetermined number of candidate documents that are similar to the query document.
10. The method of claim 8, further comprising: encoding, via the processing circuitry, the Hilbert numbers;assigning, via the processing circuitry, the encoded Hilbert numbers to the sentences in the query document; anddetermining, via the processing circuitry, semantic similarity between encoded Hilbert numbers for the sentences in the query document and the sentences in each stored document.
11. The method of claim 8, further comprising: inputting, via the processing circuitry, a number of dimensions m and a number of bits for each coordinate value p, where all values are positive integers; andspecifying, via the processing circuitry, the Hilbert curve based on the number of dimensions m and the number of bits for each coordinate value p.
12. The method of claim 11, further comprising: reducing, via the processing circuitry, the input number of dimensions m; andmapping, via the processing circuitry, the semantic embedding vectors from the reduced number of dimensions to the 1-dimension.
13. The method of claim 8, wherein the mapping, via the processing circuitry, the semantic embedding vectors further comprisessorting, via the processing circuitry, the Hilbert numbers; andencoding, via the processing circuitry, each sorted Hilbert number based on the sorted Hilbert numbers to obtain encoded numbers.
14. The method of claim 13, further comprising: assigning, via the processing circuitry, the encoded numbers to the plurality of sentences in each of the documents.
15. The method of claim 14, further comprising: determining, via the processing circuitry, semantic similarity between the sentences in the query document and each of the document sentences based on the encoded numbers.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of priority to U.S. Provisional Application No. 63/514,579 filed Jul. 20, 2023, the entire contents of which are incorporated herein by reference.

US Referenced Citations (33)

Number	Name	Date	Kind
5963956	Smartt	Oct 1999	A
6460034	Wical	Oct 2002	B1
8219564	Shao	Jul 2012	B1
9183203	Tuchman	Nov 2015	B1
9336192	Barba	May 2016	B1
10346737	Benitez	Jul 2019	B1
11321296	Rivers	May 2022	B1
11392651	McClusky	Jul 2022	B1
11983486	Mariko	May 2024	B1
12008026	Sanz	Jun 2024	B1
20030004938	Lawder	Jan 2003	A1
20040113953	Newman	Jun 2004	A1
20110035656	King	Feb 2011	A1
20140095502	Ziauddin	Apr 2014	A1
20140222826	DaCosta	Aug 2014	A1
20170123382	Ruzicka	May 2017	A1
20180004815	Zhou	Jan 2018	A1
20190065550	Stankiewicz	Feb 2019	A1
20190102400	Kumaran	Apr 2019	A1
20210056150	Karandish	Feb 2021	A1
20210090694	Colley	Mar 2021	A1
20210150201	Reisswig	May 2021	A1
20210158176	Wan	May 2021	A1
20210202045	Neumann	Jul 2021	A1
20210295822	Tomkins	Sep 2021	A1
20210312266	Youn	Oct 2021	A1
20220036209	Horesh	Feb 2022	A1
20220198133	Bar	Jun 2022	A1
20220222235	Menghani	Jul 2022	A1
20220261430	Kataoka	Aug 2022	A1
20220261545	Lauber	Aug 2022	A1
20230394235	Rahman	Dec 2023	A1
20240176673	Roseberg	May 2024	A1

Non-Patent Literature Citations (1)

Entry
Ting Li, et al., “A locality-aware similar information searching scheme”, International Journal on Digital Libraries, vol. 17, Oct. 12, 2014, pp. 79-93.

Provisional Applications (1)

	Number	Date	Country
	63514579	Jul 2023	US

Method for accelerated long document search using Hilbert curve mapping

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications