The present application claims the priority of European patent application titled “Method and Infrastructure for Processing a Text Search Query in a Collection of Documents,” Ser. No. 04107041.8, filed on Dec. 29, 2004, which is incorporated herein in its entirety.
The present invention generally relates to a method and an infrastructure for processing text search queries in a collection of documents. Particularly, the present invention utilizes current process features such as single instruction multiple data (SIMD) units to further optimize Boolean query processing.
Text search in the context of database queries is becoming more and more important—most notably for XML processing. Current text search solutions tend to focus on “stand-alone systems”.
The purpose of a text search query is usually to find those documents in a collection of documents that fulfil certain criteria or search conditions, such as that the document contains certain words. In many cases, the “relevance” of documents fulfilling the given search conditions is calculated as well by using a process called scoring. Most often, users are only interested in seeing the “best” documents as result of a text search query. Consequently, most search technology aims at producing the first N best results for relatively simple user queries as fast as possible.
In the context of database queries, especially to support XML, queries are complex, i.e. expressing many conditions, and all results are needed for combination with conditions on other database fields. As the size of document collections to be searched is constantly increasing, efficiency of text search query processing becomes an ever more important issue.
Text search query processing for full text search is usually based on “inverted indexes”. To generate inverted indexes for a collection of documents, all documents are analysed to identify the occurring words or search terms as index terms together with their positions in the documents. In an “inversion step” this information is basically sorted so that the index term becomes the first order criteria. The result is stored in a posting index comprising the set of index terms and a posting list for each index term of the set.
Most text search queries comprise Boolean conditions on index terms that can be processed by using an appropriate posting index.
Although this technology has proven to be useful, it would be desirable to present additional improvements to improve search performance. What is therefore needed is a system, a computer program product, and an associated method for processing a text search query in a collection of documents that performs well, especially for complex queries returning all results.
The present invention satisfies this need, and presents a system, a computer program product, and an associated method (collectively referred to herein as “the system” or “the present system”) for processing a text search query in a collection of documents (further referenced herein as a document collection or collection).
A text search query of the present system comprises search conditions on search terms, the search conditions being translated into conditions on index terms. The documents of the document collection are grouped in blocks of N documents, respectively, before a block posting index is generated and stored. The block posting index comprises a set of index terms and a posting list for each index term of the set, enumerating all blocks in which the index term occurs at least once. Further, intrablock postings are generated and stored for each block and each index term. The intrablock postings comprise a bit vector of length N representing the sequence of documents forming the block, wherein each bit indicates the occurrence of the index term in the corresponding document. The conditions of a given query are processed by using the block posting index to obtain hit candidate blocks comprising documents that are candidates for fulfilling the conditions, evaluating the conditions on the bit vectors of the hit candidate blocks to verify the corresponding documents, and identifying the hit documents fulfilling the conditions.
The present system groups the documents of the collection in blocks to treat N documents together as a single block. Consequently, a block posting index is generated and stored for the blocks of the collection. In the context of this block posting index, a block comprising N documents takes the role of a single document in the context of a standard inverted index.
The block posting index according to the present system does not comprise any positional or occurrence information, thus allowing a quick processing of search conditions that do not require this kind of information, like Boolean conditions.
The present system evaluates the conditions of a given query by using the block posting index. Thus, it is possible to identify all blocks of the collection comprising a set of one or more documents fulfilling the conditions when taken together. That is, the resultant “hit candidate” blocks may but do not necessarily comprise a hit document. Consequently, processing the conditions of a given query on the block posting index has a certain filter effect as this processing reduces significantly the number of documents to be searched.
The present system validates the individual documents forming the “hit candidate” blocks. Therefore, the index structure of the present system comprises intrablock postings for each block of the collection and for each index term of the block posting index. The data structure of these intrablock postings comprises a bit vector for each block and each index term. This data structure allows a fast processing of the relevant information to validate the individual “hit candidate” documents.
There are different possibilities to perform the evaluation on the bit vectors. For example, the present system may evaluate the bit vectors bit by bit. In one embodiment, the bit vector structure of the here relevant information is used for parallel processing. Therefore, a single instruction multiple data (SIMD) unit can be used to take advantage of current hardware features.
The various features of the present invention and the manner of attaining them will be described in greater detail with reference to the following description, claims, and drawings, wherein reference numerals are reused, where appropriate, to indicate a correspondence between the referenced items, and wherein:
Block posting lists are generated for each index term of a set of index terms (step 2), wherein each block posting list enumerates all blocks in which the corresponding index term occurs. The block posting lists may further comprise additional information, such as, for example, the number of occurrences of the corresponding index term for all blocks enumerated. These block posting lists are stored in a block posting index 20. The block posting index 20 is an inverted index. Consequently, the block posting index 20 may be generated as described above in connection with the state of the art wherein each block takes the role of a document. In one embodiment of the present invention, the block posting index 20 is generated by using an already existing index structure, such as, for example, a full posting index enumerating all occurrences of all index terms in all documents of the document collection 10. In any case, an appropriate method (not shown) is used for generating and storing the block posting index 20.
Beside the block posting index 20, intrablock postings are generated for each block (step 3) and each index term and are stored in an intrablock posting index 30. Each intrablock posting comprises a bit vector of length N representing the sequence of documents forming the block. Each bit of the bit vector indicates whether the index term related to the intrablock posting occurs in the document corresponding to the bit. The procedure of generating the intrablock postings (step 3) implies that the infrastructure according to the invention comprises an appropriate method for generating the bit vectors of length N.
Intrablock scoring information is generated in step 4. This implies that the infrastructure according to the invention comprises appropriate method for generating the scoring information. An example for intrablock scoring information will be described in connection with
The example illustrated in
The intrablock scoring information 24 is stored in a separate data structure. The number of occurrences of index term “queen” in a document is used as intrablock scoring information, which is 1 for the 45th document, i.e. document 167213, and 2 for 56th document, i.e. document 167224 of the document collection 10. Any type of scoring information may be stored in the intrablock scoring index; the here described embodiment is just an example for one possibility of implementing the present invention.
The flowchart of
Processing a text search query in the document collection 10 is initiated by translating the search conditions of the query into conditions on the index terms of the index structure used. The infrastructure for processing a text search query comprises a method for translating the search conditions on search terms of a given text search query into conditions on index terms.
Query processing is initialized (step 100) which comprises among other procedures the translation of the search conditions into conditions on index terms.
Processing enters a loop at step 101. A next hit candidate block is retrieved (step 101). Retrieving the next hit candidate block comprises evaluating the query conditions by using the block posting index. Consequently, the query is not evaluated for a single document but for the blocks of the document collection 10. This processing can be performed using any of the well-known query processing methods on inverted index structures. The result of this evaluation is a hit candidate block comprising at least a set of documents fulfilling the conditions when taken together, i.e., a hit candidate block does not necessarily comprise a single hit document.
Step 102 verifies whether a next hit candidate block has been found. If not, query processing is finished (step 110). If a next hit candidate block has been found, the matches are determined in the hit candidate block (step 103), evaluating the conditions of the query on the corresponding bit vectors of the intrablock posting index.
Step 104 checks whether valid matches, i.e. hit documents, are found. If the intrablock postings have the form of 128-bit vectors, a complete 128-bit vector can be processed in one step by using an SIMD unit. If no SIMD unit is available, a 128-bit vector can be processed by four 32-bit units on a 32-bit architecture or by two 64-bit units on a 64-bit architecture. However, even without an SIMD unit, this evaluation scheme may be beneficial due to good cache locality. If the result vector is zero, no hit document has been found in the block and processing returns to step 101. If the result vector is non-zero at least one hit has been validated successfully. The non-zero bit positions are decoded to determine the hit documents and the results are stored (step 105). Hereby, a hit candidate block is validated and the hit documents within the block are identified.
Query processing further comprises the possibility of scoring the identified hit documents. Therefore, step 106 determines whether scoring is needed. If not, processing returns to step 101.
In case that scoring is needed, the intrablock scoring index is accessed to decode the intrablock scoring information of the hit document. This scoring information is recorded in a buffer (step 107). The buffer is used to accumulate the scoring information for several hit documents. In one embodiment, the buffer may be managed as a round-robin queue. Step 108 determines whether a buffer fill threshold is reached. If so, the score for all buffered results is calculated (step 109). Thus, the score calculation can be vectorized using appropriate hardware features available in the infrastructure, because the score calculation requires that the same mathematical formula is evaluated on the scoring information for each hit document.
If, for example, calculation is performed using 32-bit float values then a 128-bit SIMD unit can evaluate the same formula on four complete sets of scoring information in parallel. If no SIMD unit or alternative vector processor is available, this processing is performed element-wise. However, even without an SIMD unit, this evaluation scheme may be beneficial due to good cache locality. The results of the score calculation are added to the results as a block instead of individual inserts. The buffer space is freed up and processing returns to step 101.
The content of
Method 300 is particularly suitable for complex Boolean queries returning all results. Complex queries with high-frequent terms and non-ranked queries also benefit. The block-based Boolean filtering proposed by the invention is efficient for many typical queries in database context. Only modest changes to the existing code are necessary to implement the invention. The new index data structure can be generated from current indexes.
It is to be understood that the specific embodiments of the invention that have been described are merely illustrative of certain applications of the principle of the present invention. Numerous modifications may be made to the system and method for processing a text search query in a collection of documents described herein without departing from the spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
04107041.8 | Dec 2004 | EP | regional |