The present invention relates to the processing of electronic text generally.
In many business applications, information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Email, where each reply or forward operation in a thread often repeats some previously sent content, can also be seen as having evolving document versions.
Often it is desired to enable free-text search over such repositories, i.e. to enable submitting queries for which there may be a match in any version of any document. A straightforward way to support free-text search over corpora of versioned documents is to index each version of each document separately, essentially treating the versions as independent entities. However, due to the inherent extensive redundancy in versioned documents, indexing them in this way invariably means indexing portions of identical material numerous times, resulting in larger indices that take longer to build and search, as well as require more storage capacity.
There is no provided, in accordance with an embodiment of the present invention, a method including, for at least one document, indexing a single time, text which is repeated in multiple edited versions of the document, thereby generating a compact index. The method also includes conducting text searches in the index.
There is also provided, in accordance with another embodiment with another embodiment of the present invention, a search engine including an indexer to index a single time, text which is repeated in multiple edited versions of at least one document thereby generating a compact index, and a query manager to conduct text searches in the compact index.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
Applicants have realized that when successive versions of documents are not significantly different from their predecessors, the redundancies in the documents may be exploited in order to index the documents in a compact manner, while preserving the full retrieval capabilities supported by a traditional index of the documents, in which each document is indexed as an independent entity.
The present invention may thus provide a method and an apparatus for generating a compact index for versioned documents, and for conducting query-based searches therein.
As shown in
In accordance with the present invention, each versioned document dig denotes the ith version of a document in a group g of versioned documents. Furthermore, all of the versions of a document in a group g may be related to one other by a series of revisions, i.e. insert/delete/substitute transformations. The exemplary collection 20 of versioned documents dig shown in
The operation of versioned document indexer 15 is discussed in further detail with respect to
The operation of aligner 42 is discussed in further detail with respect to
Matrix MG3 shown in
Each versioned document, d13, d23, d33 and d43 of group G3 may then be represented by a string of letter symbols. As shown in
As shown in
Furthermore, in accordance with the present invention, each subsequent row i in alignment matrix M constructed by aligner 42 may be a binary representation of the ith versioned document of group g. Thus, in exemplary matrix MG3, each exemplary string STi, representing exemplary versioned document di3, is represented by binary values in row i of the matrix. Thus, string ST1 is represented in row 1 (the first row below row 0) of matrix MG3, string ST2 is represented in row 2, string ST3 is represented in row 3 and string ST4 is represented in row 4.
As shown in
In accordance with the present invention, each versioned document represented in row i of alignment matrix M may be reconstructed from its binary representation in row i by concatenating the symbols in MO,j such that Mi,j=1. Taking the example of string ST1 represented in row 1 of exemplary alignment matrix MG3, it may be seen that only the columns headed by the symbols A, B and C have the value of 1 in row 1 and thus, by concatenating them, the text string “ABC”, string ST1, is reconstructed.
Aligner 42 may then generate a set GVDg of virtual documents for each group g of versioned documents dig in collection 20. For a group g comprising n versioned documents, i.e., where i=1, . . . n, aligner 42 may generate
virtual documents {vj,i,1≦i≦j≦n}. Thus for the example shown in
In accordance with the present invention, each virtual document vji may contain the text in row 0 of alignment matrix M, corresponding to columns where there is a maximal run of 1s which starts at row i and ends at row j. Furthermore, the virtual documents may be ordered by a lexicographic ordering of the pair <j, i>, i.e. primarily by increasing values of the end of the runs of 1s, and within all runs ending at a particular index j, by increasing index of the beginning of the run.
Thus, as shown in row 50-1 of table 50 of
It will be appreciated that while there is only one maximal run of 1s in each column of exemplary alignment matrix MG3 for the example of
Row 50-2 of table 50 shows the virtual document in [i:j] notation which corresponds to each virtual document vj,i. Row 50-4 of table 50 shows the contents of each virtual document [i:j] (and accordingly, vj,i), which, in accordance with the present invention, may be the symbols in whose columns there is a run of 1s in [i:j] (i.e., rows i through j). Furthermore, in accordance with the present invention, when there are no runs of 1 in [ij] in any column of a given alignment matrix M, the corresponding virtual document [ij] may be empty.
It may thus be seen in
It will be appreciated that the example shown in
Given k groups of versioned documents,
d11, . . . , dn
aligner 42 may construct
virtual documents in accordance with the process described with respect to
The virtual documents may then be ordered as follows:
v1,11, . . . , vn
In accordance with the present invention, aligner 42 may then assign a serial number 1, . . . ,N to each virtual document, to serve as a document identifier (docid). It may be seen in
In accordance with the present invention, as explained hereinabove with respect to
As shown in
In the example of
As shown in
It will be appreciated that, in accordance with the present invention, the total number of posting elements that stem from group g of versioned documents in compact inverted index 60 may equal the total number of maximal runs of 1 in alignment matrix M constructed by aligner 42 for the group of virtual documents GVDg associated with said group g of versioned documents. As may be seen in matrix MG3 of
In contrast, the total number of posting elements which would be identified in a traditional index of a group of versioned documents g, i.e., in which each document is indexed as an independent entity, would be the total number of distinct appearances of tokens. With respect to alignment matrix M, the total number of distinct appearances of tokens may be equal to the number of 1s appearing in matrix M. For the example of group G3 this number is 30, as may be seen in matrix MG3 of
Thus, it may be seen that indexing the virtual documents VirDN in a group GVDg, which may, in accordance with the present invention, represent the original versioned documents dig in a group g, may produce a compact inverted index 60 having fewer posting elements than a traditional index of the documents in group g. For exemplary group G3 of versioned documents dig, the number of posting elements are reduced from 30 to 12, as explained hereinabove with respect to
It will be appreciated that the ability of the present invention to afford benefits resulting from a reduced index size, without attendant detractions regarding retrieval capability, may be afforded by the maintenance of a map correlating the virtual documents VirDN to the original versioned documents dig. In accordance with the present invention, this map may be provided in the form of predicate data 47.
Returning briefly to
from(X)=i
to(X)=j
root(X)=docid(v1,1k)
last(X)=docid(vn
It will be appreciated that the predicates from(X) and to(X) map a particular virtual document X to a particular run of 1s in its associated alignment matrix M. Specifically, the value of the predicate from(X) is the row of M in which the run of 1s associated with virtual document X begins. The value of the predicate to(X) is the row of M in which the run of 1s associated with virtual document X ends.
It will further be appreciated that the predicates root(X) and last(X) map a particular virtual document X to its source group g of versioned documents dig. Specifically, the value of the predicate root(X) is the docid of the first virtual document in the group GVDg to which X belongs. For exemplary group of virtual documents GVD3 of
The value of the predicate last(X) is the docid of the last virtual document in group GVDg to which X belongs. Thus for exemplary group of virtual documents GVD3 of
Exemplary predicate data 47 for the 22 virtual documents of exemplary collection 40 of
It may be seen in
thus determines that the exemplary groups G1, G2, G3 and G4 shown in
It may further be seen in
to the number n of versioned documents dig in group g.
Taking the example of group G1 in
gives six virtual documents VirDN. Six virtual documents VirDN are similarly indicated by the total number (six) of different runs of 1 possible in alignment matrix MG1, which would have three rows, each one corresponding to one versioned document dig: [1:1], [1:2], [2:2], [1;3], [2:3] and [3:3]. Each of these combinations is explicitly listed in the array of predicate data 47 shown in
It will also be appreciated that the categorization of virtual documents X into groups g is apparent in array of predicate data 47 by virtue of the fact that the values of the predicates root(X) and last(X) are shared by the virtual documents X belonging to a single group g. Thus, all of the virtual documents (1-6) of group G1 may be seen in
It will further be appreciated that in accordance with the present invention, the values of all four predicates (i.e., from(X), to(X), root(X), and last(X)) for each virtual document X, may be available in compact index 22 at the cost of only two integers per document. Firstly, a fifth predicate, P(X), may be defined as a function of the root(X) and last(X) predicates, namely:
That is, the value of the predicate P(X) may be equal to the value of root(X) except when X=root(X), at which time it may have the value of last(X). Exemplary values of P(X) for the 22 virtual documents of exemplary collection 40 of
Furthermore, the predicates root(X), last(X) and from(X) may be calculated from the two predicates to(X) and P(X) as follows:
Thus, by storing two integers per virtual document, i.e., the two predicates to(·) and P(·), all four predicates, (i.e., from(X), to(X), root(X), and last(X)) may be readily calculable.
Returning now briefly to
In accordance with the present invention, to simplify the job of query manager 17, each forbidden term −C may be swapped with a virtual required term neg(C), which virtually appears in all of the documents in which C does not appear, and only in those documents. Formally then, a query Q may be a set of size |Q| of required terms (real and virtual), t1, . . . ,t|Q|.
During its search for terms t1, . . . ,t|Q|, query manager 17 may employ posting iterators pt1, . . . ,pt|Q| to mark the current position of the search in each posting list PLt1, . . . ,PLt|Q|. In the information retrieval (IR) literature, pt is also commonly known as the cursor of term t.
The operation of query manager 17 is discussed in further detail with respect to
In accordance with the present invention, query manager 17 may change the positions of iterators pt1, . . . ,pt|Q| in posting lists PLt1, . . . ,PLt|Q| in accordance with an algorithm provided in the present invention, which is a modification of the zig-zag join technique of Garcia-Molina et al. (Database System Implementation. Prentice Hall, 2000), in which the cursors of all required terms (real or virtual) are advanced in alternating order, until they align at some document id. The document at which the cursors align is that which is a match for the query.
At each step of a zig-zag join, a cursor that lags behind the most advanced cursor is chosen, and is advanced using a next operator to a point at or beyond the most advanced cursor. The algorithm provided in the present invention is a slight modification of the classic zig-zag join, since the cursor positions do not necessarily need to align at some particular virtual document, but rather on a set of virtual documents whose ranges intersect.
The standard outer shell document at-a-time evaluation provided in the present invention may be the following:
The search function enumerates all virtual documents which match the query Q. It outputs a virtual document if and only if the range of physical documents corresponding to it and none of the forbidden terms.
The nextCandidate function performs the zig-zag join and returns the virtual document id representing the next range on which all cursors intersect. The nextCandidate functon employs the primitive next(pt, docid), the function location(root, from, to), and the function intersection(docid1, docid2).
In accordance with the present invention, the primitive next(pt, docid) sets pt to the first virtual document in the posting list of t whose id is greater than docid (or to ∞ if no such document exists) and returns that document id.
The function location(root, from, to) returns the id of the virtual document corresponding to the range [from, to], given the id of the virtual root document (corresponding to the range [1, 1]) of a group of versional documents. This may simply be calculated as:
The function intersection(docid1, docid2) returns the id of the virtual document that corresponds to the intersection of the ranges resented by docid1 and docid2, or ∞ if the ranges do not intersect.
In accordance with the present invention, the function which may perform the zig-zag join and return the virtual document id representing the next range on which all cursors intersect is the following:
As shown in
Furthermore, as shown in
The virtual documents beyond [to,to+1] will either not intersect at all with the range of cursor CL, or will intersect with the suffix of the range of cursor CL In
In graphs 60 and 70 shown in
In graph 60 each virtual document [ij] is represented as an interval spanning row i to row j, by a hatching pattern filling the interval. In graph 70, the graphical intersection between virtual document [3:4] and each of the other virtual documents, is shown by an overlay of the hatching pattern of virtual document [3:4] over the hatching pattern of every other interval. Thus the characteristics of intersection of ranges RDNI, RINT and RQINT, as a function of the range of the interval [i:j] of the leading cursor CL, are demonstrated.
As shown in
Conversely, when the hatching pattern on interval [3:4] of leading cursor CL is overlaid on the hatching patterns of each of the intervals of the virtual documents in range RINT, (i.e. virtual documents [1:3]-[4:5]) it may be seen that the hatching patterns always overlap. Thus it is shown in
Finally, when the hatching pattern on interval [3:4] of leading cursor CL is overlaid on the hatching patterns of each of the intervals of the virtual documents in range RQINT, (i.e. virtual documents [4:5]-[6:6]) it may be seen that the hatching patterns overlap in intervals [1:6], [2:6], [3:6] and [4:6], and that the hatching patterns do not overlap in intervals [5:5], [5:6] and [6:6]. Thus it is shown in
Furthermore, in accordance with the method of the modified zig-zag join provided in the present invention, if a lagging cursor is advanced and it hits a non-intersecting range, it is guaranteed to not intersect with the range of the leading cursor CL later, so that leading cursor CL may be switched.
As explained preciously hereinabove, a forbidden term −C of query Q may be wrapped with a virtual cursor, which may use the underlying cursor to return the next interval in which C does not appear. In accordance with the present invention, the next function of the virtual cursor corresponding to a negative term may be implemented as follows:
It will be appreciated that the virtual cursor wrapper may remember the last position to which the underlying cursor was advanced. Furthermore, the next method of the wrapper may be called with a range of the form [X,X]. It will further be appreciated that for each group, the last physical document in the group may be identified as the document having the largest “to ” value of any range in the group.
As discussed previously hereinabove with respect to
The greedy polynomial-time algorithm provided in the present invention may be used for groups of versioned documents which evolve in a linear fashion, i.e., the versions are sequential and do not branch. For document versions which evolve in a treelike fashion, the method of DFS traversal may be used to configure alignment matrix M.
In the example of
In accordance with the greedy polynomial-time algorithm provided in the present invention and as shown in
Initial matrix M1 may contain the string representing the first versioned document in its uppermost row, with a column allocated to each symbol in the string (i.e., each unit of text in the document version). The row below the uppermost row may be associated with the first versioned document, and may contain values of 1 in each cell. A value of 1 in a cell may indicate the appearance of the symbol associated with its column in the string associated with its row, as explained previously with respect to
Each matrix expansion may then be performed by computing the longest common subsequence (LCS) of the strings representing versioned document j and versioned document j-1, and then inserting new columns into matrix M(j-1) for all symbols in string j inserted relative to string j-1. Each expanded matrix Mj also includes a row added to matrix M(j-1) which contains a binary representation of versioned document j, as explained previously with respect to
Thus, in the example of
To finalize the creation of expanded matrix M2, a row containing the binary representation of STR2 is appended to matrix M2. The binary representation of STR1 is also updated to contain zero values in the columns inserted into matrix M2 since their symbols are not contained in STR1.
Similarly, and as shown in
The method provided in the present invention may support such ranking in the following manner: Whenever query manager 17 returns a virtual document Vto, fromk representing the range [from,to] of version group k, from the nextCandidate function as search results 30, results ranker 92 may score the to−from+1 physical versioned documents represented by that range. Query manager 17 may stream through the postings lists of all positive query terms, starting from virtual document Vfrom,1k and ending at vto,tok, and results ranker 92 may factor each query term occurrence within those virtual documents into the scores of the corresponding physical versioned documents.
The present invention may thus be able to return results matching any of the following criteria for every group k in which some document matched query Q: the earliest or latest document version matching query Q, the highest-scoring version with respect to query Q, or all of the versions matching query Q.
It will be appreciated that search engines typically associate inner-document locations with each indexed token, thus mapping adjacencies of tokens in a document. This enables both exact-phrase searching, as well as proximity-based scoring (i.e., boosting the score of documents where query terms appear in close proximity to one another.) It will further be appreciated that phrase matching and proximity-based scoring do not typically cross sentence boundaries.
As discussed previously hereinabove with respect to
The method provided in the present invention may maintain robust performance of exact-phrase queries and proximity-based searches when the unit of text used by aligner 42 is at least a sentence. Versioned document indexer 15 may align each versioned document by sentences, hashing each sentence into an integer value, and transforming each document into a sequence of integers. The integers may then be aligned, and when assigned to the virtual documents, each integer may be replaced by the sentence it represents. Sentences may thus be kept intact, and exact-phrase queries and proximity-based searches may be reliably performed.
It will be appreciated that indexing documents aligned by sentences may result in lesser index space savings in comparison with documents aligned by individual words, since any change in a sentence between version i and i+1 of a document will require the re-indexing of the entire sentence in some new virtual document. On the other hand, the alignment phase may run much faster when the unit of text is a sentence, since the sequences to align may be much shorter.
It will further be appreciated that while the greedy polynomial-time algorithm discussed hereinabove with respect to
While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.