The drawings referenced herein form a part of the specification. Features shown in the drawing are meant as illustrative of only some embodiments of the invention, and not of all embodiments of the invention, unless otherwise explicitly indicated, and implications to the contrary are otherwise not to be made.
In the following detailed description of exemplary embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific exemplary embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized, and logical, mechanical, and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
The computer-readable medium 104 stores a number of text-based documents 106. The medium 104 also stores a base inverse index 108 that is generated for the documents 106. The generation of the inverse index 108 is beyond the scope of this patent application, and can be generated in a conventional or other manner. The inverse index 108 is typically created for rapid keyword-based search of the documents 106. The inverse index 108 may be considered information regarding the occurrence of terms within the documents 106 sorted by the terms themselves.
The annotation mechanism 102 generally annotates the tokens, or terms, within the inverse index 108 to generate the annotated inverse index 108′. As such, the documents 106 are inherently annotated, as the annotate documents 106′, by virtue of the annotated inverse index 108′. Because the documents 106 are annotated by annotating the inverse index 108 of the documents 106, it can be said that the documents 106 are all annotated at the same time. That is, because the inverse index 108 pertains to all the documents 106, annotating the index 108 effectively annotates all the documents 106 at the same time. Various approaches by which the annotation mechanism 102 may annotate the inverse index 108, and thus the documents 106 from which the inverse index 108 was generated, are now described.
A base inverse index for the documents is received (202). The base inverse index is the inverse index prior to annotation thereof, and hence is described as being the base such index. It is presumed that a document collection D contains documents d(1) to d(N). The base inverse index has two ordered sets: a first ordered lists of unique tokens T with elements t(1) through t(M) that occur in the document collection D, and a set of location lists #L, where there is one list #l(i) for each unique token t(i). A location list is defined as an ordered list of pointers to the document collection D. Each pointer locates the document and the token offset of a single occurrence of the token t(i). Thus, the location list #l(i) for token t(i) can be used to locate every occurrence of t(i) within the document collection D.
It is further noted that the base inverse index is an index of base entities, where the base entities are unique tokens within the corpus of documents. Two more complex entities can be derived from the base index: regexp, or regular-expression, entities; and dictionary entities. Thus, a regular-expression entity is defined (204). A regexp entity &ERname is defined as a token that matches a regular expression % Rname. For example, if % Rcapword is ([A-Z][a-z]*), then any &ERcapword is a token corresponding to a word with an initial capital letter.
A merge operation is also defined (206). The merge operation merge(#la, #lb) returns a location list in which each pointer occurs in location list #la, location list #lb, or both location lists #la and #lb. Therefore, the location list #LRcapword for all entities &ERcapword, for example, can be composed by using merge (#la, #lb) to combine all the lists #l(i) for the tokens t(i), where t(i) satisfies % Rcapword.
A consecutive-intersection operation is also defined (208). The consecutive-operation consint(#la, #lb) is the consecutive operation of location lists #la and #lb, and returns a location list. For a pointer to be in the location list returned by consint(#la, #lb), it must point to a token sequence that consists of two consecutive subsequences @sa and @sb. Furthermore, the sequence @sa occurs in #la, and the sequence @sb occurs in #lb.
Thereafter, for each dictionary entity of a dictionary, an index is determined as a consecutive intersection of all location lists of pointers within the dictionary entity (210). A dictionary entity &EDname is defined as a sequence of tokens that occur in the dictionary $Dname. This dictionary is simply a list of token sequences, which are typically ordered. For example, if $Dfname is a list of all first names, then any token sequence annotated as &EDfname is a first name. For the simple case in which all first names are one token in length, the location list #Ldfname corresponding to all entities of type &EDfname can be composed by using merge(#la, #lb) to combine all the lists #l(i), where t(i) is in the dictionary $Dfname.
For the more complex case, in which the sequences in $Dfname are more than one token in length, the following is performed. Particularly, for each token sequence t(i1), t(i2) . . . , t(ix) in $Dfname, where x is the length of the sequence, consint(#la, #lb) is first employed to generate an index that is the consecutive intersection of the lists #l(i1), #l(i2), . . . , #l(ix). It can be appreciated by those of ordinary skill within the art that the complex case automatically collapses to the simple case where the token sequence is one token in length—that is, where x is equal to one. This index contains the pointers to all occurrences of the sequence t(il) through t(ix) in the collection. As such, the consecutive-intersection operation defined in part 208 may be considered as being used to perform part 210 of the method 200.
Thereafter, the location lists of all the token sequences that are members of the dictionary are merged to result in a final location list for the dictionary (212). As such, the documents are annotated via the tokens of the dictionary entities annotating the base inverse index. For instance, the merge operation merge (#la, #lb) is used to combine the lists for each sequence in $Dfname to yield the final location list #LDfname. As such, the merge operation defined in part 206 may be considered as being used to perform part 212 of the method 200.
It is noted that dictionary entities as in the method 200 of
Each derived entity is composed from preexisting simpler entities using a set of rules written in modified context-free grammar (CFG) (302). Consider the example &EXfullname->&EDfname &EDlname. This means that the derived entity &EXfullname is composed of two consecutive sequences @seq1 and @seq2, where @seq1 is an entity of type &EDfname and @seq2 is of type &EDlname, assuming that &EDlname is the dictionary entity of last names. From the definition of consint(#la, #lb), the location list for #LXfullname for &EXfullname is obtained as follows: #LXfullname=consint(#LDfname, #LDlname). As such, &EXa→&EXb &EXc is generally interpreted as meaning that #LXa equals consint(#LXb, #LXc). Furthermore, &EXa→&EXb|EXc is generally interpreted as meaning #LXa equals merge(#LXb, #LXc).
Therefore, extending the example further, $Dnameprefix may be a dictionary of common prefixes for names such as Mr., Mrs., Ms., Dr., and so on. A derived entity &EXperson can be composed that annotates sequences as a person so long as they are a first name, full name, last name, or name prefix followed by a sequence of capitalized words of at most two in length. Thus, &EXperson→&EDfname|&EXfullname|&EDlname|&EXnewname; &EXnewname→&EDnameprefix & EXcapword2; and, &EXcapword2→&ERcapword|&ERcapword &ERcapword.
The location list for &EXperson is composed from the simpler location lists recursively, by using the operators merge(#l1, #12) and consint(#l1, #12). Hence, #LXcapword2=merge(#LRcapword, consint(#LRcapword, #LRcapword)). Further, #LXnewname=consint(#Ldnameprefix, #LXcapword2). Therefore, #LXperson merge(#LDfname, merge(#LXfullname, merge (#LXlname, #LXnewname))).
It is noted that one difficulty with the above approach is that the location list corresponding to &EXnewname can have pointers that span both the name-prefix and the sequence of the capitalized words. Therefore, it may be desirable to restrict the pointers so that they ignore the name-prefix. Another restriction that may be desired is that the capitalized words are also nouns, assuming that there is a noun entity annotator.
Therefore, the CFG is modified to include three operations (304). A parallel-intersection operation is defined (306). This operation parallelint(#la, #lb) is the parallel intersection of #la and #lb, returning the subset of pointers to sequences that are present in both #la and #lb. Thus, one modification of the CFG, using this parallel-intersection operation, is that &EXa→&EXb̂&EXc is interpreted to mean that the entity &EXa corresponds to a sequence of tokens that have both &EXb and &EXc annotations, and both of which fully span the sequence. That is, given the production rule &EXa→&EXb̂&EXc, the location list #LXa for &EXa is determined as #LXa=parallelint(#LXb, #LXc).
A first extension to consecutive-intersection operation is also defined (308), as well as a second extension to consecutive-intersection operation (310), where both of these operations are different than the consecutive-intersection operation defined in part 208 of
Thus, another modification of the CFG, using these two consecutive-intersection operations, is that &EXa→{&EXb}&EXc is interpreted to mean that entity &EXa is formed from two consecutive token sequences @seq1 and @seq2, where @seq1 is of type entity &EXb and @seq2 is of type entity &EXc. The curly brackets denote that where the location list for &EXa is computed, the pointers skip @seq1 and just point to @seq2. Thus, the location list #LXa for &EXa→{&EXb } &EXc is determined as #LXa=consintwp(#LXb, #LXc) and the location list #LXa for &EXa→&EXb {&EXc } is determined as #LXa=consintws(#LXb, #LXc).
Using this modified CFG, then, each derived entity may be derived from a first sequence ot tokens and a second sequence of tokens (312), as an example of which has been described in relation to the initial description of part 302. Thus, an arbitrarily complex annotation may be composed from simpler annotations. For the person-name example, the final set of rules that use the above modification are: &EXperson→&EDfname|&EXfullname|&EDlname|&EXnewname; &EXnewname→{&EDnameprefix} &EXncapword2; &EXncapword2→&ERncapword|&ERncapword &Erncapword; and, &EXncapword→&EXnoun̂&ERcapword.
It is assumed that &EXnoun is the annotation for all tokens that are nouns. The corresponding location lists are determined as follows. First, #LXncapword=parallelint(#LXnoun, #LRcapword). Second, #LXncapword2=merge( #LXncapword, consint(#LXncapword, #LXncapword)). Third, #LXnewname=consintwp( #LDnamepref#LX, #LXcapword2). Finally, fourth, #LXperson=merge( #LDfname, merge( #LXfullname, merge( #LXlname, #LXnewname))).
It is noted that the method 200 of
Therefore,
In general, as has been noted, a partial ordering of annotations of tokens within the documents is imposed (402). In particular, and in one embodiment, an array tokStatus of the integers of size equal to the total number of tokens within the document collection in question is created. This array is initialized with zeros. A positive integer is associated with each annotation type so that the order of these integers reflects the partial ordering of the annotation types that is desired to be imposed. Annotation types that are at the same level and that can overlap have the same integer associated with them.
An apply-order operation is defined (404). This operation tokStatus.applyorder(x, #lp) takes as arguments, the location list #lp of an annotation type, and the associated integer x for that type. The operation returns a subset of pointers from #lp for which all the tokens in the sequences in #lp have associated values in tokStatus less than or equal to x. In addition, the tokStatus values for the sequences that are returned are updated to the value x. Therefore, if any part of a token sequence has already been annotated as an entity with a higher value of x, this token sequence will be removed from the list of pointers in #lp.
Thus, the apply-order operation is employed to impose a desired partial ordering (406), as defined in the array tokStatus. To ensure the location lists correctly reflect the partial ordering of the entities, the apply-order operation is applied in descending order of x values. That is, the operation is performed beginning with the highest order annotation types.
It is noted that as an alternative to determining tokStatus.applyorder(x, #lp) as a post-processing operation on a location list, this operation can be combined the operation merge( #la, #lb). For instance, the operation tokStatus.merge( #la, #lb, x) can be defined as the operation that returns a location list which is a merge of the lists #la and #lb and which satisfies the constraints that tokStatus.applyorder(x, #lp) imposes on the resulting list. There may be efficiency reasons for using this alternative approach, since while the location lists are being merged the token sequences can be simultaneously checked against tokStatus.
It is noted that, although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This application is thus intended to cover any adaptations or variations of embodiments of the present invention. Therefore, it is manifestly intended that this invention be limited only by the claims and equivalents thereof.