Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The present invention provides an improvement to the class of data structures that serves as indexes of a collection of documents keyed on index terms. One example of such a data structure is an inverted index that stores, for each respective index term in a plurality of index terms, a document posting list referencing documents in a document collection that contain the index term. Using the methods of the present invention, an individual document posting list can be efficiently modified without affecting other document posting lists in the inverted index.
In additional to traditional document postings of index terms, the data structures of the present invention can store vertical collections. Such vertical collections are treated in the same manner as document posting lists in the present invention. A “vertical collection” comprises a set of documents (e.g., URLs, websites, etc.) that relate to a common category. For example, web pages pertaining to sailboats could constitute a “sailboat” vertical collection. Web pages pertaining to car racing could constitute a “car racing” collection. However, there is no requirement that the documents in the “car racing” vertical collection have the index terms “car” or “racing”. Users search a vertical collection so that only documents relevant to the category represented by the vertical collection are returned to the user.
Server 100 will typically have a user interface 104 (including a display 106 and a keyboard 108), one or more processing units (CPUs) 102, a network or other communications interface 110 for connecting to the Internet and/or other form of network 122, memory 114, and one or more communication busses 112 for interconnecting these components. Memory 114 can include high speed random access memory (ram) and can also include non-volatile memory, such as one or more magnetic disk storage devices 120 controlled by one or more controllers 118. Disk storage devices can be remotely located.
Memory 114 preferably stores:
an operating system 130 that includes procedures for handling various basic system services and for performing hardware dependent tasks;
a network communication module 132 that is used for connecting server 100 to various client computers (not shown) and possibly to other servers or computers via one or more communication networks 122 such as the Internet, other wide area networks, local area networks (e.g., a local wireless network can connect client computers to server 100), metropolitan area networks, and so on;
a query handler 134 for receiving search queries from a client computer;
a search engine 126 for searching a dynamic document index 142 for documents 148 in document repository 147 related to a search query and for forming a group of ranked documents that are related to the search query;
a hash table 138 for tracking the location of posting lists for index terms as well as vertical collections in dynamic document index 142;
a collection of free lists 140 for tracking availability of space in dynamic document index 142;
dynamic document index 142 for storing posting lists for index terms and/or vertical collections;
an optional vertical index construction module 144 for constructing one or more vertical collections;
a document index construction module 146 for constructing dynamic document index 142 from a set of documents 148 in document repository 147; and
an optional quality score index data structure 150 for tracking the quality score index of various documents 148 in document repository 147 for particular index terms.
The methods of the present invention begin before a search query is received by query handler 134 with document index construction module 146. Document index construction module 146 constructs a document index by scanning documents 148 in document repository 147 for relevant index terms. An illustration of the document index is illustrated below:
In some embodiments, the document index is constructed by document index construction module 146 by conventional indexing techniques. Exemplary indexing techniques are disclosed in United States Patent publication 2006/0031195, which is hereby incorporated by reference herein in its entirety. By way of illustration, in some embodiments, a given index term may be associated with a particular document when the index term appears more than a threshold number of times in the document. In some embodiments, a given index term may be associated with a particular document when the index term achieves more than a threshold score. Criteria that can be used to score a document relative to a candidate index term include, but are not limited to, (i) a number of times the index term appears in an upper portion of the document, (ii) a normalized average position of the index term within the document, (iii) a number of characters in the index term, and/or (iv) a number of times the document is referenced by other documents. High scoring documents are associated with the index term.
Typically, when a document is associated with an index term, the document is added to a posting list for the index term. In some embodiments, the document index stores the list of index terms and a posting list for each respective index term uniquely identifying the documents in a collection of documents that contain the respective index term. In some embodiments, the document index stores a collection of index terms, the identities of documents in a collection of document that contain such index terms, and the relevance or other form of quality scores of these documents. Those of skill in the art will appreciate that there are numerous methods for associating index terms with documents in order to build a document index and all such methods can be used to construct document indexes used in the present invention.
Advantageously, the document index constructed by document index construction module 144 is stored in a dynamic document index 142.
In populating dynamic document index 142, reconsider the document index:
In preferred embodiments, the document identifier list (or posting list) for each index term will occupy a different block 204 in dynamic document index 142. The size of a respective document identifier list (posting list) in the illustrated document index will dictate which bucket 202 the block 204 containing the respective posting list will be stored. For example, consider the dynamic document index 142 illustrated in
In general, a block 204 is stored in the bucket 202 that has the smallest characteristic size that will still accommodate the blocks. There are a number of sorting methods for identifying the suitable bucket 202 for storage of a given block 204 based on the data size of the block and all such methods are within the scope of the present invention. A method of examining the bucket 202 having the smallest characteristic data size and then examining buckets 202 characterized by sequentially larger data sizes has been outlined in the example above. Alternatively, one could start with the bucket 202 characterized by the largest data size and examine buckets 202 with sequentially smaller data sizes. In general, to store a block 204 in a given bucket 202, the size of the block 204 cannot exceed a maximum allowed block size for the given bucket (which in preferred embodiments is, in fact, the data size that characterizes the given bucket) but must exceed a minimum allowed block size for the given bucket. In preferred embodiments, the minimum allowed block size of a given bucket is determined by the characteristic data size of the bucket 202 that is sequentially smaller than the characteristic data size of the given bucket. Thus referring to
In some embodiments, any word found in any document in a corpus of documents 148 is stored as an index term in a block 204 together with the document posting list for the term. In some embodiments, certain words are excluded from the list of possible index terms stored in dynamic document index 142. For example, common words such as “a”, “the”, “but”, “and”, or “an” are excluded. In another example, an authorized user (e.g., a parent) may exclude certain words that are deemed to be offensive or inappropriate from dynamic document index 142. In some embodiments, any phrase found in any document in a corpus of documents 148 is stored as an index term in a block 204 together with the document posting list for the term.
There is no limit on the number of documents 148 that can be referenced in the posting list for an index term. For example, in some embodiments, between 10,000 and 100,000 documents 148 are referenced the posting list for an index term, between 100,000 and 1×106 documents 148 are referenced in the posting list for an index term, between 1×106 and 1×107 documents 148 are referenced in the posting list for an index term, between 1×107 and 1×108 documents 148 are referenced in the posting list for an index term, or more than 1×108 documents 148 are referenced in the posting list for search term with dynamic document index 142. As used here, the term “referenced” means that the posting list contains sufficient information to uniquely identify the document in a data store. The means used to uniquely identify the document is application specific. If the document is located in RAM memory, the document may by referenced by a pointer. Alternatively, a document may be referenced by a unique document identifier assigned to the document. Furthermore, there is no limit on the number of index terms to which a given document 148 may be associated. For instance, a given document may contain one hundred different index terms. Thus, one hundred different posting lists, one for each of the one hundred index terms, will reference the document. A given document 148 can be associated with between 0 and 100 index terms, between 0 and 1000 index terms, between 100 and 10,000 index terms, between 10,000 and 100,000 index terms, or more than 100,000 index terms in this way.
In the context of this application, documents 148 are understood to be any type of media that can be indexed and retrieved by a search engine, including web documents, images, multimedia files, text documents, PDFs or other image formatted files, ringtones, full track media, and so forth. A document 148 may have one or more pages, partitions, segments or other components, as appropriate to its content and type. Equivalently, a document 148 may be referred to as a “page,” as commonly used to refer to documents on the Internet. In fact, particularly long documents may be logically broken up by document index construction module 146 into separate documents. For example, a 100+ page PDF manual may be logically split into 100+ different documents, where each such document represents a different page of the PDF manual. No limitation as to the scope of the invention is implied by the use of the generic term “documents.”
In the present invention, there are many documents 148 indexed by document index construction module 146. Typically, there are more than one hundred thousand documents, more than one million documents, more than one billion documents, or even more than one trillion documents indexed by document index construction module 146. For the sake of illustration, document index construction module 146 has been construed as first creating a conventional document index and then populating dynamic document index 142. However, document index construction module 146 was presented in this manner solely to assist the reader in understanding how dynamic document indexes 142 of the present invention differ from conventional inverted indexes. In fact, there is no requirement that document index construction module 146 first construct a conventional inverted index prior to populating dynamic document index 142. Document index construction module 146 can construct posting lists for index terms found in a corpus of documents and populate dynamic document index 142 directly based on the size of each posting list constructed.
Advantageously, dynamic document index 142 can store data structures other than posting lists for index terms found in a corpus of documents. Each block 204 in dynamic document index 142 can store any data structure that contains the identity of a collection of documents that share some unique property. The example of a posting list for an index term is one such data structure. Each document referenced in the posting list has the unique property of containing the index term somewhere in the document. Another example of a collection of documents that share some unique property is a vertical collection. A vertical collection is a reference to a collection of documents 148 that have been identified on some basis as sharing some unique property. There is no requirement that this unique property be the presence of an index term within documents. Vertical collections and methods of using such vertical collections are described in more detail in U.S. patent application Ser. No. 11/404,687, filed Apr. 13, 2006, and Ser. No. 11/404,620, filed Apr. 13, 2006, which are each hereby incorporated by reference herein, in their entireties. Vertical index constructions module 144 can use the vertical collections and document posting lists for index terms stored in dynamic document index 142 to construct a vertical index. Other data structures that can be stored in dynamic document index 142 include anchor collections which include, for any given web page, the list of URLs that reference the web page as well as the text around each such reference. For example, consider the case in which there is a first page and a second page that references the first page. The anchor collection will include the identity of the second page as well as the text surrounding the reference in the second page to the first page (e.g., what the second page has to say about the first page). Thus, the anchor text provides, for a given URL, the referencing text of other pages that refer to the URL.
In some embodiments, vertical collections are constructed using documents referenced in an inverted index that pertain to a particular non-hierarchical category. For example, one vertical collection may be constructed from documents referenced in an inverted index that pertains to movies, another vertical collection may be constructed from documents referenced in an inverted index that pertains to sports, and so forth. Vertical collections can be constructed, merged, or split in a relatively straightforward manner. In some embodiments, there are thousands of vertical collections set up in this manner. In some embodiments, there are millions of vertical collections set up in this manner. In preferred embodiments, each such vertical collection is stored in a block 204 of dynamic document index in the same manner that document posting lists for index terms are individually stored in blocks 204.
In some embodiments, a first bucket 202 in dynamic document index 142 is characterized by a data size of 24 bytes, a second bucket 202 in dynamic document index 142 is characterized by a data size of 25 bytes, a third bucket 202 in dynamic document index 142 is characterized by a data size of 26 bytes, and a fourth bucket 202 in dynamic document index 142 is characterized by a data size of 27 bytes, and so forth through a bucket 202 characterized by a data size of 228 bytes, 229 bytes, 230 bytes, or an even larger value. Thus, some embodiments of the present invention provide a dynamic document index 142 containing a buckets characterized by a data size of 24, 25, 26, 27, 28, 29, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, or 231 bytes. There is no requirement that the characteristic data size of a bucket be a power of 2. Other characteristic data sizes are possible. One limitation on the absolute size of the buckets is that at least some of the blocks 204 allocated within a bucket are stored in memory 114 (RAM memory). Thus, as computers advance and RAM memory sizes increase, the largest characteristic data size of buckets 202 in dynamic document index 142 will increase without departing from the present invention.
In preferred embodiments, a portion of each bucket 202 in dynamic document index 142 is stored in RAM memory (e.g., memory 114 of
In some embodiments an entire bucket is stored in RAM memory. In some embodiments the most recently used bucket is stored in RAM memory. However, as is known to those of skill in the art, the operating system of a computer system will frequently page data structures, or portions thereof, in and out of RAM memory. Thus, the number of blocks in any given bucket that is actually stored in RAM memory at any given time may vary over time. In some embodiments, there is a threshold indicator that states that for buckets below the threshold, the entire bucket is to be stored in RAM and for buckets above the threshold, only blocks in the bucket are to be stored in RAM. This threshold may be a block size (e.g., 220). However, even in such embodiments, operating system paging may cause the amount of the buckets that is stored in RAM memory to vary from this general threshold specification.
In some embodiments, the percentage of blocks relegated to magnetic memory 120 is the same or different for each bucket 202 in dynamic document index 142. In some embodiments, a threshold number of blocks 204 in a given bucket are permitted in RAM memory 114 rather than limiting the number of blocks 204 in RAM memory 114 to a given percentage of the blocks 204 of a bucket 202. For instance, in some embodiments up to 100, up to 1000, up to 104, up to 105, up to 106, up to 107, up to 108, up to 109, up to 1010 blocks 204 in a given bucket 202 can be stored in RAM memory 114 while the remainder of the blocks in the bucket are stored in magnetic memory 120. In some embodiments, each of the blocks 204 in a given bucket 202 that are stored in magnetic memory 120 have a least used status. In some embodiments of the present invention, the portion of dynamic document index 142 stored in RAM memory 114 uses between 25 percent and 75 percent of all available RAM memory in server 100. In some embodiments, the portions of dynamic document index 142 that are stored in RAM memory 114 are on server 100 but the portions of dynamic document index 142 relegated to magnetic memory may be stored on computers or other devices containing computer readable media that are addressable by server 100 across Internet/network 122.
Referring to
Referring to
In some embodiments of the present invention, document posting 206 advantageously has additional information. In addition to providing the offset for each instance of a given index term in a referenced document, document posting provides the context of each instance of the search term in the document. An example of a search term context in a referenced document is an identity of an HTML tag that encloses the instance of the search term in the document.
Referring to
There is no limitation on the size of the data structure 502 referenced by the hash table. For example, each data structure 502 can have a predetermined size between 10 bits or 1000 bits. Larger and smaller data sizes are possible as well.
In
Referring to
Referring to
Methods for using the software modules and data structures of the present invention to modify a given block 204 without having to operate, shuffle or otherwise disturb any other blocks in dynamic document index 142 will now be described. In some embodiments, search engine 136 receives a query request that includes search terms. A lookup for a search term in the query request is then performed by query handler 134 using hash table 138 thereby identifying a data structure 502. Data structure 502 identifies a first bucket 202 in dynamic document index 142. Data structure 502 further identifies an offset into the first bucket. The block 204 identified by data structure 502 is retrieved from the identified bucket 202 at the offset specified by data structure 502. The block 204 is then modified. Once modified, the block 204 is restored to dynamic document index 142. Specifically, the block, in modified form, is restored to the original bucket at the original offset specified by data structure 502 when (i) the size of the block, in modified form, does not exceed a maximum allowed block size for the original bucket 202 and (ii) the block, in modified form, exceeds a minimum allowed block size for the original bucket. In typical embodiments, the maximum allowed block size is the characteristic size of the original bucket (e.g., 28 bytes). If the modified block no longer satisfies these criteria, the block is simply added to another bucket. For instance, in some embodiments, the block, in modified form, is added to one bucket in the dynamic document index 142 when the size of the block, in modified form, exceeds a maximum allowed block size for the original bucket and to another bucket in the dynamic document index 142 when the size of the block, in modified form, is less than a minimum allowed block size for the original bucket. To illustrate, consider the case in which the original bucket is characterized by a size of 26 or 64 bytes. Thus, each block 204 in original bucket 202 is allocated 64 bytes, whether the blocks use this much space or not. Say that the retrieved block uses 48 bytes before modification but uses 52 bytes after modification. In this instance, the size of the block does not exceed the maximum allowed block size for the first bucket (26 bytes or 64 bytes) and the block exceeds a minimum allowed block size for the original bucket (say 25 bytes or 32 bytes). In this instance, the block 204, in modified form, is returned to the original bucket 202 at the same offset where it initially resided. Consider, alternatively, that the retrieved block 204 uses 66 bytes after modification. In this instance the block, in modified form, exceeds a maximum allowed block size for the original bucket 202 (e.g. 26 or 64 bytes). Therefore the block, in modified form, is added to another bucket 202 in dynamic document index 142 that is characterized by a larger data size than the first bucket (e.g. 27 bytes or 128 bytes). Consider alternatively still, that the retrieved block, in modified form, has a size of only 30 bytes. In this instance the block, in modified form, is less than the minimum allowed block size for the original bucket 202 (e.g., 33 bytes). Therefore the block 204 is added, in modified form, to a bucket in dynamic document index 142 that is characterized by a smaller data size than the original bucket (e.g. 25 or 32 bytes). Free lists 140 are updated appropriately to reflect the location of the block 204. For instance, if block 204 is returned to the original offset of the original block 202, no free list 702 is updated. If the block is added to a new offset in a new bucket 202, the offset in the new bucket 702 is removed from the free list for the new bucket 702 and the original offset in the original bucket 202 is added to the free list 702 for the original bucket.
In some embodiments, a block 204 comprises a plurality of document postings and the above-referenced modifications that are made to a block 204 include adding one or more document postings to the plurality of document postings in the block. In some embodiments, the above-referenced modifications that are made to a block comprise removing one or more document postings from the plurality of document postings in the block.
All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
The present invention can be implemented as a computer program product that comprises a computer program mechanism embedded in a computer readable storage medium. For instance, the computer program product could contain the program modules shown in
Many modifications and variations of this invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The invention is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled.