The present invention generally relates to reading posting lists as part of searching an inverted index. More particularly, the invention relates to segmenting a posting list into a plurality of segments based on the size of the list.
The following definition of Information Retrieval (IR) is from the book Introduction to Information Retrieval by Manning, Raghavan and Schutze, Cambridge University Press, 2008:
An inverted index is a data structure central to the design of numerous modern information retrieval systems. In chapter 5 of Search Engines: Information Retrieval in Practice (Addison Wesley, 2010), Croft, Metzler and Strohman observe:
In a search system implemented using a computer, an inverted index often comprises two related data structures:
When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.
A large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both. Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.
Therefore, it would be helpful to minimize accesses to secondary storage for reading posting lists when searching an inverted index, in order to improve runtime performance.
The present invention satisfies the above-noted need by providing a posting list reader that reads a posting list efficiently during inverted index searching by reducing the number of accesses to secondary storage as compared to a traditional buffered reading strategy that repeatedly uses a uniform input buffer size.
The posting list reader of the present invention will be referred to as a segmenting posting list reader, to distinguish it from posting list readers in general. Further, a posting list segment refers to a sequence of adjacent postings within a posting list. A complete segmentation of a posting list breaks it up into one or more non-overlapping segments that together include all the postings of the list.
In accordance with the above, it is an object of the present invention to provide a segmenting posting list reader that can determine how many postings to read on each read request.
It is another object of the present invention to provide a segmenting reader to read short posting lists in a single burst of reading.
It is still another object of the present invention to provide a segmenting reader that automatically breaks long posting lists into segments according to, for example, a strategy that may vary with the requirements of evaluation logic, posting list organization, or other considerations. Each read request preferably reads the next segment in one burst of reading.
It is yet another object of the present invention to provide a segmenting reader with support for posting list segments of both exact and approximate size.
Finally, it is another object of the present invention to provide a segmenting posting list reader that learns, remembers and applies posting list segmentations with only a small amount of up-front configuration.
The present invention provides, in a first aspect, a method of reading a posting list. The method comprises determining by a processor a size of a posting list as part of searching an inverted index, segmenting the posting list for reading by the processor into a plurality of segments based on the size, and reading by the processor each of the plurality of segments into memory.
The present invention provides, in a second aspect, a computer system for reading a posting list. The computer system comprises a memory, and a processor in communication with the memory to perform a method. The method comprises determining a size of a posting list as part of searching an inverted index, segmenting the posting list for reading into a plurality of segments based on the size, and reading each of the plurality of segments into memory.
The present invention provides, in a third aspect, a program product for reading a posting list. The program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method. The method comprises determining a size of a posting list as part of searching an inverted index, segmenting the posting list for reading into a plurality of segments based on the size, and reading each of the plurality of segments into memory.
These, and other objects, features and advantages of this invention will become apparent from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings.
One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
Posting lists in a search index are described by Zipf's law, which states that given a corpus of natural language documents, the frequency of any word is inversely proportional to its rank in the frequency table.
Observing that queries submitted to a search system are little natural language documents, they too adhere to Zipf's law. It follows that the relatively few long posting lists in a search index are also the most frequently accessed during query processing. An efficient read strategy for long posting lists can help a search system deliver fast query run times. It is convenient that the big posting lists are few. This makes it feasible to craft and hold in memory exact read strategies for these lists.
An information retrieval system 200 that searches an inverted index comprises components similar to those labeled InvertedIndexSearcher 202 and PostingListReader 204 in
Inverted index searcher 202 takes a query 206 as input and returns search results 208. Information contained in the query includes, at a minimum, a term or terms describing the user's information need. The query optionally includes other features such as, for example, Boolean constraints (AND, OR, NOT), term weights, phrase constraints, or proximity restrictions. The query may be expressed literally as submitted by the user, or it may already have been parsed and structured. The search results returned, at a minimum, comprise unique identifiers of the documents matching the query. Often, the search results are returned in order of descending relevance, and each search result may optionally include a variety of other information such as a score, date indexed, document last modified date, a copy of the document as it was indexed, the document's URL if applicable, document title, a “snippet” or keywords in context showing how the query matches the document, and application-specific metadata.
A given inverted index searcher instance searches a single inverted index. A large scale search engine may have multiple inverted index searcher instances, spread out on different servers in a server cluster. In this case, higher level components, not pictured here, are responsible for broadcasting queries across inverted index search services and integrating the results that come back.
When inverted index searcher 202 receives query 206, it forwards it to the evaluation logic 216, which is the code and associated data structures in the inverted index searcher that executes the query and produces a list of search results. The evaluation logic decides which posting lists to read and dispatches any needed posting list readers. The evaluation logic controls the details of reading, for example, how many posting list readers to use at once, how much of each posting list to read, the order in which lists are read, whether to read a given list all at once, whether to alternate between lists in successive bursts of reading, etc. In the example of
As it executes a search, an inverted index searcher requires data transfer from the posting file. As previously mentioned, a large search index may require implementing the posting file using secondary storage.
The main component in
The SPLR is implemented using several other software components that are introduced here and described in greater detail below. The purpose of the LexiconEntryToPostingListSegmentationMapper 304 is to provide a mapping from each lexicon entry to a segmentation of the associated posting list, thereby determining for each term in the index both the number of bursts of reading to fully read the posting list and the postings that will be read by each successive read request. The LexiconEntryToPostingListSegmentationMapper delegates work optionally to a PostingListLengthApproximationTable 306 and to a PostingListSegmentationTable 308. A PostingListLengthApproximationTable provides accurate estimates of posting list size, typically in bytes. The PostingListSegmentationTable stores segmentations of the relatively few but frequently accessed posting lists that are larger than a predetermined size. A PostingListReadLimiter 310 helps the SPLR learn segmentations of long posting lists that do not have segmentations in the PostingListSegmentationTable yet, by defining the boundaries between read bursts. An enhanced buffered reader 312 uses configurable predetermined buffer fill size strategies to read from secondary storage more efficiently than a conventional buffered reader. Finally, a BufferFillSizeSelectorFactory 314 manufactures predetermined buffer fill size strategies used to configure an enhanced buffered reader.
To describe the public interface of the SPLR, it is necessary to first define a LexiconEntry. A LexiconEntry is a record retrieved from the inverted index's lexicon. A LexiconEntry comprises at least three fields: term, document frequency, and posting file start offset. The term is an indexed word or phrase. The document frequency is the length of the term's posting list in number of postings. The posting file start offset is the offset, typically in bytes, in the posting file where the posting list of the term starts. A LexiconEntry consisting of only these 3 fields will be referred to below as a minimal lexicon entry.
A LexiconEntry may optionally include, for example, a postings file end offset and/or posting list length. A posting file end offset is the offset, typically in bytes, in the posting file where the posting list of the term ends. A posting list length is the length of the posting list of the term, again, typically in bytes. If a lexicon entry has either or both of these fields it will be referred to below as an extended lexicon entry.
As will become clear, whether a lexicon entry is minimal or extended affects whether a PostingListLengthApproximationTable is required in the implementation of the SPLR.
The public interface of the SPLR preferably includes the following methods:
A discussion of the various software components, pictured in
The purpose of the PostingListReadLimiter is to give the SPLR a strategy whereby it can learn the complete segmentation of a long posting list.
The public interface to the PostingListReadLimiter consists of the following method: PostingListReadLimit getLimit (int readSequenceNumber). The getLimit method takes as input a readSequenceNumber, which is an integer greater than or equal to one. A posting list is read using one or more bursts of reading, one burst per segment. The first segment is designated readSequenceNumber 1, the second as readSequenceNumber 2, and the readSequenceNumber increases by 1 for each successive burst of reading. The getLimit function returns a PostingListReadLimit that is used by the implementation of the SPLR's read( )method to know when to stop reading during a burst with a given readSequenceNumber.
The details of how to best define the PostingListReadLimit will vary depending upon the posting list structure of the inverted index and associated evaluation logic.
In a score sorted index, the postings of each posting list are sorted into descending order by score, so that the evaluation logic gets the postings first with the highest scores, considered the most important. For example, in “Pruned Query Evaluation Using Pre-Computed Impacts,” In Proceedings 29th Annual International ACM SIGIR Conference (SIGIR 2006), pp. 372-379, Seattle, Wash., August 2006, incorporated herein by reference in its entirety, V. N. Anh and A. Moffat describe a technique to achieve fast search runtime and a guarantee of search result quality (i.e., relevance) using pruned query evaluation with score-at-a-time processing of an impact-sorted index. In their approach, the postings of each posting list are ordered by descending impact, where impact is a measure of the importance of a term in a document. In their approach, a posting list is read using a sequence of bursts of reading, and within each burst, each posting read contributes the same partial score value toward the score of each document encountered. With a score-sorted posting list organization, to help achieve efficient data access, it is preferable to align the segment boundaries of the present invention with the static score or impact boundaries that are built into the posting list.
With a score sorted index, the PostingListReadLimit is preferably defined as the minimum impact or score (more generally, the minimum relevance indicator) to read during a burst of reading. To enforce the limit, a burst of reading includes all remaining postings with a score greater than or equal to the minimum score that is the PostingListReadLimit for the current readSequenceNumber. The implementation of PostingListReadLimit getLimit (int readSequenceNumber) in this case is a trivial. The PostingListReadLimiter has as part of its state an array of scores indexed by read sequence number, and the getLimit method simply does an array lookup and returns a score. The array of scores used by the PostingListReadLimiter is preferably configurable through a file or database read by the search system on startup.
In a document sorted index, another common index organization that is simple and offers good compression characteristics, the postings of each posting list are sorted by document identifier. It is not possible to segment such an index for reading on score boundaries.
One example strategy to segment a posting list of a document sorted index is to make each successive burst of reading bigger, for example, doubling the size of each successive read. The intuition is to attempt to satisfy the evaluation logic's information need with minimal data transfer, but if the evaluation logic remains unsatisfied, then issue bigger and bigger reads to deliver the needed information with a relatively small number of separate accesses to secondary storage. To implement a strategy like this, the PostingListReadLimit is a minimum number of bytes, for example, to read during a burst. The burst of reading continues until the minimum number of bytes for the readSequenceNumber has been read or until end of list, whichever comes first. The implementation of PostingListReadLimit getLimit (int readSequenceNumber) is straight forward in this case. The PostingListReadLimiter has as part of its state an array of sizes in bytes indexed by read sequence number, and the getLimit method simply does an array lookup and returns a size. The array of sizes used by the PostingListReadLimiter is preferably configurable through a file or database read by the search system on startup.
A PostingListSegmentationTable is a table of posting list segmentations randomly accessible by term, where a term is an indexed word or phrase. The segmentation information in the table may be complete or incomplete. The SPLR adds segmentation information as it becomes known.
A PostingListSegmentation object 406 describes a complete or partial segmentation of a posting list. Recall that a posting list segment is a sequence of adjacent postings within a posting list. A complete segmentation of a posting list breaks it up into one or more non-overlapping segments that together include all the postings of the list.
A PostingListSegmentation object has the following object state:
A PostingListSegmentation also has a convenience method numSegments( ) to return the number of segment lengths that are known. This is the length of the postingListSegmentLengths array.
A PostingListSegmentationTable includes the following public methods:
In a search system that is under load, the get( ) and putRefined( )methods may be called concurrently by multiple threads of execution. These methods should be synchronized to avoid erroneous behavior.
A PostingListLengthApproximationTable provides accurate estimates of posting list size, typically in bytes. The main method on a PostingListLengthApproximationTable is:
PostingListLengthApproximation getPostingListLengthApproximation (documentFrequency)—Returns a PostingListLengthApproximation for a posting list with the indicated document frequency (document frequency is the same thing as posting list length).
A PostingListLengthApproximation includes the following information: rangeId; average posting list length in bytes for this range; and standard deviation of posting list length in bytes for this range.
For a detailed discussion of a PostingListLengthApproximationTable refer to U.S. Non-Provisional patent application entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE” (Attorney Docket No. 1634.068A) filed concurrently herewith.
The purpose of this component is to map a lexicon entry to a PostingListSegmentation. The PostingListSegmentation is useful to the SPLR, representing what is known about how to best break a given posting list into segments for reading.
The LexiconEntryToPostingListSegmentationMapper delegates work to a PostingListSegmentationTable and optionally to a PostingListLengthApproximationTable as will be spelled out below.
A LexiconEntryToPostingListSegmentationMapper has the following public methods:
Internally, the LexiconEntryToPostingListSegmentationMapper knows how to discriminate between long and short posting lists. A short posting list is one that is short enough to read in its entirety in one burst of reading. A long posting list is one that should be broken into multiple segments and read in pieces. The exact methodology to discriminate between long and short posting lists could vary and is left to the implementer. In one example, the inflection point on the graph of document frequency over term rank (see
As described above, different possible implementations of LexiconEntry include: a minimal lexicon entry that includes just term, document frequency and posting file start offset; and a more extended lexicon entry that adds posting file end offset or posting list length in bytes.
The implementation of getPostingListSegmentation varies depending upon whether a LexiconEntry is minimal or extended. Examples of Java-like pseudocode for these two scenarios is given below. In the pseudocode below, firstLongDocumentFrequency is the length of the shortest posting list that is considered long as opposed to short per the discussion above.
Example Pseudocode for getPostingListSegmentation, Minimal Lexicon Entry
Example Pseudocode for getPostingListSegmentation, Extended Lexicon Entry
The buffered reader used by the SPLR is an enhanced buffered reader that uses configurable predetermined buffer fill size strategies to read from secondary storage more efficiently than a conventional buffered reader.
For a detailed discussion of enhanced buffered readers, refer to U.S. Non-Provisional patent application entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A) filed concurrently herewith.
The BufferFillSizeSelectorFactory is used to make BufferFillSizeSelector objects for plugging into the enhanced buffered reader. A BufferFillSizeSelector object is a predetermined buffer fill size strategy. More specifically, a BufferFillSizeSelector is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of an enhanced buffered reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize.
The object state of the BufferFillSizeSelectorFactory includes maxBufferSize, which is the largest read system call that can be issued, typically in bytes, based on the maximum available input buffer size of the enhanced buffered reader. In one example, a large maxBufferSize (of 20 megabytes or so) is used on a commodity server with an index of 20 million web documents.
The BufferFillSizeSelectorFactory provides the following public methods:
In the discussion that follows, let “/” represent the operation of integer division, and “%” represent the operation of integer modulo.
To implement makePreciseBufferFillSizeSelector, there are two cases to consider, where numBytesToRead is the input to makePreciseBufferFillSizeSelector, and maxBufferSize is the largest read system call that can be issued in bytes:
A discussion of these cases follows.
Case 1: maxBufferSize>=numBytesToRead
Build a one-stage predetermined buffer fill size strategy as indicated below in Table I.
The above strategy, when installed in an enhanced buffered reader, will read exactly numBytesToRead bytes of data using a single system call.
Case 2: maxBufferSize<numBytesToRead
In this case, build a predetermined buffer fill size strategy that generally has two stages, as indicated in Table II. However, the second stage is not necessary when the maxBufferSize evenly divides numBytesToRead.
The above strategy, when installed in an enhanced buffered reader, will read exactly numBytesToRead bytes of data with the minimum possible number of read system calls.
To implement makeApproximateBufferFillSizeSelector, there are two cases to consider, where approximateNumBytesToRead is input to makeApproximateBufferFillSizeSelector, and maxBufferSize is the largest read system call that can be issued in bytes:
Build a two-stage predetermined buffer fill size strategy as indicated below in Table III.
The above strategy, when installed in an enhanced buffered reader, will read approximateNumBytesToRead bytes of data using a single read system call and thereafter will perform as many additional system calls of the supplemental read size as necessary.
Case 4: maxBufferSize<approximateNumBytesToRead
In this case, build a predetermined buffer fill size strategy that generally has three stages, as indicated below in Table IV. However, the second stage is not necessary when the maxBufferSize evenly divides the approximateNumBytesToRead.
The above strategy, when installed in an enhanced buffered reader, will read approximateNumBytesToRead bytes of data with the minimum possible number of read system calls and thereafter will perform as many additional system calls of the supplemental read size as necessary.
One example of a method of reading a posting list will now be described with reference to the flow diagram 500 of
Having described the SPLR and each of its subcomponents from
In a first example, the sequence diagram 600 in
The inverted index searcher begins a reading session by calling the open method 616 on the SPLR, passing in the lexicon entry of the posting list to read. The SPLR saves a reference to this lexicon entry as part of its state to help control the reading session. The SPLR calls getPostingListSegmentation 618 on the LexiconEntryToPostingListSegmentationMapper, forwarding the lexicon entry. The LexiconEntryToPostingListSegmentationMapper examines the document frequency of the lexicon entry and consults its method of discriminating between long and short posting lists. The LexiconEntryToPostingListSegmentationMapper determines that the posting list to read is short and calls getPostingListLengthApproximation 620 on the PostingListLengthApproximationTable, providing as input the document frequency of the lexicon entry. A PostingListLengthApproximation is returned to the LexiconEntryToPostingListSegmentationMapper 622, which then builds a complete, approximate PostingListSegmentation, incorporating the term from the lexicon entry and a single posting list segment length equal to the average posting list length in bytes plus the desired number of standard deviations from the PostingListLengthApproximation. The LexiconEntryToPostingListSegmentationMapper returns this newly built PostingListSegmentation to the SPLR 624, where it becomes part of the SPLR's state to control the reading session. The SPLR finishes the execution of its open( )method by initializing various miscellaneous state variables and finally seeking the enhanced buffered reader to the start of the posting list 626 by passing the posting file start offset of the lexicon entry to the enhanced buffered reader's seek method. At this point, the open method called by the inverted index searcher returns, and the SPLR is ready to accept a read call.
The inverted index searcher calls the SPLR's read( )method 628. Based on the state established during the open( )method, the SPLR recognizes that the posting list being read consists of a single segment with an approximate length in bytes. The SPLR forwards the approximate number of bytes to read to the BufferFillSizeSelectorFactory's makeApproximateBufferFillSizeSelector method 630. A predetermined buffer fill size strategy in the form of a BufferFillSizeSelector object is returned to the SPLR 632, which it installs in the enhanced buffered reader by calling setBufferFillSizeSelector 634. The SPLR next uses the enhanced buffered reader to read all of the postings in this relatively short posting list 636, forwarding each posting to the evaluation logic 638. Finally, the SPLR's read method returns false to the inverted index searcher 640, indicating that there are no more postings available to be read, and the inverted index searcher calls close 642 on the SPLR to end the reading session.
In a second example, the sequence diagram 700 in
The inverted index searcher begins a reading session by calling the open method on the SPLR 716, passing in the lexicon entry of the posting list to read. The SPLR saves a reference to this lexicon entry as part of its state to help control the reading session. The SPLR calls getPostingListSegmentation on the LexiconEntryToPostingListSegmentationMapper 718, forwarding the lexicon entry. The LexiconEntryToPostingListSegmentationMapper examines the document frequency of the lexicon entry and consults its method of discriminating between long and short posting lists. The LexiconEntryToPostingListSegmentationMapper determines that the posting list to read is long and calls get( ) on the PostingListSegmentationTable 720, passing in the term of the lexicon entry as the key for the lookup. The PostingListSegmentationTable consults its hash but finds no mapping from the term to a PostingListSegmentation. In this scenario, the posting list has not been read since the inverted index was deployed, and its segmentation is unknown. The get( ) call returns null to the LexiconEntryToPostingListSegmentationMapper 722, indicating that no segmentation information is available. In response, the LexiconEntryToPostingListSegmentationMapper creates a new incomplete, precise (i.e. complete=false, approximate=false) PostingListSegementation, incorporating the term from the lexicon entry, and using an empty array of posting list segment lengths. This new empty PostingListSegmentation is returned to the SPLR 724, where it becomes part of the SPLR's state to control the reading session. The SPLR finishes the execution of its open( )method by initializing various miscellaneous state variables and finally seeking the enhanced buffered reader to the start of the posting list 726 by passing the posting file start offset of the lexicon entry to the enhanced buffered reader's seek method. At this point, the open method called by the inverted index searcher returns, and the SPLR is ready to accept a read call.
The inverted index searcher calls the SPLR's read( )method 728. Based on the state established during the open( )method, the SPLR recognizes that the posting list consists of multiple segments, that the segment boundaries are unknown, and the segment boundaries need to be learned. Because this is the first call to read in this session, the SPLR forwards the value 1 (one) to the getLimit method of the PostingListReadLimiter 730. The PostingListReadLimiter returns a PostingListReadLimit 732, an indication of how far the SPLR may read during this first read call. With this information, the SPLR is almost ready to read postings. Since the SPLR does not know the size in bytes of the segment it is about to read, it calls setBufferFillSizeSelector 734 to install a default predetermined buffer fill size strategy on the enhanced buffered reader that always buffers several disk blocks worth of data whenever the buffered reader needs more data. This strategy is acceptable for learning a new segmentation, after which a better strategy will be available.
Before reading any postings, the SPLR is careful to note the current logical position of the enhanced buffered reader in the posting file 736. Knowing the read start position will allow the SPLR to know the length of the segment later when reading stops. The SPLR now uses the enhanced buffered reader to read postings 738, forwarding each one to the evaluation logic as soon as it is read 740, stopping when the PostingListReadLimit is reached or at end of posting list, whichever comes first. In this case, reading stops because the PostingListReadLimit is reached. Once again the SPLR gets the current logical position from the enhanced buffered reader 742. The difference between this second logical position and the first one that was obtained is the length of the segment just read. The SPLR creates and remembers an updated PostingListSegmentation object that includes the new segment length just learned. The SPLR then passes the updated PostingListSegmentation to the updatePostingListSegmentation method of the LexiconEntryToPostingListSegmentationMapper 744, to preserve the updated segmentation information for reuse by future read sessions. The LexiconEntryToPostingListSegmentationMapper simply forwards the PostingListSegmentation to the putRefined method of the PostingListSegmentationTable 746, where the PostingListSegmentation is stored for reuse. Because reading stopped due to the PostingListReadLimit (and not due to end of posting list), there are more postings to read and the SPLR's read method returns true 748 to the inverted index searcher to indicate this fact.
The inverted index searcher then calls the SPLR's read method a second time 750. Based on the state of the SPLR after the first read call, the SPLR recognizes that the posting list consists of multiple segments, more postings are available, but the extent of the next segment to read is unknown and has to be learned. Because this is the second call to read in this session, the SPLR forwards the value 2 (two) to the getLimit method of the PostingListReadLimiter 752. The PostingListReadLimiter returns a PostingListReadLimit 754, an indication of how far the SPLR may read during this second read call. The SPLR now follows the same steps it used during the first read call, installing a default predetermined buffer fill size strategy on the enhanced buffered reader 756, noting the read start position by getting the current logical position from the enhanced buffered reader 758, and reading postings 760 and forwarding each one to the evaluation logic 762. As before, reading stops when the PostingListReadLimit is reached or at end of posting list, whichever comes first. In this case, reading stops because end of posting list is reached.
The SPLR then gets the current logical position from the enhanced buffered reader 764. The difference between this second logical position and the first one that was obtained is the length of the segment just read. The SPLR creates and remembers an updated PostingListSegmentation object that includes both the new segment length just learned and the new knowledge that the segmentation of this posting list is complete (complete=true). The SPLR then passes the updated PostingListSegmentation to the updatePostingListSegmentation method of the LexiconEntryToPostingListSegmentationMapper 766, to preserve the updated segmentation information for reuse by future read sessions. The LexiconEntryToPostingListSegmentationMapper simply forwards the PostingListSegmentation to the putRefined method of the PostingListSegmentationTable 768, where the PostingListSegmentation is stored for reuse. Because reading stopped this time due to end of posting list, there are no more postings to read and the SPLR's read method returns false to the inverted index searcher to indicate this fact 770. Finally, the inverted index searcher calls close to close this read session 772.
In a third example, the sequence diagram 800 in
The inverted index searcher begins a reading session by calling the open method on the SPLR 816, passing in the lexicon entry of the posting list to read. The SPLR saves a reference to this lexicon entry as part of its state to help control the reading session. The SPLR calls getPostingListSegmentation on the LexiconEntryToPostingListSegmentationMapper 818, forwarding the lexicon entry. The LexiconEntryToPostingListSegmentationMapper examines the document frequency of the lexicon entry and consults its method of discriminating between long and short posting lists. The LexiconEntryToPostingListSegmentationMapper determines that the posting list to read is long and calls get( ) on the PostingListSegmentationTable 820, passing in the term of the lexicon entry as the key for the lookup. The PostingListSegmentationTable consults its hash and finds that the term is mapped to a complete, precise (i.e. complete=true, approximate=false) PostingListSegmentation with 2 segments. The get( ) call returns this PostingListSegmentation to the LexiconEntryToPostingListSegmentationMapper 822, which in turn simply returns it to the SPLR 824, where it becomes part of the SPLR's state to control the reading session. The SPLR finishes the execution of its open( )method by initializing various miscellaneous state variables and finally seeking the enhanced buffered reader to the start of the posting list 826 by passing the posting file start offset of the lexicon entry to the enhanced buffered reader's seek method. At this point, the open method called by the inverted index searcher returns, and the SPLR is ready to accept a read call.
The inverted index searcher calls the SPLR's read( )method 828. Based on the state established during the open( )method, the SPLR recognizes that the posting list being read consists of two segments of known sizes in bytes. The SPLR forwards the exact size in bytes of the first segment to the BufferFillSizeSelectorFactory's makePreciseBufferFillSizeSelector method 830. A predetermined buffer fill size strategy in the form of a BufferFillSizeSelector object is returned to the SPLR 832, which it installs in the enhanced buffered reader by calling setBufferFillSizeSelector 834. The SPLR next uses the enhanced buffered reader to read all of the postings in the first segment of this posting list 836, forwarding each posting to the evaluation logic 838. Finally, the SPLR's read method returns true to the inverted index searcher 840, indicating that there are more postings available to be read.
The inverted index searcher again calls the SPLR's read( ) method 842. Based on the state after the first read call, the SPLR recognizes that there is another segment of known size in bytes available to read. The SPLR forwards the exact size in bytes of the second segment to the BufferFillSizeSelectorFactory's makePreciseBufferFillSizeSelector method 844. A predetermined buffer fill size strategy in the form of a BufferFillSizeSelector object is returned to the SPLR 846, which it installs in the enhanced buffered reader by calling setBufferFillSizeSelector 848. The SPLR next uses the enhanced buffered reader to read all of the postings in the second segment of this posting list 850, forwarding each posting to the evaluation logic 852. Finally, the SPLR's read method returns false 854 to the inverted index searcher, indicating that there are no more postings available to be read, and the inverted index searcher closes the read session by calling close( ) on the SPLR 856.
The SPLR and its subcomponents, pictured in
The pseudocode below applies equally to all the scenarios listed above; thus, it is the common pseudocode for a SPLR implementation.
As a prerequisite to understanding the pseudocode for methods of the SPLR, it is helpful to first understand the data members that are part of its state. The following data members are initialized by sending object references to the SPLR's constructor.
The SPLR has additional state that is set up as part of a call to its open (lexiconEntry) method. These data members are documented here.
The SPLR has three public methods:
Open( ) should be called first to prepare for reading. Read( ) may be called multiple times. Each call to read( ) reads a segment of postings, and the boolean return value indicates whether there is another segment available. Finally, a well behaved client calls close( ) to signal the end of the reading session.
The pseudocode below is Java-like. Java operators and Java-like syntax are used, and array indexes start at 0. Example pseudocode for each of the SPLR's public methods follows.
Example pseudocode for open method
Example Pseudocode for read method
Example pseudocode for close method
As evident in the pseudocode above, the implementation of the SPLR's read method has to handle different cases defined by the combination of the PostingListSegmentation state and the readNum. Recall that the readNum is 1 throughout the first call to read, 2 throughout the second call to read, and so on. The combination of the PostingListSegmentation (pls) state and the readNum defines cases as described in Table V below.
The definition of the cases in Table V depends upon how PostingListSegmentation objects are created by the LexiconEntryToPostingListSegmentationMapper. An awareness of this dependency is helpful for understanding and possibly evolving the pseudocode that was presented.
The PostingListSegmentationTable will be updated dynamically as the SPLR's read method is called. When the search service shuts down, the PostingListSegmenationTable is preferably saved to disk or other nonvolatile storage medium. To avoid losing the work of learning segmentations, the PostingListSegmenationTable could also be saved automatically (like every 5 or 10 minutes or so) if it has become dirty.
If the inverted index changes, the PostingListSegmentation table becomes invalid. On any index maintenance, all persistent and in-memory copies of this table must be deleted. The system can then re-learn the up-to-date segmentations.
The present invention includes the following aspects:
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for efficient reading of posting lists as part of inverted index searching. The computer program product comprises a storage medium readable by a processor and storing instructions for execution by a processor for performing a method. The method includes, for instance, determining by a processor a size of a posting list as part of searching an inverted index, segmenting the posting list by the processor for reading into a plurality of segments based on the size, and reading by the processor each of the plurality of segments into memory.
Methods and systems relating to one or more aspects of the present invention are also described and claimed herein.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
In one aspect of the present invention, an application can be deployed for performing one or more aspects of the present invention. As one example, the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.
As a further aspect of the present invention, a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.
As yet a further aspect of the present invention, a process for integrating computing infrastructure comprising integrating computer readable code into a computer system may be provided. The computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention. The code in combination with the computer system is capable of performing one or more aspects of the present invention.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
In one example, a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention. The computer program product can take many different physical forms, for example, disks, platters, flash memory, etc.
Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
A data processing system 900, as shown in
Input/Output or I/O devices 908 (including, but not limited to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs, thumb drives and other memory media, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiment with various modifications as are suited to the particular use contemplated.
This application claims priority under 35 U.S.C. §119 to the following U.S. Provisional Applications, which are herein incorporated by reference in their entirety: Provisional Patent Application Ser. No. 61/233,411, by Flatland et al., entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE,” filed on Aug. 12, 2009; Provisional Patent Application No. 61/233,420, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION,” filed on Aug. 12, 2009; and Provisional Patent Application Ser. No. 61/233,427, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER,” filed on Aug. 12, 2009. This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety: U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE” (Attorney Docket No. 1634.068A); and U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A).
Number | Date | Country | |
---|---|---|---|
61233427 | Aug 2009 | US | |
61233420 | Aug 2009 | US | |
61233411 | Aug 2009 | US |