The invention relates generally to computer systems, and more particularly to an improved system and method for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm.
There has been considerable past work on efficiently computing top objects by aggregating information from ranked lists of individual attributes of these objects. Efficient top-k aggregation plays a vital role in large-scale database and information retrieval systems. An important instance of this problem is query processing in search engines where k is small and the posting lists can be overwhelmingly long. One particularly well-studied approach to achieve efficiency in top-k aggregation includes early termination algorithms.
Early-termination is an attractive option to ensure efficiency in top-k aggregation, and such algorithms have been developed in both database and IR contexts. See, for example, R. Fagin, A. Lotem, and M. Naor, Optimal Aggregation Algorithms for Middleware, JCSS, 66(4):614-656, 2003; S. Nepal and M. V. Ramakrishna, Query Processing Issues in Image (Multimedia) Databases, in 15th ICDE, pages 22-29, 1999; U. Güntzer, W.-T. Balke, and W. Kiebling, Optimizing Multi-feature Queries for Image Databases, in 26th VLDB, pages 419-428, 2000; V. N. Anh, O. de Kretser, and A. Moffat, Vector-space Ranking with Effective Early Termination, In 24th SIGIR, pages 35-42, 2001; and V. N. Anh and A. Moffat, Compressed Inverted Files with Reduced Decoding Overheads, In 21st SIGIR, pages 290-297, 1998.
Two particularly interesting early termination algorithms are the Threshold Algorithm (TA) and the No Random-access Algorithm (NRA) proposed by Fagin, Lotem, and Naor. See R. Fagin, A. Lotem, and M. Naor, Optimal Aggregation Algorithms for Middleware, JCSS, 66(4):614-656, 2003. The Threshold Algorithm assumes random access capabilities to the list while the No Random-access Algorithm assumes only sequential access. These algorithms require aggregation functions to be monotone and proceed as follows. The input lists are scanned in parallel and the top k objects seen so far are stored. At each step, an upper bound on the best possible aggregated score of an object that is yet to be encountered is computed. If this upper bound is worse than the aggregated score of the k-th best object found so far, the algorithm stops. Note that the upper bound guarantees that the top k objects are correctly computed. However, these early termination algorithms fail to incorporate additional information such as combinations of attributes.
Another particularly well-studied approach to achieve efficiency in top-k aggregation includes pre-aggregation of some of the input lists. The use of combinations of attributes or pairs of terms to improve query processing has been addressed in several papers. See, for example, Long and Suel, Three-level Caching for Efficient Query Processing in Large Web Search Engines, In 14th WWW, pages 257-266, 2005. Long and Suel consider a three-level caching scheme for improving search engine performance, where the intermediate level is tasked to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. Unfortunately, incorporating additional information from using combinations of attributes has not been developed in early termination algorithms to achieve efficiency in top-k aggregation.
G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis, Answering Top-k Queries Using Views, in 32nd VLDB, pages 451-462, 2006, consider the problem of answering top-k queries using views, where a view is a materialized version of a list that ranks values according to a positive linear combination of a subset of attributes of a relation. Their work relies on generic LP solvers and fail to provide combinatorial algorithms for the problem.
What is needed is a way of using additional information from combinations of attributes in early termination algorithms to achieve efficiency in top-k aggregation. Such a system and method should be able to return the top k results for application where the posting lists can be overwhelmingly long.
The present invention provides a system and method for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm. Ranked lists of individual object attributes may be aggregated into ranked lists of combination object attributes. The ranked lists of object attributes, including ranked lists of individual object attributes as well as ranked lists of combination object attributes, may be scanned in parallel. A fixed number of top scoring objects may be stored in a results list of top ranked objects. An upper bound of best possible aggregation scores of unseen object in the ranked lists of object attributes may be computed to incorporate the extra information given by the combination lists of attributes. If the upper bound computed is less than the score of top scoring objects in the results list, then the top scoring objects in the results list may be output.
In one embodiment for aggregating a list of top ranked objects from ranked combination lists using a generalized Threshold Algorithm for early termination, a list may be selected in round robin order from the ranked lists of individual attributes and the ranked combination lists of multiple attributes. The next score for an object may be read from the list, and the scores for the object may be retrieved from each of the other ranked lists. An upper bound threshold for unseen objects in the ranked lists may be computed by a mathematical program such as a linear program or an approximation program. If the upper bound threshold computed for unseen objects in the ranked lists of object attributes is less than the lowest score of an object in the results list, then the results list of top ranked objects from ranked combination lists may be output.
In another embodiment for aggregating a list of top ranked objects from ranked combination lists using a generalized No Random-access Algorithm for early termination, a list may be selected in round robin order from the ranked lists of individual attributes and the ranked combination lists of multiple attributes. The next score for an object may be read from the list. The best possible score and the worst possible score may be computed for each object seen from the ranked lists of object attributes. If the best possible score for every object seen that is not in the ranked list of results is greater than a fixed number of largest worst scores computed for every object seen, then the results list of top ranked objects from ranked combination lists may be output.
The present invention may be used by many applications for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm. For example, information retrieval applications may use the present invention to output the top k most relevant documents given a multi-term query. In this case, the documents are the objects and the attribute lists are the posting lists for terms sorted by a relevance score. The relevance of a document for a multi-term query is defined to be an aggregation of the relevance scores for individual terms. Or, web search engines may use the present invention to find the top k web pages ranked according to an aggregation function to combine relevance scores of posting lists for terms. Or a database middleware system may use the present invention, given a set of objects and lists of object attributes ordered by attribute score, to find the top k objects ranked according to an aggregation function to combine attribute scores. For any of these applications, the present invention may aggregate a list of top ranked objects from ranked combination lists using an early termination algorithm.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer system 100 may include a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer system 100 and includes both volatile and nonvolatile media. For example, computer-readable media may include volatile and nonvolatile computer storage media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer system 100. Communication media may include computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For instance, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
The system memory 104 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 106 and random access memory (RAM) 110. A basic input/output system 108 (BIOS), containing the basic routines that help to transfer information between elements within computer system 100, such as during start-up, is typically stored in ROM 106. Additionally, RAM 110 may contain operating system 112, application programs 114, other executable code 116 and program data 118. RAM 110 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by CPU 102.
The computer system 100 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, discussed above and illustrated in
The computer system 100 may operate in a networked environment using a network 136 to one or more remote computers, such as a remote computer 146. The remote computer 146 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer system 100. The network 136 depicted in
Aggregating a List of Top Ranked Objects from Ranked Combination Attribute Lists Using an Early Termination Algorithm
The present invention is generally directed towards a system and method for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm. Ranked lists of individual object attributes may be aggregated into ranked lists of combination object attributes. The ranked lists of object attributes, including ranked lists of individual object attributes as well as ranked lists of combination object attributes, may be scanned in parallel. A fixed number of top scoring objects may be stored in a results list of top ranked objects. An upper bound of best possible aggregation scores of unseen object in the ranked lists of object attributes may be computed to incorporate the extra information given by the combination lists of attributes. If the upper bound computed is less than the score of top scoring objects in the results list, then the top scoring objects in the results list may be output.
As will be seen, the ranked lists of combinations of object attributes help the early termination algorithms discover new objects. For example, an object may be far down in lists Li and Lj, but be near the top in list Li,j. Additionally, the ranked lists of combinations of object attributes improve the bounds computed on the unseen elements. As will be understood, the various block diagrams, flow charts and scenarios described herein are only examples, and there are many other scenarios to which the present invention will apply.
Turning to
In various embodiments, a client computer 202 may be operably coupled to one or more servers 208 by a network 206. The client computer 202 may be a computer such as computer system 100 of
The server 208 may be any type of computer system or computing device such as computer system 100 of
The server 208 may be operably coupled to computer-readable storage such as storage 220 that may include objects 222 with attributes 224 and ranked attribute lists 226 that include objects 228 with a score 230. In an embodiment for query processing, the objects may represent web pages and the attributes may represent keywords of a query. In this case, a search engine may combine information from several different rankings of web pages to obtain the top k web-pages to answer user queries.
There may be many applications which may use the present invention for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm. In general, information retrieval applications may use the present invention to output the top k most relevant documents given a multi-term query. In this case, the documents are the objects and the attribute lists are the posting lists for terms. Within each posting list for a term, the documents that contain the term are sorted by a relevance score. The relevance of a document for a multi-term query is defined to be an aggregation of the relevance scores for individual terms. For instance, web search engines may use the present invention to find the top k web pages ranked according to an aggregation function to combine relevance scores of posting lists for terms. Typically, the top k web pages desired is small and the posting lists can be overwhelmingly long. Or a database middleware system may use the present invention, given a set of objects and lists of object attributes ordered by attribute score, to find the top k objects ranked according to an aggregation function to combine attribute scores. For any of these applications, the present invention may aggregate a list of top ranked objects from ranked combination lists using an early termination algorithm.
In the classic scenario for database middleware, the database D may include a set of objects {R1, . . . ,Rn} where each object Ri has m different scores which may also be referred to as parameters (x1, . . . ,xm). The database may be considered to represent m sorted lists, L1, . . . ,Lm, and each element in list Li has a pair (R,xi) where xi is the i-th field of R. The lists are stored in decreasing sorted order by xi.
Consider list Li
Also consider the aggregation function t(•) used in retrieving the top k elements to be monotone, that is: t(x1, . . . ,xm)≦t(x′1, . . . ,x′m) whenever xi≦x′i for every i. In the limited information case, t may be further limited by belonging to a family of symmetric decomposable functions. Consider ρ={P1, . . . ,Pk} to be a partition of {1,2, . . . ,m}. For example, if m=6, then a possible partition is ρ={{1,4,6},{2,5},{3}}. The threshold function t is considered ρ-decomposable, if there exists a function t′, and functions fP
t(x1, . . . ,xm)=t′(fP
In the example above, there may exist functions f1,4,5,f2,5,f3 and a function t′ such that t(x1,x2,x3,x4,x5,x6)=t′(f1,4,6(x1,x4,x6),f2,5(x2,x5),f3(x3)). There may be many functions that occur in practice which are decomposable. For example, if t=min(•), max(•) or sum(•), the decomposition may be t′=f=t.
The overall process of aggregating a list of top ranked objects may be represented by
The ranked lists of object attributes may be scanned in parallel at step 304. In an embodiment, the ranked lists of object attributes may include ranked lists of individual object attributes as well as ranked lists of combination object attributes. At step 306, a fixed number of top scoring objects may be stored in a results list of top ranked objects.
An upper bound of best possible aggregation scores of unseen object in the ranked lists of object attributes may be computed at step 308. In a generalized early termination algorithm, an upper bound on the aggregated score of yet unseen objects may be computed to incorporate the extra information given by the combination lists of attributes. In various embodiments, the upper bound may be computed by a mathematical program. For simple decomposable aggregation functions such as addition, this simplifies to a linear program that can be solved in polynomial time. Addition is a natural aggregation function that is of interest in particular for information retrieval, where the relevance score of a document to a multi-term query is the sum of the relevance scores of the document to each of the terms in the query. While the linear program gives an optimum upper bound, it can be expensive to solve, especially if the number of lists is large. In an embodiment, an approximation algorithm may be used that computes a threshold within a factor of two of the optimum upper bound. Importantly, this approximation algorithm also extends to combination lists constructed from more than two lists.
At step 310, it may be determined whether the upper bound computed is less than the total score of top scoring objects stored in the results list. If the upper bound computed is not less than the total score of top scoring objects in the results list, then processing may continue at step 304 and the ranked lists of object attributes may continue to be scanned in parallel. If the upper bound computed is less than the total score of top scoring objects in the results list, then the top scoring objects in the results list may be output at step 312 and processing may be finished.
At step 402, ranked lists of individual attributes may be received for objects with a score. The ranked lists of individual attributes may be aggregated into ranked combination lists of multiple attributes with a score for objects at step 404. At step 406, a list may be selected in round robin order from the ranked lists of individual attributes and the ranked combination lists of multiple attributes. At step 408, the next score for an object may be read from the list. And at step 410, the scores for the object may be retrieved from each of the other ranked lists. At step 412, the scores for the object retrieved from the ranked lists may be added.
It should be noted that the object may be added to the results list if there are less than a fixed number of objects in the results list. Assuming there are a fixed number of objects in the results list, it may then be determined whether the sum of the scores for the object is greater than the lowest score for an object in the results list at step 414. If so, then the object may be added to the results list at step 416 and the object with the lowest score may be removed from the results list at step 418. If it may be determined that the sum of the scores for the object is not greater than the lowest score for an object in the results list at step 414, then the upper bound threshold for unseen objects in the ranked lists may be computed at step 420.
A common problem in the design of the early termination condition for top-k algorithms, and in particular, TA and NRA, is to obtain an upper bound on the aggregated score for elements not yet seen. Consider that the score of each parameter i may be bounded by xi. Then, for every element U=(x1,x2, . . . ,xm), xi≦xi, and t(U)≦t(x1,x2, . . . ,xm) given the monotonicity of the aggregation function. Where extra information may be known for the aggregated score of some of the elements, the upper bound may be expressed as a mathematical program. Consider a case, for instance, where m=3 and the aggregation function t is sum of all elements, such that t(x1,x2,x3)=x1+x2+x3. If the bounds of x1,x2,x3 may be known, then an easy bound on t is x1+x2+x3. If, in addition, it is known that x1+x2≦x1,2, t may also be bounded by x1,2+x3. Suppose that the values of x2,3 and x1,3 may also be known, then t may be bounded by:
Given these five possible bounds on t, the minimum may be computed over all of them by
This minimum may be formulated as a linear program: minimize x1+x2+x3, subject to xi≦xi, ∀i and xi+xj≦xi,j, ∀i,j.
And, more generally, given the decomposition of the aggregation function t with the resulting functions fP and t′, as above, and upper bounds xP, the optimization may be expressed as a mathematical program: maximize: τ=t′(fP
For arbitrary functions fP, this may be a complicated optimization problem. However, f may be the addition function in the context of information retrieval where the relevance of a document to a multi-term query is the sum of the relevance of the document to each of the terms in the query. In this case, t is also the addition function, and each list is a combination of at most two elements. So, t(x1, . . . ,xm)=x1+ . . . +xm, and a list Lij has scores of xi+xj. The mathematical program then simplifies to minimize x1+x2+x3, subject to xi≦xi, ∀i and xi+xj≦xi,j, ∀i,j. This linear program can be expensive to solve where the number of lists is large. To handle this, an approximation algorithm may be used that computes a threshold within a factor of two of the optimum upper bound. This approximation algorithm also extends to combination lists that involve more than two lists.
Values yi and yij may be initially stored which will represent our best upper bounds for the values of xi and xij. The next step may assign yi=xi and yij=xij. Considering each of the paired constraints, yi+yj≦yij, yi≦min(yi1,yi2, . . . ,yim) since all of the values y are positive. The yi's may be reduced until yi≦min(yi1,yi2, . . . ,yim) is satisfied for all i and j. Since yij is the bound on the sum of xi+xj and yi is a bound on the value of xi, then yij≦yi+yj. The yij's may be reduced until yij≦yi+yj is satisfied for all i and j. By iteratively reducing yi's until yi≦min(yi1,yi2, . . . ,yim) is satisfied and yij's until yij≦yi+yj is satisfied for all i and j, a set of values y may be found that satisfy these conditions.
Returning to step 422 of
At step 508, the next score for an object may be read from the list. And at step 510, the best possible score may be computed for each object seen from the ranked lists of object attributes. For instance, the upper bound for t(R) may be expressed as a mathematical program, where N may denote the set of variables that have been revealed, such as N={1,3,6}, that minimizes t(y1, . . . ,ym), subject to: yi=xi for i ∈ N, yi≦xi for i ∉ N, and fP({yj:j ∈ P})≦xP, ∀P N.
At step 512, the worst possible score may be computed for each object seen from the ranked lists of object attributes. By substituting the value 0 for the objects yet unseen so that t(x1,0,x3,0,0,x6), the lower bound for t(R) may be expressed as a mathematical program, where N may denote the set of variables that have been revealed, such as N={1,3,6}, that minimizes t(y1, . . . ,ym), subject to: yi=xi for i ∈ N, yi≦xi for i ∉ N, and fP({yj:j ∈ P})≦xP, ∀P N.
It should be noted that the object may be added to the results list if there are less than a fixed number of objects in the results list. Assuming there are a fixed number of objects in the results list, it may then be determined whether the worst possible score for the object is greater than the lowest score for an object in the results list at step 514. If so, then the object may be added to the results list at step 516 and the object with the lowest score may be removed from the results list at step 518.
If it may be determined that the worst possible score for the object is not greater than the lowest score for an object in the results list at step 514, then it may be determined whether a fixed number of objects have been read from the ranked lists of object attributes at step 520. If it is determined that there have not been a fixed number of objects read from the ranked lists of object attributes, then processing may continue at step 506 where a list may be selected in round robin order from the ranked lists. If it is determined that there have been a fixed number of objects read from the ranked lists of object attributes, then it may be determined at step 522 whether the best score for every object seen that is not in the ranked list of results is less than the fixed number of largest worst scores computed for every object seen. Thus the generalized NRA algorithm may halt when at least k objects have been seen and for every object U that is not in the top k, B(U)<M, where B(U) is upper bound on the object score for U, and M is the kth largest worst score with ties broken in favor of higher best scores.
If the best score for every object seen that is not in the ranked list of results is not greater than the fixed number of largest worst scores computed for every object seen, then processing may continue at step 506 where a list may be selected in round robin order from the ranked lists. Otherwise, if it may be determined at step 522 that the best score for every object seen that is not in the ranked list of results is greater than the fixed number of largest worst scores computed for every object seen, then the results list of ranked objects may be output at step 524 and processing may be finished for aggregating a list of top ranked objects from ranked combination lists using a generalized No Random-access Algorithm for early termination.
Thus the present invention may provide generalizations of the TA and NRA algorithms where some pre-aggregated ranked lists of combination object attributes are available in addition to ranked lists of singleton object attributes. Importantly, the generalizations compute appropriate upper and lower bounds using a mathematical program to incorporate the additional information available for combinations of object attributes. In the case of the addition aggregation function, a matching-based algorithm may be used for pairwise intersections of object attributes, and a linear program that can be approximated may be used for intersections of object attributes over a larger number of lists. Moreover, an exact combinatorial algorithm based on minimum cost perfect matching may be used for pairwise intersections of object attributes. The intersections of object attributes improve the performance of retrieval algorithms in the following ways. First, the ranked lists of combinations of object attributes help the algorithm discover new objects. For example, an object may be far down in lists Li and Lj, but be near the top in list Li,j. Secondly, the ranked lists of combinations of object attributes improve the bounds on the unseen elements as computed by the mathematical program.
As can be seen from the foregoing detailed description, the present invention provides an improved system and method for aggregating a list of top ranked objects from ranked combination lists using an early termination algorithm Ranked lists of individual object attributes may be aggregated into ranked lists of combination object attributes. The ranked lists of object attributes, including ranked lists of individual object attributes as well as ranked lists of combination object attributes, may be scanned in parallel. A fixed number of top scoring objects may be stored in a results list of top ranked objects. An upper bound of best possible aggregation scores of unseen object in the ranked lists of object attributes may be computed to incorporate the extra information given by the combination lists of attributes. If the upper bound computed is less than the score of top scoring objects in the results list, then the top scoring objects in the results list may be output. As a result, the system and method provide significant advantages and benefits needed in contemporary computing, and more particularly in online search applications.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.