Ranking information is increasingly important as more data becomes available. Thus, while the information exists, such as on the Internet (or the World Wide Web), or is otherwise stored, the information may not be obtainable in an understandable fashion. This is to say, the information desired by a user or an application is not obtained because the search does not retrieve the desired information
Comparison methodologies may be problematic as the results may be obtained on a pair-wise basis. For example, when performing a search, the accuracy of the results may be determined based on how well the obtained item matches an idealized sample. As a result, the returned results may suffer from this type of comparison.
Procedures for learning and ranking items in a listwise manner are discussed. A listwise methodology may consider a ranked list, of individual items, as a specific permutation of the items being ranked. In implementations, a listwise loss function may be used in ranking items. A listwise loss function may be a metric which reflects the departure or disorder from an exemplary ranking for one or more sample listwise rankings used in learning. In this manner, use of the loss function may result in a listwise ranking, for the plurality of items, that approximates the exemplary ranking.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different instances in the description and the figures may indicate similar or identical items.
Overview
Accordingly, techniques are described which may provide ranking on a “list basis”. For example, the particular order (or arrangement) of ranked items may be considered when ranking items such as documents, email correspondence to be filtered, web site filtering and so on. While searching and document retrieval are discussed, the techniques disclosed herein may be applicable to a variety of situation in which a ranking or ordering of items is at issue.
In implementations, ranking of items may be accomplished using a listwise loss function which may be a metric which indicates the differentiation (e.g., the distribution) between one or more samples rankings and an exemplary ranking which may be considered an idealized ranking of results. For example, a probabilistic methodology may be used, in part, to normalize sample or in question items (e.g., a collection of documents to be ranked) so that the particular order of items within the ranking may more closely match the exemplary ranking in comparison to other methodologies. Exemplary probabilistic methodologies may include, but are not limited to a permutation probability, a top k probability and so on. A listwise basis may result in a more accurate match between an ordered sample query or an in question ordered set of items and the exemplary ranking in comparison to a pairwise basis in which individual items are compared with a standard, scored and arranged based on the per item score. In the present discussion, the items may be ordered in a manner such that the order (e.g., the particular permutation of results) may approximate that of the exemplary ranking.
Exemplary Environment
While the present discussion describes the subject matter with respect to searching for documents or pages on the Internet (e.g., a data source 104, having data storage capability 106 over a network 108), the principles discussed herein may be used in a variety of situations in which ranked accuracy is desired. For example, the techniques discussed herein may be used to rank retrieved files stored in local data storage 110 or for use in blocking unwanted electronic correspondence.
In the present instance, a computing system may be configured to apply list based ranking methodologies in order to obtain a listwise loss function. The listwise loss function may be used to increase the ranking accuracy for an in question set of items. The listwise loss function may be obtained from ranking module training. A neural network descent or a gradient descent may be used for teaching the ranking module. For example, a ranking module may use a listwise loss function, obtained from comparing training samples to an exemplary listwise ranking, to order Internet (i.e., the World Wide Web) search results. A listwise ranking may be a particular permutation for a set of items. The accuracy of particular ranking (a subject listwise ranking) may be based on how closely the ordered list matches that of the exemplary ranked list. For example, a test listwise ranking may form a particular permutation (such as a search for documents or web pages) which may be compared on a pairwise basis to the exemplary listwise ranking. This comparison may be used to determine a listwise loss function which is a metric of how disordered the subject listwise ranking is from the presumed good set (i.e., the exemplary listwise ranking). In further instances, in question searches may be used as well for learning purposed. The ranking module may use the listwise loss function to order items so that the ranking more closely matches or approximates the exemplary listwise ranking. In this way, ranked sets of items may be compared, rather than comparing individual items from the subject set to a standard.
In contrast, pairwise analysis may involve comparing a sample to a standard and then ordering the items based on how closely the sample matches the standard. A pairwise ranking may result in a tight coupling with the classification of the item rather than return a response which ordered or ranked is considered. Pairwise rankings may be biased in-favor of rankings having larger sample instances (e.g., document pairs). In addition, the relative order of the results compared in the foregoing fashion may vary (i.e., different permutations).
While a human graded exemplary listwise ranking is described, various techniques may be used to generate an idealized or a ranking list which is assigned as a “ground truth”. For example, the exemplary listwise ranking may be used for deriving a ranking function which is obtained from the exemplary listwise ranking and a subject listwise ranking which may be empirically obtained.
In a highly simplified example, a ranking module 112, using the techniques discussed herein, may use to generate a resultant ranking 114 on a listwise basis for query documents “A-J”. A listwise loss function 116, as discussed herein, may be used in ranking the documents on a listwise basis. The listwise loss function may be derived from learning data 118, which may include sample listwise rankings 120 in comparison to an exemplary listwise ranking 122. The listwise loss function may be used in ranking unordered “A-J” documents.
In implementations, the ranking module may use a training set of queries Q={q(1), q(2), . . . , q(m)} in which individual queries q(i) within the set are associated with a list of documents (or items) d(i)=(d1(i), d2(i), . . . , dn
For example, yj(i) can be the number of times a web page was selected or “clicks” on dj(i) when dj(i) is retrieved and returned for query q(i) at for search engine. The association being that more often the particular document or page is “click-on” the more relevant the item is to the query. Thus, the rate may be observed for dj(i) and q(i) the stronger relevance exists between them.
A feature vector xj(i)=Ψ(q(i), dj(i)) may be applied to individual query-document pair(s) (q(i), dj(i)), i=1, 2, . . . , m; j=1, 2, . . . , n(i). Thus, a list of features x(i)=(x1(i), . . . , xn
A ranking function ƒ; may be generated for the individual feature vector xj(i) (corresponding to document dj(i), it outputs a score ƒ(dj(i)). For a list of feature vectors x(i) a list of scores z(i)=(ƒ(x1(i)), . . . , ƒ(xn
where L may represent a listwise loss function.
When ranking using the function obtained from the original query q(i′) and the searches associated documents d(i′) or items are given, a feature vectors x(i′) may be derived from the results and use to train the ranking function to assign scores to the documents d(i′) for the current search. The documents d(i′) may be ranked in descending order of the scores in a listwise manner.
Various probability models may be applied to determine a ranking function. The ranking function may represent a listwise loss function for the summation of “L” above. A list of scores may be mapped to a probability distribution using probability models, with a selected metric indicating the loss distribution between the subject listwise ranking and the exemplary listwise ranking. The metric may be considered a representation of how the listwise ranked results for the subject search diverge from the exemplary listwise ranked results. Exemplary probability models include, but are not limited to, permutation probability and top k probability.
For a permutation probability model, the set of items (e.g., documents) to be ranked may be identified as 1, 2, . . . , n. A permutation π on the objects is defined as a bijection (e.g., a one-to-one and on to mapping) from {1, 2, . . . , n} to itself. The permutation may be expressed as π=<π(1), π(2), . . . , π(n)>. In which, π(j) denotes the object at position j in the permutation. The set of possible permutations of n objects (e.g., items) is denoted as Ωn. “Ωn” may represent the set of possible (distinct) arrangements of the n items.
For a ranking function which assigns scores to the n objects, s may denote the list of scores s=(s1, s2 . . . , sn), where sj is the score of the j-th object. For purposes of the present discussion, the ranking function and the list of scores obtained from the ranking function may be referred to “interchangeably” (in a general sense) to aid in describing the subject matter.
In the present procedure, there may exist some uncertainty in the prediction of ranking lists (permutations) using the ranking function. While individual permutation may be possible, the various permutations may be associated with different likelihood values based on the ranking function. As a result, some individual permutations may be considered more likely to occur in comparison to the other permutations within the set of possible permutations. A permutation probability may be associated with the ranking function to indicate the likelihood of a particular permutation given the ranking list.
As a result, if π is a permutation of n objects, and φ(.) is an increasing and positive function, the probability of permutation s given the list of scores s may be represented as:
in which sπ(j) denotes the score of the item at position j of permutation π and Ps(π) is the product from j=1 to n.
For a three objects {1,2,3} example having scores s=(s1,s2,s3). The probabilities of permutations π=<1,2,3> and π′=<3,2,1> may be represented as:
The permutation probabilities Ps(π), πεΩn may form a probability distribution over the set of permutations, that is to say, for each πεΩn, Ps(π)>0, and
Given any two permutations π and π′εΩn, if: (1) π(p)=π′(q); π(q)=π′(p); p<q; (2) π(r)=π′(r); r≠p,q; (3) sπ(p)>sπ(q); then Ps(π)>Ps(π′). For the n objects, if s1>s2> . . . >sn, then Ps(<1, 2, . . . , n>) is the highest permutation probability and Ps(<n, n−1, . . . , 1>) is the lowest permutation probability among the permutation probabilities of the n objects. Thus, the formula: π and π′εΩn indicates that, for a permutation in which an object or item with a larger score is ranked ahead of another object (item) with a smaller score, if the respective positions (of the items in the listwise ranking) are changed, the permutation probability of the resulting permutation will be lower than that of the original permutation.
As a result, if s1>s2> . . . >sn, then Ps(<1, 2, . . . , n>) indicates given the scores of n objects, the list of objects sorted in descending order of the scores may have the highest permutation probability, while the list of objects sorted in ascending order may have the lowest permutation probability (in comparison to each other). Thus, a listwise ranking in which the particular arrangement of scores are ordered in descending order may occur more frequently than an ascending list in which the (relatively) highest match is at the terminal position in the list.
For a linear function φ(x)=αx; α>0, the permutation probability may not vary with scale:
in which ∀λ>0. Here λs may be individual components of score list “s” which is multiplied by a positive constant λ.
For exponential function φ(x)=exp(x), the permutation probability may not vary with translation:
∀λε. In which, λ+s indicates a constant added (individually) to an individual components of score list λ.
Thus, for two lists of scores (used in a listwise comparison in which the listwise rankings are compared), the two corresponding permutation probability distributions may be calculated, and the metric between the two distributions may be taken as a listwise loss function representing the deviation from idealization for the subject listwise permutation. Since the number of permutations is on the order of O(n!) (“O” times “n” factorial), the calculation may be problematic to calculate in practice. The calculation may problematic because of the extent of the calculation, in comparison to other techniques.
Following a top k probability model, the top k probability of items or objects (j1, j2, . . . , jk) may represent the probability of the item or object being ranked in the top k positions for the given the scores of the objects. The top k subgroup of permutations may be the top k subgroup Gk(j1, j2, . . . , jk) containing the permutations in which the top k objects are ordered (j1, j2, . . . , jk):
G
k(j1, j2, . . . , jk)={πεΩn|π(t)=jt, ∀t=1, 2, . . . , k};
and Gk is the collection of all top k subgroups:
G
k
={G
k(j1, j2, . . . , jk)|jt=1, 2, . . . , n, ∀t=1, 2, . . . , k, and ju≠jv, ∀u≠v}
Thus for the group, there may be
elements in the collection Gk. As a result of this methodology, the result may be much smaller than the number of elements in Ωn for the permutation probability approach discussed above.
The top k probability of objects (j1, j2, . . . , jk) may be the probability of subgroup Gk(j1, j2, . . . , jk):
in which Ps(π) is permutation probability of π given s. Thus, the top k probability of objects (j1, j2, . . . , jk) may equal the sum of the permutation probabilities of permutations in which objects (j1, j2, . . . , jk) are ranked in the top k positions.
While
for top k probabilities may be calculated in the above approach, in further implementations, the following approach may be implemented.
For top k probability Ps(Gk(j1, j2, . . . , jk)),
in which sj
As a result, a metric between the corresponding top k probability distributions may be implemented as a listwise loss function for use in determining the accuracy for a listwise ranking.
For example, when using cross entropy as a metric (in which the metric indicates the tendency towards “disorder”), the listwise loss function
(above) may be expressed as:
While other methodologies may be utilized (such as permutation probability methodologies discussed above), a learning method may be utilized for optimizing the listwise loss function based on top k probability, using neural network modeling and gradient descent as optimization algorithm. Continuing the example above, if the ranking module is used when searching the Internet or a database for relevant documents or items, the ranking module may employ a ranking function based on a neural network model. For the neural network ranking function: ω as ƒω, wherein a feature vector xj(i), ƒω(xj(i)) is used in scoring. If φ in
is set as an exponential function which may not vary with translation (as discussed above). The top k probability may be expressed as:
Given a query q(i), the ranking function ωω can generate a score list z(i)(ƒω)=(ƒω(x1(i)),ƒω(x2(i)), . . . , ƒω(xn
With Cross Entropy as metric, the loss for query q(i) may be expressed as:
In which “∀gεGk” indicates that each element g is an element of collection Gk (e.g., the top k subgroup).
The gradient of L(y(i),z(i)(ƒω)) with respect to parameter ω can be calculated as:
The above calculation being the gradient descent with respect to
for use in determining a listwise loss function for training the ranking module. In the foregoing manner, the module may optimize a listwise ranking for the subject items.
Generally, any of the functions described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or a combination of these implementations. The terms “module,” “functionality,” and “logic” as used herein generally represent software, firmware, hardware, or a combination thereof. In the case of a software implementation, for instance, the module, functionality, or logic represents program code that performs specified tasks when executed on a processor (e.g., CPU or CPUs). The program code can be stored in one or more computer readable memory devices, e.g., tangible memory and so on.
The following discussion describes transformation techniques that may be implemented utilizing the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, or software, or a combination thereof. The procedures are shown as a set of blocks that specify operations performed by one or more devices and are not necessarily limited to the orders shown for performing the operations by the respective blocks.
Exemplary Procedures
The following discussion describes a methodology that may be implemented utilizing the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, or software, or a combination thereof. The procedures are shown as a set of blocks that specify operations performed by one or more devices and are not necessarily limited to the orders shown for performing the operations by the respective blocks. A variety of other examples are also contemplated.
A sample listwise ranking may be obtained 202, such as from a sample or learning search, and compared 204 to an exemplary listwise ranking such as for use in training. For example, the sample listwise ranking may be a ranked or ordered set of sample or training items or documents. For example, a training set may include four individual items which are ranked in descending order based on the relevancy of the items.
The comparison 202 may be used to derive 204 a listwise loss function which generally indicates how the set of ranked results differs from an exemplary listwise ranking (e.g., how the obtained list is disordered from the exemplary list). Thus, the metric may be applied in subsequent situations so that a resultant ranking may more closely match an expected or exemplary ranking. For example, the listwise loss function may indicate how the obtained ranking is disordered from the exemplary ranking. In a simplistic example, if the exemplary ranking is “A, B, C, D, E, F, G” and the obtained ranking is “A, B, C, D, F, E, G” the listwise loss function may generally represent the disorder in the ranking of items “E and F” on a listwise basis.
In implementations, comparing 202 the sample listwise ranking (of individual items) may be used with a training set of queries Q={q(1), q(2), . . . , q(m)} in which the individual queries q(1) within the list are associated with a list of documents (or items) d(i)=(d1(i), d2(i), . . . , dn
For example, yj(i) can be the number of times a web page was selected or “clicks” on dj(i) when dj(i) is retrieved and returned for query q(i) at for search engine. The association being that more often the particular document or page is “click-on” the more relevant the item is to the query. Thus, the rate may be observed for dj(i) and q(i) the stronger relevance exists between them.
A feature vector xj(i)=Ψ(q(i),dj(i)) may be applied to individual query-document pair(s) (q(i), dj(i)), i=1, 2, . . . , m; j=1, 2, . . . , n(i). Thus, a list of features x(i)=(x1(i), . . . , xn
A ranking function ƒ; may be generated for the individual feature vector xj(i) (corresponding to document dj(i), it outputs a score ƒ(dj(i)). For a list of feature vectors x(i), a list of scores z(i)=(ƒ(x1(i), . . . , ƒ(xn
where L may represent a listwise loss function.
When ranking using the function obtained from the original query q(i′) and the searches associated documents d(j′) or items are given, a feature vectors x(i′) may be derived from the results and use to train the ranking function to assign scores to the documents d(i′) for the current search. The (items) documents d(i′) (from the current search) may be ranked in descending order of the scores in a listwise manner.
As discussed previously respect to
For a permutation probability model, the set of items (e.g., documents) to be ranked may be identified as 1, 2, . . . , n. A permutation π on the objects is defined as a bijection (e.g., a one-to-one and on to mapping) from {1, 2, . . . , n} to itself. The permutation may be expressed as π=<π(1), π(2), . . . , π(n)>. In which, π(j) denotes the object at position j in the permutation. The set of possible permutations of n objects (e.g., items) is denoted as Ωn. “Ωn” may represent the set of possible (distinct) arrangements of the n items.
For a ranking function which assigns scores to the n objects, s may denote the list of scores s=(s1, s2, . . . , sn), where sj is the score of the j-th object.
Following a top k probability model, as discussed above with respect to
G
k(j1, j2, . . . , jk)={πεΩn|π(t)=jt,∀t=1, 2, . . . , k};
and Gk is the collection of all top k subgroups:
G
k
={G
k(j1, j2, . . . , jk)|jt=1, 2, . . . , n,∀t=1, 2, . . . , k, and ju≠jv, ∀u≠v}
Thus, for the group there may be
elements in the collection Gk. As a result of this methodology, the result may be much smaller than the number of elements in Ωn for the permutation probability approach discussed above.
The top k probability of objects(j1, j2, . . . , jk) may be the probability of subgroup Gk(j1, j2, . . . , jk):
in which Ps(π) is permutation probability of π given s. Thus, the top k probability of objects (j1, j2, . . . , jk) may equal the sum of the permutation probabilities of permutations in which objects (j1, j2, . . . , jk) are ranked in the top k positions.
While
for top k probabilities may be calculated in the above approach, in further implementations, the following approach may be implemented.
For top k probability Ps(Gk(j1, j2, . . . , jk)),
in which sj
As a result, a metric between the corresponding top k probability distributions may be implemented as a listwise loss function for use in increasing the accuracy for a listwise ranking. The metric may be derived from the sample listwise ranking (one or more may be used) and exemplary listwise rankings.
For example, when using cross entropy as a metric (in which the cross entropy metric indicates the tendency of the in question set towards “disorder” with respect to the exemplary listwise ranking), the listwise loss function
(above) may be expressed as:
While other probability methodologies may be utilized when obtaining a loss function from the sample listwise ranking(s) and the exemplary listwise ranking, a learning method may be utilized for optimizing the listwise loss function based on top k probability, using neural network modeling and gradient descent as optimization algorithm. Continuing the top k probability model directly above, a ranking function based on a neural network model may be used. For the neural network ranking function: ω as ƒω, wherein a feature vector xj(i), ƒω(xj(i)) is used in scoring. If φ in
is set as an exponential function which may not vary with translation (as discussed above). The top k probability may be expressed as:
Given a query q(i), the ranking function ƒω can generate a score list z(i)(ƒω)=(ƒω(x1(i)),ƒω(x2(i), . . . , ƒω(xn
With Cross Entropy as metric, the loss for query q(i) may be expressed as:
In which “∀gεGk” indicates that each element g is an element of collection Gk (e.g., the top k subgroup).
The gradient of L(y(i),z(i)(ƒω)) with respect to parameter ω can be calculated as:
The above calculation being the gradient descent with respect to
for learning a listwise loss function for training. In the foregoing manner, the method may optimize a listwise ranking for the subject items.
The derived listwise loss function may be used as part of ranking 208 a plurality of items so that the in question ranking approximates the exemplary ranking on a listwise basis. If, a plurality of items are retrieved in a query, the ordered listwise ranking may be arranged so that the overall ranking matches, or otherwise approximates, the exemplary listwise ranking. In this manner, the subject ranking may avoid the issues experienced with a pairwise methodology in which the ranked order of the items may vary. Thus, the listwise loss function may be applied to a set including a plurality of items so that the obtained ranking approximates (e.g., matches or more closely matches the example ranking than other methodologies). Thus, the subject ranking may be ranked using the listwise loss function to “correct” for differences from the expected example listwise function. The ranked in question plurality of items may be implemented 210 such as by presenting the ranking to a user, exporting the ranking to an application and so on.
In additional implementations procedures and computer readable-media including computer-executable instructions that may direct a computer to perform the discussed procedures are discussed. The present techniques may be used when ranking a plurality of items. The ranking may be achieved using a listwise loss function which may be a learned metric representing the cross entropy or the tendency away from order (e.g., the exemplary listwise ranking) for one or more sample listwise rankings. The listwise loss function may be obtained as described above with respect to
Although the invention has been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed invention.